Research in Industrial Projects for Students Ant Financialalsup/projects/Alipay.pdfXiaoyu,...

Research in Industrial Projects for Students

Sponsor

Ant Financial

Forgery Detection in Alipay

Student Members

Catherine Sullivan (Project Manager), Siena College,

[email protected]

David Liu, HKUST, ECE

[email protected]

Ryan Chan, HKUST, CPEG

[email protected]

Terrence Alsup, Georgia Institute of [email protected]

Academic Mentor

Dr. Haixia Liu, [email protected]

Sponsoring Mentors

Xunyang, [email protected]

Xiaoyu, [email protected]

Date: August 12, 2016

This project was jointly supported by Ant Financial and NSF Grant 1460018.

Abstract

Improvements in encryption have enabled online pay services, such as Alipay, to becomemore widely used. One of the biggest challenges these services face is verifying new informa-tion, particularly images uploaded by users. Many different techniques have been studiedin the literature that approach the problem of detecting images that have been digitallyaltered. It is our goal to implement and compare the performance of these methods as wellas to use ensemble learning to combine them in order to provide a more reliable and robustclassifier. Various statistical features of digital image are studied and analyzed in this arti-cle. Specifically we check the correlation between noise in the image and certain EXIF data,check for double compression using both Benford’s Law and a Block Grain Analysis, andinvestigate the color filter array. At the end of this article, Benford model and Block GrainAnalysis model are combined using ensemble learning to give a much better estimation witha respectable AUC score of 0.91.

3

Acknowledgments

We would like to express our great appreciation to Dr. Haixia Liu for hersuggestions and encouragement throughout the project. We would also liketo extend our gratitude to Xunyang and Xiaoyu for their advice and supportthroughout the project. Finally, we would additionally like to extend our thanksto Albert Ku for overseeing the whole program.

We would like to thank the following organizations for their support in theprogram: Ant Financial, IPAM, NSF, and HKUST.

5

Contents

Abstract 3

Acknowledgments 5

1 Introduction 13

2 EXIF Data Based Detection 152.1 Software Tag Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.2 Noise Features Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3 JPEG Double Compression Detection 213.1 Background on JPEG Compression . . . . . . . . . . . . . . . . . . . . . . . 213.2 Benford’s Law Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.3 Block Grained Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4 Color Filter Array Algorithm 334.1 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.2 Estimation of Interpolation Correlation . . . . . . . . . . . . . . . . . . . . 334.3 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5 Combining all of the Algorithms 375.1 Stacking and the Combiner Model . . . . . . . . . . . . . . . . . . . . . . . 375.2 Combining all of the Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 375.3 Combining Benford’s Law Algorithm and Block Grain Analysis Algorithm . 385.4 Conclusion and Further Work . . . . . . . . . . . . . . . . . . . . . . . . . . 41

APPENDIXES

A Data Sets Used 43A.1 Data Set - Noise Feature Algorithm . . . . . . . . . . . . . . . . . . . . . . 43A.2 Data Set - Benford’s Law Algorithm . . . . . . . . . . . . . . . . . . . . . . 43A.3 Data Set - Block Grain Analysis . . . . . . . . . . . . . . . . . . . . . . . . 44A.4 Data Set - CFA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45A.5 Data Set - Combo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

B Abbreviations 47

REFERENCES

Selected Bibliography Including Cited Works 49

7

List of Figures

2.1 Visualization of the Errors for Noise Feature Algorithm (Note: The entiredataset is presented here.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2 kNN ROC Curve and CV Accuracy for Noise Feature Algorithm . . . . . . 202.3 Random Forest ROC Curve and CV Accuracy for Noise Feature Algorithm 20

3.1 Histograms of the Leading Non-zero Digit of a DCT Coefficient . . . . . . . 223.2 Random Forest ROC Curve and CV Accuracy for Benford’s Law Algorithm 233.3 kNN ROC Curve and CV Accuracy for Benford’s Law Algorithm . . . . . . 243.4 Accuracy versus Percentage of Image Altered for Benford’s Algorithm. . . . 243.5 Probability Map for an Image Being A-DJPG . . . . . . . . . . . . . . . . . 293.6 An Example of Binary Map and 3 Different Types of Perimeter Used . . . . 303.7 Random forest ROC curve and CV accuracy for Block Grained Analysis . . 303.8 kNN ROC curve and CV accuracy for Block Grained Analysis . . . . . . . . 31

4.1 Authentic Image and Result of CFA Algorithm . . . . . . . . . . . . . . . . 344.2 Tampered Image and Result of CFA Algorithm . . . . . . . . . . . . . . . . 354.3 Random Forest ROC Curve and CV Accuracy for CFA Algorithm . . . . . 35

5.1 Visualization of the Features for Combining Block Grain Analysis and Ben-ford’s Law Algorithm (Note: The entire dataset is presented here.) . . . . . 39

5.2 Contour Maps and Estimate Probability Density Functions for Real and Al-tered Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.3 Random Forest ROC Curve and CV Accuracy for Combining Block GrainAnalysis and Benford’s Law Algorithm . . . . . . . . . . . . . . . . . . . . . 41

9

List of Tables

2.1 Results of Noise Feature Algorithm using 4 ML Algorithm for AuthenticImages from [11] and Images with Manipulations of Adding Text, ChangingBrightness, or Changing Contrast . . . . . . . . . . . . . . . . . . . . . . . . 19

3.1 Results of Benford’s Law Algorithm using 4 ML algorithm . . . . . . . . . . 233.2 Accuracy on Individual Classification and FAR/FRR Achieved by Benford’s

Law Algorithm for Different Modification Types . . . . . . . . . . . . . . . 253.3 Accuracy Achieved by Benford’s Law Algorithm for Unaltered Double Com-

pressed Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.4 Accuracy Achieved by Benford’s Law Algorithm for Modified Double Com-

pressed Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.5 Table of Notation for Block Grained Analysis . . . . . . . . . . . . . . . . . 263.6 AUC Achieved by Block Grained Analysis for Images with Added Text . . 283.7 AUC Achieved by Block Grained Analysis for Shifted Images . . . . . . . . 283.8 Result of Block Grain Analysis using 4 ML Algorithms . . . . . . . . . . . . 31

4.1 Results of CFA Algorithm using 4 ML Algorithms . . . . . . . . . . . . . . 354.2 Accuracy on Individual Classification and FAR/FRR Achieved by CFA Al-

gorithm for Different Modification Types . . . . . . . . . . . . . . . . . . . . 36

5.1 Results of Combining Block Grain Analysis, Benford’s Law Algorithm, NoiseFeature Algorithm, and CFA Algorithm . . . . . . . . . . . . . . . . . . . . 38

5.2 Performance Comparison for Individual Algorithms . . . . . . . . . . . . . . 385.3 Results of Combining Block Grain Analysis and Benford’s Law Algorithm . 41

A.1 Number of Images used for Noise Feature Algorithm . . . . . . . . . . . . . 43A.2 Number of Authentic Images . . . . . . . . . . . . . . . . . . . . . . . . . . 44A.3 Number of Fraudulent Images . . . . . . . . . . . . . . . . . . . . . . . . . . 45A.4 Number of Images for Different Quality Factor Combinations . . . . . . . . 45

11

Chapter 1

Introduction

Ant Financial Services Group, formerly known as Alipay, is an affiliate company of theChinese Alibaba Group based in Hangzhou, China. It is well known for operating theAlipay payment platform, an online payment platform. As of June 2016, Alipay is recordedto have 270 million active registered users (compared to its competitor PayPal, which has184 million) and 400 million certificated users in total. Alipay payment services are largelyextended to plenty offline scenarios to cover restaurants, supermarkets, convenient stores,taxis and public transports. In wealth management, Alipay users can now purchase financialproducts, including Yu’e Bao and Zhao Cai Bao where it has over 200 million users. It alsoruns Sesame Credit - an online credit rating system based on big data and cloud computing.Ant Financial Group aims to provide fast and secure financial services, especially to smallbusinesses and individual consumers.

Submission of ID documentation is important for an Alipay user to acquire full privilegeof their account, meaning identity recognition is a core part of providing secure financialservices to protect against identity fraud. According to Credit Suisse in 2014, it is estimatedthat half of China’s online payment transactions go through Alipay [20], which shows howcrucial it is to be able to detect against identity fraud. In this project, we aimed to developan image forgery detection algorithm for Alipay to include in their application.

The problem is simple to define: create an algorithm to determine whether an imagehas been tampered with using image-editing software. In the past, image editing wasleft to the professionals. Increasing popularity of new software such as Adobe Photoshop,for example, has made it much easier to manipulate images. The use of such softwarecan be done innocently but can also be done for illegal purposes such as creating fakeidentifications. For companies like Ant Financial who rely on images of identification, ithas become important to determine how to detect fraudulent images.

Ant Financial currently implements technology with image recognition. Although thesecurrently employed algorithms are useful in verifying the image information, they cannotidentify the authenticity of the image data. Therefore, Ant Financial is in need of technologythat can differentiate between fraudulent and authentic images.

To approach this problem, we start off by examining the software EXIF data tag tosee written clues for forgery. We then look at the f-number, exposure time, and ISO speedratings and compare it to the noise features that were observed in the image. Since imagesare not required to have EXIF data, we move on to studying JPEG images to look fordouble compression. Our first approach is to observe how the leading non-zero digit ofthe DCT coefficients followed Benford’s Law. Our second approach is to conduct a blockgrained analysis. After that, we try analyzing the color filter array. At last, we combine all

13

algorithms to give a more reliable result.We begin in Chapter 2 by discussing the software tag algorithm and the noise features

algorithm and their results. We continue in Chapter 3 with our two approaches in lookingfor double JPEG compression. We then discuss about camera based detection in Chapter4. We show our overall result from the combined algorithm and conclude our project inChapter 5.

14

Chapter 2

EXIF Data Based Detection

EXIF stands for ”Exchangeable Image File Format” and this type of information is for-matted according to the TIFF specification, and may be found in JPG, TIFF, PNG, JP2,PGF, MIFF, HDP, PSP and XCF images, as well as many TIFF-based RAW images, andeven some AVI and MOV videos. The EXIF meta information is organized into differentimage file directories within an image [12]. The EXIF data was started by Japan ElectronicIndustries Development Association in 1998 and today is widely adopted mainly supportingTIFF and JPEG formats [13].

When the picture is edited or modified, there is a possibility for the EXIF data to bedeleted or changed [13]. In this case, our first algorithm was to look directly at the metadatatag called ”Software.” Using the last software opened with the image, we wanted the chanceto use it to detect if Photoshop or other editing software was used. Our second algorthmhad followed Fan, Cao and Kot’s [13] proposal by statistically analyzing the photo andcomparing it to the values listed in the Aperture, Shutter Speed, and ISO Speed RatingEXIF metadata tags.

2.1 Software Tag Algorithm

When looking at the EXIF data, we had noticed how there was a software tag that updatedthe most recent software that had saved the photo. That gave us the idea of creating aexpandable dictionary that was able determine if the software would indicate if the imagewas probably authentic (1) or probably fraudulent (−1). Since there is the case wherea photo does not contain EXIF data, we decided to return this as 0. The expandabledictionary works by returning the value of the software or if the software is not in the list thenprompting the user to identify which category the software should go under. The dictionaryis then saved as a pickle file. An additional script was written to aid in maintaining thedictionary by allowing the user to manually add an entry, delete an entry, change the valueof an entry, or view the current dictionary.

Creating this algorithm, we knew that it could never be perfect as people who knowabout EXIF data can easy change it or delete it. As a result, we did not test the algorithm.Since the entries in the dictionary are hashed, our algorithm runs in constant time andbecause of that reason, we felt it is a relatively quick check to make at the beginning forcatching some of the obviously fraudulent images.

15

2.2 Noise Features Algorithm

For the second algorithm, we looked at the meta-tags in the EXIF data that people mightusually not think to change: the f-number, exposure time and ISO speed rating. Althoughthis algorithm relies on the fact that the image contains EXIF data and it contains values forthese three tags. These three features can affect the photo in certain ways. For example, thebrighter the image and more visual noise are associated with higher ISO speed ratings [13].In addition, past research on the photon transfer curve has shown that during the shutter-opening period, the effective light photons impinging onto a sensor, or digital camera signalintensity, correlates to the standard deviation of total noise. By finding the approximateamount of total noise of an image and estimating the DCSI from several shot settings likeaperture size, exposure time and ISO, it has been proposed by Fan, Cao and Kot [13] thatthere is a correlation between the EXIF header parameters and the image. If the correlationis broken then we can determine that the image is fraudulent. It is through the methodoutlined in [13] that we seek to implement and evaluate for detecting fraudulent images.

2.2.1 Outline of Extracting the Noise Features

To extract the noise features, we began by estimating the sharp area map using its gray-scale image using the computed values: 0.2989 ∗Ar + 0.5870 ∗Ag + 0.1140 ∗Ab where A isthe color image and r, g, and b correspond to the red, green and blue values for that pixel.From there, we estimate the sharpness residual image by taking the absolute value of thegray-scale image and the gray-scale image convoluted with a 3x3 Gaussian filter.

Using the cumulative distribution of the sharpness residual, we calculate out the valuethat corresponds to 0.9 to get the threshold. The top 10% of the cumulative distributionis the sharp area. Using the threshold, we then decide whether or not a pixel location isin the sharp area by comparing it’s sharpness residual value to the threshold. The sharpregions are then further dilated as to not affect the noise estimation using a 3x3 structureelement of all ones. Using this map, the statistical noise features within the non-sharp areacan be computed [13].

Following the same pattern as [13], we used the averaging filter and Gaussian filter forremoving high-frequency noise, the median filter for salt and pepper noise, and the Wienerfilter to remove the noise associated with the local pixels. Using these filters, we are ableto extract different types of residuals.

Then for each combination of the color channel (red, blue, and green) and the fourdenoising features, we calculated our noise features as the mean and standard deviation ofthe log-scale residuals. This gives us twenty-four different noise features in total for eachimage.

2.2.2 Outline of Modeling the Correlation

To model the correlation between noise and EXIF features, we start with transforming thenoise features into both first-order and second-order terms. Then, the least square approachis applied on the authentic data to find the weights and to model the correlation. Finally,the error between the estimated value and the real value of the EXIF features are calculatedfor classification.

Let za,1, . . . , za,m and zf,1, . . . , zf,n be the noise feature vectors extracted from m authen-tic images and n fraud images respectively. The original feature vector z = [z1, z2, . . . , z24]are changed to x = [z1, z2, . . . , z24, z1z2, z1z3, . . . , z23z24, z

21 , z

22 , . . . , z

224]. The dimension of

16

the new feature vector will become 24 +(242

)+ 24 = 324. By using the second-order terms,

we can model the non-linearity between the noise features and the EXIF features. However,the inclusion of second-order terms brings the problem of more data required and increas-ing computational complexity. Moreover, some features may have a negative effect on theregression result. Therefore, feature selection techniques are needed to select the usefulfeatures. In our experiment, we choose to use sequential floating forward selection togetherwith Fisher’s ratio as the criterion function to obtain the optimal features subset [13].

To compute the Fisher’s ratio of a set of d features, we first obtain the least squaresolution on the authentic data, i.e. solve w for the equation:

Xaw = ya,

where Xa = [xa,1 . . .xa,m]T is a m-by-d matrix containing noise features in the given setfrom the authentic images, and ya is the EXIF feature vector of the authentic images.

Using weights w, we can get the error of the authentic data and fraud data respectively:

ea = ya −Xaw,

ef = yf −Xfw.

Hence, the Fisher’s ratio of the error is given by

r =(µa − µf )2

σ2a + σ2f

where µa = 1m

∑|ea|, µf = 1

n

∑|ef |, σ2a = 1

m−1∑

(|ea| − µa)2 and σ2f = 1n−1

∑(|ef | − µf )2.

ea and ef are entries in the error vectors ea and ef respectively [13].

The ratio can measure the discriminating power of the error between authentic andfraudulent images. With a higher ratio, better discriminating results can be achieved.Therefore, we try to find a subset of features with the highest Fisher’s ratio. We begin withan empty set and then do features inclusion and conditional exclusion alternately. In eachstep of inclusion or exclusion, Fisher’s ratio of all possible outcomes are evaluated and thebest feature is added to the set or the worst feature is removed if the remaining set performsbetter than the original set [13].

Finally, an optimal set of features are selected and the corresponding errors are calcu-lated again for classifications. Generally, an authentic image should get a small error whilea fraudulent image should get a large error since modifications on images may change theirnoise features and hence weaken the correlation between the noise and EXIF information.Then, we can combine errors from 3 noise-EXIF correlations to classify whether an imageis authentic or fraud.

2.2.3 Classification

After applying linear regression on the noise features we obtain an M -by-3 error matrixwhere M is the number of images. The error of our data can be seen in the scatter plotin Figure 2.1. In order to classify the error results, we looked into different classificationmethods.

• Logistic Regression

• Support Vector Machine (SVM)

17

Figure 2.1: Visualization of the Errors for Noise Feature Algorithm (Note: The entire dataset ispresented here.)

• K-Nearest Neighbors (k-NN)

• Random Forest

The logistics regression is originally a regression model which measure the relationshipbetween the categorical dependent variables by estimating probabilities using a logistic func-tion [8]. In our application, binomial logistic regression is performed for our classificationpurpose.

In machine learning, support vector machines (SVM) with associated learning algo-rithms falls into the category of supervised learning models. A SVM tries to construct ahyper-plane of a set of hyper-planes in a high dimensional space, which can be used forour classification purpose [4]. A good separation is achieved by finding the hyper-planewhich has the largest margin. The margin is computed by taking the largest distance to thenearest training data points. In general, by maximizing the margin, we can find a desiredhyper-lane thus lower the generalization error of the classifier.

K-Nearest Neighbors (k-NN) algorithm is a non-parametric method in pattern recog-nition which can be used for classification [1]. The output is a class membership whichis decided by majority votes of its neighbors. k-NN is among the simplest of all machinelearning algorithms and although it may not produce the highest accuracy, it can be usedto evaluate the feasibility and practicability of other classification methods.

Another classification technique we employed was random forest. Random forest is anensemble learner where many different decision tree classifiers cast a vote on the predictedoutput for any given input. In this case, our input is the vector of errors. To construct theforest we begin by choosing the parameter b to be the number of trees. An optimal b shouldbe obtained through validation. We randomly draw with replacement b different samplesfrom the overall training set to use to grow each tree. The difference between growing anindividual tree in a random forest versus a stand-alone decision tree rests in the selectionof features to split on. In a random forest only a randomly chosen subset is considered ateach stage, selecting the best feature in the subset to split on. The individual trees are

18

grown until all examples in the training set have been classified. In other words, we neitherpreemptively nor retroactively prune the trees. The idea of the random forest is to reducethe correlation between the trees. This is done through several steps, such as when it israndomly selecting different training sets or randomly selecting the different features toconsider splitting on. One nice property of random forests is that they tend to not over fit,due to the fact that we are considering many votes from uncorrelated (or loosely correlated)trees [3].

2.2.4 Results of Noise Feature Algorithm

Different classification models mentioned in the previous section are used to evaluate thequality of our noise features, and in general, to improve overall performance of the algorithm.The performance is mainly evaluated by looking at the receiver operating curve (ROC) curvefor the FAR rate plotted against 1− FRR where the FAR is the false acceptation rate andFRR is the false rejection rate. The FAR rate is defined by:

Fraudulent Photos Accepted

Total Fraudulent Photos

and the FRR rate is defined by:

Authentic Photos Rejected

Total Authentic Photos.

The data set used can be found in Appendix A.1. After running the errors as featuresthrough several machine learning algorithms, the results can be seen in Table 2.1 wherelogistic regression produced the highest area under the curve (AUC). The classifiers optimalparameters were determined by 10-fold cross validation. Figures 2.2 and 2.3 show the ROCand CV for both the kNN and random forest learning algorithms. The ROC curves for bothSVM-linear and logistic regression look similar to those of random forest and kNN.

Algorithm Accuracy FAR FRR AUCLogistic Regression 0.7772 0.4319 0.0200 0.7957SVM-linear 0.7620 0.4781 0.0050 0.7955kNN (k = 62) 0.7658 0.4576 0.0175 0.7834Random Forest (b = 81) 0.6962 0.3650 0.2444 0.7480

Table 2.1: Results of Noise Feature Algorithm using 4 ML Algorithm for Authentic Images from[11] and Images with Manipulations of Adding Text, Changing Brightness, or Changing Contrast

We notice immediately that all of the algorithms, with the exception of random forest,have very low FRRs. In contrast the FARs are relatively high at about 45%. This suggeststhat each learning algorithm is tending to classify everything as a true image. This claimis further supported by the CV accuracy plots in Figures 2.3 and 2.4 as well as the plotof the errors in Figure 2.1. We see that the error points of real and fake images areextremely close together and are not easily distinguishable. There is also an enormousamount of variance, shown by a large confidence interval of the CV accuracy. The factthat the accuracy can range from a 40% to 100% means the algorithms are struggling todifferentiate forged and authentic images given the three error features. Furthermore, theshallow ROC curves suggest that to some degree any reduction in the false positive ratewill result in a comparable reduction of the true positive rate. The y-intercept seems tocorrespond with the percentage of fake data that lies in a patch in the upper-left corner of

19

Figure 2.2: kNN ROC Curve and CV Accuracy for Noise Feature Algorithm

AFigure 2.3: Random Forest ROC Curve and CV Accuracy for Noise Feature Algorithm

Figure 2.1, since there does not seem to be any data points corresponding real images inthat location.

For a suitable range of the parameter (k for kNN and b for random forest), both algo-rithms seem to avoid the problem of overfitting as accuracy appears to level out, which isa good indicator that these are robust to noise in the dataset.

In the sequel, when we are looking at the results of all of our algorithms in aggregate,this one will be particularly useful to avoid falsely rejecting a given image. Indeed, if thisalgorithm were to falsely reject an image it has only about a 2% chance of being wrong.Practically, this will mean users who upload a real image will be less likely to be deniedservice.

20

Chapter 3

JPEG Double CompressionDetection

Double compression based detection is among one of the most commonly used detectionmethod for image forensics. The process double compression refers to the decompression ofan existing compressed JPEG image and the subsequent re-compression action applied tothe image contents to produce a new JPEG image. This process is a natural and to someextent, an inextricable part of the existing JPEG image forgery procedures. Compared tothe EXIF data based algorithm we discussed in the previous chapter, double compressiondetection has a wider range of application as EXIF information is usually discarded in manyimage rendering procedures.

In this chapter, we will begin by discussing the JPEG compression process followed byhow we established our data set. We will then discuss our first method tried which is to seeif the DCT coefficients directly follow a prescribed distribution known as Benford’s Law.This will then be followed by our algorithm to look at a unified statistical model built todetect both aligned and non-aligend double JPEG compression.

3.1 Background on JPEG Compression

“JPEG” is a still picture coding standard created by Joint Photographic Experts Group.The JPEG standard defines how a image is compressed into a stream of bytes and de-compressed back into image. The EXIF standard is added for additional information forexchange of JPEG-compressed images. JPEG uses a lossy image compression form knownas DCT (discrete cosine transform). An image is first divided into 8 by 8 nonoverlappingblocks. Then, the DCT operation converts each block of the image from the spacial domaininto the frequency domain (transform domain). In the transform domain, high frequencyinformation is discarded. This process of information reduction is called quantization. Thequantized coefficients are then sequenced and losslessly packed into a output bit stream.

Double compression refers to when an image is compressed more than once. There areseveral ways for an image to be double compressed, for example, when the photo is takenthe camera may store the image in a JPEG format the first time and if you go to edit thephoto and save it again in JPEG format, parts of the image will be compressed a secondtime. In general, the double compression occurs on the non-tampered region while thetampered region experience only single compression. Therefore, the tampered region canbe localized by double compression detection.

21

Figure 3.1: Histograms of the Leading Non-zero Digit of a DCT Coefficient

Depending on the DCT grid used by the second compression, double compression canbe categorized into two classes. The second compression which uses the aligned DCT grid isreferred to as aligned double JPEG (A-DJPG) compression while the second case is referredto as non-aligned double JPEG (NA-DJPG) compression [2]. One observable feature of A-DJPG is the double quantization (DQ) effect. Double quantization effect refers to thechanges on the distribution of the DCT coefficients by consecutive quantization procedures.[17] shows that periodic artifacts are introduced to the histograms of DCT coefficients byA-DJPG compression. Different techniques such as statistical modeling, Fourier analysisor machine learning algorithms can be applied to detect such effects and hence recognizeA-DJPG images.

3.2 Benford’s Law Algorithm

Instead of analyzing the histogram of the DCT coefficients directly, our first approachfollows [15] and looks at the distribution of leading non-zero digits of the DCT coefficients.They should follow a prescribed distribution, known as generalized Benford’s Law [9], aftera single compression. Same as the distribution of the DCT coefficients, the distributionof the first non-zero digits is perturbed after a second compression. Such a difference canbe observed in the Figure 3.1. To utilize this difference for A-DJPG detection, we use thesame set of classification methods listed in Section 2.2.3 to classify images based on theirdistribution features of leading non-zero digits.

3.2.1 Outline of Distribution Features Extraction

To extract the distribution features of leading non-zero digits, we begin with transformingthe DCT coefficients x into theirs first digits d using the formula,

d =⌊ x

10blog10 xc

⌋.

By the nature of DCT, it packed the image information in the low frequencies. Thehigh frequencies are dominated by zeros and it affects the observation of the generalizedBenford’s Law. Also, the DC (direct current) term which is the first coefficient of the DCTis dependent on the images. Therefore, we only consider the histograms of the first 20AC (alternating current) terms, i.e. the first 21 frequencies excluding the first. Since we

22

Figure 3.2: Random Forest ROC Curve and CV Accuracy for Benford’s Law Algorithm

are looking at the first non-zero digit of each coefficient, our histogram has the domain ofthe integers {1, 2, . . . , 9}. As a consequence, for each image we have a distribution featurevector of 9× 20 = 180 dimensions.

3.2.2 Results of Benford’s Law Algorithm

The data set used can be found in Appendix A.2. As you can see in Table 3.1, among the4 algorithm we used, random forest has the best combined sum of FAR and FRR rates. Ifyou look at Figure 3.2, you can see that the CV for Random Forest has low variance. Thisleads us to believe that this algorithm has a lot of potential.

Algorithm Accuracy FAR FRR AUCRandom Forest (b = 240) 0.8234 0.1157 0.3316 0.8314kNN (k = 11) 0.8074 0.1059 0.4129 0.8238Logistic Regression 0.7347 0.0402 0.8374 0.7739SVM - Linear 0.7330 0.0227 0.8879 0.7610

Table 3.1: Results of Benford’s Law Algorithm using 4 ML algorithm

It is noticeable that all algorithms produce relatively low FAR results comparing tothe FRR. One of the reasons for the low FAR is that the authentic images that followa single compression have similar distributions as seen in Figure 3.1. Compressing theimage a second time ruins this nice distribution so we would like to perform further tests todetermine whether the current FAR rate is raised due to the double compressed authenticimages.

Although Logistic Regression and SVM - Linear provide extremely low FAR rates, usingthem as learning models for this algorithm would be unreasonable when they have high FRRrates. More than 80% of authentic images will get rejected and this will leave lots of honestcustomers upset. Especially since the majority of people applying to be full privilegedmembers will be providing authentic images.

Looking at Figure 3.3, after around 11 neighbors, the mean for the accuracy decreases.One of the reasons for this is over-fitting. Similar to the cross validation for random forest,the variance was also rather low. Compared to random forest, the FAR rate for kNN waslower by a percent, however, the FRR rate was higher by 8%.

23

Figure 3.3: kNN ROC Curve and CV Accuracy for Benford’s Law Algorithm

(a) Copy and Paste Not Random (b) Copy and Paste Random

Figure 3.4: Accuracy versus Percentage of Image Altered for Benford’s Algorithm.

To further analyze our algorithm, in Table 3.2 rather than running a binary classification,we wanted to see how well it could classify each type of modification. Single compressedunaltered images had the highest individual classification accuracy. This was expected as thetheory behind the algorithm is that the distribution gets perturbed after being compresseda second time. In addition, it appeared that the images that had been compressed a thirdtime had low FRR rates. However, for the double compressed images and the copy and pasteimages, the algorithm did not perform as well. We then looked into the double compressedmodified images to see if the combination of percentages for quality factors from beingcompressed once or twice had an affect.

Table 3.4 does not show a pattern for when it performs terribly. The accuracy for whenthe modified image was compressed at the highest quality factor twice is extremely low.This is probably due to the fact that the information lost in both compressions was not alot and the DCT coefficients were able to maintain the distribution. Table 3.3 only had ahigh accuracy for the same combination of quality factors. From this, we do not tink thecombination of quality factors matters for the algorithm.

Since the single and double compressed images did not do well, we looked into whetherthe percentage of the image altered mattered. As you can see in Figure 3.4, the accuracydoes not change all that much depending on the percentage of the image that is singlycompressed.

24

Authentic ImagesIndividual Classification

AccuracyFRR

Single Compressed Unaltered 0.9386 0.0614Double Compressed Unaltered 0.0000 0.8084

Double CompressedIndividual Classification

AccuracyFAR

Text - Fixed 0.0206 0.2045Text - Random 0.1801 0.3879Copy and Move Rectangle 0.0507 0.1949Copy and Move Circle 0.0583 0.2199Non-Aligned 0.0669 0.2045

Single / Double CompressedIndividual Classification

AccuracyFAR

Copy and Paste Not Random 0.0840 0.4414Copy and Paste Random 0.1957 0.4960

Triple CompressedIndividual Classification

AccuracyFAR

Text - Fixed 0.2473 0.0073Copy and Move Rectangle 0.1223 0.0132Copy and Move Circle 0.2148 0.0105Non-Aligned 0.1056 0.0055

Table 3.2: Accuracy on Individual Classification and FAR/FRR Achieved by Benford’s Law Algo-rithm for Different Modification Types

3.3 Block Grained Analysis

Besides the analysis based on the double quantization effect, our second approach follows[2] to take the NA-DJPG case into account. A unified statistical model is built to detectboth A-DJPG and NA-DJPG compression. In this part, we take a statistical approachinstead of classifying images by machine learning algorithms. Our goal is to estimate theprobabilities of a DCT block being single compressed, aligned doubly compressed, andnon-aligned doubly compressed, and hence generate a likelihood map of the images.

3.3.1 Outline of Modeling Double JPEG Compression

With the notation in Table 3.5, we first model a single JPEG compression process as

I1 = T −100 (D1(Q1(T00(I)))) + E1 = I + O1.

For the A-DJPG case, the distribution of the quantized DCT coefficients is considered.They are given by

C1 = Q1(T00(I)) = Q1(U),

C2 = Q2(T00(I1)) = Q2(D1(Q1(U)) + T00(E1))

respectively, where U = T00(I) is the unquantized DCT coefficients of the uncompressedimages.

Assume the distribution of the unquantized DCT coefficients of the uncompressed im-ages U be p0(u). By [17], we know that quantization acts as a many-to-one function on the

25

Quality Factor 250 60 70 80 90 100

Quality Factor 1

50 0.6267 0.4400 0.5067 0.5867 0.4533 0.266760 0.4933 0.6400 0.3867 0.6000 0.4533 0.320070 0.6533 0.4133 0.6533 0.4000 0.5867 0.306780 0.5733 0.6267 0.3600 0.6400 0.4267 0.333390 0.6000 0.6533 0.5867 0.4800 0.6533 0.2667100 0.6267 0.6267 0.6133 0.6267 0.6133 0.8667

Table 3.3: Accuracy Achieved by Benford’s Law Algorithm for Unaltered Double Compressed Images

Quality Factor 250 60 70 80 90 100

Quality Factor 1

50 0.7671 0.7436 0.7763 0.8243 0.8875 0.918960 0.7837 0.8243 0.7901 0.7671 0.8133 0.961570 0.4730 0.8442 0.7432 0.8289 0.7297 0.810180 0.8472 0.8312 0.7368 0.6579 0.8442 0.844290 0.8684 0.8553 0.8649 0.8026 0.6133 0.8684100 0.8442 0.8333 0.7368 0.6164 0.6364 0.3205

Table 3.4: Accuracy Achieved by Benford’s Law Algorithm for Modified Double Compressed Images

distribution of U, i.e. it maps several neighboring bins from the histogram of U to one binin that of C1. Hence, the distribution of C1 can be expressed as

pC1(x;Q1) =

Q1x+Q1/2∑u=Q1x−Q1/2

p0(u),

where Q1 is the quantization step from the quantization matrix.The dequantization process then up-samples the distribution by Q1 and introduces zeros

between the multiples ofQ1. Therefore, the distribution of the unquantized DCT coefficientsof singly compressed images is given by

pU1(v;Q1) =

{∑v+Q1/2u=v−Q1/2

p0(u), v = kQ1

0, otherwise.

Table 3.5: Table of Notation for Block Grained Analysis

I ImageU Unquantized DCT coefficientsC Quantized DCT coefficientsQ Quantization processD Dequantization processTab DCT transformation using a grid shifted left by a and

up by b pixelsE Rounding errors after decompressionO Overall errors resulted from single compression and de-

compression process

26

Hence, the distribution of C2 is given by

pC2(x;Q1, Q2) =

Q2x+Q2/2∑v=Q2x−Q2/2

pU1(v;Q1) ∗ gA(v), (3.1)

where gA(v) = 1σe√2πe−(v−µc)

2/σ2e is used to model the error [2]. It will be the distribution

for A-DJPG compression using quantization steps Q1 and then Q2.In the actual scenario, we are going to distinguish between A-DJPG compression and a

singly compression using quantization step Q2. The distribution of the singly compressionshould follow the distribution of C1 but using quantization steps Q2 instead of Q1, i.e. thedistribution is

pC1(x;Q2) =

Q2x+Q2/2∑u=Q2x−Q2/2

p0(u). (3.2)

For the NA-DJPG case, we begin with modeling the single compression process using ashifted grid as

I1 = T −1ab (D1(Q1(Tab(I)))) + E1.

After the second compression using a normal grid, the new image I2 = I1 + O2 =T −1ab (D1(Q1(Tab(I)))) +E1 +O2. To eliminate the inverse DCT function T −1ab , we apply theDCT with a grid shifted with a by b pixels on I2. As a result,

Tab(I2) = D1(Q1(Tab(I))) + Tab(E1 + O2).

It is noticeable that Tab(I) should also follow the distribution of U which is p0(u). Hence,the distribution used for detect NA-DJPG compression is

pNA(x;Q1) = pU1(x;Q1) ∗ gNA(x), (3.3)

where gNA(x) = 1√2π(σ2

e+Q22/12)

e−(x−µc)2/(σ2

e+Q22/12) is used to model the overall error [2].

If the NA-DJPG compression is absent, we simply assume the distribution of the DCTcoefficients after DCT Tab follows that of U, i.e.

pNNA(x) = p0(x). (3.4)

The underlying reason of this assumption is that misaligned DCT usually destroys thequantization effects on the distribution of DCT coefficients [2].

After getting the formulas, the only missing part is estimating the parameters neededfor modeling from the images. Using an expectation maximization approach [2], we can getthe estimated values of p0(x), Q1, µe, σe, and the grid shift a, b. Then, the likelihoods ofa DCT coefficient (quantized for A-DJPG and Tab shift-transformed for NA-DJPG) beingfrom A-DJPG and NA-DJPG are given by

LA(x) =pC2(x;Q1, Q2)

pC1(x;Q2),

LNA(x) =pNA(x;Q1)

pNNA(x),

respectively.We choose to use the first 6 DCT frequencies for detection rather than 20 which we did

in Benford’s Law algorithm in Section 3.2.1. The likelihood of a block being A-DJPG andNA-DJPG can be computed by multiplying that of individual frequencies together. Finally,the likelihood can be compared to a threshold to give the detection result.

27

3.3.2 Quality Factor Differences in Block Grained Analysis

By setting different thresholds, we can get the ROC curves of the above algorithm. It isimportant to know that the AUC (area under curve) depends on the quality factors of twocompression processes, so AUC obtained from different quality factors are listed in Tables3.6 and 3.7 as the experimental results.

For text altering, one can easily observe that good detection results are only obtainedwhen the first quality factors is smaller than the second quality factors, i.e. QF 1 < QF 2.[2] suggests that it may be caused by the inaccurate estimation of the model parameterssuch as Q1. When QF 1 ≥ QF 2, the double quantization effect cannot be noticed by theprobability model and hence it gives erroneous results.

Quality Factor 2

Quality Factor 1

50 60 70 80 90 10050 0.4930 0.7773 0.8477 0.8615 0.8413 0.801160 0.6286 0.5197 0.7796 0.8491 0.8191 0.787370 0.6421 0.6486 0.5182 0.7552 0.8087 0.793680 0.4984 0.6031 0.6290 0.5447 0.8009 0.780490 0.5206 0.5631 0.5031 0.4435 0.5143 0.7504100 0.5351 0.5514 0.5469 0.5507 0.5097 0.5006

Table 3.6: AUC Achieved by Block Grained Analysis for Images with Added Text

For shift modification, the algorithm is generally unreliable. A possible reason is thatthe grid shift a and b may be wrongly estimated due to a large portion of the image beingthe fixed region [2]. Since the model get the estimated parameters based on the wholeimage, it is hard for it to know the correct gird shift if the shifted region only occupies 1/16of the whole image. In reality, an image is usually forged by tampering with only a smallportion. As a result, this algorithm need some improvements before real world application.

Quality Factor 2

Quality Factor 1

50 60 70 80 90 10050 0.4011 0.4405 0.3978 0.4611 0.4983 0.473060 0.3885 0.4204 0.4822 0.4652 0.5060 0.521170 0.3701 0.4104 0.4756 0.4669 0.4862 0.546280 0.4190 0.4023 0.4660 0.5082 0.4565 0.586490 0.3993 0.4169 0.4691 0.4882 0.4982 0.5922100 0.3971 0.4176 0.4719 0.4934 0.4894 0.5086

Table 3.7: AUC Achieved by Block Grained Analysis for Shifted Images

3.3.3 Features from Block Grained Analysis

As mentioned in Section 3.3.1, we were able to get the likelihood of a block being A-DJPGwhich gives us a probability map that is one sixty-fourth of the image size. A heat map ofthese probabilities can be seen in Figure 3.5.

From this probability map, we took a total of 13 features. First, we considered themean and the variance of the probability from each of the blocks. Another feature was thepercentage of the image that was below the threshold. The natural threshold was 0 since the

28

Figure 3.5: Probability Map for an Image Being A-DJPG

probability was the natural log of the percentage that the image was singly compressed overthe percentage that the image was doubly compressed. This meant if the value was negativethen there was more likely a chance that the image was doubly compressed. Another featurewas counting the number of areas that was below the threshold and then we also took theaverage size of these areas.

From [19] they proposed that one of the features be the ratio of the perimeter to thearea. We counted the perimeter 3 different ways to get 3 features. The first way we countedthe perimeter was if there was counting the number of edges between either the edge of themap and the block below the threshold or an edge between a block below the threshold anda block above the threshold. The other two ways involved erosion where the probabilitymap is transformed into a binary map indicating whether it is below, 1, or above, 0, thethreshold. The perimeter is then counted by using a three by three matrix filled with one’sor by a three by three matrix where only the corners are zero’s. An example is shown inFigure 3.6.

The last 5 features came from correlations between the distributions of DCT coefficientsof the whole photo and that of the blocks below the threshold, i.e. the singly compressedblocks. They basically check the validity of the probability map. If the algorithm worksproperly, the blocks identified as singly compressed should have a different DCT coefficientsdistributions from a doubly compressed image. The correlation is used as a measurementof such differences between the distributions, and the first 5 AC frequencies are consideredto give 5 features.

3.3.4 Results of Testing Block Grained Features

Using the features discussed in Section 3.3.3, we trained on four different learning algo-rithms. Our results are shown in Table 3.8; random forest had the highest accuracy andAUC by at least 6%. As you can see though in Figure 3.7, the variance spans over 20%.Although the variance of kNN shown in Figure 3.8 spans over 15%, the accuracy is muchlower. Although kNN has the lowest FRR rate, it has the highest FAR rate meaning thatit accepts more images than it should.

29

(a) A binary map(b) Perimeter = 68 given by counting edges

(orange)

(c) Perimeter = 48 given by boundary pixels(red)

(d) Perimeter = 33 given by boundary pixels(green)

Figure 3.6: An Example of Binary Map and 3 Different Types of Perimeter Used

Figure 3.7: Random forest ROC curve and CV accuracy for Block Grained Analysis

30


Table 3.8: Result of Block Grain Analysis using 4 ML Algorithms

Figure 3.8: kNN ROC curve and CV accuracy for Block Grained Analysis

31

Chapter 4

Color Filter Array Algorithm

Camera based detection algorithm examine traces left by the camera during photo capturingprocess. One of these algorithms looks at the color filter array (CFA). A color filter arrayis a mosaic of small color filtering micro lens placed over the pixels sensors array to capturecolor images. A digital color image consists of three color channels containing samples fromdifferent bands of spectrum, i.e., red, green and blue. Typical photo-sensors, charge-coupleddevice or complementary metal-oxide semiconductor, only detect light intensity with littleor no wavelength specificity, thus cannot obtain color information. The raw image datacaptured by the sensor array only contains one third of the color information and the restis calculated by CFA interpolation. For example, the green and blue components of red-filtered pixels are interpolated by checking the adjacent green and blue-filtered pixels. Afterthis process the raw image is converted into a full-color image.

Since a specified interpolation, or mosaicing algorithm is used for all the pixels in a colordigital image, a globally correlation between the adjacent pixels should be observed. Varioustypes of images forgery techniques, including copy and paste, smoothing and rotating willlocally destroy this correlation. In this chapter, we will try to utilized this property for ourforgery detection process.

4.1 Outline

A commonly applied color filter array is the Bayer Array with a color pattern of 0.5 green,0.25 red and 0.25 blue, hence also called RGBG. The pixels in the image generated byapplying Bayer fliting will have a relatively simple relationship with its adjacent pixels.An initial model and corresponding parameters is estimated and expectation-maximization(EM) algorithm is used to optimize our prediction. Probability is also calculated to producea probability map which can be used as an additional tool for forgery indication [6].

4.2 Estimation of Interpolation Correlation

Our model assumes that in single channel of the image, f(x, y), each sample belongs to oneof the two models:

M1 : f1(x, y) =

N∑u,v=−N

αu,vf(x+ u, y + v),

33

Figure 4.1: Authentic Image and Result of CFA Algorithm

where α = {αu,v| − N ≤ u, v ≤ N}. The model’s linear coefficients are given by ourprediction and M2 if the sample is from the sensors and not correlated to its neighbors, i.e.

M2 : f2(x, y) is given by an estimation of uniform distribution

.In the expectation process, the residue error for the first model is calculated:

r1(x, y) = f(x, y)−N∑

u,v=−Nαu,vf(x+ u, y + v),

By assuming that the probabilities of observing a sample generated from M1 and M2

follow a Gaussian distribution and a uniform distribution respectively, the likelihood thatit is from M1 can be calculated as follow:

w1(x, y) = Pr(α|r1(x, y)) =e−r

21(x,y)/2σ

e−r21(x,y)/2σ +

√2πσp0

.

In the maximization process, we will obtain a new estimation of α by trying to minimizethe quadratic error function:

E(α) =∑x,y

w1(x, y)r21(x, y).

The error function is minimized by setting ∂E∂αs,t

= 0, it yields

N∑u,v=−N

αu,v

(∑x,y

w1(x, y)f(x+ s, y + t)f(x+ u, y + v)

)=∑x,y

w1(x, y)f(x+s, y+t)f(x, y).

By solving the above set of linear equations, we can obtain αu,v. We then will repeatthe expectation and maximization processes until αu,v converge.

4.3 Results and Analysis

Authentic pictures and the results from the expectation-maximization (EM) algorithm usu-ally converge to a nice set of values. The correlation can be also observed from the prob-ability map generated in which the white region suggest higher probability and the blacksuggests low probability. From Figure 4.1, from the probability map we can observe thenatural edges (high frequency parts) which are shown as black boundary lines with a smooth

34

Figure 4.2: Tampered Image and Result of CFA Algorithm

Figure 4.3: Random Forest ROC Curve and CV Accuracy for CFA Algorithm

transition compared with the high correlation region. The 2D Fourier transformation showsingle and concentrated local peaks.

The CFA analysis of the tampered image, as shown in Figure 4.2, shows several dis-tinguish features. The tampered region from copy and paste operation is shown in theprobability map with rather sharp edges. Several peaks in higher frequency domain can beobserved in the Fourier transformation which indicates the forgery.

We tried to use the estimated weights αu,v from different color channels as featuresto predict the authenticity of photos. Applying different machine learning algorithm fromSection 2.2.3 gives the results shown in Table 4.1. The best algorithm is random forestwhich gives an accuracy of 0.5811 and a AUC of 0.6008 which is still unreliable. It meansthat the algorithm can barely distinguish real and fake images. The high variance shownby the large confidence interval of the CV accuracy in Figure 4.3 further illustrates theineffectiveness of this method.

Algorithm Accuracy FAR FRR AUCRandom Forest (b=120) 0.5811 0.4286 0.4093 0.6008kNN (k=8) 0.5465 0.5322 0.3764 0.5871Logistic Regression 0.4785 0.3585 0.6813 0.4961SVM - rbf 0.4896 0.1092 0.9038 0.5246

Table 4.1: Results of CFA Algorithm using 4 ML Algorithms

In order to explain the poor performance of this algorithm, we investigate different typesof image forgery individually. According to Table 4.2, the CFA algorithm works well forthe authentic images and copy and paste images. It did not work well on our text based

35

modification or the copy and move. One possible reason for such observation is that textbased modification and the copy and move do not have significant changes on the CFAinterpolation. Characters added into a picture may be too thin for the algorithm to noticethe differences while copied parts from the same image retain the same CFA interpolationpattern.

Image TypeIndividual Classification

AccuracyFRR or

FARFRR/FAR Value

Authentic 0.7718 FRR 0.1416Text 0.1316 FAR 0.7544Copy and Move 0.3446 FAR 0.5946Copy and Paste 0.6987 FAR 0.1416

Table 4.2: Accuracy on Individual Classification and FAR/FRR Achieved by CFA Algorithm forDifferent Modification Types

Moreover, integrating more features may enhance the performance of image forgerydetection. As mentioned in the previous paragraph, peaks are observed in high frequencydomain of the tampered images. If we have more time, a possible improvement is to getmore features from the frequency domain instead of using the estimated weights αu,v alone.For instance, the number of high frequency peaks and the energy of the high frequencycomponents may help pick out the forged images.

36

Chapter 5

Combining all of the Algorithms

A single algorithm individually will not be able to detect every type of forgery. What isneeded is a solution built out of several algorithms. In this chapter, we will discuss how wecombined our algorithms to produce a solution to the proposed problem. The data set usedin combining all of the algorithms is described in Appendix A.5.

5.1 Stacking and the Combiner Model

We have seen many different algorithms thusfar which all have returned a binary classi-fication result. Instead we will now modify these results to return a probability that theimage has been forged. Doing so has several advantages. For example, we can get a bettersense of the accuracy of our results. Suppose that a simple threshold is set where below 0.5probability the image is classified as authentic and above the threshold it is labeled fake. Ifan image is determined to have a 0.51 probability of being fake by some model, then it isclassified as fake even though the model is extremely uncertain.

We have used the models that we have learned from each algorithm to give us a proba-bility. These probabilities will become our new features for when we combine the models.This procedure is a type of ensemble learning known as ”stacking”. Using a newly con-structed dataset we can run the data through each of the individual models to obtain theprobabilities. We split this data into the training and testing sets and proceed to learn fromthe training data as we would via any other learning algorithm.

In our case we will look at combining all of the models as well as only combining theBenford’s Law model with the Block Grain Analysis model. In the latter case we will reduceour dataset of images into a 2-dimensional continous feature space.

5.2 Combining all of the Models

As seen in Table 5.1, the performance of the combiner model of all individual models waspoor. A possible explanation is to attribute the poor performance to the noise feature andCFA models. Recall, that the noise feature model has an extremely high variance due to theclosely overlapped data points as well as having a tendency to almost always falsely rejectan image. Naturally, this will bias the results and since it is only rejecting images, then itshould be considered irrelevant for determining the authenticity of an image anyways. Withregards to the CFA model, the ROC curves were extremely close to baseline randomness.Keeping such a model for use in the combiner model will only serve to introduce unneccesary

37

noise and artificially increasing the variance of our results. As was already discussed, thefeatures selected for the CFA were likely very poor indicators of the authenticity of theimage and would only obsure the classification procedure. Our hope is that we will be ableto leverage the benifits of each individual model whilst mitigating their drawbacks in thecombiner. As we shall see, the combiner model will in fact perform better than either ofthe indiviual models did thereby indicating that the learning was a success.


Table 5.1: Results of Combining Block Grain Analysis, Benford’s Law Algorithm, Noise FeatureAlgorithm, and CFA Algorithm

Table 5.2 shows the results for when we test our models using the newly constructeddata set. We see that the CFA model performs with 50% accuracy indicating a close torandom performance. Moreover, the FRR for the noise feature model is 100% meaningthat it has no bearing on determining the actual authenticity of the image. Although herethe Benford and Block Grain Analysis models did not perform stellar, they performed thebest individually with acceptable AUC scores from before. It is for these reasons that wehave decided to train the combiner model using only the results from these two models asfeatures.

Algorithm Accuracy FAR FRRCFA 0.5250 0.5233 0.3300Block Grain Analysis 0.3450 0.7800 0.2800EXIF Noise Feature 0.7500 0.0000 1.0000Benford’s Law Algorithm 0.5950 0.3767 0.4900

Table 5.2: Performance Comparison for Individual Algorithms

5.3 Combining Benford’s Law Algorithm and Block

Grain Analysis Algorithm

One major advantage of combining only the Benford and Block Grain Analysis models isthat our new feature space is 2-dimensional and therefore easy to visualize. Figure 5.1does exactly this and we can immediately see the advantages of using a combiner model.Previously, the individual algorithms used a threshold to differentiate between real andfake images. These thresholds correspond to strictly vertical or horizontal lines in the newfeature space. However, since they are combined we have a much richer geometry at ourdisposal. For example, we can now classify images based on regions such as the patch ofreal images in the upper-left or the group of fake images in the lower-right.

Figure 5.1 still shows lots of overlap between the real and fake images. However, thereis a noticable difference in the concentration of real versus fake images for any given region.This prompts us to now plot the joint probability density fucntions (p.d.f.’s) in the newfeature space for both real and fake images. Notice that since these p.d.f.’s are unknown we

38

Figure 5.1: Visualization of the Features for Combining Block Grain Analysis and Benford’s LawAlgorithm (Note: The entire dataset is presented here.)

39

(a) Contour Map for Real Images (b) Contour Map for Altered Images

(c) Estimated Probability Density Functionfor Real Images

(d) Estimated Probability Density Functionfor Altered Images

Figure 5.2: Contour Maps and Estimate Probability Density Functions for Real and Altered Images

will need to estimate them ourselves. This was done using a kernel density estimator witha Gaussian kernel. The bandwidth was determined using Scott’s Rule

Bandwidth: h = n−1d+4

where n is the number of datapoints and d is the dimension (2 in this case).When looking at the contour maps, it can easily been seen that for the real images

there is a high density around 0.45 for Benford’s algorithm and 0.2 for Block Grain Analysisalgorithm. This high density can also bee seen in the spike in the 3-D depiction (Figure5.2 (c)). As for the altered images, the density does not have the same peak but is insteadmuch flatter (Figure 5.2 (d)).

Using a combiner of these two models gave us much better accuracy and improvedscores overall. This can be explained largely by the fact that the authentic and unauthenticimages are concentrated in different areas. As seen in Table 5.3, both random forest andkNN performed well with accuracies above 80%. Looking at 95% confidence interval inFigure 5.3, we can see that the variance is fairly tight and there is a 95% probability thatthe true accuracy is between about 75% and 87%. The FAR and FRR scores for the randomforest were respectable, and when coupled with impressive AUC of 0.91, we can easily seethat these results were better than trying it on either model individually.

40

Algorithm Accuracy FAR FRR AUCRandom Forest (b = 250) 0.8375 0.1566 0.1800 0.9059kNN (k = 5) 0.8025 0.2133 0.1500 0.8848Logistic Regression 0.5750 0.4633 0.3100 0.5948SVM - RBF 0.5625 0.5133 0.2100 0.7019

Table 5.3: Results of Combining Block Grain Analysis and Benford’s Law Algorithm

Figure 5.3: Random Forest ROC Curve and CV Accuracy for Combining Block Grain Analysis andBenford’s Law Algorithm

5.4 Conclusion and Further Work

While stacking the models showed improvement in the results, it is still not infallible andrequires refinement. With regards to the CFA algorithm, we would like to examine theFourier Transformation maps in order to extract features similar to those from the BlockGrain Analysis relating to the peaked regions. Next we should also like to investigate thenon-aligned double JPEG compressed images, which was a case we did not consider in ourexperiments. Perhaps we can develop a more comprehensive Block Grain Analysis modelby including this case.

There are also many other types of detection methods too. For example, there aregeometric and physics based detections. Real environments are subject to physical lawswhich naturally manifest themselves in photos. If an image is tampered with then thereis a chance that inconsistencies in these laws arise. Geometric and physics-based detectionalgorithms attempt to find these inconsistencies in order to distinguish between real andfake images. For example, one such method that can be looked into is the examinationof the lighting and shadowing in a natural scene. The idea is to estimate light directionsof different objects in photos by mathematics and then search for discrepencies betweenthem. If the light directions are inconsistent, then the image can be classified as fraudulent.Implementing several different physical and geometric based algorithms could prove usefulwhen combined. Finally, our work was done mainly from the perspective of shallow learning.In the future, it may be beneficial to attempt to solve this problem from a deep-leanringangle. Deep learning is an exciting field with many applications to analyzing images fromimage recognition to self-driving cars. It seems reasonable that such technology could beuseful for solving our problem as well.

41

Appendix A

Data Sets Used

A.1 Data Set - Noise Feature Algorithm

To test our algorithm, we used photographs from the ’Dresden Image Database’ [11] whichprovided us with photos containing EXIF data. As for altered photographs, we wrote ourown scripts. The first script changed the brightness of either a randomly selected rectangularor elliptical area in the photograph. The amount the brightness increased or decreased bywas randomly selected as well. The second script was similar to the first script except itchanged the contrast of the image. The third script randomly placed a string of numbersin a random shade of gray text on the image. The total number of images used and eachtype are outlined in Table A.1. The data trained and tested was a 60/40 ratio.

Authentic 1,002Fraudulent

Contrast 334Brightness 334Text 300Total Fraudulent 968

Total 1,970

Table A.1: Number of Images used for Noise Feature Algorithm

A.2 Data Set - Benford’s Law Algorithm

The data set used for this algorithm was different than the noise features algorithm due tothe fact that we needed to know the quality factor of the images and how many times theimage had been compressed. Following in [2]’s data set process, we acquired RAW imagesfrom RAISE — A Raw Images Data set for Digital Image Forensics [5]. RAW images aredifferent from JPEG images because JPEG images will compress the information in thepixels whereas a RAW image saves all the information in the pixels. By starting with RAWimages, we can then compress the images for the first time knowing the quality factor.

We downloaded a total of 998 raw data images and compressed each image with a qualityfactor of 50, 60, 70, 80, 90, and 100 for a total of 5, 988 authentic images. For the doublecompressed images, we started with a select 50 RAW images and compressed those two times

43

with every possible combination of quality factors from the set {50, 60, 70, 80, 90, 100}. Thetotal number of authentic images can be found in Table A.2.

Number of ImagesSingle Compressed 5,988Double Compressed 1,800

Total 7,788

Table A.2: Number of Authentic Images

For the double compressed fraudulent images, we took the same 50 RAW images toperform modifications on them. The order of creating these images starting with the RAWimage is to: Compress, Modify, and then Compress.

The first type of modification was to write a fixed size, color, and position string inthe top left corner. Every image in this set had the string ”HELLO WORLD” printedat position (0, 0) in 50 pt Arial font with an RGB value of (0, 0, 0). The second type ofmodification was to use the same RGB value, font, and string except position it randomlywith a random font size. The one requirement of this algorithm was that the area of thepicture that the text covered had to be larger than 1% of the image.

The Copy and Move modifications, took either a rectangular or elliptical portion ofthe image and copied it to another area. The area being copied from and moved to weredisjoint. In addition, both areas and the sizes of the areas were randomly selected.

To obtain non-aligned images, a sixteenth of the image was copied and moved up by3 pixels and to the left by 3 pixels. The width of the copied area had the edges of pixellocations: (3, 3), (3 + 0.25 ∗width, 3), (3, 3 + 0.25 ∗ height), and (3 + 0.25 ∗width, 3 + 0.25 ∗height). This rectangle was then moved to the rectangular area with edges located at: (0, 0),(0.25 ∗ width, 0), (0, 0.25 ∗ height), and (0.25 ∗ width, 0.25 ∗ height).

For the triple compressed images, we took the first 300 images from each of the dou-ble compressed modification except for the random text and compressed them again withquality factors in the set: {50, 60, 70, 80, 90, 100}.

The data set for the single and double compressed involves randomly selecting twodifferent images from the RAW data set, compressing one at a random quality factor inthe set {50, 60, 70, 80, 90, 100}. From there a random percentage of the second image waschosen to become RAW and either a random location was chosen to place the RAW squareor a random location was chosen to place the RAW square. The size of the RAW squarecopied in was big enough to match the random percent. After placing in part of the RAWimage into the compressed image, the image was compressed and saved using a randomquality factor from the set used for the first compression. Both the quality factors used,the percentage and the two images used were saved in the image name.

The total number of each type of modification and compression performed are listed inTable A.3. The ratio of trained and tested was a 70/30 ratio.

A.3 Data Set - Block Grain Analysis

For this algorithm, we only used 1, 800 text modified set and the 1, 800 non-aligned setsdescribed in Section A.2.

For evaluating the effect of the quality factor on performance we used all of the doublecompressed images mentioned in Table A.3 for a total of 9,000 images. Additionally, weincluded 9,000 unaltered double compressed images. Table A.4 shows the breakdown of

44

Double Compressed Number of ImagesText - Fixed 1,800Text - Random 1,800Copy and Move Rectangle 1,800Copy and Move Circle 1,800Non-Aligned 1,800Single/Double CompressedCopy and Paste Not Random 1,800Copy and Paste Random 1,800

Triple CompressedText - Fixed 1,800Copy and Move Rectangle 1,800Copy and Move Circle 1,800Non-Aligned 1,800

Total 19,800

Table A.3: Number of Fraudulent Images

Quality Factor 250 60 70 80 90 100

Quality Factor 1

50 242 257 251 244 264 24660 245 244 267 241 248 25970 245 254 246 251 246 26180 239 254 251 251 255 25490 252 252 245 252 249 252100 256 237 253 243 256 260

Table A.4: Number of Images for Different Quality Factor Combinations

these images for different quality factor combinations. The training-testing split was keptconstant at 70-30.

For extracting the features, we used a total of 2,002 images which is a subset of theimages mentioned in Table A.3. 1,200 images were from the single compressed unalteredphotos. 401 images came from double compress copy and move, 401 images came fromcopy and paste random and 400 images from text-random. We trained the data set using70 percent of each type and tested using the other 30 percent.

A.4 Data Set - CFA

We used a total of 2,002 images which is a subset of the images mentioned in Table A.3.1,200 images were from the single compressed unaltered photos. 401 images came fromdouble compress copy and move, 401 images came from copy and paste random and 400images from text-random. We trained the data set using 70 percent of each type and testedusing the other 30 percent.

45

A.5 Data Set - Combo

We used a total of 1,000 images. 400 authentic images was split between each one of ourphones (as we all had different phones and thus different cameras). We then created 200double compress copy and move from images, 200 double compress text random images,and 200 copy and paste random from our images and using a random raw image from [5].Each of these types of forgeries were created by the processes described in Appendix A.2.When training, we used 300 authentic images and 100 from each set of fraudulent images.For testing, we used the last 100 of each type.

46

Appendix B

Abbreviations

A-DJPG. Aligned double JPEG. Double JPEG compression which the second compressionuses a block grid aligned with the first one.

CFA. Color Filter Array. A mosaic of small color fitting micro lens placed over pixels sensorsarray to capture color images.

CV. Cross Validation. Used to assess model validation.

DCSI. Digital Camera Signal Intensity. Sensor of the photons in the camera.

DCT. Discrete Cosine Transformation. Operation used to in JPEG compression.

EM. Expectation Maximization. Algorithm for approximating statistical parameters.

EXIF. Exchangeable Image File Format. The standard used by images for specifying theformat with meta-data tags typically used in .JPG and .TIF image files.

FAR. False Accept Rate. Number of fraudulent photographs accepted over the total numberof fraudulent photographs.

FRR. False Reject Rate. Number of authentic photographs rejected over the total numberof authentic photographs.

JPG/JPEG. Joint Photographic Experts Group. A method for compression of a photographand the most commonly used by digital cameras.

kNN. K Nearest Neighbors. Type of instance-based learning algorithm.

ML. Machine Learning. A way for the computer to learn rather than being explicitlyprogrammed.

NA-DJPG. Non-aligned double JPEG. Double JPEG compression which the second com-pression uses a block grid misaligned with the first one.

ROC. Receiver Operating Characteristic. Graphical representation of a binary classifiersystem.

SVM. Support Vector Machine. Supervised learning models used for classification andregression analysis.

47

Selected Bibliography IncludingCited Works

[1] N. S. Altman, An introduction to kernel and nearest-neighbor nonparametric regres-sion, The American Statistician, 46 (1992), pp. pages 175–185.

[2] T. Bianchi and A. Piva, Image forgery localization via block-grained analysis ofjpeg artifacts, IEEE Transactions on Information Forensics and Security, 7 (2012),pp. 1003–1017.

[3] L. Breiman and A. Cutler, Random forests.

[4] V. V. C. Cortes, Support-vector networks, Machine Learning, (1995).

[5] D.-T. Dang-Nguyen, C. Pasquini, V. Conotter, and G. Boato, Raise a rawimages dataset for digital image forensics.

[6] H. Farid, Digital image forensics.

[7] , Image forgery detection – a survey, 2009.

[8] D. A. Freedman, Statistical models: Theory and practice, Cambridge UniversityPress, (2009), p. 128.

[9] D. Fu, Y. Q. Shi, and W. Su, A generalized benford’s law for jpeg coefficients and itsapplications in image forensics, in SPIE Conference on Security, Steganography, andWatermarking of Multimedia Contents, E. J. Delp and P. W. Wong, eds., vol. 6505,2007.

[10] D. C. Garcia and R. L. de Queiroz, Face-spoofing 2d-detection based on moire-pattern analysis, IEEE Transactions on Informatin Forensics and Security, 10 (2015),pp. 778–786.

[11] T. Gloe, The ’dresden image database’.

[12] P. Harvey, Exif tags, may 2016.

[13] H. C. Jiayuan Fan and A. C. Kot, Estimating exif parameters based on noisefeatures for image manipulation detection, IEEE Transactions on Information Forensicsand Security, 8 (2013), pp. 608–618.

[14] H. C. Jiayuan Fan, Alex C. Kot and F. Sattar, Modeling the exif-image correla-tion for image manipulation detection, 18th IEEE International Conference on ImageProcessing, (2011), pp. 1945–1948.

49

[15] B. Li, Y. Q. Shi, and J. Huang, Detecting doubly compressed jpeg images by us-ing mode based first digit features, in Multimedia Signal Processing, 2008 IEEE 10thWorkshop on, Oct 2008, pp. 730–735.

[16] A. Piva, An overview on image forensics, ISRN Signal Processing, 2013 (2013).

[17] A. C. Popescu and H. Farid, Statistical Tools for Digital Forensics, Springer BerlinHeidelberg, Berlin, Heidelberg, 2005, pp. 128–147.

[18] A. J. Smola and B. Schlkopf, A tutorial on support vector regression, 2004.

[19] W. Wang, J. Dong, and T. Tan, Exploring dct coefficient quantization effect forimage tampering localization, in 2011 IEEE International Workshop on InformationForensics and Security, Nov 2011, pp. 1–6.

[20] J. Watling, China’s internet giants lead in online finance, The Financialist, (2014).

50

Research in Industrial Projects for Students Ant Financialalsup/projects/Alipay.pdfXiaoyu,...

Documents

Transcript of Research in Industrial Projects for Students Ant Financialalsup/projects/Alipay.pdfXiaoyu,...