Comparative Analysis of Machine Learning Algorithms ... · Credit-card fraud can be classiﬁed...

Comparative Analysis of Machine LearningAlgorithms through Credit Card Fraud

Detection

Rishi Banerjee Gabriela [email protected] [email protected]

Steven Chen Mehal Kashyap Sonia [email protected] [email protected] sonia [email protected]

*Jacob [email protected]

New Jersey’s Governor’s School of Engineering and TechnologyJuly 27, 2018

*Corresponding Author

Abstract—With the increase of e-commerce and online trans-actions throughout the twenty-first century, credit card fraud is aserious and growing problem. Such malicious practices can affectmillions of people across the world through identity theft and lossof money. Data science has emerged as a means of identifyingfraudulent behavior. Contemporary methods rely on applyingdata mining techniques to skewed datasets with confidentialvariables. This paper examined numerous classification modelstrained on a public dataset to analyze correlation of certainfactors with fraudulence. This paper also proposed better metricsfor determining false negative rate and measured the effectivenessof random sampling to diminish the imbalance of the dataset.Finally, this paper explains the best algorithms to utilize indatasets with high class imbalances. It was determined that theSupport Vector Machine algorithm had the highest performancerate for detecting credit card fraud under realistic conditions.

I. INTRODUCTIONCredit-card fraud is a general term for the unauthorized use

of funds in a transaction typically by means of a credit ordebit card [1]. Incidents of fraud have increased significantlyin recent years with the rising popularity of online shoppingand e-commerce. Credit-card fraud can be classified intotwo different types, card-not-present fraud and card-presentfraud. Card-not-present fraud takes place when a customer’scard details including card number, expiration date, and card-verification-code (CVC) are compromised and then used with-out physically presenting a credit card to a vendor, such as inonline transactions. Card-present fraud occurs when credit cardinformation is stolen directly from a physical credit card [2].Since 2015, credit card companies have issued chip-payment(EMV) cards to combat card-present fraud. Although thismeasure has been effective at reducing point-of-sale fraud by

28% within the last three years, card-not-present fraud hasrisen by 106%, increasing the need for online security toprevent data breaches. Although less than 0.1% of all credit-card transactions are fraudulent, analysts predict that credit-card fraud losses incurred by banks and credit-card companiescan surpass $12 billion in the United States in 2020. Evidently,there is a dire need for robust detection of card-present andcard-not-present fraudulent transactions to minimize monetarylosses.

Currently, credit-card companies attempt to predict thelegitimacy of a purchase through the analyzing anomalies invarious fields such as purchase location, transaction amount,and user purchase history. However, with the recent increasesin cases of credit card fraud it is crucial for credit cardcompanies to optimize their algorithmic solutions. [3]

This paper compares various deep learning and regressionalgorithmic models to explore which algorithm and com-bination of factors provides the most accurate method ofclassifying a credit-card transaction as fraudulent or non-fraudulent (normal).

II. BACKGROUND

A. Data Mining and Data Science

Data Mining is the process of determining useful patternsand trends from large sets of data [4], and it combinesvarious fields of study such as machine learning, informationscience, and statistics. It requires skills in analysis and datamanipulation. [5]

Classification is a data mining function that assigns itemsin a collection to target categories or classes. The goal of

1

classification is to accurately predict the target class for eachcase in the data. A classification task begins with a data setin which the class assignments are known, which serve aspredictors of the target. The simplest type of classificationproblem is binary classification. In binary classification, thetarget attribute has only two possible values, such as high riskor low risk for fraud. Thus, the most suitable algorithms fordetecting credit card fraud are binary classifiers. [6]

B. Machine Learning

Machine Learning is a type of Artificial Intelligence inwhich computers are trained to recognize patterns withinlarge data sets and improve upon those patterns automaticallywithout the need for human intervention. The training processinvolves starting out with a basic machine-learning algorithmthat processes training data to analyze the relationship ofvarious factors with a target value. The target value is explicitlyprovided to the machine-learning algorithm in the trainingstage. Once trained, the model can then be used to predictunknown target values for other instances of the data.

Machine learning can be classified as supervised or unsu-pervised depending on whether the training data provided islabeled. Supervised learning focuses on finding a relationshipbetween an input value and an output value to predict furtheroutput values when more input is provided. A supervisedlearning problem can further be grouped into either classi-fication or regression [7]. Classification problems categorizethe output (such as fraud vs. not fraud) whereas regressionproblems provide the output as a specific value (for e.g. dollaramount). Machine learning algorithms that do not producean output, but rather analyze the relationship between theinput and output, are referred to as unsupervised because thetraining data is neither labeled nor classified [8]. This projectimplements supervised machine learning algorithms for clas-sification of a credit-card transaction as either fraudulent ornot-fraudulent. [9]

C. Classification Models

1) K-Nearest Neighbors (KNN): The K Nearest NeighborAlgorithm is a clustering algorithm which predicts a datapoint’s attributes based on its relative position to other datapoints.

To discover the unknown attribute, or factor, of a testingdata point, its Euclidean distance, as seen in Equation 1, inreference to every other data point must be found. The datapoint in the training set which has the shortest Euclideandistance to the testing point is assumed to contain the sameunknown attribute as the testing point. For example, in thispaper, we use “hour1” and “field3” to calculate Euclideandistance. Then, fraudulence can be determined using thetraining point which has the closest Euclidean distance to thetesting data point. The equation for Euclidean distance is seenin Figure 1, where x, y, and n are known numerical and binaryattributes of the target set and the training set.

Ed = 2√

∆x+ ∆y + ...+ ∆n (1)

However, when the Euclidean distance is calculated, largernumerical attributes can have greater impact. In order to reducethe impact of these large numerical attributes, the data couldbe normalized by dividing a single attribute by the standarddeviation and subtracting by the mean, thus reducing thestandard deviation to 1 and the mean to 0. Normalizing ensuresthat all attributes bear equal weight when calculating distance,so that the calculated distance is not biased. [10]

2) Logistic Regression: The logistic regression algorithmuses both the logistic regression and sigmoid function toperform binary classification based on different factors withinthe data set. Displayed below is the sigmoid function:

y′ =1

1 + e−(z)(2)

The Sigmoid Function is used to find the probability of abinary classification. In this equation, y is the output proba-bility, and z is the log-odds of the example; z is defined withthe equation

z = b+ w1x1 + w2x2 + ...+ wNxN (3)

in which b is the intercept,of the linear regression. W rep-resents the weighted values and bias, and x represents thefeatured values. The probability provided by the sigmoidfunction predicts the likelihood of a certain outcome. [11]

Figure 1. General Form of Logistic (Sigmoid) Function [12]

3) Random Forest Classifier: The Random Forest algorithmis a supervised classification algorithm which randomly gen-erates connected decision tree algorithms. If a training datasetwith targets and features is inputted into the decision tree,it will formulate a set of rules which are used to generatepredictions. The difference between Random Forest algorithmand the decision tree algorithm is that in Random Forest, theprocess of finding the root node and splitting the feature nodeswill run randomly, and as seen in Figure 2, the Random Forestalgorithm is composed of an ensemble of numerous decisiontree algorithms.

The Random Forest Model has many advantageous features.For example, the classifier is unsusceptible to overfitting, and

2

Figure 2. Architecture of Random Forest Ensemble [13]

thus does not match closely with random, obsolete fluctuationsin the dataset. Greater numbers of decision trees generated bythe algorithm prevent the classifier from overfitting the model.If enough decision trees are generated by the algorithm, theclassifier will not overfit the model. Other advantages of theclassifier include its ability to handle missing values, or bemodeled for categorical, or quantitative values. The advantageof the classifier with the greatest relevance to this project isits ability to quantify the importances of the attributes it usessince decision tree-based strategies used by random forestsnaturally are ranked by how well they improve the purity ofthe node. This algorithm is applicable to credit card frauddetection since it is able to use multiple attributes at once topredict values and measure the importance of each attribute.[12]

4) Support Vector Machine: Support Vector Machines areexamples of supervised Machine Learning algorithms that canbe applied to classification and regression problems. In thecase of a classification problem, a support vector machine willdetermine the best-fitting method for categorizing the data.[13]

Figure 3 depicts the overall goal of a support vector machinetasked with classifying a credit card transaction as eitherfraudulent or non-fraudulent. After plotting the training dataon an n-dimensional plane, with n being the number of factorsbeing analyzed, the support vector machine will generateequations for multiple hyperplanes that can linearly separatethe data points by category. A hyperplane exists as a line,plane, or hyperplane if two, three, or greater than three factorsare analyzed, respectively. A sample of generated hyperplanesare represented by letters A and B in the above figure. Datapoints that fall to the right of the hyperplanes are classified asnon-fraudulent while others fall under the fraudulent category.Both hyperplanes in the above figure correctly separate thegiven data points by fraudulence, but the most effectivehyperplane will achieve a similar level of accuracy whenunknown data points are in need of classification. Thus theoptimal hyperplane is chosen based on the distance of the line

Figure 3. Support Vector Machine Separation [14]

to the nearest point on either side. This distance is referred toas the margin, and the points that determine the margin areknown as support vectors. In general, the hyperplane with thegreatest margin is chosen to make predictions in the supportvector machine algorithm. [14]

5) Naive Bayes: The Naive Bayes Classifier is a supervisedmachine learning algorithm based upon Bayes Theorem, whichstates:

P (A|B) =P (B|A) P (A)

P (B)(2)

Bayes Theorem provides a method for calculating the pos-terior probability (P (A|B)), the probability of an outcome(A) provided certain conditions (B). The theorem calculatesthe posterior probability by relating it to the prior proba-bility (P(A)), the probability of the outcome without anyknowledge of influential conditions, through a likelihood ratio(P (B|A)/P (B)) [15]. The Naive Bayes theorem is basedon the assumption that every factor independently affects theoutcome, and is thus naive. The Naive-Bayes Classifier allowsfor a simple yet powerful method of classifying fraudulentcredit card transactions [16].

6) Multi Layer Perceptron: A Multilayer Perceptron (MLP)is the simplest form of a deep, artificial neural network, con-sisting of three or more layers of nonlinearly-activating nodes.They are composed of an input layer to receive the signal,an output layer that makes a decision or prediction about theinput, and in between those two, an arbitrary number of hiddenlayers that are the true computational engine of the MLP.MLPs with one hidden layer are capable of approximatingany continuous function. Since MLPs are fully connected, eachnode in one layer connects with a certain weight to every nodein the following layer. [17]

Multilayer perceptrons are often applied to supervised learn-ing problems: they train on a set of input-output pairs andlearn to model the correlation (or dependencies) between those

3

Figure 4. General Architecture of Multilayer Perceptron Network [18]

inputs and outputs. Training involves adjusting the parameters,or the weights and biases, of the model in order to minimizeerror. Backpropagation is used to make those weigh and biasadjustments relative to the error. The network trains in batches,uses information from all records in the training dataset, andupdates the synaptic weights only after passing all trainingdata records [19] [20].

D. Python Programming

Python, although a general programming language, is oftenused in data analysis [19]. Built-in Libraries such as scik-itlearn, pandas, and matplotlib, aid in analysis and visualiza-tion. More specifically, Numpy allows for storage of infor-mation [21], pandas format data from Excels and CSVs intoanalyzable DataFrames [22], scikitlearn has numerous built-in machine learning algorithms [23], matplotlib graphs data invarious plot types [24], and Keras, a high-level neural networksAPI which is run on Python, focuses on fast experimentation[25]. These formatting, analyzing, and visualization techniqueshelp analyze credit-card transactions and detect fraud.

E. Libraries

Several libraries were utilized in the creation of thesealgorithms.

1) Numpy: Numpy is integral to Python programming.Its most useful feature is a dynamic, multidimensional arraythat can store large amounts of data. The Numpy libraryinclude several functions for linear algebra; random numbergeneration; Fourier transform; and sorting and searching [22].

2) Pandas: Similar to Numpy, Pandas (Python Data Anal-ysis Library) provides functions for data organization. Pandas,however, allows Excel and CSV files to be read and formattedinto table-like data structures called DataFrames, enhancingcode legibility and data-processing speed. [23].

3) Scikit-Learn: Scikit-learn is one of Python’s most no-table libraries for machine learning. Unlike Numpy and Pan-das, which are used for data manipulation, Scikit-learn focuseson data modeling. Some of the most popular models areclustering, cross-validation, supervised models, and featureselection. [24]

4) Matplotlib: Matplotlib provides tools for 2D data visu-alization. A variety of graphs including bar graphs, scatterplots, and line graphs can be made from provided libraryfunctions. In this project, a histogram is used to assess thefrequency of fraudulent data within the dataset. Furthermore,predictions made by the algorithm are charted using the libraryand categorized as true positive, true negative, false positive,or false negative in a confusion matrix. [25].

5) Keras: Keras is a Python library that is used for deeplearning algorithms and is capable of utilizing TensorFlow as abackend. Its popularity stems from its modularity, minimalism,and extensibility [26].

III. PROCEDURE: CREATING AND TRAININGCLASSIFICATION MODELS

The process for finding usable data in this paper follows thetypical Data Analysis Pipeline, as seen in Figure 5.

Figure 5. Data Analysis Pipeline [27]

A. Dataset

1) Data Acquisition: The dataset utilized was part of a 2009competition in coordination with the University of California,San Diego and the Fair Isaac Corporation. It contains 94,682data points with sixteen known fields, including amount spent,hour of the transaction, location of the transaction basedupon zipcode, and thirteen other unknown fields, titled withencrypted names such as field1, flag1, indicator1 that areconcealed for privacy. Although these factors are concealed,they can be analyzed for trends which are indicative offraudulent transactions.

2) Hidden Fields: Regulation P is a federal privacy lawwhich prevents financial institutions from releasing informa-tion on credit card transactions to a third party, unless in acourt case. Furthermore, agreements such as the Payment CardIndustry Data Security Standard prohibits organizations whichhandle credit card transactions to handle any private financialinformation. These preventive measures are obstacles whichhinder the extent which this project can analyze credit cardfraud [28] [29]. The FICO dataset contains encrypted columnheadings, and so the exact factors being analyzed to detectfraud are unknown. However, trends within the provided data

4

can still be discovered through machine learning to prove thatmachine learning may be a powerful tool for fraud detection.

3) Known Fields: There are several fields within the datasetthat are not encrypted by the data provider.• amount: The monetary cost of the transaction.• hour1: The hour at which the transaction occurred based

on a 24 hour clock.• zip1: Provides the state and zip code which the transaction

occurred from.• domain1: Identification of the consumer based on email

domain.• hour2: The hour at which the transaction was completed

based on a 24 hour clock.• targets: Identifies a transaction as fraudulent or not fraud-

ulent based on a binary classification.

B. Preprocessing

Due to the high non-fraudulent to fraudulent ratio, displayedwithin the dataset as seen in Figure 6 , predictions made fromthe initial training set, which had a normal to fraudulent ratioof 49:1, were greatly skewed. Many of the utilized algorithmsclassified the test data with ninety-eight percent accuracy bypredicting every transaction as normal, with only true negativeand false negative cases.

Figure 6. Count of Fraudulent to Non-Fraudulent Datapoints [30]

In order to resolve this issue, the data was processed witha lower non-fraudulent to fraudulent ratio. This was doneby first dividing the initial dataset into two separate datasets based off of fraudulence. The 2094 data values fromthe fraudulent dataset was split evenly between a trainingset and a testing set. Through random subsampling, normaldata could be randomly split into the training and testing sets.This action was an example of random undersampling, or theremoval of negatives in order to make the data positive forfraud more significant to the classifier [31]. This method ofdata processing provided greater flexibility in manipulatingthe ratio of normal to fraudulent data within each data setsimply by increasing or decreasing the amount of normal datathat was added to each dataset. Splitting the data into botha training and testing set also solved the issue of overfitting,a condition where an algorithm can only properly functionwith a particular data set. Ultimately, eight unique pairsof testing and training datasets were formed with differingratios. Although this processing method greatly diminished

the number of training cases from 63122 cases, the predictionsmade were less skewed as the algorithms were able to maketrue positive predictions based on the increased percentage offraudulent cases within the dataset.

C. Evaluation Metrics

Because of the large imbalance between non-fraudulent andfraudulent data points, accuracy, or the percent of correctlypredicted data points out of the total dataset. was not a viablemetric on which to base results. As a result, the paper will usethe metrics of precision, recall, and F-1 score. Precision is theratio of the number of true positives by all of the actuallycorrect data points. It can be seen as a measure of the qualityof the data returned as positive. The equation for precision canbe seen in Equation 5.

precision =true positives

true positives+ false positives(5)

Recall is the measure of the ratio of correct positivepredictions to all actual positive entries. The recall assessesthe completeness of the program, checking how many truepositives were detected as positive. The equation for recallcan be seen by Equation 6.

recall =true positives

true positives+ false negatives(6)

The Fβ score is the weighted harmonic mean of precisionand recall, reaching its optimal value at 1 and its worst valueat 0. The beta parameter determines the weight of precisionin the combined score. β < 1 lends more weight to precision,while β > 1 favors recall. Because both precision and recallare equally important to the accuracy of the model, the F-1score, or Fβ score when β is set to 1, was used. The equationof Fβ score can be represented by Equation 7 [32].

Fβ = (1 + β2) ∗ precision ∗ recall(β2 + precision) + recall

(7)

D. Model Training

Many of the machine learning algorithms were created fromthe scikit-learn library with regressor classes. Many of thecolumns that contain string attributes were dropped from thetraining and testing dataset, as the algorithms were only ableto accept numbers as inputs. The models were trained andtested on two CSV files containing identical attributes. Forall algorithms except for the Random Forest Classifier andthe Multilayer Perceptron, the precision, recall, and Fβ scorewere calculated for each field and for all cumulative fields.The quantification of the true positives, true negatives, falsepositives, and false negatives, was represented by a confusionmatrix, which organizes true positives, true negatives, falsepositives, and false negatives. A general format of a confusionmatrix can be seen in Figure 7. To assess the effect ofundersampling, numerous datasets were created with differentnormal transaction to fraudulent transaction ratios, which werethen trained and tested on multiple fields with the same ratioas the set and the 98 : 2 ratio of the initial dataset.

5

Figure 7. General Form of a Confusion Matrix [33]

1) K Nearest Neighbor: In order to implement KNN,the KNeighborsRegressor class was implemented. KNN wastrained with numerical attributes found in the data includingamount, hour1, field1, field2, field3, field4, field5, indicator1,indicator2, flag2, flag3, flag4, and flag5. To test the overalleffectiveness of the algorithm, the Euclidean distance wascalculated using all of the factors. Then, the Euclidean distancefor each factor alone was calculated.

2) Logistic Regression: Since logistic regression performsbinary classification, it works well with this specific data set,which has targets as the binary dependent variable. Whengiven a specific factor, or column, to analyze, the outputprobability is then converted to a 0 or 1 for negative or positive,respectively. This output can be easily checked with the targetsin the test dataset since both are represented by 1s and 0s inthe same way.

3) Naive Bayes Classifier: The Naive Bayes Classifier wasrun using the GaussianNB() class from the scikit-learn library.As the Naive Bayes Classifier assumes every factor functionsindependently, every numerical factor was tested individually.The Naive Bayes Classifier provides a binary classificationof the testing set depending on probabilistic predictive trendsobserved in the training set.

4) Support Vector Machine: The Support Vector Machinealgorithm was run using the SVM() class from the scikit-learn library. The factors were tested both independently andin various combinations. For each testing cycle, the SupportVector Machine plotted the various data points to calculatethe optimal dividing line before reaching a conclusion as towhether a transaction was fraudulent or not.

5) Random Forest Classifier: The Random Forest algo-rithm was ran using the RandomForestRegressor() class of

the Python scikit-learn library. The algorithm was run witha combination of all fields, and a Random Forest algorithmtree was generated, as seen in Figure 8. The Random ForestAlgorithm creates a metric of importance for the attributes inthe dataset.

Figure 8. Snippet of Generated Random Forest Tree

6) Multilayer Perceptron Model: Multilayer perceptronneural networks were implemented using Keras 2.2.0 withTensorflow as a backend for matrix and tensor-based compu-tations. The original neural network consisted of a Sequentialmodel with 3 Dense layers, a Dropout layer, and an ActivationLayer.

Listing 1. Snippet of Baseline MP Codefrom k e r a s . models import S e q u e n t i a lfrom k e r a s . l a y e r s import Dense ,

A c t i v a t i o n , Dropout

model = S e q u e n t i a l ( )model . add ( Dense ( 1 0 , i n p u t d i m =14 ,

a c t i v a t i o n = ’ r e l u ’ ) )model . add ( Dense ( 1 0 , a c t i v a t i o n = ’ r e l u ’ ) )model . add ( Dropout ( 0 . 5 ) )model . add ( Dense ( 2 , a c t i v a t i o n = ’ so f tmax ’ ) )model . add ( A c t i v a t i o n ( a c t i v a t i o n = ’ so f tmax ’ ) )

The Dense Layer of the Neural Network implements theoperation: output = activation(dot(input, kernel) + bias)where activation is the element-wise activation function passedas the activation argument, kernel is a weights matrix createdby the layer, and bias is a bias vector created by the layer.The model also utilized Dropout layers, which randomly setthe weights of connections between layers to zero to reduce thenumber of parameters learned and prevent overfitting. Dropoutrates were initialized at 0.5, meaning that each connectionhad a one in two chance of being randomly dropped. Thesenetworks were run at different epochs, or a complete iterationsover the dataset.

IV. RESULTS AND ANALYSIS

A. Finding Feature Importance

The data was first run through the Random Forest Classifierwith both the 98:2 training and testing set. The importancemetric created as a result of this is shown below in Table I. As

6

a result of the Feature Importance Metric, the most significantfields were seen to be field1,field3 and hour1.

Table IFEATURE IMPORTANCE MEASURES FOR RANDOM FOREST

Attribute(s) Importancefield3 0.34hour1 0.16field1 0.12field4 0.1

amount 0.06field5 0.06flag5 0.04field2 0.03flag1 0.02

indicator1 0.02flag2 0.02flag3 0.02flag4 0.01

indicator2 0.0

B. Bulk Implementation of Classifiers

The datasets created using the undersampling method weretrained on all attributes and combinations of attributes foundimportant from the Random Forest selection. The F-1 scorewas then calculated from each result after being tested ondatasets that had the same ratio of fraudulent to non-fraudulentdata as the datasets on which they were trained, as seen inFigure 9 below.

Figure 9. F-1 scores of Algorithms applied to testing datasets with controlledNormal-to-Fraudulent Transaction Ratios

Figure 9 demonstrates the effectiveness, measured by F-1 Score of the algorithms as the non-fraudulent to fraudulentratio increases. All of these algorithms reveal themselves to beu-curves. Most of these algorithms, except for Support VectorMachine, decreased in effectiveness as the ratio became higher.Support Vector Machine, however, increased in effectivenessas the ratio increased.

Next, the algorithms trained on the datasets created byundersampling were then tested on a dataset of 31,560 dat-apoints and a normal-to-fraudulent data ratio of 98:2. The F-1scores versus the normal-to-fraud data are shown in Figure 10.

Figure 10. F-1 scores of Algorithms applied to testing datasets withuncontrolled Normal-to-Fraudulent Transaction Ratios

Through this, it can be seen that for datasets with high skew,the Support Vector Machine algorithm produces the highestF-1 scores while in balanced datasets, the Random Forestalgorithm produced the highest F-1 score.

In systems where there a large dataset is tested, for manyalgorithms, there is an optimal ratio for normal-fraudulent dataat which the predictions for the testing set will produce thehighest F-1 score. This trend can be seen in the RandomForests, KNN, and Naive Bayes. If the models are trained ondatasets with normal-to-fraud ratios close to 1:1, the testingset will match the ratio that will maintain the same ratio ofpositives to negatives. Testing on a highly imbalanced datasetpredicts the same ratio of the training set, resulting in a highnumber of false positives and a very low precision and avery high recall. As the imbalance increases, the precisionincreases as the recall decreases. As a result, the normal-to-fraudulent ratio reached its optimal ratio as precision and recallapproached each other.

C. Random Forest Classifier

For the data tested on sets of the same ratios, the classifierexperienced the smallest decrease in the F-1 score. Whilethe precision of the Random forests model experiencing onlya slight drop, the recall experienced large decreases as thenormal to fraudulent ratio increased. While testing on thelarge dataset, the random forest algorithm reached the optimalnormal-to-fraud ratio at around the 30:1 ratio. As seen inFigure 11, the Random Forest algorithm was indicative of thepattern observed in Figure 10, as the F-1 score reaches itspeak score, the precision and recall lines intersect each otherat the same ratio. As a result, for the random forest algorithm,it would be necessary to find the optimal ratio to train thedataset on in order to use random subsampling to reduce bias.The Random Forest is the most efficient algorithm for testingon biased datasets at the optimal dataset because, unlike theSVM, the Random Forests algorithm does not require muchprocessing power and time to train efficiently.

7

Figure 11. F-1 score, Precision, and Recall in the Random Forest Classifier

D. K Nearest Neighbor (KNN)

As demonstrated in Figure 9, when tested with the 25:75,40:60, 50:50, 60:40, 75:25, and 90:10 non-fraudulent to fraud-ulent test split, all factors consistently showed that, with anincreasing ratio, the F-1 Score decreased. For example, in the25:75 split, the F-1 Score was about 80%. However, at the90:10 split, the F-1 Score dropped to about 20%.

As demonstrated in Figure 10, however, when training thealgorithm with the 25:75, 40:60, 50:50, 60:40, 75:25, and90:10 non-fraudulent to fraudulent test split and tested thewhole dataset, the F-1 Score was higher when the ratio washigher. This is possibly because the testing set is part ofthe training set. In other words, for many of the cases, theEuclidean distance calculated was zero. Thus, the testing setassumed fraudulence based on itself.

Unlike with the Random Forest algorithm, KNN’s precisionand recall both decreased as the ratio increases. Thus, unlikewith the Random Forest algorithm, the intersection of recalland precision do not correlate with the training dataset atwhich KNN is its most effective.

In terms of efficiency, because this algorithm calculatesEuclidean distance between all of the points in the data setand all of the points in the testing set, it is not very efficient.However, as compared to Support Vector Machine and theMultilayer Perceptron, the algorithm takes less time to acquireresults; thus, although it is inefficient, it is less inefficient thanmany other algorithms.

E. Logistic Regression

As shown in figure 9, for logistic regression, the F-1 scoredecreases as the normal to fraud ratio increases when thetesting and training datasets have the same ratios. The graphshows a relationship similar to exponential decay. When testedwith skewed data as seen in table II, the F-1 score increaseswith the decrease of the normal to fraud ratio until it peaks atthe 75-25 ratio and then decreases. Also, this time the logisticregression performed much worse with all f-scores lower than20%. These results show that logistic regression works much

better when it is trained and tested on the same data set ratioand size. If both data sets are similar in this way, the lowestpossible normal to fraud ratio is ideal. However, if forced totest on a very biased data set, a training data set split with75% normal transactions and 25% fraudulent transactions isbest.

Table IIF-1 SCORES OF LOGISTIC REGRESSION FUNCTION AT DIFFERENT

NORMAL-TO-FRAUD RATIOS (%)

Ratios Hour1 Field3 Hour1 + Field325-75 6 4 640-60 7 5 850-50 9 6 1060-40 14 7 1475-25 17 7 1890-10 0 0 1298-2 0 0 0

F. Naive Bayes Classifier

When the Naive Bayes classifier was used to predict fraudwith a training and testing data set of equal bias and length,there was a clear trend in the F-1 score.

Table IIIF-1 SCORES OF NAIVE BAYES CLASSIFIER AT DIFFERENT

NORMAL-TO-FRAUD RATIOS (%)

Ratios Field1 Hour1 Field3 Hour1 + Field3 Average25-75 83.5 86.3 85.7 86.5 85.540-60 70 76 70 76.7 73.1250-50 64.3 68 44 71 61.8260-40 58.7 56 33 62 52.475-25 0 43 20 51 28.590-10 0 19 10 20 12.2

As seen in Table 3, the Naive Bayes Classifier had thehighest average f-score (85.5) when trained and tested withthe dataset with the lowest normal-to-fraud ratio of 25:75.The F-1 -score dropped exponentially following a logisticcurve as the normal-to-fraud ratio increased, shown in Figure9. When training and testing with the 90:10 ratio, the F-1score reached its average lowest (12.2). This trend can beexplained by the functionality of the Naive Bayes Classifier.As there is a greater normal-to-fraud ratio, the algorithm isprovided with more evidence of trends pointing towards nor-mal transactions relative to evidence of fraudulent transactions.With a greater percentage of normal transactions within thedataset, there is also greater variability in the range of valueswithin each field which indicate a normal transaction. Thus,there is overlap between the patterns indicating fraudulentand normal transactions. Coupled together, these two reasonsmake the algorithm overwhelmingly predict every transactionas fraudulent, decreasing the F-1 score.

When the Naive Bayes classifier was used to predict fraudwith a training and testing data set of different bias andlength, a different pattern emerged as indicated in Figure10. When graphed, the average f-scores of the fields whichwere trained form a inverted-U curve. There is a clear peak

8

in the F-1 score right after the ratio 75:25. This shape canbe explained by the nature of training and testing data. Bytraining the algorithm with data sets of lower ratios, a testingset of a similar ratio was expected, which thus decreasedthe precision of the classifier but increased the recall. Asthe normal-to-fraud ratio increased closer to the peak found,precision increased and recall decreased as the ratio in thetraining set grew closer to the testing set. However, oncereaching the peak, denoting the optimal ratio for the NaiveBayes Classifier to function, a similar effect indicated in Table3 took place and decreased the efficacy of the classifier. TheNaive Bayes classifier performed well for this project, howeverthe simplicity of the classifier prevented it from observingpatterns of great complexity within the data which ultimatelyreduced its efficacy in detecting credit card fraud.

G. Multilayer Perceptron

The Multilayer Perceptron model was initially created using3 Dense layers of 10 neurons, a Dropout layer with 0.5probability and an activation layer. This layer configuration ledto the data having a precision and recall of 0 at high normal-to-fraud ratios, and the removal of the Dropout layer led tothe model training to have changing recalls and precisionswhenever the table was retrained. As a result, the Dropout wasretested until the optimal Dropout probability of 0.2 was used.As the normal-to-fraud ratio increased, the network maintaineda recall of 43% while the precision dropped. At a certainnormal-to-fraud ratio, the recall and precision both convergedto zero, as a result, the Multilayer Perceptron model has shownto be very unstable in imbalanced datasets. While testing ona dataset with a large imbalance, the Multilayer Perceptronhad a 5% precision and a 65% recall, and precision slightlyincreased while precision sharply decreased. At a 10:1 normal-to-fraudulent ratio, the model experienced the same drop offwhere both precision and accuracy approached zero. For bothtrials, the model failed at any ratio when there were more thanone attribute being tested at a time, such as hour1 and field3The Multilayer Perceptron model would not be a viable modelto use for imbalanced datasets in conjunction with randomsubsampling.

H. Support Vector Machine

Results of the Support Vector Machine algorithm testedand trained with adjusted datasets of approximately equalfraud to non-fraud ratios indicated an average F-1 Score of79.96%, regardless of the combination of factors. Unlike mostmodels presented in this paper, the F-score did not decreaseas the normal-to-fraud ratio decreased. It stayed constant atapproximately 84% as seen in Figure 10.

Results of the Support Vector Machine algorithm variedwhen trained with an adjusted dataset but tested with theunadjusted and highly skewed data set. F-1 scores averaged94.07% for the SVM with all combinations of factors, whichis significantly higher compared to the data for the five otheralgorithms. The average F-scores of the tests under differentratio combinations but same factor combination were averaged

to determine the ideal combination of factors for analysis.The average F-1 scores for the Support Vector Machineanalyzing the “hour1”, “field3”, and “hour 1 and field 3” factorcombinations of transactions was calculated to be 94.51%,94.4%, and 93.27%, respectively.

In contrast to other algorithms, the SVM algorithm tookmuch more time and computing power to complete the fittingof the model. Compared to the Random Forest Classifier, themodel takes much more time to execute. As a result, whenprocessing real-time data such as credit card transactions ondatasets that would be much larger than the UCSD-FICO set,the SVM model would have to be made more efficient in orderto process and classify fraud data in a reasonable amount oftime after the transaction.

V. CONCLUSION

Overall the main project goal involved determining theoptimal algorithm for analysis as well as the best-performingcombination of factors to detect credit-card fraud. Based on theresults of Figure 9, it can be concluded that the best algorithmfor analysis of datasets with a close to 1:1 ratio of fraudulentto non-fraudulent transactions is the Random Forest Classifier,assuming the fraud-to-not fraud distribution of the testing andtraining set is the same. However, the presence of a balancedtraining dataset as well as a testing & training dataset of thesame distribution is an unrealistic expectation. Therefore theoptimal machine learning algorithm that a credit card companyshould use is dependent on the F-1 Scores of algorithmstested with highly skewed datasets. According to Figure 10,the Support Vector Machine was the most successful in thedetection of credit card fraud when tested under more realisticconditions. The F-scores of all algorithms under multiplecombinations of factors were analyzed as described earlier inthis paper, and it was determined that the ideal condition foranalysis is the hour1 field. Based on this research, a credit-card company should consider implementing a Support-VectorMachine algorithm that analyzes the purchase time in orderto most accurately detect whether a credit-card transaction isfraudulent or not.

A. Future Work

This research on detecting credit card fraud has great poten-tial for future implications. If a dataset with unencrypted fieldswas released to the public, the true factors which can be tracedfor credit card fraud detection can known. Therefore, creditcard companies can be informed about the most importantfactors to analyze when predicting credit card fraud and im-prove the efficiency of their notification systems. Furthermore,The results of this project were limited by the small samplesize of fraudulent cases provided by the data set. By usinga larger dataset with a greater number of fraudulent cases,the algorithms can be trained to make predictions of greaterprecision. In order to pursue these goals, more computingpower may be required. It may be important to considerusing a Graphical Processing Unit like the Nvidia Jetson II toimprove the productivity of training and testing each algorithm

9

with a larger, more complex data set. Other methods for biasprevention, such as other resampling techniques, cost-sensitivelearning methods, and ensemble learning methods could alsobe tested in future datasets to discover the best method ofdealing with skewed data sets. Ultimately, the results of thisresearch project can provide insight on the best algorithm tobe used in other cases of data analysis on skewed data sets,such as in natural disaster prediction.

VI. ACKNOWLEDGMENTS

The authors of this paper gratefully acknowledge the fol-lowing: Project mentor Jacob Battipaglia for his valuableknowledge of machine learning and data science; ResidentialTeaching Assistant liaison Siddhi Shah for her invaluableassistance; research coordinator Brian Lai for his assistance inconducting proper research; Nicholas Ferraro, head counselorfor his continual guidance and support; Dean Ilene Rosen,the Director of GSET, and Dean Jean Patrick Antoine, theAssociate Director of GSET, for their management and guid-ance; Rutgers University, Rutgers School of Engineering, andthe State of New Jersey for the chance to advance knowledge,explore engineering, and open up new opportunities; LockheedMartin, Silverline, Rubik’s, and other Corporate Sponsors; andlastly NJ GSET Alumni, for their continued participation andsupport.

REFERENCES

[1] What is credit card fraud? definition and meaning. [Online].[2] R. Harrow, Is Your Credit Card Less Secure Than Ever Before?, Forbes,

20-Apr-2018. [Online].[3] J. Steele and J. Gonzalez, Credit card fraud and ID theft statistics,

CreditCards.com. [Online].[4] Data Analytics vs Data Science: Two Separate, but Interconnected

Disciplines, Data Scientist Insights, 28-Apr-2018. [Online].[5] D. T. Larose and C. D. Larose, Discovering knowledge in data: an

introduction to data mining. Hoboken, NJ: John Wiley & Sons, 2014.[6] Data Mining Concepts, Lesson: All About Sockets (The Java Tutorials

¿ Custom Networking), 01-Jul-2008. [Online].[7] Supervised and Unsupervised Machine Learning Algorithms, Machine

Learning Mastery, 22-Sep-2016. [Online].

[8] What is Machine Learning? A definition, Expert System, 05-Oct-2017.[Online].

[9] C. Donalek, Supervised and Unsupervised Learning. .[10] V. Paruchuri, K nearest neighbors in python: A tutorial, Dataquest, 06-

Feb-2018. [Online].[11] Logistic Regression: Calculating a Probability — Machine Learning

Crash Course — Google Developers, Google. [Online].[12] File:Sigmoid-function-2.svg, File:Cholesterol (chemical structure).svg -

Wikimedia Commons. [Online].[13] Synced, How Random Forest Algorithm Works in Machine Learning,

Medium, 24-Oct-2017. [Online].[14] Support Vector Machines: A Simple Explanation, KDnuggets Analytics

Big Data Data Mining and Data Science. [Online].[15] Bayes’ Theorem and Conditional Probability, Brilliant Math & Science

Wiki. [Online].[16] Naive Bayes Classifier, Support Vector Machines (SVM). [Online].[17] IBM Knowledge Center, IBM Cognitive advantage reports. [Online].[18] Neural network models (supervised). [Online].[19] 1.17. Neural network models (supervised), 1.4. Support Vector Machines

- scikit-learn 0.19.1 documentation. [Online].[20] “Multilayer Perceptron,” IBM Knowledge Center. [Online].[21] Welcome to Python.org, Python.org. [Online].[22] NumPy, NumPy - NumPy. [Online].[23] A. Bronshtein, A Quick Introduction to the Pandas Python Library,

Towards Data Science, 18-Apr-2017. [Online].[24] A Gentle Introduction to Scikit-Learn, Machine Learning Mastery, 26-

Mar-2018. [Online].[25] Matplotlib, Matplotlib: Python plotting - Matplotlib 2.2.2 documenta-

tion. [Online].[26] Keras: The Python Deep Learning library, Keras Documentation. [On-

line].[27] Big Data in Science: Which Business Model is Suitable? ADC Review,

ADC Review, 11-Sep-2015. [Online].[28] C. Prater, Your credit card is a tattletale: With electronic payments, little

is private, CreditCards.com. [Online].[29] What is PCI DSS (Payment Card Industry Data Security Standard)? -

Definition from WhatIs.com, SearchFinancialSecurity. [Online].[30] Q. M. Rahman, What is a Support Vector Machine?, Quora, 15-Jan-

2016.[31] R. Alencar, Resampling strategies for imbalanced datasets — Kaggle,

Kaggle. [Online].[32] C. Goutte and E. Gaussier, A Probabilistic Interpretation of Precision,

Recall and F-Score, with Implication for Evaluation, Lecture Notes inComputer Science Advances in Information Retrieval, pp. 345359, 2005.

[33] Is your Classification Model making lucky guesses?, Revolutions. [On-line].

10

Comparative Analysis of Machine Learning Algorithms ... · Credit-card fraud can be classiﬁed...

Documents

Transcript of Comparative Analysis of Machine Learning Algorithms ... · Credit-card fraud can be classiﬁed...