Data Analysis for Credit Card Fraud Detection Alejandro Correa Bahnsen Luxembourg University

23
Data Analysis for Credit Card Fraud Detection Alejandro Correa Bahnsen Luxembourg University

description

Data Analysis for Credit Card Fraud Detection Alejandro Correa Bahnsen Luxembourg University. Introduction. Introduction. Simplify transaction flow. Network. Fraud??. Agenda. Introduction Database Evaluation of algorithms Logistic Regression Financial measure - PowerPoint PPT Presentation

Transcript of Data Analysis for Credit Card Fraud Detection Alejandro Correa Bahnsen Luxembourg University

Page 1: Data Analysis for Credit Card Fraud Detection Alejandro Correa  Bahnsen Luxembourg University

Data Analysis for Credit Card Fraud Detection

Alejandro Correa BahnsenLuxembourg University

Page 2: Data Analysis for Credit Card Fraud Detection Alejandro Correa  Bahnsen Luxembourg University

Introduction

2007 2008 2009 2010 2011E 2012E € 500

€ 600

€ 700

€ 800

Europe fraud evolutionInternet transactions (millions of euros)

Page 3: Data Analysis for Credit Card Fraud Detection Alejandro Correa  Bahnsen Luxembourg University

Introduction

2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 $-

$1.0

$2.0

$3.0

$4.0

$5.0

US fraud evolutionOnline revenue lost due to fraud (Billions of dollars)

Page 5: Data Analysis for Credit Card Fraud Detection Alejandro Correa  Bahnsen Luxembourg University

• Introduction• Database• Evaluation of algorithms• Logistic Regression• Financial measure• Cost Sensitive Logistic Regression

Agenda

Page 6: Data Analysis for Credit Card Fraud Detection Alejandro Correa  Bahnsen Luxembourg University

Database

• Larger European card processing company

• 2012 card present transactions

• 750,000 Transactions• 3500 Frauds• 0.467% Fraud rate

• 148,562 EUR lost due to fraud on test dataset

Dec

Nov

Oct

Sep

Aug

Jul

Jun

May

Apr

Mar

Feb

Jan

Test

Train

Page 7: Data Analysis for Credit Card Fraud Detection Alejandro Correa  Bahnsen Luxembourg University

7

• Raw attributes

• Other attributes:Age, country of residence, postal code, type of card

Database

TRXID Client ID Date Amount Location Type Merchant Group Fraud

1 1 2/1/12 6:00 580 Lux Internet Airlines No

2 1 2/1/12 6:15 120 Lux Present Car Renting No

3 2 2/1/12 8:20 12 Bel Present Hotel Yes

4 1 3/1/12 4:15 60 Lux ATM ATM No

5 2 3/1/12 9:18 8 Fra Present Retail No

6 1 3/1/12 9:55 1210 Lux Internet Airlines Yes

Page 8: Data Analysis for Credit Card Fraud Detection Alejandro Correa  Bahnsen Luxembourg University

8

• Derived attributes

Combination offollowing criteria:

Database

ID Num CC Date Amt Location Type Merchant

Group FraudNo. of Trx –

same client – last 6 hour

Sum – same client – last 7

days

1 1 2/1/12 6:00 580 Lux Internet Airlines No 0 0

2 1 2/1/12 6:15 120 Lux Present Car Renting No 1 580

3 2 2/1/12 8:20 12 Bel Present Hotel Yes 0 0

4 1 3/1/12 4:15 60 Lux ATM ATM No 0 700

5 2 3/1/12 9:18 8 Fra Present Retail No 0 12

6 1 3/1/12 9:55 1210 Lux Internet Airlines Yes 1 760

By Group Last FunctionClient None hour CountCredit Card Transaction Type day Sum(Amount)

Merchant week Avg(Amount)Merchant Category monthMerchant Group 1 3 monthsMerchant Group 2Merchant Country

Page 9: Data Analysis for Credit Card Fraud Detection Alejandro Correa  Bahnsen Luxembourg University

• Misclassification • Recall • Precision • F-Score

Evaluation

True Class ()

Fraud (=1) Legitimate (=0)

Predicted class ()

Fraud (=1) TP FP

Legitimate (=0) FN TN

• Confusion matrix

Page 10: Data Analysis for Credit Card Fraud Detection Alejandro Correa  Bahnsen Luxembourg University

• Introduction• Database• Evaluation of algorithms• Logistic Regression• Financial measure• Cost Sensitive Logistic Regression

Agenda

Page 11: Data Analysis for Credit Card Fraud Detection Alejandro Correa  Bahnsen Luxembourg University

True Class ()

Fraud (=1) Legitimate (=0)

Predicted class ()

Fraud (=1) 0 1

Legitimate (=0) 1 0

• Model

• Cost Function

• Cost Matrix

Logistic Regression

Page 12: Data Analysis for Credit Card Fraud Detection Alejandro Correa  Bahnsen Luxembourg University

1% 5% 10% 20% 50%

Logistic Regression

Under sampling procedure

0.467%

Select all the frauds and a random sample of the legitimate transactions.

Page 13: Data Analysis for Credit Card Fraud Detection Alejandro Correa  Bahnsen Luxembourg University

Logistic Regression

Results

No Model All 1% 5% 10% 20% 50%0%

10%

20%

30%

40%

50%

60%

70%

Recall Precision Miss-cla F1-Score

Page 14: Data Analysis for Credit Card Fraud Detection Alejandro Correa  Bahnsen Luxembourg University

• Motivation

• False positives carry a different cost than false negatives

• Frauds range from few to thousands of euros (dollars, pounds, etc)

Financial evaluation

There is a need for a real comparison measure

Page 15: Data Analysis for Credit Card Fraud Detection Alejandro Correa  Bahnsen Luxembourg University

• Cost matrix

where:

Financial evaluation

Ca Administrative costsAmt Amount of transaction i

True Class ()

Fraud (=1) Legitimate (=0)

Predicted class ()

Fraud (=1) Ca Ca

Legitimate (=0) Amt 0

• Evaluation measure

Page 16: Data Analysis for Credit Card Fraud Detection Alejandro Correa  Bahnsen Luxembourg University

Logistic Regression

Results

No Model

All 1% 5% 10% 20% 50%0%

10%

20%

30%

40%

50%

60%

70%

€ -

€ 20,000

€ 40,000

€ 60,000

€ 80,000

€ 100,000

€ 120,000

€ 140,000

€ 160,000 € 148,562 € 148,196 € 142,510

€ 112,103

€ 79,838

€ 65,870

€ 46,530

Cost Recall Precision F1-Score

Selecting the algorithm by F1-ScoreSelecting the algorithm by Cost

Page 17: Data Analysis for Credit Card Fraud Detection Alejandro Correa  Bahnsen Luxembourg University

Logistic Regression

• Best model selected using traditional F1-Score does not give the best results in terms of cost

• Model selected by cost, is trained using less than 1% of the database, meaning there is a lot of information excluded

• The algorithm is trained to minimize the miss-classification (approx.) but then is evaluated based on cost

• Why not train the algorithm to minimize the cost instead?

Page 18: Data Analysis for Credit Card Fraud Detection Alejandro Correa  Bahnsen Luxembourg University

True Class ()

Fraud (=1) Legitimate (=0)

Predicted class ()

Fraud (=1) Ca Ca

Legitimate (=0) Amt 0

• Cost Matrix

Cost Sensitive Logistic Regression

• Cost Function

• ObjectiveFind that minimized the cost function (Genetic Algorithms)

Page 19: Data Analysis for Credit Card Fraud Detection Alejandro Correa  Bahnsen Luxembourg University

No Model

All 1% 5% 10% 20% 50%0%

10%20%30%40%50%60%70%80%90%

100%

€ -

€ 20,000

€ 40,000

€ 60,000

€ 80,000

€ 100,000

€ 120,000

€ 140,000

€ 160,000 € 148,562

€ 31,174 € 37,785

€ 66,245 € 67,264 € 73,772 € 85,724

Cost Recall Precision F1-Score

Cost sensitive Logistic Regression

Results

Page 20: Data Analysis for Credit Card Fraud Detection Alejandro Correa  Bahnsen Luxembourg University

Cost sensitive Logistic Regression

Results

0%20%40%60%80%

€ -

€ 4

€ 8

€ 12

Cost Recall Precision F1-Score

Page 21: Data Analysis for Credit Card Fraud Detection Alejandro Correa  Bahnsen Luxembourg University

Conclusion

• Selecting models based on traditional statistics does not give the best results in terms of cost

• Models should be evaluated taking into account real financial costs of the application

• Algorithms should be developed to incorporate those financial costs

Page 22: Data Analysis for Credit Card Fraud Detection Alejandro Correa  Bahnsen Luxembourg University

Thank you!

Page 23: Data Analysis for Credit Card Fraud Detection Alejandro Correa  Bahnsen Luxembourg University

Contact information

Alejandro Correa Bahnsen

University of LuxembourgLuxembourg

[email protected]

http://www.linkedin.com/in/albahnsen

http://www.slideshare.net/albahnsen