Download - Data Analysis for Credit Card Fraud Detection Alejandro Correa Bahnsen Luxembourg University

Transcript
Page 1: Data Analysis for Credit Card Fraud Detection Alejandro Correa  Bahnsen Luxembourg University

Data Analysis for Credit Card Fraud Detection

Alejandro Correa BahnsenLuxembourg University

Page 2: Data Analysis for Credit Card Fraud Detection Alejandro Correa  Bahnsen Luxembourg University

Introduction

2007 2008 2009 2010 2011E 2012E € 500

€ 600

€ 700

€ 800

Europe fraud evolutionInternet transactions (millions of euros)

Page 3: Data Analysis for Credit Card Fraud Detection Alejandro Correa  Bahnsen Luxembourg University

Introduction

2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 $-

$1.0

$2.0

$3.0

$4.0

$5.0

US fraud evolutionOnline revenue lost due to fraud (Billions of dollars)

Page 5: Data Analysis for Credit Card Fraud Detection Alejandro Correa  Bahnsen Luxembourg University

• Introduction• Database• Evaluation of algorithms• Logistic Regression• Financial measure• Cost Sensitive Logistic Regression

Agenda

Page 6: Data Analysis for Credit Card Fraud Detection Alejandro Correa  Bahnsen Luxembourg University

Database

• Larger European card processing company

• 2012 card present transactions

• 750,000 Transactions• 3500 Frauds• 0.467% Fraud rate

• 148,562 EUR lost due to fraud on test dataset

Dec

Nov

Oct

Sep

Aug

Jul

Jun

May

Apr

Mar

Feb

Jan

Test

Train

Page 7: Data Analysis for Credit Card Fraud Detection Alejandro Correa  Bahnsen Luxembourg University

7

• Raw attributes

• Other attributes:Age, country of residence, postal code, type of card

Database

TRXID Client ID Date Amount Location Type Merchant Group Fraud

1 1 2/1/12 6:00 580 Lux Internet Airlines No

2 1 2/1/12 6:15 120 Lux Present Car Renting No

3 2 2/1/12 8:20 12 Bel Present Hotel Yes

4 1 3/1/12 4:15 60 Lux ATM ATM No

5 2 3/1/12 9:18 8 Fra Present Retail No

6 1 3/1/12 9:55 1210 Lux Internet Airlines Yes

Page 8: Data Analysis for Credit Card Fraud Detection Alejandro Correa  Bahnsen Luxembourg University

8

• Derived attributes

Combination offollowing criteria:

Database

ID Num CC Date Amt Location Type Merchant

Group FraudNo. of Trx –

same client – last 6 hour

Sum – same client – last 7

days

1 1 2/1/12 6:00 580 Lux Internet Airlines No 0 0

2 1 2/1/12 6:15 120 Lux Present Car Renting No 1 580

3 2 2/1/12 8:20 12 Bel Present Hotel Yes 0 0

4 1 3/1/12 4:15 60 Lux ATM ATM No 0 700

5 2 3/1/12 9:18 8 Fra Present Retail No 0 12

6 1 3/1/12 9:55 1210 Lux Internet Airlines Yes 1 760

By Group Last FunctionClient None hour CountCredit Card Transaction Type day Sum(Amount)

Merchant week Avg(Amount)Merchant Category monthMerchant Group 1 3 monthsMerchant Group 2Merchant Country

Page 9: Data Analysis for Credit Card Fraud Detection Alejandro Correa  Bahnsen Luxembourg University

• Misclassification • Recall • Precision • F-Score

Evaluation

True Class ()

Fraud (=1) Legitimate (=0)

Predicted class ()

Fraud (=1) TP FP

Legitimate (=0) FN TN

• Confusion matrix

Page 10: Data Analysis for Credit Card Fraud Detection Alejandro Correa  Bahnsen Luxembourg University

• Introduction• Database• Evaluation of algorithms• Logistic Regression• Financial measure• Cost Sensitive Logistic Regression

Agenda

Page 11: Data Analysis for Credit Card Fraud Detection Alejandro Correa  Bahnsen Luxembourg University

True Class ()

Fraud (=1) Legitimate (=0)

Predicted class ()

Fraud (=1) 0 1

Legitimate (=0) 1 0

• Model

• Cost Function

• Cost Matrix

Logistic Regression

Page 12: Data Analysis for Credit Card Fraud Detection Alejandro Correa  Bahnsen Luxembourg University

1% 5% 10% 20% 50%

Logistic Regression

Under sampling procedure

0.467%

Select all the frauds and a random sample of the legitimate transactions.

Page 13: Data Analysis for Credit Card Fraud Detection Alejandro Correa  Bahnsen Luxembourg University

Logistic Regression

Results

No Model All 1% 5% 10% 20% 50%0%

10%

20%

30%

40%

50%

60%

70%

Recall Precision Miss-cla F1-Score

Page 14: Data Analysis for Credit Card Fraud Detection Alejandro Correa  Bahnsen Luxembourg University

• Motivation

• False positives carry a different cost than false negatives

• Frauds range from few to thousands of euros (dollars, pounds, etc)

Financial evaluation

There is a need for a real comparison measure

Page 15: Data Analysis for Credit Card Fraud Detection Alejandro Correa  Bahnsen Luxembourg University

• Cost matrix

where:

Financial evaluation

Ca Administrative costsAmt Amount of transaction i

True Class ()

Fraud (=1) Legitimate (=0)

Predicted class ()

Fraud (=1) Ca Ca

Legitimate (=0) Amt 0

• Evaluation measure

Page 16: Data Analysis for Credit Card Fraud Detection Alejandro Correa  Bahnsen Luxembourg University

Logistic Regression

Results

No Model

All 1% 5% 10% 20% 50%0%

10%

20%

30%

40%

50%

60%

70%

€ -

€ 20,000

€ 40,000

€ 60,000

€ 80,000

€ 100,000

€ 120,000

€ 140,000

€ 160,000 € 148,562 € 148,196 € 142,510

€ 112,103

€ 79,838

€ 65,870

€ 46,530

Cost Recall Precision F1-Score

Selecting the algorithm by F1-ScoreSelecting the algorithm by Cost

Page 17: Data Analysis for Credit Card Fraud Detection Alejandro Correa  Bahnsen Luxembourg University

Logistic Regression

• Best model selected using traditional F1-Score does not give the best results in terms of cost

• Model selected by cost, is trained using less than 1% of the database, meaning there is a lot of information excluded

• The algorithm is trained to minimize the miss-classification (approx.) but then is evaluated based on cost

• Why not train the algorithm to minimize the cost instead?

Page 18: Data Analysis for Credit Card Fraud Detection Alejandro Correa  Bahnsen Luxembourg University

True Class ()

Fraud (=1) Legitimate (=0)

Predicted class ()

Fraud (=1) Ca Ca

Legitimate (=0) Amt 0

• Cost Matrix

Cost Sensitive Logistic Regression

• Cost Function

• ObjectiveFind that minimized the cost function (Genetic Algorithms)

Page 19: Data Analysis for Credit Card Fraud Detection Alejandro Correa  Bahnsen Luxembourg University

No Model

All 1% 5% 10% 20% 50%0%

10%20%30%40%50%60%70%80%90%

100%

€ -

€ 20,000

€ 40,000

€ 60,000

€ 80,000

€ 100,000

€ 120,000

€ 140,000

€ 160,000 € 148,562

€ 31,174 € 37,785

€ 66,245 € 67,264 € 73,772 € 85,724

Cost Recall Precision F1-Score

Cost sensitive Logistic Regression

Results

Page 20: Data Analysis for Credit Card Fraud Detection Alejandro Correa  Bahnsen Luxembourg University

Cost sensitive Logistic Regression

Results

0%20%40%60%80%

€ -

€ 4

€ 8

€ 12

Cost Recall Precision F1-Score

Page 21: Data Analysis for Credit Card Fraud Detection Alejandro Correa  Bahnsen Luxembourg University

Conclusion

• Selecting models based on traditional statistics does not give the best results in terms of cost

• Models should be evaluated taking into account real financial costs of the application

• Algorithms should be developed to incorporate those financial costs

Page 22: Data Analysis for Credit Card Fraud Detection Alejandro Correa  Bahnsen Luxembourg University

Thank you!

Page 23: Data Analysis for Credit Card Fraud Detection Alejandro Correa  Bahnsen Luxembourg University

Contact information

Alejandro Correa Bahnsen

University of LuxembourgLuxembourg

[email protected]

http://www.linkedin.com/in/albahnsen

http://www.slideshare.net/albahnsen