Making Data Mining Models Useful to Model Non-paying Customers of Exchange Carriers Wei Fan, IBM...

11
Making Data Mining Models Useful to Model Non-paying Customers of Exchange Carriers Wei Fan, IBM T.J.Watson Janek Mathuria, and Chang-tien Lu Virginia Tech, Northern Virginia Center

Transcript of Making Data Mining Models Useful to Model Non-paying Customers of Exchange Carriers Wei Fan, IBM...

Page 1: Making Data Mining Models Useful to Model Non-paying Customers of Exchange Carriers Wei Fan, IBM T.J.Watson Janek Mathuria, and Chang-tien Lu Virginia.

Making Data Mining Models Useful to Model Non-paying

Customers of Exchange Carriers

Wei Fan, IBM T.J.Watson

Janek Mathuria, and Chang-tien LuVirginia Tech, Northern Virginia Center

Page 2: Making Data Mining Models Useful to Model Non-paying Customers of Exchange Carriers Wei Fan, IBM T.J.Watson Janek Mathuria, and Chang-tien Lu Virginia.

Our Selling Points

A real practical problem for an actual CLEC company.

A whole process: Start with a great goal. Reality taught us a lesson Settle down with a realistic solution

A new set of algorithms to calibrate probability outputs (as distinguished from Zadrozny and Elkan’s calibration methods)

Page 3: Making Data Mining Models Useful to Model Non-paying Customers of Exchange Carriers Wei Fan, IBM T.J.Watson Janek Mathuria, and Chang-tien Lu Virginia.

Challenging Problem

Differentiate between “Late” and “Default”: Late: 1 month past due Default: two month past due.

Default Percentage: 20%. Designed feature set: Details in Paper

Calling summary. Billing summary. Obvious ones. Other ones out there? Maybe.

Page 4: Making Data Mining Models Useful to Model Non-paying Customers of Exchange Carriers Wei Fan, IBM T.J.Watson Janek Mathuria, and Chang-tien Lu Virginia.

Failure

Failure of Commonly Used Methods: Nearly predicting every customer as “paying on

time” and still has 80% What this means:

Our feature set not complete? Probably. Problem itself is just stochastic in nature.

Natural next step: cost-sensitive learning? Impossible to define precisely due to complexity.

Page 5: Making Data Mining Models Useful to Model Non-paying Customers of Exchange Carriers Wei Fan, IBM T.J.Watson Janek Mathuria, and Chang-tien Lu Virginia.

A Compromised Solution

Predict a reliable probability score. A customer is uniquely distinguished by its feature

vector. If the model predict that a customer has 20%

chance to default Indeed the customer has 20% chance to default The predicted score is considered “reliable”

Page 6: Making Data Mining Models Useful to Model Non-paying Customers of Exchange Carriers Wei Fan, IBM T.J.Watson Janek Mathuria, and Chang-tien Lu Virginia.

Previously Proposed Calibration Methods

Existing approaches that output scores are not reliable (Zadrozny and Elkan) Decision trees. Naïve Bayes SVM Logistic Regression

Use “function” mapping to calibrate unreliable score to reliable ones. Assumption: original unreliable score need to be

monotonous. Otherwise, it is not applicable.

Page 7: Making Data Mining Models Useful to Model Non-paying Customers of Exchange Carriers Wei Fan, IBM T.J.Watson Janek Mathuria, and Chang-tien Lu Virginia.

A Good Calibration

Page 8: Making Data Mining Models Useful to Model Non-paying Customers of Exchange Carriers Wei Fan, IBM T.J.Watson Janek Mathuria, and Chang-tien Lu Virginia.

A Bad Calibration

Page 9: Making Data Mining Models Useful to Model Non-paying Customers of Exchange Carriers Wei Fan, IBM T.J.Watson Janek Mathuria, and Chang-tien Lu Virginia.

Random Decision Trees Amazingly Simple and Counter-intuitive: Do not use any purity check function. Pick a feature “randomly”. Continuous feature, pick a random splitting point. Discrete feature can be picked only once in one decision path. Continuous feature can be picked multiple times. Tree depth up to the number of features. Original feature set. No bootstrap! Each tree computes probability at the leaf node.

10 fraud and 90 normal transaction, p(fraud|x) = 0.1 Multiple trees, 10 min and 30 enough, average probability.

Page 10: Making Data Mining Models Useful to Model Non-paying Customers of Exchange Carriers Wei Fan, IBM T.J.Watson Janek Mathuria, and Chang-tien Lu Virginia.

Random Forest +

Marriage between Random Decision Tree and Random Forest

Pick a feature subset randomly. Compute info gain for each feature. Choose

the one with highest info gain. Original dataset. Not bootstrap. Leaf node computes probability. 10 to 30 trees.

Page 11: Making Data Mining Models Useful to Model Non-paying Customers of Exchange Carriers Wei Fan, IBM T.J.Watson Janek Mathuria, and Chang-tien Lu Virginia.

Availability.

Software available upon request.