Kaggle "Give me some credit" challenge overview

121
Predicting delinquency on debt

description

Full description of the work associated with this project can be found at: http://www.npcompleteheart.com/project/kaggle-give-me-some-credit/

Transcript of Kaggle "Give me some credit" challenge overview

Page 1: Kaggle "Give me some credit" challenge overview

Predicting delinquency on debt

Page 2: Kaggle "Give me some credit" challenge overview

What is the problem?

Page 3: Kaggle "Give me some credit" challenge overview

What is the problem?

• X Store has a retail credit card available to customers

Page 4: Kaggle "Give me some credit" challenge overview

What is the problem?

• X Store has a retail credit card available to customers

• There can be a number of sources of loss from this product, but one is customer’s defaulting on their debt

Page 5: Kaggle "Give me some credit" challenge overview

What is the problem?

• X Store has a retail credit card available to customers

• There can be a number of sources of loss from this product, but one is customer’s defaulting on their debt

• This prevents the store from collecting payment for products and services rendered

Page 6: Kaggle "Give me some credit" challenge overview

Is this problem big enough to matter?

Page 7: Kaggle "Give me some credit" challenge overview

Is this problem big enough to matter?

• Examining a slice of the customer database (150,000 customers) we find that 6.6% of customers were seriously delinquent in payment the last two years

Page 8: Kaggle "Give me some credit" challenge overview

Is this problem big enough to matter?

• Examining a slice of the customer database (150,000 customers) we find that 6.6% of customers were seriously delinquent in payment the last two years

• If only 5% of their carried debt was the store credit card this is potentially an:

Page 9: Kaggle "Give me some credit" challenge overview

Is this problem big enough to matter?

• Examining a slice of the customer database (150,000 customers) we find that 6.6% of customers were seriously delinquent in payment the last two years

• If only 5% of their carried debt was the store credit card this is potentially an:

• Average loss of $8.12 per customer

Page 10: Kaggle "Give me some credit" challenge overview

Is this problem big enough to matter?

• Examining a slice of the customer database (150,000 customers) we find that 6.6% of customers were seriously delinquent in payment the last two years

• If only 5% of their carried debt was the store credit card this is potentially an:

• Average loss of $8.12 per customer

• Potential overall loss of $1.2 million

Page 11: Kaggle "Give me some credit" challenge overview

What can be done?

Page 12: Kaggle "Give me some credit" challenge overview

What can be done?

• There are numerous models that can be used to predict which customers will default

Page 13: Kaggle "Give me some credit" challenge overview

What can be done?

• There are numerous models that can be used to predict which customers will default

• This could be used to decrease credit limits or cancel credit lines for current risky customers to minimize potential loss

Page 14: Kaggle "Give me some credit" challenge overview

What can be done?

• There are numerous models that can be used to predict which customers will default

• This could be used to decrease credit limits or cancel credit lines for current risky customers to minimize potential loss

• Or better screen which customers are approved for the card

Page 15: Kaggle "Give me some credit" challenge overview

How will I do this?

Page 16: Kaggle "Give me some credit" challenge overview

How will I do this?

• This is a basic classification problem with important business implications

Page 17: Kaggle "Give me some credit" challenge overview

How will I do this?

• This is a basic classification problem with important business implications

• We’ll examine a few simplistic models to get an idea of performance

Page 18: Kaggle "Give me some credit" challenge overview

How will I do this?

• This is a basic classification problem with important business implications

• We’ll examine a few simplistic models to get an idea of performance

• Explore decision tree methods to achieve better performance

Page 19: Kaggle "Give me some credit" challenge overview

What will the models predict delinquency?

Each customer has a number of attributes

Page 20: Kaggle "Give me some credit" challenge overview

What will the models predict delinquency?

Each customer has a number of attributes

John SmithDelinquent: YesAge: 23Income: $1600Number of Lines: 4

Page 21: Kaggle "Give me some credit" challenge overview

What will the models predict delinquency?

Each customer has a number of attributes

John SmithDelinquent: YesAge: 23Income: $1600Number of Lines: 4

Mary RasmussenDelinquent: NoAge: 73Income: $2200Number of Lines: 2

Page 22: Kaggle "Give me some credit" challenge overview

What will the models predict delinquency?

Each customer has a number of attributes

John SmithDelinquent: YesAge: 23Income: $1600Number of Lines: 4

Mary RasmussenDelinquent: NoAge: 73Income: $2200Number of Lines: 2

...

Page 23: Kaggle "Give me some credit" challenge overview

What will the models predict delinquency?

Each customer has a number of attributes

John SmithDelinquent: YesAge: 23Income: $1600Number of Lines: 4

Mary RasmussenDelinquent: NoAge: 73Income: $2200Number of Lines: 2

...

We will use the customer attributes to predict whether they were delinquent

Page 24: Kaggle "Give me some credit" challenge overview

How do we make sure that our solution actually has predictive power?

Page 25: Kaggle "Give me some credit" challenge overview

How do we make sure that our solution actually has predictive power?

We have two slices of the customer dataset

Page 26: Kaggle "Give me some credit" challenge overview

How do we make sure that our solution actually has predictive power?

We have two slices of the customer dataset

Train150,000

customers

Delinquencyin dataset

Page 27: Kaggle "Give me some credit" challenge overview

How do we make sure that our solution actually has predictive power?

We have two slices of the customer dataset

Train Test150,000

customers

Delinquencyin dataset

101,000customers

Delinquencynot indataset

Page 28: Kaggle "Give me some credit" challenge overview

How do we make sure that our solution actually has predictive power?

We have two slices of the customer dataset

Train Test150,000

customers

Delinquencyin dataset

101,000customers

Delinquencynot indataset

None of the customers in the test dataset are used to train the model

Page 29: Kaggle "Give me some credit" challenge overview

Internally we validate our model performance with cross-fold validation

Using only the train dataset we can get a sense of how well our model performs without externally validating it

Train

Page 30: Kaggle "Give me some credit" challenge overview

Internally we validate our model performance with cross-fold validation

Using only the train dataset we can get a sense of how well our model performs without externally validating it

TrainTrain 1

Train 2

Train 3

Page 31: Kaggle "Give me some credit" challenge overview

Internally we validate our model performance with cross-fold validation

Using only the train dataset we can get a sense of how well our model performs without externally validating it

TrainTrain 1

Train 2

Train 3

Train 1

Train 2

AlgorithmTraining

Page 32: Kaggle "Give me some credit" challenge overview

Internally we validate our model performance with cross-fold validation

Using only the train dataset we can get a sense of how well our model performs without externally validating it

TrainTrain 1

Train 2

Train 3

Train 1

Train 2

AlgorithmTraining

AlgorithmTesting

Train 3

Page 33: Kaggle "Give me some credit" challenge overview

What matters is how well we can predict the test dataset

We judge this using the accuracy, which is the number of our predictions correct out of the total number of predictions made

So with 100,000 customers and an 80% accuracy we will have correctly predicted whether 80,000 customers will default or not in the next two years

Page 34: Kaggle "Give me some credit" challenge overview

Putting accuracy in context

Page 35: Kaggle "Give me some credit" challenge overview

Putting accuracy in context

We could save $600,000 over two years if we correctly predicted 50% of the customers that would default and changed their account to prevent it

Page 36: Kaggle "Give me some credit" challenge overview

Putting accuracy in context

We could save $600,000 over two years if we correctly predicted 50% of the customers that would default and changed their account to prevent it

The potential loss is minimized by ~$8,000 for every 100,000 customers with each percentage point increase in accuracy

Page 37: Kaggle "Give me some credit" challenge overview

Looking at the actual data

Page 38: Kaggle "Give me some credit" challenge overview

Looking at the actual data

Page 39: Kaggle "Give me some credit" challenge overview

Looking at the actual data

Page 40: Kaggle "Give me some credit" challenge overview

Looking at the actual data

Assume$2,500

Page 41: Kaggle "Give me some credit" challenge overview

Looking at the actual data

Assume$2,500

Assume0

Page 42: Kaggle "Give me some credit" challenge overview

There is a continuum of algorithmic choices to tackle the problem

Simpler,Quicker

Complex,Slower

Page 43: Kaggle "Give me some credit" challenge overview

There is a continuum of algorithmic choices to tackle the problem

Simpler,Quicker

Complex,Slower

RandomChance

Page 44: Kaggle "Give me some credit" challenge overview

There is a continuum of algorithmic choices to tackle the problem

Simpler,Quicker

Complex,Slower

RandomChance

50%

Page 45: Kaggle "Give me some credit" challenge overview

There is a continuum of algorithmic choices to tackle the problem

Simpler,Quicker

Complex,Slower

RandomChance

50%

Page 46: Kaggle "Give me some credit" challenge overview

There is a continuum of algorithmic choices to tackle the problem

Simpler,Quicker

Complex,Slower

RandomChance

50%

SimpleClassification

Page 47: Kaggle "Give me some credit" challenge overview

For simple classification we pick a single attribute and find the best split in the customers

Page 48: Kaggle "Give me some credit" challenge overview

For simple classification we pick a single attribute and find the best split in the customers

Page 49: Kaggle "Give me some credit" challenge overview

For simple classification we pick a single attribute and find the best split in the customers

Num

ber

of C

usto

mer

s

Times Past Due

Page 50: Kaggle "Give me some credit" challenge overview

For simple classification we pick a single attribute and find the best split in the customers

Num

ber

of C

usto

mer

s

Times Past Due

True PositiveTrue NegativeFalse PositiveFalse Negative

1

Page 51: Kaggle "Give me some credit" challenge overview

For simple classification we pick a single attribute and find the best split in the customers

Num

ber

of C

usto

mer

s

Times Past Due

True PositiveTrue NegativeFalse PositiveFalse Negative

1 2

Page 52: Kaggle "Give me some credit" challenge overview

For simple classification we pick a single attribute and find the best split in the customers

Num

ber

of C

usto

mer

s

Times Past Due

True PositiveTrue NegativeFalse PositiveFalse Negative

1 2

Page 53: Kaggle "Give me some credit" challenge overview

For simple classification we pick a single attribute and find the best split in the customers

Num

ber

of C

usto

mer

s

Times Past Due

True PositiveTrue NegativeFalse PositiveFalse Negative

1 2

Page 54: Kaggle "Give me some credit" challenge overview

For simple classification we pick a single attribute and find the best split in the customers

Num

ber

of C

usto

mer

s

Times Past Due

True PositiveTrue NegativeFalse PositiveFalse Negative

1 2 ...

Page 55: Kaggle "Give me some credit" challenge overview

We evaluate possible splits using accuracy, precision, and sensitivity

Acc = Number correctTotal Number

Page 56: Kaggle "Give me some credit" challenge overview

We evaluate possible splits using accuracy, precision, and sensitivity

Acc = Number correctTotal Number

Prec = True PositivesNumber of People

Predicted Delinquent

Page 57: Kaggle "Give me some credit" challenge overview

We evaluate possible splits using accuracy, precision, and sensitivity

Acc = Number correctTotal Number

Prec = True PositivesNumber of People

Predicted Delinquent

Sens = True PositivesNumber of PeopleActually Delinquent

Page 58: Kaggle "Give me some credit" challenge overview

0 20 40 60 80 100Number of Times 30-59 Days Past Due

0

0.2

0.4

0.6

0.8

AccuracyPrecisionSensitivity

We evaluate possible splits using accuracy, precision, and sensitivity

Acc = Number correctTotal Number

Prec = True PositivesNumber of People

Predicted Delinquent

Sens = True PositivesNumber of PeopleActually Delinquent

Page 59: Kaggle "Give me some credit" challenge overview

0 20 40 60 80 100Number of Times 30-59 Days Past Due

0

0.2

0.4

0.6

0.8

AccuracyPrecisionSensitivity

We evaluate possible splits using accuracy, precision, and sensitivity

Acc = Number correctTotal Number

Prec = True PositivesNumber of People

Predicted Delinquent

Sens = True PositivesNumber of PeopleActually Delinquent

Page 60: Kaggle "Give me some credit" challenge overview

0 20 40 60 80 100Number of Times 30-59 Days Past Due

0

0.2

0.4

0.6

0.8

AccuracyPrecisionSensitivity

We evaluate possible splits using accuracy, precision, and sensitivity

Acc = Number correctTotal Number

Prec = True PositivesNumber of People

Predicted Delinquent

Sens = True PositivesNumber of PeopleActually Delinquent

0.61 KGI on Test Set

Page 61: Kaggle "Give me some credit" challenge overview

However, not all fields are as informative

Using the number of times past due 60-89 dayswe achieve a KGI of 0.5

Page 62: Kaggle "Give me some credit" challenge overview

However, not all fields are as informative

Using the number of times past due 60-89 dayswe achieve a KGI of 0.5

The approach is naive and could be improved but our time is better spent on different algorithms

Page 63: Kaggle "Give me some credit" challenge overview

Exploring algorithmic choices further

Simpler,Quicker

Complex,Slower

RandomChance

0.50

SimpleClassification

0.50-0.61

Page 64: Kaggle "Give me some credit" challenge overview

Exploring algorithmic choices further

Simpler,Quicker

Complex,Slower

RandomChance

0.50

SimpleClassification

0.50-0.61

RandomForests

Page 65: Kaggle "Give me some credit" challenge overview

A random forest starts from a decision tree

Customer Data

Page 66: Kaggle "Give me some credit" challenge overview

A random forest starts from a decision tree

Customer Data

Find the best split in a set ofrandomly chosen attributes

Page 67: Kaggle "Give me some credit" challenge overview

A random forest starts from a decision tree

Customer Data

Find the best split in a set ofrandomly chosen attributes

Is age <30?

Page 68: Kaggle "Give me some credit" challenge overview

A random forest starts from a decision tree

Customer Data

Find the best split in a set ofrandomly chosen attributes

Is age <30?

No

75,000 Customers>30

Page 69: Kaggle "Give me some credit" challenge overview

A random forest starts from a decision tree

Customer Data

Find the best split in a set ofrandomly chosen attributes

Is age <30?

No

75,000 Customers>30

Yes

25,000 Customers <30

Page 70: Kaggle "Give me some credit" challenge overview

A random forest starts from a decision tree

Customer Data

Find the best split in a set ofrandomly chosen attributes

Is age <30?

No

75,000 Customers>30

Yes

25,000 Customers <30

...

Page 71: Kaggle "Give me some credit" challenge overview

A random forest is composed of many decision trees

...

Customer Data

Best Split

No

Customers Data Set 2

Yes

Customers Data Set 1

Page 72: Kaggle "Give me some credit" challenge overview

A random forest is composed of many decision trees

...

Customer Data

Best Split

No

Customers Data Set 2

Yes

Customers Data Set 1 ...

Customer Data

Best Split

No

Customers Data Set 2

Yes

Customers Data Set 1

...

Customer Data

Best Split

No

Customers Data Set 2

Yes

Customers Data Set 1

...

Customer Data

Best Split

No

Customers Data Set 2

Yes

Customers Data Set 1

...

Customer Data

Best Split

No

Customers Data Set 2

Yes

Customers Data Set 1

...

Customer Data

Best Split

No

Customers Data Set 2

Yes

Customers Data Set 1

Page 73: Kaggle "Give me some credit" challenge overview

A random forest is composed of many decision trees

...

Customer Data

Best Split

No

Customers Data Set 2

Yes

Customers Data Set 1 ...

Customer Data

Best Split

No

Customers Data Set 2

Yes

Customers Data Set 1

...

Customer Data

Best Split

No

Customers Data Set 2

Yes

Customers Data Set 1

...

Customer Data

Best Split

No

Customers Data Set 2

Yes

Customers Data Set 1

...

Customer Data

Best Split

No

Customers Data Set 2

Yes

Customers Data Set 1

...

Customer Data

Best Split

No

Customers Data Set 2

Yes

Customers Data Set 1

Class assignment of a customer is based on how manyof the decision trees “vote” on how to split an attribute

Page 74: Kaggle "Give me some credit" challenge overview

A random forest is composed of many decision trees

...

Customer Data

Best Split

No

Customers Data Set 2

Yes

Customers Data Set 1 ...

Customer Data

Best Split

No

Customers Data Set 2

Yes

Customers Data Set 1

...

Customer Data

Best Split

No

Customers Data Set 2

Yes

Customers Data Set 1

...

Customer Data

Best Split

No

Customers Data Set 2

Yes

Customers Data Set 1

...

Customer Data

Best Split

No

Customers Data Set 2

Yes

Customers Data Set 1

...

Customer Data

Best Split

No

Customers Data Set 2

Yes

Customers Data Set 1

We use a large number of trees to not over-fit to the training data

Class assignment of a customer is based on how manyof the decision trees “vote” on how to split an attribute

Page 75: Kaggle "Give me some credit" challenge overview

The Random Forest algorithm are easily implemented

In Python or R for initial testing and validation

Page 76: Kaggle "Give me some credit" challenge overview

The Random Forest algorithm are easily implemented

In Python or R for initial testing and validation

Page 77: Kaggle "Give me some credit" challenge overview

The Random Forest algorithm are easily implemented

In Python or R for initial testing and validation

Also parallelized with Mahout and Hadoop since there is no dependence from one tree to the next

Page 78: Kaggle "Give me some credit" challenge overview

A random forest performs well on the test set

Random Forest 10 trees: 0.779 KGI

Page 79: Kaggle "Give me some credit" challenge overview

A random forest performs well on the test set

Random Forest 10 trees: 0.779 KGI150 trees: 0.843 KGI

Page 80: Kaggle "Give me some credit" challenge overview

A random forest performs well on the test set

Random Forest 10 trees: 0.779 KGI150 trees: 0.843 KGI1000 trees: 0.850 KGI

Page 81: Kaggle "Give me some credit" challenge overview

A random forest performs well on the test set

Random Forest 10 trees: 0.779 KGI150 trees: 0.843 KGI1000 trees: 0.850 KGI

Page 82: Kaggle "Give me some credit" challenge overview

A random forest performs well on the test set

Random Forest 10 trees: 0.779 KGI150 trees: 0.843 KGI1000 trees: 0.850 KGI

0.4 0.5 0.6 0.7 0.8 0.9

Random

Accuracy

ClassificationRandom Forests

Page 83: Kaggle "Give me some credit" challenge overview

Exploring algorithmic choices further

Simpler,Quicker

Complex,Slower

RandomChance

0.50

SimpleClassification

0.50-0.61

RandomForests

0.78-0.85

Page 84: Kaggle "Give me some credit" challenge overview

Exploring algorithmic choices further

Simpler,Quicker

Complex,Slower

RandomChance

0.50

SimpleClassification

0.50-0.61

RandomForests

0.78-0.85

Gradient TreeBoosting

Page 85: Kaggle "Give me some credit" challenge overview

Boosting Trees is similar to a Random Forest

Customer Data

Find the best split in a set ofrandomly chosen attributes

Is age <30?

No

Customers >30 Data

Yes

Customers <30 Data

...

Page 86: Kaggle "Give me some credit" challenge overview

Boosting Trees is similar to a Random Forest

Customer Data

Is age <30?

No

Customers >30 Data

Yes

Customers <30 Data

...

Do an exhaustive searchfor best split

Page 87: Kaggle "Give me some credit" challenge overview

How Gradient Boosting Trees differs from Random Forest

...

Customer Data

Best Split

No

Customers Data Set 2

Yes

Customers Data Set 1

The first tree is optimized to minimize a loss function describing the data

Page 88: Kaggle "Give me some credit" challenge overview

How Gradient Boosting Trees differs from Random Forest

...

Customer Data

Best Split

No

Customers Data Set 2

Yes

Customers Data Set 1

The first tree is optimized to minimize a loss function describing the data

The next tree is then optimized to fit whatever variability the first

tree didn’t fit

Page 89: Kaggle "Give me some credit" challenge overview

How Gradient Boosting Trees differs from Random Forest

...

Customer Data

Best Split

No

Customers Data Set 2

Yes

Customers Data Set 1

The first tree is optimized to minimize a loss function describing the data

The next tree is then optimized to fit whatever variability the first

tree didn’t fit

This is a sequential process in comparison to the random forest

Page 90: Kaggle "Give me some credit" challenge overview

How Gradient Boosting Trees differs from Random Forest

...

Customer Data

Best Split

No

Customers Data Set 2

Yes

Customers Data Set 1

The first tree is optimized to minimize a loss function describing the data

The next tree is then optimized to fit whatever variability the first

tree didn’t fit

This is a sequential process in comparison to the random forest

We also run the risk of over-fitting to the data, thus the learning rate

Page 91: Kaggle "Give me some credit" challenge overview

Implementing Gradient Boosted Trees

In Python or R it is easy for initial testing and validation

Page 92: Kaggle "Give me some credit" challenge overview

Implementing Gradient Boosted Trees

In Python or R it is easy for initial testing and validation

There are implementations that use Hadoop but it’s more complicated to achieve the best performance

Page 93: Kaggle "Give me some credit" challenge overview

Gradient Boosting Trees performs well on the dataset

100 trees, 0.1 Learning: 0.865022 KGI

Page 94: Kaggle "Give me some credit" challenge overview

Gradient Boosting Trees performs well on the dataset

100 trees, 0.1 Learning: 0.865022 KGI1000 trees, 0.1 Learning: 0.865248 KGI

Page 95: Kaggle "Give me some credit" challenge overview

Gradient Boosting Trees performs well on the dataset

100 trees, 0.1 Learning: 0.865022 KGI1000 trees, 0.1 Learning: 0.865248 KGI

0 0.6 0.8Learning Rate

0.75

0.8

0.85

KG

I

0.2 0.4

Page 96: Kaggle "Give me some credit" challenge overview

Gradient Boosting Trees performs well on the dataset

100 trees, 0.1 Learning: 0.865022 KGI1000 trees, 0.1 Learning: 0.865248 KGI

0 0.6 0.8Learning Rate

0.75

0.8

0.85

KG

I

0.2 0.4

0.4 0.5 0.6 0.7 0.8 0.9

Random

Accuracy

ClassificationRandom Forests

Boosting Trees

Page 97: Kaggle "Give me some credit" challenge overview

Moving one step further in complexity

Simpler,Quicker

Complex,Slower

RandomChance

0.50

SimpleClassification

0.50-0.61

RandomForests

0.78-0.85

Gradient TreeBoosting

0.71-0.8659

BlendedMethod

Page 98: Kaggle "Give me some credit" challenge overview

Or more accurately an ensemble ofensemble methods

Algorithm Progression

Page 99: Kaggle "Give me some credit" challenge overview

Or more accurately an ensemble ofensemble methods

Algorithm Progression

Random Forest

Page 100: Kaggle "Give me some credit" challenge overview

Or more accurately an ensemble ofensemble methods

Algorithm Progression

Random Forest

Extremely Random Forest

Page 101: Kaggle "Give me some credit" challenge overview

Or more accurately an ensemble ofensemble methods

Algorithm Progression

Random Forest

Extremely Random Forest

Gradient Tree Boosting

Page 102: Kaggle "Give me some credit" challenge overview

Or more accurately an ensemble ofensemble methods

Algorithm ProgressionTrain Data Probabilities

Random Forest

Extremely Random Forest

Gradient Tree Boosting

0.10.50.010.80.7...

Page 103: Kaggle "Give me some credit" challenge overview

Or more accurately an ensemble ofensemble methods

Algorithm ProgressionTrain Data Probabilities

Random Forest

Extremely Random Forest

Gradient Tree Boosting

0.10.50.010.80.7...

0.150.60.00.750.68

.

.

.

Page 104: Kaggle "Give me some credit" challenge overview

Or more accurately an ensemble ofensemble methods

Algorithm ProgressionTrain Data Probabilities

Random Forest

Extremely Random Forest

Gradient Tree Boosting

0.10.50.010.80.7...

0.150.60.00.750.68

.

.

.

Page 105: Kaggle "Give me some credit" challenge overview

Combine all of the model information

Train Data Probabilities

0.10.50.010.80.7...

0.150.60.00.750.68

.

.

.

Page 106: Kaggle "Give me some credit" challenge overview

Combine all of the model information

Train Data Probabilities

0.10.50.010.80.7...

0.150.60.00.750.68

.

.

.

Optimize the set of train probabilities to the known delinquencies

Page 107: Kaggle "Give me some credit" challenge overview

Combine all of the model information

Train Data Probabilities

0.10.50.010.80.7...

0.150.60.00.750.68

.

.

.

Optimize the set of train probabilities to the known delinquencies

Apply the same weighting scheme to the set of test data probabilities

Page 108: Kaggle "Give me some credit" challenge overview

Implementation can be done in a number of ways

Testing in Python or R is slower, due to the sequential nature of applying the algorithms

Could be faster parallelized, running each algorithm separately and combining the results

Page 109: Kaggle "Give me some credit" challenge overview

Assessing model performance

Blending Performance, 100 trees: 0.864394 KGI

Page 110: Kaggle "Give me some credit" challenge overview

Assessing model performance

Blending Performance, 100 trees: 0.864394 KGI

0.4 0.5 0.6 0.7 0.8 0.9

Random

Accuracy

ClassificationRandom Forests

Boosting TreesBlended

Page 111: Kaggle "Give me some credit" challenge overview

Assessing model performance

Blending Performance, 100 trees: 0.864394 KGI

But this performance and the possibility of additional gains comes at a distinct time cost.

0.4 0.5 0.6 0.7 0.8 0.9

Random

Accuracy

ClassificationRandom Forests

Boosting TreesBlended

Page 112: Kaggle "Give me some credit" challenge overview

Examining the continuum of choices

Simpler,Quicker

Complex,Slower

RandomChance

0.50

SimpleClassification

0.50-0.61

RandomForests

0.78-0.85

Gradient TreeBoosting

0.71-0.8659

BlendedMethod

0.864

Page 113: Kaggle "Give me some credit" challenge overview

What would be best to implement?

Page 114: Kaggle "Give me some credit" challenge overview

What would be best to implement?

There is a large amount of optimization in the blended method that could be done

Page 115: Kaggle "Give me some credit" challenge overview

What would be best to implement?

There is a large amount of optimization in the blended method that could be done

However, this algorithm takes the longest to run.This constraint will apply in testing and validation also

Page 116: Kaggle "Give me some credit" challenge overview

What would be best to implement?

There is a large amount of optimization in the blended method that could be done

However, this algorithm takes the longest to run.This constraint will apply in testing and validation also

Random Forests returns a reasonably good result.It is quick and easily parallelized

Page 117: Kaggle "Give me some credit" challenge overview

What would be best to implement?

There is a large amount of optimization in the blended method that could be done

However, this algorithm takes the longest to run.This constraint will apply in testing and validation also

Random Forests returns a reasonably good result.It is quick and easily parallelized

Gradient Tree Boosting returns the best result and runs reasonably fast.It is not as easily parallelized though

Page 118: Kaggle "Give me some credit" challenge overview

What would be best to implement?

Random Forests returns a reasonably good result.It is quick and easily parallelized

Gradient Tree Boosting returns the best result and runs reasonably fast.It is not as easily parallelized though

Page 119: Kaggle "Give me some credit" challenge overview

Increases in predictive performance have real business value

Using any of the more complex algorithms we achieve an increase of 35% in comparison to random

Page 120: Kaggle "Give me some credit" challenge overview

Increases in predictive performance have real business value

Using any of the more complex algorithms we achieve an increase of 35% in comparison to random

Potential decrease of ~$420k in losses by identifyingcustomers likely to default in the training set alone

Page 121: Kaggle "Give me some credit" challenge overview

Thank you for your time