Machine Learning Homework 8, 598 and 494 - Rob...

Machine Learning Homework 8, 598 and 494Rob McCulloch

3/26/2019

ContentsTrees on the Kaggle Data 1

A Simple Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Comparing the Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Homework Problem 12

Trees on the Kaggle Data

Let’s try trees on the kaggle data.

We are trying to predict whether an account will go delinquent.

ktr=read.csv("http://www.rob-mcculloch.org/data/kaggle-del-train.csv") #read in the train datakte=read.csv("http://www.rob-mcculloch.org/data/kaggle-del-test.csv") #read in the test dataktr$DelIn2Yr = as.factor(ktr$DelIn2Yr)kte$DelIn2Yr = as.factor(kte$DelIn2Yr)names(ktr)## [1] "RevolvingUtilizationOfUnsecuredLines"## [2] "age"## [3] "NumberOfTime30.59DaysPastDueNotWorse"## [4] "DebtRatio"## [5] "NumberOfOpenCreditLinesAndLoans"## [6] "NumberOfTimes90DaysLate"## [7] "NumberRealEstateLoansOrLines"## [8] "NumberOfTime60.89DaysPastDueNotWorse"## [9] "DelIn2Yr"dim(ktr)## [1] 75000 9dim(kte)## [1] 75000 9table(ktr$DelIn2Yr)/length(ktr$DelIn2Yr)#### 0 1## 0.93294667 0.06705333table(kte$DelIn2Yr)/length(kte$DelIn2Yr)#### 0 1## 0.93337333 0.06662667

So ktr is our training data and kte is our test data.

1

A Simple Tree

Let’s fit a single tree to the data using the R package rpart.

First we fit a big tree by using a small cp (the .0001 below).

library(rpart)set.seed(99)big.tree = rpart(DelIn2Yr~.,data=ktr, control=rpart.control(cp=.0001))nbig = length(unique(big.tree$where))cat("size of big tree: ",nbig,"\n")## size of big tree: 376head(big.tree$where)## 1 2 3 4 5 6## 5 5 5 5 5 5

Remember, cp is the key cost complexity parameter which is α in the notes.A smaller cp gives you a bigger tree.

The where component of the list returned by rpart indicates the partitioning of data into disjoint subsets. Sothere are 376 bottom nodes in the tree big.tree and the first observeration is in the 5th bottom node. Thenumbering system for the bottom nodes is not the meaningful, we just have a unique integer assigned to eachbottom node.

Let see what cross validation tells us about the a good size for the tree. The rpart package does this for uswith the plotcp function.

plotcp(big.tree)

2

cp

X−

val R

elat

ive

Err

or

0.85

0.95

1.05

1.15

Inf 0.0025 0.0013 0.001 0.00057 0.00042 0.00023

1 5 8 13 20 30 40 52 72 82 183 211 311

size of tree

Let’s pull off the cp = α for the best size and prune our big tree pack using that cp and then prune the treeback using that cp.

iibest = which.min(big.tree$cptable[,"xerror"]) #which has the lowest errorbestcp=big.tree$cptable[iibest,"CP"]bestsize = big.tree$cptable[iibest,"nsplit"]+1cat("the best tree has size ",bestsize,"\n")## the best tree has size 33best.tree = prune(big.tree,cp=bestcp) #prune back big.tree using the best cp#let's check the sizenbest = length(unique(best.tree$where))cat("size of best tree: ",nbest,"\n")## size of best tree: 33

Now let’s look at our out-of-sample predictions and lift.

##out of sampleyhattest = predict(best.tree,kte)[,2] #first col is prob(y=0|x) second col is prob(Y=1|x)summary(yhattest)## Min. 1st Qu. Median Mean 3rd Qu. Max.## 0.02245 0.02245 0.02245 0.06790 0.08155 0.81250

##liftsource("http://www.rob-mcculloch.org/2019_ml/webpage/notes/lift-loss.R")lift.plot(yhattest,kte$DelIn2Yr,cex.lab=1.2)

3

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

% tried

% o

f suc

cess

es

Now let’s look at ROC/AUC.

##Roc, Auclibrary(pROC)## Type 'citation("pROC")' for a citation.#### Attaching package: 'pROC'## The following objects are masked from 'package:stats':#### cov, smooth, vardelRoc = roc(response=kte$DelIn2Yr,predictor=yhattest)delAuc = auc(delRoc)cat("AUC for tree fit to Kaggle data is ",delAuc,"\n")## AUC for tree fit to Kaggle data is 0.7945745plot(delRoc)

4

Specificity

Sen

sitiv

ity

1.0 0.5 0.0

0.0

0.2

0.4

0.6

0.8

1.0

Now let’s plot the tree. To get a tree that plots nicely, I’ll prune it back to be smaller than the best tree byusing a bigger cp value.

library(rpart.plot)best.tree10 = prune(big.tree,cp=.0015) #prune back big.tree using a bigger cprpart.plot(best.tree10,split.cex=0.5,cex=0.75,type=3,extra=4)

5

NumberOfTimes90DaysLate < 1

NumberOfTimes90DaysLate < 2

RevolvingUtilizationOfUnsecuredLines < 0.41

NumberOfTime60.89DaysPastDueNotWorse < 1

DebtRatio < 0.018



age >= 51

NumberOfOpenCreditLinesAndLoans < 1

RevolvingUtilizationOfUnsecuredLines < 0.93

>= 1

>= 2

>= 0.41

>= 1

>= 0.018

>= 1

>= 2

< 51

>= 1

>= 0.93

0.95 .05

0.83 .17

0.66 .34

0.85 .15

1.48 .52

0.65 .35

0.68 .32

0.55 .45

1.41 .59

1.19 .81

1.37 .63

How did I know that cp=.0015 would give me a nice size tree?You get this kind of info from the cptable:

big.tree$cptable## CP nsplit rel error xerror xstd## 1 0.0237621794 0 1.0000000 1.0000000 0.01362033## 2 0.0040763571 2 0.9524756 0.9524756 0.01331542## 3 0.0028832770 4 0.9443229 0.9582422 0.01335291## 4 0.0021873136 6 0.9385564 0.9578445 0.01335033## 5 0.0015907735 7 0.9363691 0.9562537 0.01334000## 6 0.0014913502 10 0.9315967 0.9488964 0.01329208## 7 0.0014582091 12 0.9286140 0.9488964 0.01329208## 8 0.0012427918 15 0.9242394 0.9481010 0.01328689## 9 0.0011930801 19 0.9192682 0.9479022 0.01328559## 10 0.0011433685 22 0.9156890 0.9473056 0.01328169## 11 0.0010936568 29 0.9073374 0.9469079 0.01327910## 12 0.0009942334 32 0.9031617 0.9421356 0.01324785## 13 0.0007953868 39 0.8962020 0.9429310 0.01325307## 14 0.0006959634 45 0.8914297 0.9463114 0.01327520## 15 0.0005965401 51 0.8862597 0.9457149 0.01327129## 16 0.0005468284 62 0.8791012 0.9486976 0.01329079## 17 0.0004971167 71 0.8723404 0.9506860 0.01330376## 18 0.0004772321 76 0.8697554 0.9496918 0.01329728## 19 0.0004639756 81 0.8673693 0.9512826 0.01330765## 20 0.0004474051 99 0.8566315 0.9512826 0.01330765## 21 0.0003976934 112 0.8504673 0.9568503 0.01334388## 22 0.0003579240 182 0.8176576 0.9604295 0.01336710

6

## 23 0.0003314111 187 0.8158680 0.9657984 0.01340183## 24 0.0002982700 194 0.8130841 0.9807119 0.01349768## 25 0.0002651289 210 0.8079141 0.9876715 0.01354211## 26 0.0001988467 226 0.8021475 0.9930404 0.01357624## 27 0.0001590774 300 0.7852456 1.0220720 0.01375890## 28 0.0001491350 310 0.7836548 1.0280374 0.01379603## 29 0.0001325645 341 0.7784848 1.0334062 0.01382933## 30 0.0001000000 375 0.7739113 1.0373832 0.01385393

From the cptable I can see that a cp of about .0015 corresponds to a tree with about 10 decision rules.

This is also where we got the best tree from. We find the row of the cptable with the smallest xerror andthen pull off the cp value from that row (see above).

Here is the variable importance from rpart.

print(best.tree$variable.importance)## NumberOfTimes90DaysLate RevolvingUtilizationOfUnsecuredLines## 1207.146968 332.129838## NumberOfTime30.59DaysPastDueNotWorse NumberOfTime60.89DaysPastDueNotWorse## 235.299080 178.574720## DebtRatio NumberOfOpenCreditLinesAndLoans## 39.374718 31.100845## age NumberRealEstateLoansOrLines## 24.360910 2.720439plot(best.tree$variable.importance)

1 2 3 4 5 6 7 8

020

060

010

00

Index

best

.tree

$var

iabl

e.im

port

ance

7

Random Forests

Ok, let’s try Random Forests!!

library(randomForest)## randomForest 4.6-14## Type rfNews() to see new features/changes/bug fixes.set.seed(99)rffit = randomForest(DelIn2Yr~.,data=ktr,mtry=3,ntree=500)plot(rffit)

0 100 200 300 400 500

0.0

0.2

0.4

0.6

0.8

rffit

trees

Err

or

The plot suggests that it does not take many trees to get rid of the high variance, but the uncertainty is huge.

Let’s get the predictions and look at the lift.

rfyhattest = predict(rffit,newdata=kte,type="prob")[,2] #again, second column is p(y=1|x)summary(rfyhattest)## Min. 1st Qu. Median Mean 3rd Qu. Max.## 0.00000 0.00200 0.01200 0.06103 0.05000 0.99400lift.plot(rfyhattest,kte$DelIn2Yr,cex.lab=1.2)

8

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

% tried

% o

f suc

cess

es

Now let’s look at the variable importance.

varImpPlot(rffit)

NumberRealEstateLoansOrLines

NumberOfTime60.89DaysPastDueNotWorse

NumberOfTime30.59DaysPastDueNotWorse

NumberOfOpenCreditLinesAndLoans

NumberOfTimes90DaysLate

age

DebtRatio

RevolvingUtilizationOfUnsecuredLines

0 500 1000 1500 2000

rffit

MeanDecreaseGini

This does not agree with rpart !!

9

Boosting

library(gbm) #also xgboost is supposed to be good## Loaded gbm 2.1.4# first gbm needs a numeric y, weirdtrDB = ktr; trDB$DelIn2Yr = as.numeric(trDB$DelIn2Yr)-1teDB = kte; teDB$DelIn2Yr = as.numeric(teDB$DelIn2Yr)-1# check the new y's make sensetable(trDB$DelIn2Yr,ktr$DelIn2Yr)#### 0 1## 0 69971 0## 1 0 5029table(teDB$DelIn2Yr,kte$DelIn2Yr)#### 0 1## 0 70003 0## 1 0 4997#fit boostingbfit = gbm(DelIn2Yr~.,trDB, distribution="bernoulli",n.trees=500,interaction.depth=3,shrinkage=.05)byhattest = predict(bfit,newdata=teDB,n.trees=500,type="response")

Ok, let’s look at the lift.

lift.plot(rfyhattest,kte$DelIn2Yr,cex.lab=1.2)

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

% tried

% o

f suc

cess

es

Andthe variable importance:

10

summary(bfit)N

umbe

rRea

lEst

ateL

oans

OrL

ines

Relative influence

0 10 20 30 40

## var## NumberOfTimes90DaysLate NumberOfTimes90DaysLate## RevolvingUtilizationOfUnsecuredLines RevolvingUtilizationOfUnsecuredLines## NumberOfTime60.89DaysPastDueNotWorse NumberOfTime60.89DaysPastDueNotWorse## NumberOfTime30.59DaysPastDueNotWorse NumberOfTime30.59DaysPastDueNotWorse## DebtRatio DebtRatio## age age## NumberOfOpenCreditLinesAndLoans NumberOfOpenCreditLinesAndLoans## NumberRealEstateLoansOrLines NumberRealEstateLoansOrLines## rel.inf## NumberOfTimes90DaysLate 40.937395## RevolvingUtilizationOfUnsecuredLines 21.073632## NumberOfTime60.89DaysPastDueNotWorse 13.369043## NumberOfTime30.59DaysPastDueNotWorse 13.313170## DebtRatio 4.340049## age 3.439496## NumberOfOpenCreditLinesAndLoans 2.013411## NumberRealEstateLoansOrLines 1.513805

Comparing the Methods

Which one was the best?

Let’s use lift to compare them.

yhatL = list(yhattest,rfyhattest,byhattest)lift.many.plot(yhatL,kte$DelIn2Yr)legend("topleft",legend=c("tree","random forests","boosting"),lwd=rep(3,1),col=1:3,bty="n",cex=.8)

11

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

% tried

% o

f suc

cess

estreerandom forestsboosting

Homework Problem

Try changing mtry in random forests to something bigger than the 3 I used in the code above.

Try a different setting for the boosting fit. Your choice.

Use plots (e.g lift) to compare to new random forests and boosting fits with the old ones.Is it any better?

12

Machine Learning Homework 8, 598 and 494 - Rob...

Documents

Transcript of Machine Learning Homework 8, 598 and 494 - Rob...