Data Analytics and Business Intelligence …Build many decision trees (e.g., 500). For each tree:...

Data Analytics and Business Intelligence(8696/8697)

Ensemble Decision Trees

[email protected]

Data ScientistAustralian Taxation Office

Adjunct Professor, University of Canberra

[email protected]

http://datamining.togaware.com

http: // togaware. com Copyright© 2014, [email protected] 1/36

http://creativecommons.org/licenses/by-sa/3.0/

[email protected]

http://datamining.togaware.com

http://togaware.com

Overview

Overview

1 Overview

2 Multiple Models

3 BoostingAlgorithmExample

4 Random ForestsForests of TreesIntroduction

5 Other EnsemblesEnsembles of Different Models



http://togaware.com

Multiple Models

Overview

1 Overview

2 Multiple Models






http://togaware.com

Multiple Models

Building Multiple Models

General idea developed in Multiple Inductive Learning algorithm(Williams 1987).

Ideas were developed (ACJ 1987, PhD 1990) in the context of:

observe that variable selection methods don’t discriminate;so build multiple decision trees;then combine into a single model.

Basic idea is that multiple models, like multiple experts, mayproduce better results when working together, rather than inisolation

Two approaches covered: Boosting and Random Forests.

Meta learners.



http://togaware.com

Boosting

Overview

1 Overview

2 Multiple Models






http://togaware.com

Boosting Algorithm

Boosting AlgorithmsBasic idea: boost observations that are “hard to model.”

Algorithm: iteratively build weak models using a poor learner:

Build an initial model;

Identify mis-classified cases in the training dataset;

Boost (over-represent) training observations modelled incorrectly;

Build a new model on the boosted training dataset;

Repeat.

The result is an ensemble of weighted models.

Best off the shelf model builder. (Leo Brieman)



http://togaware.com

Boosting Algorithm

Algorithm in Pseudo Code

adaBoost <- function(form, data, learner)

{w <- rep(1/nrows(data), nrows(data))

e <- NULL

a <- NULL

m <- list()

i <- 0

repeat

{i <- i + 1

m <- c(m, learner(form, data, w))

ms <- which(predict(m[i], data) != data[target(form)])

e <- c(e, sum(w[ms])/sum(w))

a <- c(a, log((1-e[i])/e[i]))

w[ms] <- w[ms] * exp(a[i])

if (e[i] >= 0.5) break

}return(sum(a * sapply(m, predict, data)))

}



http://togaware.com

Boosting Algorithm

Distributions

1e−01

1e+01

1e+03

0.0 0.1 0.2 0.3 0.4 0.5Error Rate epsilon

alpha

e^alpha

Learning Rate



http://togaware.com

Boosting Example

Example: First Iteration

n <- 10

w <- rep(1/n, n) # 0.1 0.1 ...

ms <- c(7, 8, 9, 10)

e <- sum(w[ms])/sum(w) # 0.4

a <- log((1-e)/e) # 0.4055

w[ms] <- w[ms] * exp(a) # 0.15 0.15 0.15 0.15



http://togaware.com

Boosting Example

Example: Second Iteration

ms <- c(1, 8) # 0.10 0.15

w[ms]

## [1] 0.10 0.15

e <- sum(w[ms])/sum(w) # 0.2083

a <- log((1-e)/e) # 1.335

(w[ms] <- w[ms] * exp(a))

## [1] 0.38 0.57



http://togaware.com

Boosting Example

Example: Ada on Weather Data

head(weather[c(1:5, 23, 24)], 3)

## Date Location MinTemp MaxTemp Rainfall RISK_MM...

## 1 2007-11-01 Canberra 8.0 24.3 0.0 3.6...

## 2 2007-11-02 Canberra 14.0 26.9 3.6 3.6...

## 3 2007-11-03 Canberra 13.7 23.4 3.6 39.8...

....

set.seed(42)

train <- sample(1:nrow(weather), 0.7 * nrow(weather))

(m <- ada(RainTomorrow ~ ., weather[train, -c(1:2, 23)]))

## Call:

## ada(RainTomorrow ~ ., data=weather[train, -c(1:2, 23)])

##

## Loss: exponential Method: discrete Iteration: 50

....



http://togaware.com

Boosting Example

Example: Error Rate

Notice error rate decreases quickly then flattens.

plot(m)

0 10 20 30 40 50

0.06

0.08

0.10

0.12

0.14

Iteration 1 to 50

Err

or

50

1

1

1

1

1

Training Error

1 Train



http://togaware.com

Boosting Example

Example: Variable Importance

Helps understand the knowledge captured.

varplot(m)

RainToday

WindDir9am

Rainfall

WindDir3pm

WindGustDir

Pressure3pm

Sunshine

WindGustSpeed

Cloud3pm

WindSpeed9am

WindSpeed3pm

Humidity9am

Cloud9am

Evaporation

MaxTemp

Temp9am

Humidity3pm

MinTemp

Pressure9am

Temp3pm

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

0.04 0.05 0.06 0.07 0.08 0.09 0.10

Variable Importance Plot

Score



http://togaware.com

Boosting Example

Example: Sample Trees

There are 50 trees in all. Here’s the first 3.

fancyRpartPlot(m$model$trees[[1]])



yes no

1

2

4

5

10 11 3

Cloud3pm < 7.5

Pressure3pm >= 1012

Humidity3pm < 42

−1.96 .04100%

−1.97 .03

94%

−1.99 .01

76%

−1.87 .13

18%

−1.96 .04

9%

1.72 .28

9%

1.42 .58

6%

yes no

1

2

4

5

10 11 3

Rattle 2014−Jul−01 20:15:55 gjw

yes no

1

2

3

6 7

Pressure3pm >= 1012

Sunshine >= 11

−1.96 .04100%

−1.98 .02

80%

1.82 .18

20%

−11.00 .00

6%

1.63 .37

14%

yes no

1

2

3

6 7

Rattle 2014−Jul−01 20:15:55 gjw

yes no

1

2

3

6 7

Pressure3pm >= 1012

MaxTemp >= 27

−1.97 .03100%

−1.99 .01

82%

1.83 .17

18%

−1.97 .03

6%

1.67 .33

12%

yes no

1

2

3

6 7

Rattle 2014−Jul−01 20:15:55 gjw



http://togaware.com

Boosting Example

Example: Performance

predicted <- predict(m, weather[-train,], type="prob")[,2]

actual <- weather[-train,]$RainTomorrow

risks <- weather[-train,]$RISK_MM

riskchart(predicted, actual, risks)

0.10.20.30.40.50.60.70.80.9

Risk Scores

23%

1

2

3

4

LiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLift

0

20

40

60

80

100

0 20 40 60 80 100Caseload (%)

Perfo

rman

ce (%

)

Recall (88%)

Risk (94%)

Precision



http://togaware.com

Boosting Example

Example Applications

ATO Application: What life events affect compliance?

First application of the technology — 1995Decision Stumps: Age > NN; Change in Marital Status

Boosted Neural Networks

OCR using neural networks as base learnersDrucker, Schapire, Simard, 1993



http://togaware.com

Boosting Example

Summary

1 Boosting is implemented in R in the ada library

2 AdaBoost uses e−m; LogitBoost uses log(1 + e−m); Doom IIuses 1− tanh(m)

3 AdaBoost tends to be sensitive to noise (addressed byBrownBoost)

4 AdaBoost tends not to overfit, and as new models are added,generalisation error tends to improve.

5 Can be proved to converge to a perfect model if the learners arealways better than chance.



http://togaware.com

Random Forests Forests of Trees

Overview

1 Overview

2 Multiple Models






http://togaware.com

Random Forests

Random Forests

Original idea from Leo Brieman and Adele Cutler.

The name is Licensed to Salford Systems!

Hence, R package is randomForest.

Typically presented in context of decision trees.

Random Multinomial Logit uses multiple multinomial logitmodels.



http://togaware.com

Random Forests

Random Forests

Build many decision trees (e.g., 500).

For each tree:

Select a random subset of the training set (N);Choose different subsets of variables for each node of thedecision tree (m << M);Build the tree without pruning (i.e., overfit)

Classify a new entity using every decision tree:

Each tree “votes” for the entity.The decision with the largest number of votes wins!The proportion of votes is the resulting score.



http://togaware.com

Random Forests

Example: RF on Weather Data

set.seed(42)

(m <- randomForest(RainTomorrow ~ ., weather[train, -c(1:2, 23)],

na.action=na.roughfix,

importance=TRUE))

##

## Call:

## randomForest(formula=RainTomorrow ~ ., data=weath...

## Type of random forest: classification

## Number of trees: 500

## No. of variables tried at each split: 4

##

## OOB estimate of error rate: 13.67%

## Confusion matrix:

## No Yes class.error

## No 211 4 0.0186

## Yes 31 10 0.7561



http://togaware.com

Random Forests

Example: Error Rate

Error rate decreases quickly then flattens over the 500 trees.

plot(m)

0 100 200 300 400 500

0.0

0.2

0.4

0.6

0.8

m

trees

Err

or



http://togaware.com

Random Forests

Example: Variable Importance

Helps understand the knowledge captured.

varImpPlot(m, main="Variable Importance")

RainTodayRainfallWindDir3pmWindDir9amEvaporationWindGustDirHumidity9amWindSpeed9amWindSpeed3pmCloud9amMinTempHumidity3pmTemp9amPressure9amMaxTempWindGustSpeedTemp3pmPressure3pmCloud3pmSunshine

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

0 5 10 15MeanDecreaseAccuracy

RainTodayRainfallWindDir3pmWindGustDirWindDir9amEvaporationCloud9amWindSpeed9amWindSpeed3pmHumidity9amMaxTempTemp9amTemp3pmMinTempHumidity3pmWindGustSpeedPressure9amCloud3pmSunshinePressure3pm

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

0 2 4 6 8MeanDecreaseGini

Variable Importance



http://togaware.com

Random Forests

Example: Sample Trees

There are 500 trees in all. Here’s some rules from the first tree.

## Random Forest Model 1

##

## ------------------------------------------------------...

## Tree 1 Rule 1 Node 30 Decision No

##

## 1: Evaporation <= 9

## 2: Humidity3pm <= 71

## 3: Cloud3pm <= 2.5

## 4: WindDir9am IN ("NNE")

## 5: Sunshine <= 10.25

## 6: Temp3pm <= 17.55

## ------------------------------------------------------...

## Tree 1 Rule 2 Node 31 Decision Yes

##

## 1: Evaporation <= 9

## 2: Humidity3pm <= 71

....



http://togaware.com

Random Forests

Example: Performance

predicted <- predict(m, weather[-train,], type="prob")[,2]

actual <- weather[-train,]$RainTomorrow

risks <- weather[-train,]$RISK_MM

riskchart(predicted, actual, risks)

0.10.20.30.40.50.60.70.80.9

Risk Scores

22%

1

2

3

4

LiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLift

0

20

40

60

80

100

0 20 40 60 80 100Caseload (%)

Perfo

rman

ce (%

)

Recall (92%)

Risk (97%)

Precision



http://togaware.com

Random Forests

Features of Random Forests: By Brieman

Most accurate of current algorithms.

Runs efficiently on large data sets.

Can handle thousands of input variables.

Gives estimates of variable importance.



http://togaware.com

Other Ensembles

Overview

1 Overview

2 Multiple Models






http://togaware.com

Other Ensembles Ensembles of Different Models

Other Ensembles

Netflix

Movie rental business - 100M customer movie ratings$1M for 10% improved root mean square errorFirst annual award (Dec ’07) to KorBell (AT&T) 8.43% $50KAggregate of the best other models!Linear combination of 107 other modelshttp://stat-computing.org/newsletter/v182.pdf

A lot of the different model builders deliver similar performance.

So why not build one of each model and combine!

In Rattle: Generate a Score file from all the models, and reloadthat into Rattle to explore.



http://stat-computing.org/newsletter/v182.pdf

http://togaware.com


Build a Model of Each Type

ds <- weather[train, -c(1:2, 23)]

form <- RainTomorrow ~ .

m.rp <- rpart(form, data=ds)

m.ada <- ada(form, data=ds)

m.rf <- randomForest(form, data=ds, na.action=na.roughfix,

importance=TRUE)

m.svm <- ksvm(form, data=ds, kernel="rbfdot", prob.model=TRUE)

m.glm <- glm(form, data=ds, family=binomial(link="logit"))

m.nn <- nnet(form, data=ds, size=10, skip=TRUE,

MaxNWts=10000, trace=FALSE, maxit=100)



http://togaware.com


Calculate Probabilities

ds <- weather[-train, -c(1:2, 23)]

ds <- na.omit(ds, "na.action")

pr <- data.frame(

obs=row.names(ds),

rp=predict(m.rp, ds)[,2],

ada=predict(m.ada, ds, type="prob")[,2],

rf=predict(m.rf, ds, type="prob")[,2],

svm=predict(m.svm, ds, type="probabilities")[,2],

glm=predict(m.glm, type="response", ds),

nn=predict(m.nn, ds))

prw <- pr



http://togaware.com


Plots—Weather Dataset

0

25

50

75

100

0.00 0.25 0.50 0.75 1.00rp

coun

t

0

25

50

75

100

0.00 0.25 0.50 0.75 1.00ada

coun

t

0

25

50

75

100

0.00 0.25 0.50 0.75 1.00rf

coun

t

0

25

50

75

100

0.00 0.25 0.50 0.75 1.00svm

coun

t

0

25

50

75

100

0.00 0.25 0.50 0.75 1.00glm

coun

t

0

25

50

75

100

0.00 0.25 0.50 0.75 1.00nn

coun

t



http://togaware.com


Plots—Audit Dataset

0

100

200

300

0.0 0.4 0.8rp

coun

t

0

100

200

300

0.0 0.4 0.8ada

coun

t

0

50

100

150

200

250

0.0 0.4 0.8rf

coun

t

0

50

100

150

200

250

0.0 0.4 0.8svm

coun

t

0

100

200

0.0 0.4 0.8glm

coun

t

0

200

400

0.0 0.4 0.8nn

coun

t



http://togaware.com


Correlation of Scores

The correlations between scores obtained by the different modelssuggest quite an overlap in their abilities to extract the sameknowledge.

Weather

rp

ada

rf

svm

glm

rp ada

rf svm

glm

Audit

rp

ada

rf

svm

glm

rp ada

rf svm

glm



http://togaware.com

Summary

Overview

1 Overview

2 Multiple Models






http://togaware.com

Summary Resources

Reference Book

Data Mining with Rattle and RGraham Williams2011, Springer, Use R!ISBN: 978-1-4419-9889-7.Chapters 12 and 13.



http://togaware.com

Summary Summary

Summary

Ensemble: Multiple models working together

Often better than a single model

Variance and bias of the model are reduced

The best available models today - accurate and robust

In daily use in very many areas of application

˜http: // togaware. com Copyright© 2014, [email protected] 36/36


http://togaware.com

Data Analytics and Business Intelligence …Build many decision trees (e.g., 500). For each tree:...

Documents

Transcript of Data Analytics and Business Intelligence …Build many decision trees (e.g., 500). For each tree:...