Subset A set contained within another set is called a SUBSET! Set Subset.
Data Analytics and Business Intelligence …Build many decision trees (e.g., 500). For each tree:...
Transcript of Data Analytics and Business Intelligence …Build many decision trees (e.g., 500). For each tree:...
Data Analytics and Business Intelligence(8696/8697)
Ensemble Decision Trees
Data ScientistAustralian Taxation Office
Adjunct Professor, University of Canberra
http://datamining.togaware.com
http: // togaware. com Copyright© 2014, [email protected] 1/36
Overview
Overview
1 Overview
2 Multiple Models
3 BoostingAlgorithmExample
4 Random ForestsForests of TreesIntroduction
5 Other EnsemblesEnsembles of Different Models
http: // togaware. com Copyright© 2014, [email protected] 2/36
Multiple Models
Overview
1 Overview
2 Multiple Models
3 BoostingAlgorithmExample
4 Random ForestsForests of TreesIntroduction
5 Other EnsemblesEnsembles of Different Models
http: // togaware. com Copyright© 2014, [email protected] 3/36
Multiple Models
Building Multiple Models
General idea developed in Multiple Inductive Learning algorithm(Williams 1987).
Ideas were developed (ACJ 1987, PhD 1990) in the context of:
observe that variable selection methods don’t discriminate;so build multiple decision trees;then combine into a single model.
Basic idea is that multiple models, like multiple experts, mayproduce better results when working together, rather than inisolation
Two approaches covered: Boosting and Random Forests.
Meta learners.
http: // togaware. com Copyright© 2014, [email protected] 4/36
Boosting
Overview
1 Overview
2 Multiple Models
3 BoostingAlgorithmExample
4 Random ForestsForests of TreesIntroduction
5 Other EnsemblesEnsembles of Different Models
http: // togaware. com Copyright© 2014, [email protected] 5/36
Boosting Algorithm
Boosting AlgorithmsBasic idea: boost observations that are “hard to model.”
Algorithm: iteratively build weak models using a poor learner:
Build an initial model;
Identify mis-classified cases in the training dataset;
Boost (over-represent) training observations modelled incorrectly;
Build a new model on the boosted training dataset;
Repeat.
The result is an ensemble of weighted models.
Best off the shelf model builder. (Leo Brieman)
http: // togaware. com Copyright© 2014, [email protected] 6/36
Boosting Algorithm
Algorithm in Pseudo Code
adaBoost <- function(form, data, learner)
{w <- rep(1/nrows(data), nrows(data))
e <- NULL
a <- NULL
m <- list()
i <- 0
repeat
{i <- i + 1
m <- c(m, learner(form, data, w))
ms <- which(predict(m[i], data) != data[target(form)])
e <- c(e, sum(w[ms])/sum(w))
a <- c(a, log((1-e[i])/e[i]))
w[ms] <- w[ms] * exp(a[i])
if (e[i] >= 0.5) break
}return(sum(a * sapply(m, predict, data)))
}
http: // togaware. com Copyright© 2014, [email protected] 7/36
Boosting Algorithm
Distributions
1e−01
1e+01
1e+03
0.0 0.1 0.2 0.3 0.4 0.5Error Rate epsilon
alpha
e^alpha
Learning Rate
http: // togaware. com Copyright© 2014, [email protected] 8/36
Boosting Example
Example: First Iteration
n <- 10
w <- rep(1/n, n) # 0.1 0.1 ...
ms <- c(7, 8, 9, 10)
e <- sum(w[ms])/sum(w) # 0.4
a <- log((1-e)/e) # 0.4055
w[ms] <- w[ms] * exp(a) # 0.15 0.15 0.15 0.15
http: // togaware. com Copyright© 2014, [email protected] 9/36
Boosting Example
Example: Second Iteration
ms <- c(1, 8) # 0.10 0.15
w[ms]
## [1] 0.10 0.15
e <- sum(w[ms])/sum(w) # 0.2083
a <- log((1-e)/e) # 1.335
(w[ms] <- w[ms] * exp(a))
## [1] 0.38 0.57
http: // togaware. com Copyright© 2014, [email protected] 10/36
Boosting Example
Example: Ada on Weather Data
head(weather[c(1:5, 23, 24)], 3)
## Date Location MinTemp MaxTemp Rainfall RISK_MM...
## 1 2007-11-01 Canberra 8.0 24.3 0.0 3.6...
## 2 2007-11-02 Canberra 14.0 26.9 3.6 3.6...
## 3 2007-11-03 Canberra 13.7 23.4 3.6 39.8...
....
set.seed(42)
train <- sample(1:nrow(weather), 0.7 * nrow(weather))
(m <- ada(RainTomorrow ~ ., weather[train, -c(1:2, 23)]))
## Call:
## ada(RainTomorrow ~ ., data=weather[train, -c(1:2, 23)])
##
## Loss: exponential Method: discrete Iteration: 50
....
http: // togaware. com Copyright© 2014, [email protected] 11/36
Boosting Example
Example: Error Rate
Notice error rate decreases quickly then flattens.
plot(m)
0 10 20 30 40 50
0.06
0.08
0.10
0.12
0.14
Iteration 1 to 50
Err
or
50
1
1
1
1
1
Training Error
1 Train
http: // togaware. com Copyright© 2014, [email protected] 12/36
Boosting Example
Example: Variable Importance
Helps understand the knowledge captured.
varplot(m)
RainToday
WindDir9am
Rainfall
WindDir3pm
WindGustDir
Pressure3pm
Sunshine
WindGustSpeed
Cloud3pm
WindSpeed9am
WindSpeed3pm
Humidity9am
Cloud9am
Evaporation
MaxTemp
Temp9am
Humidity3pm
MinTemp
Pressure9am
Temp3pm
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.04 0.05 0.06 0.07 0.08 0.09 0.10
Variable Importance Plot
Score
http: // togaware. com Copyright© 2014, [email protected] 13/36
Boosting Example
Example: Sample Trees
There are 50 trees in all. Here’s the first 3.
fancyRpartPlot(m$model$trees[[1]])
fancyRpartPlot(m$model$trees[[2]])
fancyRpartPlot(m$model$trees[[3]])
yes no
1
2
4
5
10 11 3
Cloud3pm < 7.5
Pressure3pm >= 1012
Humidity3pm < 42
−1.96 .04100%
−1.97 .03
94%
−1.99 .01
76%
−1.87 .13
18%
−1.96 .04
9%
1.72 .28
9%
1.42 .58
6%
yes no
1
2
4
5
10 11 3
Rattle 2014−Jul−01 20:15:55 gjw
yes no
1
2
3
6 7
Pressure3pm >= 1012
Sunshine >= 11
−1.96 .04100%
−1.98 .02
80%
1.82 .18
20%
−11.00 .00
6%
1.63 .37
14%
yes no
1
2
3
6 7
Rattle 2014−Jul−01 20:15:55 gjw
yes no
1
2
3
6 7
Pressure3pm >= 1012
MaxTemp >= 27
−1.97 .03100%
−1.99 .01
82%
1.83 .17
18%
−1.97 .03
6%
1.67 .33
12%
yes no
1
2
3
6 7
Rattle 2014−Jul−01 20:15:55 gjw
http: // togaware. com Copyright© 2014, [email protected] 14/36
Boosting Example
Example: Performance
predicted <- predict(m, weather[-train,], type="prob")[,2]
actual <- weather[-train,]$RainTomorrow
risks <- weather[-train,]$RISK_MM
riskchart(predicted, actual, risks)
0.10.20.30.40.50.60.70.80.9
Risk Scores
23%
1
2
3
4
LiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLift
0
20
40
60
80
100
0 20 40 60 80 100Caseload (%)
Perfo
rman
ce (%
)
Recall (88%)
Risk (94%)
Precision
http: // togaware. com Copyright© 2014, [email protected] 15/36
Boosting Example
Example Applications
ATO Application: What life events affect compliance?
First application of the technology — 1995Decision Stumps: Age > NN; Change in Marital Status
Boosted Neural Networks
OCR using neural networks as base learnersDrucker, Schapire, Simard, 1993
http: // togaware. com Copyright© 2014, [email protected] 16/36
Boosting Example
Summary
1 Boosting is implemented in R in the ada library
2 AdaBoost uses e−m; LogitBoost uses log(1 + e−m); Doom IIuses 1− tanh(m)
3 AdaBoost tends to be sensitive to noise (addressed byBrownBoost)
4 AdaBoost tends not to overfit, and as new models are added,generalisation error tends to improve.
5 Can be proved to converge to a perfect model if the learners arealways better than chance.
http: // togaware. com Copyright© 2014, [email protected] 17/36
Random Forests Forests of Trees
Overview
1 Overview
2 Multiple Models
3 BoostingAlgorithmExample
4 Random ForestsForests of TreesIntroduction
5 Other EnsemblesEnsembles of Different Models
http: // togaware. com Copyright© 2014, [email protected] 18/36
Random Forests
Random Forests
Original idea from Leo Brieman and Adele Cutler.
The name is Licensed to Salford Systems!
Hence, R package is randomForest.
Typically presented in context of decision trees.
Random Multinomial Logit uses multiple multinomial logitmodels.
http: // togaware. com Copyright© 2014, [email protected] 19/36
Random Forests
Random Forests
Build many decision trees (e.g., 500).
For each tree:
Select a random subset of the training set (N);Choose different subsets of variables for each node of thedecision tree (m << M);Build the tree without pruning (i.e., overfit)
Classify a new entity using every decision tree:
Each tree “votes” for the entity.The decision with the largest number of votes wins!The proportion of votes is the resulting score.
http: // togaware. com Copyright© 2014, [email protected] 20/36
Random Forests
Example: RF on Weather Data
set.seed(42)
(m <- randomForest(RainTomorrow ~ ., weather[train, -c(1:2, 23)],
na.action=na.roughfix,
importance=TRUE))
##
## Call:
## randomForest(formula=RainTomorrow ~ ., data=weath...
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 13.67%
## Confusion matrix:
## No Yes class.error
## No 211 4 0.0186
## Yes 31 10 0.7561
http: // togaware. com Copyright© 2014, [email protected] 21/36
Random Forests
Example: Error Rate
Error rate decreases quickly then flattens over the 500 trees.
plot(m)
0 100 200 300 400 500
0.0
0.2
0.4
0.6
0.8
m
trees
Err
or
http: // togaware. com Copyright© 2014, [email protected] 22/36
Random Forests
Example: Variable Importance
Helps understand the knowledge captured.
varImpPlot(m, main="Variable Importance")
RainTodayRainfallWindDir3pmWindDir9amEvaporationWindGustDirHumidity9amWindSpeed9amWindSpeed3pmCloud9amMinTempHumidity3pmTemp9amPressure9amMaxTempWindGustSpeedTemp3pmPressure3pmCloud3pmSunshine
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0 5 10 15MeanDecreaseAccuracy
RainTodayRainfallWindDir3pmWindGustDirWindDir9amEvaporationCloud9amWindSpeed9amWindSpeed3pmHumidity9amMaxTempTemp9amTemp3pmMinTempHumidity3pmWindGustSpeedPressure9amCloud3pmSunshinePressure3pm
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0 2 4 6 8MeanDecreaseGini
Variable Importance
http: // togaware. com Copyright© 2014, [email protected] 23/36
Random Forests
Example: Sample Trees
There are 500 trees in all. Here’s some rules from the first tree.
## Random Forest Model 1
##
## ------------------------------------------------------...
## Tree 1 Rule 1 Node 30 Decision No
##
## 1: Evaporation <= 9
## 2: Humidity3pm <= 71
## 3: Cloud3pm <= 2.5
## 4: WindDir9am IN ("NNE")
## 5: Sunshine <= 10.25
## 6: Temp3pm <= 17.55
## ------------------------------------------------------...
## Tree 1 Rule 2 Node 31 Decision Yes
##
## 1: Evaporation <= 9
## 2: Humidity3pm <= 71
....
http: // togaware. com Copyright© 2014, [email protected] 24/36
Random Forests
Example: Performance
predicted <- predict(m, weather[-train,], type="prob")[,2]
actual <- weather[-train,]$RainTomorrow
risks <- weather[-train,]$RISK_MM
riskchart(predicted, actual, risks)
0.10.20.30.40.50.60.70.80.9
Risk Scores
22%
1
2
3
4
LiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLiftLift
0
20
40
60
80
100
0 20 40 60 80 100Caseload (%)
Perfo
rman
ce (%
)
Recall (92%)
Risk (97%)
Precision
http: // togaware. com Copyright© 2014, [email protected] 25/36
Random Forests
Features of Random Forests: By Brieman
Most accurate of current algorithms.
Runs efficiently on large data sets.
Can handle thousands of input variables.
Gives estimates of variable importance.
http: // togaware. com Copyright© 2014, [email protected] 26/36
Other Ensembles
Overview
1 Overview
2 Multiple Models
3 BoostingAlgorithmExample
4 Random ForestsForests of TreesIntroduction
5 Other EnsemblesEnsembles of Different Models
http: // togaware. com Copyright© 2014, [email protected] 27/36
Other Ensembles Ensembles of Different Models
Other Ensembles
Netflix
Movie rental business - 100M customer movie ratings$1M for 10% improved root mean square errorFirst annual award (Dec ’07) to KorBell (AT&T) 8.43% $50KAggregate of the best other models!Linear combination of 107 other modelshttp://stat-computing.org/newsletter/v182.pdf
A lot of the different model builders deliver similar performance.
So why not build one of each model and combine!
In Rattle: Generate a Score file from all the models, and reloadthat into Rattle to explore.
http: // togaware. com Copyright© 2014, [email protected] 28/36
Other Ensembles Ensembles of Different Models
Build a Model of Each Type
ds <- weather[train, -c(1:2, 23)]
form <- RainTomorrow ~ .
m.rp <- rpart(form, data=ds)
m.ada <- ada(form, data=ds)
m.rf <- randomForest(form, data=ds, na.action=na.roughfix,
importance=TRUE)
m.svm <- ksvm(form, data=ds, kernel="rbfdot", prob.model=TRUE)
m.glm <- glm(form, data=ds, family=binomial(link="logit"))
m.nn <- nnet(form, data=ds, size=10, skip=TRUE,
MaxNWts=10000, trace=FALSE, maxit=100)
http: // togaware. com Copyright© 2014, [email protected] 29/36
Other Ensembles Ensembles of Different Models
Calculate Probabilities
ds <- weather[-train, -c(1:2, 23)]
ds <- na.omit(ds, "na.action")
pr <- data.frame(
obs=row.names(ds),
rp=predict(m.rp, ds)[,2],
ada=predict(m.ada, ds, type="prob")[,2],
rf=predict(m.rf, ds, type="prob")[,2],
svm=predict(m.svm, ds, type="probabilities")[,2],
glm=predict(m.glm, type="response", ds),
nn=predict(m.nn, ds))
prw <- pr
http: // togaware. com Copyright© 2014, [email protected] 30/36
Other Ensembles Ensembles of Different Models
Plots—Weather Dataset
0
25
50
75
100
0.00 0.25 0.50 0.75 1.00rp
coun
t
0
25
50
75
100
0.00 0.25 0.50 0.75 1.00ada
coun
t
0
25
50
75
100
0.00 0.25 0.50 0.75 1.00rf
coun
t
0
25
50
75
100
0.00 0.25 0.50 0.75 1.00svm
coun
t
0
25
50
75
100
0.00 0.25 0.50 0.75 1.00glm
coun
t
0
25
50
75
100
0.00 0.25 0.50 0.75 1.00nn
coun
t
http: // togaware. com Copyright© 2014, [email protected] 31/36
Other Ensembles Ensembles of Different Models
Plots—Audit Dataset
0
100
200
300
0.0 0.4 0.8rp
coun
t
0
100
200
300
0.0 0.4 0.8ada
coun
t
0
50
100
150
200
250
0.0 0.4 0.8rf
coun
t
0
50
100
150
200
250
0.0 0.4 0.8svm
coun
t
0
100
200
0.0 0.4 0.8glm
coun
t
0
200
400
0.0 0.4 0.8nn
coun
t
http: // togaware. com Copyright© 2014, [email protected] 32/36
Other Ensembles Ensembles of Different Models
Correlation of Scores
The correlations between scores obtained by the different modelssuggest quite an overlap in their abilities to extract the sameknowledge.
Weather
rp
ada
rf
svm
glm
rp ada
rf svm
glm
Audit
rp
ada
rf
svm
glm
rp ada
rf svm
glm
http: // togaware. com Copyright© 2014, [email protected] 33/36
Summary
Overview
1 Overview
2 Multiple Models
3 BoostingAlgorithmExample
4 Random ForestsForests of TreesIntroduction
5 Other EnsemblesEnsembles of Different Models
http: // togaware. com Copyright© 2014, [email protected] 34/36
Summary Resources
Reference Book
Data Mining with Rattle and RGraham Williams2011, Springer, Use R!ISBN: 978-1-4419-9889-7.Chapters 12 and 13.
http: // togaware. com Copyright© 2014, [email protected] 35/36
Summary Summary
Summary
Ensemble: Multiple models working together
Often better than a single model
Variance and bias of the model are reduced
The best available models today - accurate and robust
In daily use in very many areas of application
˜http: // togaware. com Copyright© 2014, [email protected] 36/36
Summary Summary
Summary
Ensemble: Multiple models working together
Often better than a single model
Variance and bias of the model are reduced
The best available models today - accurate and robust
In daily use in very many areas of application
˜http: // togaware. com Copyright© 2014, [email protected] 36/36