Post on 06-Aug-2020
Data Mining:Ensemble Learning
Business Analytics PracticeWinter Term 2015/16
Stefan Feuerriegel
Today’s Lecture
Objectives
1 Creating and pruning decision trees
2 Combining an ensemble of trees to form a Random Forest
3 Understanding the idea and usage of Boosting and AdaBoost
2Ensembles
Outline
1 Decision Trees
2 Concepts of Ensemble Learning
3 Random Forests
4 Boosting
5 AdaBoosting
3Ensembles
Outline
1 Decision Trees
2 Concepts of Ensemble Learning
3 Random Forests
4 Boosting
5 AdaBoosting
4Ensembles: Decision Trees
Decision Trees
I Flowchart-like structure in which nodes represent tests on attributes
I End nodes (leaves) of each branch represent class labels
I Example: Decision tree for playing tennis
Outlook
Humidity Wind
Sunny Rain
Overcast
Yes
Normal Strong WeakHigh
No Yes No Yes
5Ensembles: Decision Trees
Decision Trees
I IssuesI How deep to grow?I How to handle continuous attributes?I How to choose an appropriate attributes selection measure?I How to handle data with missing attributes values?
I AdvantagesI Simple to understand and interpretI Requires only few observationsI Best and expected values can be determined for different scenarios
I DisadvantagesI Information Gain criterion is biased in favor of attributes with more levelsI Calculations become complex if values are uncertain and/or outcomes
are linked
6Ensembles: Decision Trees
Decision Trees in R
I Loading required libraries rpart, party and partykit
library(rpart)library(party)library(partykit)
I Accessing credit scores
library(caret)data(GermanCredit)
I Split data into index subset for training (20 %) and testing (80 %)instances
inTrain <- runif(nrow(GermanCredit)) < 0.2
I Building a decision tree withrpart(formula, method="class", data=d)
dt <- rpart(Class ~ Duration + Amount + Age,method="class", data=GermanCredit[inTrain,])
7Ensembles: Decision Trees
Decision Trees in R
I Plot decision tree using plot(dt) and text(dt)
plot(dt)text(dt)
|Amount>=3864Duration< 19
Duration< 34.5Duration>=25.5
Amount< 3565Duration>=11.5
Amount< 1561Amount>=786.5
Amount< 1102Duration>=15.5
Age< 35.5Amount>=3016
Bad
Bad Good
Good
Bad Good
Good
Bad Good
GoodGood
GoodGood
8Ensembles: Decision Trees
Drawing Decision Trees Nicely
plot(as.party(dt))
Amount
1
≥ 3864.5 < 3864.5
Duration
2
< 19 ≥ 19
Node 3 (n = 8)
Goo
dB
ad
00.6
Duration
4
< 34.5≥ 34.5
Duration
5
≥ 25.5< 25.5
Node 6 (n = 8)
Goo
dB
ad
00.6Node 7 (n = 14)
Goo
dB
ad
00.6Node 8 (n = 16)
Goo
dB
ad
00.6
Amount
9
< 3565≥ 3565
Duration
10
≥ 11.5 < 11.5
Amount
11
< 1561 ≥ 1561
Amount
12
≥ 786.5< 786.5
Amount
13
< 1102.5≥ 1102.5
Node 14 (n = 7)
Goo
dB
ad
00.6
Node 15 (n = 25)
Goo
dB
ad
00.6Node 16 (n = 9)
Goo
dB
ad
00.6
Duration
17
≥ 15.5< 15.5
Age
18
< 35.5≥ 35.5
Amount
19
≥ 3015.5< 3015.5
Node 20 (n = 9)G
ood
Bad
00.6
Node 21 (n = 19)
Goo
dB
ad00.6
Node 22 (n = 17)
Goo
dB
ad
00.6
Node 23 (n = 22)
Goo
dB
ad
00.6
Node 24 (n = 34)
Goo
dB
ad
00.6
Node 25 (n = 12)
Goo
dB
ad
00.6
9Ensembles: Decision Trees
Complexity Parameter
printcp(dt)
#### Classification tree:## rpart(formula = Class ~ Duration + Amount + Age, data = GermanCredit[inTrain,## ], method = "class")#### Variables actually used in tree construction:## [1] Age Amount Duration#### Root node error: 58/200 = 0.29#### n= 200#### CP nsplit rel error xerror xstd## 1 0.051724 0 1.00000 1.00000 0.11064## 2 0.034483 2 0.89655 0.96552 0.10948## 3 0.012931 4 0.82759 1.08621 0.11326## 4 0.010000 12 0.72414 1.13793 0.11465
I Rows show results for trees with different numbers of nodes
I Cross-validation error in column xerror
I Complexity parameter in column CP, similar to number of nodes
10Ensembles: Decision Trees
Pruning Decision Trees
I Reduce tree size by removing nodes with little predictive powerI Aim: Minimize cross-validation error in column xerror
m <- which.min(dt$cptable[, "xerror"])
I Index with smallest complexity parameterm
## 2## 2
I Optimal number of splits
dt$cptable[m, "nsplit"]
## [1] 2
I Choose corresponding complexity parameter
dt$cptable[m, "CP"]
## [1] 0.03448276
11Ensembles: Decision Trees
Pruning Decision Trees
p <- prune(dt, cp = dt$cptable[which.min(dt$cptable[, "xerror"]), "CP"])plot(as.party(p))
Amount
1
≥ 3864.5 < 3864.5
Duration
2
< 19 ≥ 19
Node 3 (n = 8)
Goo
dB
ad
0
0.2
0.4
0.6
0.8
1Node 4 (n = 38)
Goo
dB
ad
0
0.2
0.4
0.6
0.8
1Node 5 (n = 154)
Goo
dB
ad
0
0.2
0.4
0.6
0.8
1
12Ensembles: Decision Trees
Prediction with Decision Trees
I predict(dt, test, type="class") predicts classes onnew data test
pred <- predict(p, GermanCredit[-inTrain,], type="class")pred[1:5]
## 2 3 4 5 6## Good Good Good Good Good## Levels: Bad Good
I Output: predicted label in 1st row out of all possible labels (2nd row)I Confusion matrix viatable(pred=pred_classes, true=true_classes)
# horizontal: true class; vertical: predicted classtable(pred=pred, true=GermanCredit[-inTrain,]$Class)
## true## pred Bad Good## Bad 20 28## Good 280 671
13Ensembles: Decision Trees
Outline
1 Decision Trees
2 Concepts of Ensemble Learning
3 Random Forests
4 Boosting
5 AdaBoosting
14Ensembles: Concepts of Ensemble Learning
Ensemble Learning
I Combine predictions of multiple learning algorithms→ ensemble
I Often leads to a better predictive performance than a single learner
I Well-suited when small differences in the training data produce verydifferent classifiers (e. g. decision trees)
I Drawbacks: increases computation time, reduces interpretability
ReasoningI Classifiers C1, . . . ,CK which are independent, i. e. cor(Ci ,Cj) = 0
I Each has an error probability of Pi < 0.5 on the training data
I Then an ensemble of classifiers should have an error probability lowerthan each individual Pi
15Ensembles: Concepts of Ensemble Learning
Example: Ensemble Learning
I Given K classifiers, each with the same error probability Pε = 0.3
I Probability that exactly L classifiers make an error is(KL
)PL
ε (1−Pε)K−L
0.00
0.05
0.10
0.15
0.20
0 5 10 15 20Number of Classifiers with Misclassification
Pro
babi
lity
Ensemble of 20 Classifiers
I For example, the probability of 10+ classifiers making an error is 0.05
I Only if Pε > 0.5, the error rate of the ensemble increases16Ensembles: Concepts of Ensemble Learning
Ensemble Learning
→ Various methods exist for ensemble learning
Constructing ensembles: methods for obtaining a set of classifiers
I Bagging (also named Bootstrap Aggregation)
I Random Forest
I Cross-validation (covered as part of resampling)
→ Instead of different classifiers, train same classifier on different data→ Since training data is expensive, reuse data by subsampling
Combining classifiers: methods for combining different classifiers
I Stacking
I Bayesian Model Averaging
I Boosting
I AdaBoost
17Ensembles: Concepts of Ensemble Learning
Bagging: Bootstrap Aggregation
I Meta strategy design to accuracy of machine learning algorithms
I Improvements for unstable procedures→ Neural networks, trees and linear regression with subset selection,rule learning (opposed to k -NN, linear regression, SVM)
I Idea: Reuse the same training algorithm several times on differentsubsets of the training data
I When classifier needs random initialization (e. g. k -means), very theseacross each run
18Ensembles: Bagging
Bagging: Bootstrap Aggregation
D
D1 DK 1D2 DK...
C1 CKCK 1C2...
C
Step 1: Create separate datasets
Step 2: Train multiple classifiers
Step 3: Combine classifiers
Original training data
19Ensembles: Bagging
Bagging: Bootstrap Aggregation
AlgorithmI Given training set D of size N
I Bagging generates new training sets Di of size M by sampling withreplacement from D
I Some observations may be repeated in each Di
I If M = N, then on average 63.2% (Breiman, 1996) of the originaltraining dataset D is represented, the rest are duplicates
I Afterwards train classifier on each Ci separately
20Ensembles: Bagging
Outline
1 Decision Trees
2 Concepts of Ensemble Learning
3 Random Forests
4 Boosting
5 AdaBoosting
21Ensembles: Random Forests
Random Forests
I Random Forests are an ensemble learning method for classificationand regression
I It combines multiple individual decision trees by means of bagging
I Overcomes the problem of overfitting decision trees
Tree1 Tree2 Tree3
no
rain
rain rainno
rain
no
rain
no
rain
rain
rain
no
rain
22Ensembles: Random Forests
Random Forest: Algorithm
1 Create many decision trees by bagging2 Inject randomness into decision trees
a. Tree grows to maximum size and is left unprunedI Deliberate overfitting: i. e. each tree is a good model on its own
b. Each split is based on randomly selected subset of attributesI Reduces correlation between different treesI Otherwise, many trees would just select the very strong predictors
3 Ensemble trees (i. e. the random forest) vote on categories by majority
...
+
y
x x x
23Ensembles: Random Forests
Random Forest: Algorithm
1 Split the training data into K bootstrap samples by drawing samplesfrom training data with replacement
2 Estimate individual trees ti to the samples
3 Every regression tree predicts a value for unseen data
4 Averaging those predictions by
y =1K
K
∑i=1
ti(x)
with y as the response vector and x = [x1, . . . ,xN ]T ∈ X as the input
parameters.
24Ensembles: Random Forests
Advantages and Limitations
I Increasing the number of trees tends to decease the variance of themodel without increasing the bias
I Averaging reveals real structure that persists across datasets
I Noisy signals of individual trees cancel out
AdvantagesI Simple algorithm that
learns non-linearity
I Good performance inpractice
I Fast training algorithm
I Resistant to overfitting
LimitationsI High memory consumption
during tree construction
I Little performance gainfrom large training data
25Ensembles: Random Forests
Random Forests in R
I Load required library randomForest
library(randomForest)
I Load dataset with credit scores
library(caret)data(GermanCredit)
I Split data into index subset for training (20 %) and testing (80 %)instances
inTrain <- runif(nrow(GermanCredit)) < 0.2
26Ensembles: Random Forests
Random Forests in R
I Learn random forest on training data with randomForest(...)
rf <- randomForest(Class ~ .,data=GermanCredit[inTrain,],ntree=100)
I Options to control behaviorI ntree controls the number of trees (default: 500)I mtry gives number of variables to choose from at each nodeI na.action specifies how to handle missing valuesI importance=TRUE calculates variable importance metric
27Ensembles: Random Forests
Random Forest in R
I Plot estimated error across the number of decision trees
plot(rf)
0 20 40 60 80 100
0.2
0.6
rf
trees
Err
or
Dotted lines represent corresponding error of classes and solid blackline represents overall error
28Ensembles: Random Forests
Random Forest in R
I Calculate confusion matrix
rf$confusion
## Bad Good class.error## Bad 12 46 0.79310345## Good 10 132 0.07042254
I Predict credit scores for testing instances
pred <- predict(rf, newdata=GermanCredit[-inTrain,])table(pred=pred, true=GermanCredit$Class[-inTrain])
## true## pred Bad Good## Bad 97 29## Good 203 670
29Ensembles: Random Forests
Variable Importance
I Coefficients normally tell the effect, but not its relevance
I Frequency and position of where variables appear in decision treescan be used for measuring variable importance
I Computed based on the corresponding reduction of accuracy whenthe predictor of interest is removed
I Variable importance is
VI(t)(x)=∑
Ki=1 I(yi = y(t)
i )
K−∑
Ki=1 I(yi = y(t)
i learned from permuted x)K
for tree t , with yi being the true class and y(t)i the predicted class
I A frequent alternative is the Gini importance index
30Ensembles: Random Forests
Variable Importance in R
I Learn random forest and enable the calculation of variable importancemetrics via importance=TRUE
rf2 <- randomForest(Class ~ .,data=GermanCredit, #with full datasetntree=100,importance=TRUE)
31Ensembles: Random Forests
Variable Importance in R
I Plot variable importance via varImpPlot(rf, ...)
varImpPlot(rf2, type=1, n.var=5)
AmountCreditHistory.CriticalDurationCheckingAccountStatus.lt.0CheckingAccountStatus.none
●
●
●
●
●
5 6 7 8 9 10 11
rf2
MeanDecreaseAccuracy
I type choose the importance metric (= 1 is the mean decrease inaccuracy if the variable would be randomly permuted)
I n.var denotes number of variables
32Ensembles: Random Forests
Outline
1 Decision Trees
2 Concepts of Ensemble Learning
3 Random Forests
4 Boosting
5 AdaBoosting
33Ensembles: Boosting
Boosting
I Combine multiple classifiers to improve classification accuracy
I Works together with many different types of classifiers
I None of the classifier needs extremely good, only better than chance→ Extreme case: decision stumps
y(x) =
{1, xi ≥ θ
0, otherwise
I Idea: train classifiers on a subset of the training data that is mostinformative given the current classifiers
I Yields sequential classifier selection
34Ensembles: Boosting
Boosting
High-level algorithm
1 Fit a simple model to a subsample of the data
2 Identify misclassified observations, i. e. that are hard to predict
3 Focus subsequent learners on these samples and get them right
4 Combine weak learners to form a complex predictor
Application: spam filtering
I First classifier: distinguish between emails from contacts and others
I Subsequent classifiers: focus on examples wrongly classified as spam(i. e. emails from others) and find words/phrases appearing in spam
I Combine to final classifier that predicts spam accurately
35Ensembles: Boosting
Boosting: Example
Given training data D from which wesample without replacement
1 Sample N1 < N training examplesD1 from D
I Train weak classifier C1 on D1
2 Sample N2 < N training examplesD2 from D, half of which weremisclassified by C1
I Train weak classifier C2 on D2
3 Identify all data D3 in D on whichC1 and C2 disagree
I Train weak classifier C3 on D3
36Ensembles: Boosting
Boosting: Example
4 Combine C1, C2, C3 to get finalclassifier C by majority vote
I E. g. on the missclassified redpoint, C1 voted for red but C2
and C3 voted for blue
Optimal number of samples Ni
I Reasonable guess N1 = N2 = N3⇒N1
3but problematic
I Simple problem: C1 explains most of the data and N2 and N3 are smallI Hard problem: C1 explains a small part and N2 is large
I Solution: run boosting procedure several times and adjust N1
37Ensembles: Boosting
Boosting in R
I Load the required packages mboost
library(mboost)
I Fit a generalized linear model via glmboost(...)m.boost <- glmboost(Class ~ Amount + Duration
+ Personal.Female.Single,family=Binomial(), # needed for classificationdata=GermanCredit)
coef(m.boost)
## (Intercept) Amount Duration## 4.104949e-01 -1.144369e-05 -1.703911e-02## attr(,"offset")## [1] 0.4236489
I Different from the normal glm(...) routine, the boosted versioninherently performs variable selection
38Ensembles: Boosting
Boosting in R: Convergence Plot
I Partial effects show how estimated coefficients evolve across iterationsI Plot convergence of selected coefficients
plot(m.boost, ylim=range(coef(m.boost,which=c("Amount", "Duration"))))
0 20 40 60 80 100
−0.
015
−0.
005
glmboost.formula(formula = Class ~ Amount + Duration + Personal.Female.Single, data = GermanCredit, family = Binomial())
Number of boosting iterations
Coe
ffici
ents
Duration
Amount
39Ensembles: Boosting
Boosting in R
I Main parameter for tuning is number of iterations mstopI Use cross-validated estimates of empirical risk to find optimal number→ Default is 25-fold bootstrapp cross-validationcv.boost <- cvrisk(m.boost)mstop(cv.boost) # optimal no. of iterations to prevent overfitting
## [1] 16
plot(cv.boost, main="Cross-validated estimates of empirical risk")
Cross−validated estimates of empirical risk
Number of boosting iterations
Neg
ativ
e B
inom
ial L
ikel
ihoo
d
1 7 14 22 30 38 46 54 62 70 78 86 94
0.56
0.58
0.60
0.62
40Ensembles: Boosting
Boosting in RAlternative: fit generalized additive model via component-wise boosting
m.boost <- gamboost(Class ~ Amount + Duration,family=Binomial(), # needed for classificationdata=GermanCredit)
m.boost
#### Model-based Boosting#### Call:## gamboost(formula = Class ~ Amount + Duration, data = GermanCredit, family = Binomial())###### Negative Binomial Likelihood#### Loss function: {## f <- pmin(abs(f), 36) * sign(f)## p <- exp(f)/(exp(f) + exp(-f))## y <- (y + 1)/2## -y * log(p) - (1 - y) * log(1 - p)## }###### Number of boosting iterations: mstop = 100## Step size: 0.1## Offset: 0.4236489## Number of baselearners: 2
41Ensembles: Boosting
Outline
1 Decision Trees
2 Concepts of Ensemble Learning
3 Random Forests
4 Boosting
5 AdaBoosting
42Ensembles: AdaBoosting
AdaBoosting
Instead of resampling, reweight misclassified training examples
Illustration
Weak classifier C1 Weak classifier C2 Weak classifier C3
⇒ Combine weak classifiers C1, C2, C3
into final classifier by majority vote
43Ensembles: AdaBoosting
AdaBoost
BenefitsI Simple combination of
multiple classifiers
I Easy implementation
I Different types of classifierscan be used
I Commonly in used acrossmany domains
LimitationsI Sensitive to misclassified
points in training data
44Ensembles: AdaBoosting
AdaBoost in R
I Load required package ada
library(ada)
I Fit AdaBoost model on training data with ada(..., iter) given afixed number iter of iterations
m.ada <- ada(Class ~ .,data=GermanCredit[inTrain,],iter=50)
I Evaluate on test data test.x with response test.y
m.ada.test <- addtest(m.ada,test.x=GermanCredit[-inTrain,],test.y=GermanCredit$Class[-inTrain])
45Ensembles: AdaBoosting
AdaBoost in R
m.ada.test
## Call:## ada(Class ~ ., data = GermanCredit[inTrain, ], iter = 50)#### Loss: exponential Method: discrete Iteration: 50#### Final Confusion Matrix for Data:## Final Prediction## True value Bad Good## Bad 33 25## Good 2 140#### Train Error: 0.135#### Out-Of-Bag Error: 0.18 iteration= 50#### Additional Estimates of number of iterations:#### train.err1 train.kap1 test.errs2 test.kaps2## 36 36 43 37
46Ensembles: AdaBoosting
AdaBoost in RPlot error on training and testing data via plot(m, test=TRUE) formodel m
plot(m.ada.test, test=TRUE)
0 10 20 30 40 50
0.15
0.25
0.35
Iteration 1 to 50
Err
or
50
1 1 1
1 1
Training And Testing Error
2 2 2 2 2
12
TrainTest1
47Ensembles: AdaBoosting
AdaBoost in RSimilarly as with random forest, varplot(...) plots the importance forthe first variables
varplot(m.ada.test, max.var.show=5) # first 5 variables
NumberExistingCredits
OtherDebtorsGuarantors.Guarantor
Age
Duration
Amount
●
●
●
●
●
0.070 0.075 0.080
Variable Importance Plot
Score
48Ensembles: AdaBoosting
Summary
Decision TreesI Highly visual tool for decision support, though the risk of overfitting
Ensemble Learning: Random Forest, Boosting and AdaBoostI Idea: combine an ensemble of learners to improve performance
I Random forest combines independent decision trees
I Boosting resamples the training data, whereas AdaBoost reweightstraining data→ focus subsequent learn from misclassifications
I Combined weak learners usually vote by majority on new data
49Ensembles: Wrap-Up