Lecture 20: Bagging, Random Forests, Boosting · Lecture 20: Bagging, Random Forests, Boosting...
Transcript of Lecture 20: Bagging, Random Forests, Boosting · Lecture 20: Bagging, Random Forests, Boosting...
Lecture 20: Bagging, Random Forests,Boosting
Reading: Chapter 8
STATS 202: Data mining and analysis
Jonathan TaylorNovember 12, 2018
Slide credits: Sergio Bacallado
1 / 1
Classification and Regression trees, in a nut shell
I Grow the tree by recursively splitting the samples in the leafRi according to Xj > s, such that (Ri, Xj , s) maximize thedrop in RSS.
→ Greedy algorithm.
I Create a sequence of subtrees T0, T1, . . . , Tm using a pruningalgorithm.
I Select the best tree Ti (or the best α) by cross validation.
2 / 1
Classification and Regression trees, in a nut shell
I Grow the tree by recursively splitting the samples in the leafRi according to Xj > s, such that (Ri, Xj , s) maximize thedrop in RSS.
→ Greedy algorithm.
I Create a sequence of subtrees T0, T1, . . . , Tm using a pruningalgorithm.
I Select the best tree Ti (or the best α) by cross validation.
2 / 1
Classification and Regression trees, in a nut shell
I Grow the tree by recursively splitting the samples in the leafRi according to Xj > s, such that (Ri, Xj , s) maximize thedrop in RSS.
→ Greedy algorithm.
I Create a sequence of subtrees T0, T1, . . . , Tm using a pruningalgorithm.
I Select the best tree Ti (or the best α) by cross validation.
2 / 1
Classification and Regression trees, in a nut shell
I Grow the tree by recursively splitting the samples in the leafRi according to Xj > s, such that (Ri, Xj , s) maximize thedrop in RSS.
→ Greedy algorithm.
I Create a sequence of subtrees T0, T1, . . . , Tm using a pruningalgorithm.
I Select the best tree Ti (or the best α) by cross validation.
2 / 1
Example. Heart dataset.
How do we deal with categorical predictors?
|Thal:a
Ca < 0.5
MaxHR < 161.5
RestBP < 157
Chol < 244MaxHR < 156
MaxHR < 145.5
ChestPain:bc
Chol < 244 Sex < 0.5
Ca < 0.5
Slope < 1.5
Age < 52 Thal:b
ChestPain:a
Oldpeak < 1.1
RestECG < 1
No YesNo
NoYes
No
No No No Yes
Yes No No
No Yes
Yes Yes
Yes
5 10 15
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Tree Size
Err
or
TrainingCross−ValidationTest
|Thal:a
Ca < 0.5
MaxHR < 161.5 ChestPain:bc
Ca < 0.5
No No
No Yes
Yes Yes
3 / 1
Categorical predictors
I If there are only 2 categories, then the split is obvious. Wedon’t have to choose the splitting point s, as for a numericalvariable.
I If there are more than 2 categories:I Order the categories according to the average of the response:
ChestPain : a > ChestPain : c > ChestPain : b
I Treat as a numerical variable with this ordering, and choose asplitting point s.
I One can show that this is the optimal way of partitioning.
4 / 1
Categorical predictors
I If there are only 2 categories, then the split is obvious. Wedon’t have to choose the splitting point s, as for a numericalvariable.
I If there are more than 2 categories:I Order the categories according to the average of the response:
ChestPain : a > ChestPain : c > ChestPain : b
I Treat as a numerical variable with this ordering, and choose asplitting point s.
I One can show that this is the optimal way of partitioning.
4 / 1
Categorical predictors
I If there are only 2 categories, then the split is obvious. Wedon’t have to choose the splitting point s, as for a numericalvariable.
I If there are more than 2 categories:I Order the categories according to the average of the response:
ChestPain : a > ChestPain : c > ChestPain : b
I Treat as a numerical variable with this ordering, and choose asplitting point s.
I One can show that this is the optimal way of partitioning.
4 / 1
Missing data
I Suppose we can assign every sample to a leaf Ri despite themissing data.
I When choosing a new split with variable Xj (growing the tree):
I Only consider the samples which have the variable Xj .
I In addition to choosing the best split, choose a second bestsplit using a different variable, and a third best, ...
I To propagate a sample down the tree, if it is missing a variableto make a decision, try the second best decision, or the thirdbest, ...called “surrogate split”
5 / 1
Missing data
I Suppose we can assign every sample to a leaf Ri despite themissing data.
I When choosing a new split with variable Xj (growing the tree):
I Only consider the samples which have the variable Xj .
I In addition to choosing the best split, choose a second bestsplit using a different variable, and a third best, ...
I To propagate a sample down the tree, if it is missing a variableto make a decision, try the second best decision, or the thirdbest, ...called “surrogate split”
5 / 1
Missing data
I Suppose we can assign every sample to a leaf Ri despite themissing data.
I When choosing a new split with variable Xj (growing the tree):I Only consider the samples which have the variable Xj .
I In addition to choosing the best split, choose a second bestsplit using a different variable, and a third best, ...
I To propagate a sample down the tree, if it is missing a variableto make a decision, try the second best decision, or the thirdbest, ...called “surrogate split”
5 / 1
Missing data
I Suppose we can assign every sample to a leaf Ri despite themissing data.
I When choosing a new split with variable Xj (growing the tree):I Only consider the samples which have the variable Xj .
I In addition to choosing the best split, choose a second bestsplit using a different variable, and a third best, ...
I To propagate a sample down the tree, if it is missing a variableto make a decision, try the second best decision, or the thirdbest, ...called “surrogate split”
5 / 1
Missing data
I Suppose we can assign every sample to a leaf Ri despite themissing data.
I When choosing a new split with variable Xj (growing the tree):I Only consider the samples which have the variable Xj .
I In addition to choosing the best split, choose a second bestsplit using a different variable, and a third best, ...
I To propagate a sample down the tree, if it is missing a variableto make a decision, try the second best decision, or the thirdbest, ...called “surrogate split”
5 / 1
Bagging
I Bagging = Bootstrap Aggregating
I In the Bootstrap, we replicate our dataset by sampling withreplacement:
I Original dataset: x = c(x1, x2, . . . , x100)
I Bootstrap samples:boot1 = sample(x, 100, replace = True), ...,bootB = sample(x, 100, replace = True).
I We used these samples to get the Standard Error of aparameter estimate:
SE(β1) ≈1
B
B∑b=1
β(b)1
6 / 1
Bagging
I Bagging = Bootstrap Aggregating
I In the Bootstrap, we replicate our dataset by sampling withreplacement:
I Original dataset: x = c(x1, x2, . . . , x100)
I Bootstrap samples:boot1 = sample(x, 100, replace = True), ...,bootB = sample(x, 100, replace = True).
I We used these samples to get the Standard Error of aparameter estimate:
SE(β1) ≈1
B
B∑b=1
β(b)1
6 / 1
Bagging
I Bagging = Bootstrap Aggregating
I In the Bootstrap, we replicate our dataset by sampling withreplacement:
I Original dataset: x = c(x1, x2, . . . , x100)
I Bootstrap samples:boot1 = sample(x, 100, replace = True), ...,bootB = sample(x, 100, replace = True).
I We used these samples to get the Standard Error of aparameter estimate:
SE(β1) ≈1
B
B∑b=1
β(b)1
6 / 1
Bagging
I In Bagging we average the predictions of a model fit to manyBootstrap samples.
Example. Bagging the Lasso
I Let yL,b be the prediction of the Lasso applied to the bthbootstrap sample.
I Bagging prediction:
yboot =1
B
B∑b=1
yL,b.
7 / 1
Bagging
I In Bagging we average the predictions of a model fit to manyBootstrap samples.
Example. Bagging the LassoI Let yL,b be the prediction of the Lasso applied to the bth
bootstrap sample.
I Bagging prediction:
yboot =1
B
B∑b=1
yL,b.
7 / 1
Bagging
I In Bagging we average the predictions of a model fit to manyBootstrap samples.
Example. Bagging the LassoI Let yL,b be the prediction of the Lasso applied to the bth
bootstrap sample.
I Bagging prediction:
yboot =1
B
B∑b=1
yL,b.
7 / 1
When does Bagging make sense?
When a regression method or a classifier has a tendency to overfit,Bagging reduces the variance of the prediction.
I When n is large, the empirical distribution is similar to thetrue distribution of the samples.
I Bootstrap samples are like independent realizations of thedata.
I Bagging amounts to averaging the fits from many independentdatasets, which would reduce the variance by a factor 1/B.
8 / 1
When does Bagging make sense?
When a regression method or a classifier has a tendency to overfit,Bagging reduces the variance of the prediction.
I When n is large, the empirical distribution is similar to thetrue distribution of the samples.
I Bootstrap samples are like independent realizations of thedata.
I Bagging amounts to averaging the fits from many independentdatasets, which would reduce the variance by a factor 1/B.
8 / 1
When does Bagging make sense?
When a regression method or a classifier has a tendency to overfit,Bagging reduces the variance of the prediction.
I When n is large, the empirical distribution is similar to thetrue distribution of the samples.
I Bootstrap samples are like independent realizations of thedata.
I Bagging amounts to averaging the fits from many independentdatasets, which would reduce the variance by a factor 1/B.
8 / 1
When does Bagging make sense?
When a regression method or a classifier has a tendency to overfit,Bagging reduces the variance of the prediction.
I When n is large, the empirical distribution is similar to thetrue distribution of the samples.
I Bootstrap samples are like independent realizations of thedata.
I Bagging amounts to averaging the fits from many independentdatasets, which would reduce the variance by a factor 1/B.
8 / 1
Bagging decision trees
I Disadvantage: Every time we fit a decision tree to aBootstrap sample, we get a different tree T b.
→ Loss of interpretability
I For each predictor, add up the total amount by which the RSS(or Gini index) decreases every time we use the predictor in T b.
I Average this total over each Boostrap estimate T 1, . . . , TB.
Thal
Ca
ChestPain
Oldpeak
MaxHR
RestBP
Age
Chol
Slope
Sex
ExAng
RestECG
Fbs
0 20 40 60 80 100
Variable Importance
9 / 1
Bagging decision trees
I Disadvantage: Every time we fit a decision tree to aBootstrap sample, we get a different tree T b.
→ Loss of interpretability
I For each predictor, add up the total amount by which the RSS(or Gini index) decreases every time we use the predictor in T b.
I Average this total over each Boostrap estimate T 1, . . . , TB.
Thal
Ca
ChestPain
Oldpeak
MaxHR
RestBP
Age
Chol
Slope
Sex
ExAng
RestECG
Fbs
0 20 40 60 80 100
Variable Importance
9 / 1
Bagging decision trees
I Disadvantage: Every time we fit a decision tree to aBootstrap sample, we get a different tree T b.
→ Loss of interpretability
I For each predictor, add up the total amount by which the RSS(or Gini index) decreases every time we use the predictor in T b.
I Average this total over each Boostrap estimate T 1, . . . , TB.
Thal
Ca
ChestPain
Oldpeak
MaxHR
RestBP
Age
Chol
Slope
Sex
ExAng
RestECG
Fbs
0 20 40 60 80 100
Variable Importance
9 / 1
Bagging decision trees
I Disadvantage: Every time we fit a decision tree to aBootstrap sample, we get a different tree T b.
→ Loss of interpretability
I For each predictor, add up the total amount by which the RSS(or Gini index) decreases every time we use the predictor in T b.
I Average this total over each Boostrap estimate T 1, . . . , TB.
Thal
Ca
ChestPain
Oldpeak
MaxHR
RestBP
Age
Chol
Slope
Sex
ExAng
RestECG
Fbs
0 20 40 60 80 100
Variable Importance
9 / 1
Bagging decision trees
I Disadvantage: Every time we fit a decision tree to aBootstrap sample, we get a different tree T b.
→ Loss of interpretability
I For each predictor, add up the total amount by which the RSS(or Gini index) decreases every time we use the predictor in T b.
I Average this total over each Boostrap estimate T 1, . . . , TB.
Thal
Ca
ChestPain
Oldpeak
MaxHR
RestBP
Age
Chol
Slope
Sex
ExAng
RestECG
Fbs
0 20 40 60 80 100
Variable Importance
9 / 1
Out-of-bag (OOB) error
I To estimate the test error of a bagging estimate, we could usecross-validation.
I Each time we draw a Bootstrap sample, we only use 63% ofthe observations.
I Idea: use the rest of the observations as a test set.
I OOB error:I For each sample xi, find the prediction ybi for all bootstrap
samples b which do not contain xi. There should be around0.37B of them. Average these predictions to obtain yoob
i .
I Compute the error (yi − yoobi )2.
I Average the errors over all observations i = 1, . . . , n.
10 / 1
Out-of-bag (OOB) error
I To estimate the test error of a bagging estimate, we could usecross-validation.
I Each time we draw a Bootstrap sample, we only use 63% ofthe observations.
I Idea: use the rest of the observations as a test set.
I OOB error:I For each sample xi, find the prediction ybi for all bootstrap
samples b which do not contain xi. There should be around0.37B of them. Average these predictions to obtain yoob
i .
I Compute the error (yi − yoobi )2.
I Average the errors over all observations i = 1, . . . , n.
10 / 1
Out-of-bag (OOB) error
I To estimate the test error of a bagging estimate, we could usecross-validation.
I Each time we draw a Bootstrap sample, we only use 63% ofthe observations.
I Idea: use the rest of the observations as a test set.
I OOB error:I For each sample xi, find the prediction ybi for all bootstrap
samples b which do not contain xi. There should be around0.37B of them. Average these predictions to obtain yoob
i .
I Compute the error (yi − yoobi )2.
I Average the errors over all observations i = 1, . . . , n.
10 / 1
Out-of-bag (OOB) error
I To estimate the test error of a bagging estimate, we could usecross-validation.
I Each time we draw a Bootstrap sample, we only use 63% ofthe observations.
I Idea: use the rest of the observations as a test set.
I OOB error:I For each sample xi, find the prediction ybi for all bootstrap
samples b which do not contain xi. There should be around0.37B of them. Average these predictions to obtain yoob
i .
I Compute the error (yi − yoobi )2.
I Average the errors over all observations i = 1, . . . , n.
10 / 1
Out-of-bag (OOB) error
I To estimate the test error of a bagging estimate, we could usecross-validation.
I Each time we draw a Bootstrap sample, we only use 63% ofthe observations.
I Idea: use the rest of the observations as a test set.
I OOB error:I For each sample xi, find the prediction ybi for all bootstrap
samples b which do not contain xi. There should be around0.37B of them. Average these predictions to obtain yoob
i .
I Compute the error (yi − yoobi )2.
I Average the errors over all observations i = 1, . . . , n.
10 / 1
Out-of-bag (OOB) error
I To estimate the test error of a bagging estimate, we could usecross-validation.
I Each time we draw a Bootstrap sample, we only use 63% ofthe observations.
I Idea: use the rest of the observations as a test set.
I OOB error:I For each sample xi, find the prediction ybi for all bootstrap
samples b which do not contain xi. There should be around0.37B of them. Average these predictions to obtain yoob
i .
I Compute the error (yi − yoobi )2.
I Average the errors over all observations i = 1, . . . , n.
10 / 1
Out-of-bag (OOB) error
0 50 100 150 200 250 300
0.1
00
.15
0.2
00
.25
0.3
0
Number of Trees
Err
or
Test: Bagging
Test: RandomForest
OOB: Bagging
OOB: RandomForest
The test error decreases as we increase B(dashed line is the error for a plain decision tree).
11 / 1
Random Forests
Bagging has a problem:
→ The trees produced by different Bootstrap samples can be verysimilar.
Random Forests:
I We fit a decision tree to different Bootstrap samples.
I When growing the tree, we select a random sample of m < ppredictors to consider in each step.
I This will lead to very different (or “uncorrelated”) trees fromeach sample.
I Finally, average the prediction of each tree.
12 / 1
Random Forests
Bagging has a problem:
→ The trees produced by different Bootstrap samples can be verysimilar.
Random Forests:I We fit a decision tree to different Bootstrap samples.
I When growing the tree, we select a random sample of m < ppredictors to consider in each step.
I This will lead to very different (or “uncorrelated”) trees fromeach sample.
I Finally, average the prediction of each tree.
12 / 1
Random Forests
Bagging has a problem:
→ The trees produced by different Bootstrap samples can be verysimilar.
Random Forests:I We fit a decision tree to different Bootstrap samples.
I When growing the tree, we select a random sample of m < ppredictors to consider in each step.
I This will lead to very different (or “uncorrelated”) trees fromeach sample.
I Finally, average the prediction of each tree.
12 / 1
Random Forests
Bagging has a problem:
→ The trees produced by different Bootstrap samples can be verysimilar.
Random Forests:I We fit a decision tree to different Bootstrap samples.
I When growing the tree, we select a random sample of m < ppredictors to consider in each step.
I This will lead to very different (or “uncorrelated”) trees fromeach sample.
I Finally, average the prediction of each tree.
12 / 1
Random Forests
Bagging has a problem:
→ The trees produced by different Bootstrap samples can be verysimilar.
Random Forests:I We fit a decision tree to different Bootstrap samples.
I When growing the tree, we select a random sample of m < ppredictors to consider in each step.
I This will lead to very different (or “uncorrelated”) trees fromeach sample.
I Finally, average the prediction of each tree.
12 / 1
Random Forests vs. Bagging
0 50 100 150 200 250 300
0.1
00
.15
0.2
00
.25
0.3
0
Number of Trees
Err
or
Test: Bagging
Test: RandomForest
OOB: Bagging
OOB: RandomForest
13 / 1
Random Forests, choosing m
0 100 200 300 400 500
0.2
0.3
0.4
0.5
Number of Trees
Test C
lassific
ation E
rror
m=p
m=p/2
m= p
The optimal m is usually around√p,
but this can be used as a tuning parameter.
14 / 1
Boosting regression trees
1. Set f(x) = 0, and ri = yi for i = 1, . . . , n.
2. For b = 1, . . . , B, iterate:
2.1 Fit a decision tree f b with d splits to the response r1, . . . , rn.
2.2 Update the prediction to:
f(x)← f(x) + λf b(x).
2.3 Update the residuals,
ri ← ri − λf b(xi).
3. Output the final model:
f(x) =
B∑b=1
λf b(x).
15 / 1
Boosting regression trees
1. Set f(x) = 0, and ri = yi for i = 1, . . . , n.
2. For b = 1, . . . , B, iterate:
2.1 Fit a decision tree f b with d splits to the response r1, . . . , rn.
2.2 Update the prediction to:
f(x)← f(x) + λf b(x).
2.3 Update the residuals,
ri ← ri − λf b(xi).
3. Output the final model:
f(x) =
B∑b=1
λf b(x).
15 / 1
Boosting regression trees
1. Set f(x) = 0, and ri = yi for i = 1, . . . , n.
2. For b = 1, . . . , B, iterate:
2.1 Fit a decision tree f b with d splits to the response r1, . . . , rn.
2.2 Update the prediction to:
f(x)← f(x) + λf b(x).
2.3 Update the residuals,
ri ← ri − λf b(xi).
3. Output the final model:
f(x) =
B∑b=1
λf b(x).
15 / 1
Boosting regression trees
1. Set f(x) = 0, and ri = yi for i = 1, . . . , n.
2. For b = 1, . . . , B, iterate:
2.1 Fit a decision tree f b with d splits to the response r1, . . . , rn.
2.2 Update the prediction to:
f(x)← f(x) + λf b(x).
2.3 Update the residuals,
ri ← ri − λf b(xi).
3. Output the final model:
f(x) =
B∑b=1
λf b(x).
15 / 1
Boosting regression trees
1. Set f(x) = 0, and ri = yi for i = 1, . . . , n.
2. For b = 1, . . . , B, iterate:
2.1 Fit a decision tree f b with d splits to the response r1, . . . , rn.
2.2 Update the prediction to:
f(x)← f(x) + λf b(x).
2.3 Update the residuals,
ri ← ri − λf b(xi).
3. Output the final model:
f(x) =
B∑b=1
λf b(x).
15 / 1
Boosting regression trees
1. Set f(x) = 0, and ri = yi for i = 1, . . . , n.
2. For b = 1, . . . , B, iterate:
2.1 Fit a decision tree f b with d splits to the response r1, . . . , rn.
2.2 Update the prediction to:
f(x)← f(x) + λf b(x).
2.3 Update the residuals,
ri ← ri − λf b(xi).
3. Output the final model:
f(x) =
B∑b=1
λf b(x).
15 / 1
Boosting regression trees
1. Set f(x) = 0, and ri = yi for i = 1, . . . , n.
2. For b = 1, . . . , B, iterate:
2.1 Fit a decision tree f b with d splits to the response r1, . . . , rn.
2.2 Update the prediction to:
f(x)← f(x) + λf b(x).
2.3 Update the residuals,
ri ← ri − λf b(xi).
3. Output the final model:
f(x) =
B∑b=1
λf b(x).
15 / 1
Boosting, intuitively
Boosting learns slowly:
We first use the samples that are easiest to predict, then slowlydown weigh these cases, moving on to harder samples.
16 / 1
Boosting vs. random forests
0 1000 2000 3000 4000 5000
0.0
50.1
00.1
50.2
00.2
5
Number of Trees
Test C
lassific
ation E
rror
Boosting: depth=1
Boosting: depth=2
RandomForest: m= p
The parameter λ = 0.01 in each case.We can tune the model by CV using λ, d,B.
17 / 1