Lecture 20: Bagging, Random Forests, Boosting · Lecture 20: Bagging, Random Forests, Boosting...

52
Lecture 20: Bagging, Random Forests, Boosting Reading: Chapter 8 STATS 202: Data mining and analysis Jonathan Taylor November 12, 2018 Slide credits: Sergio Bacallado 1/1

Transcript of Lecture 20: Bagging, Random Forests, Boosting · Lecture 20: Bagging, Random Forests, Boosting...

Page 1: Lecture 20: Bagging, Random Forests, Boosting · Lecture 20: Bagging, Random Forests, Boosting Reading: Chapter8 STATS202: Dataminingandanalysis JonathanTaylor November12,2018 Slidecredits:

Lecture 20: Bagging, Random Forests,Boosting

Reading: Chapter 8

STATS 202: Data mining and analysis

Jonathan TaylorNovember 12, 2018

Slide credits: Sergio Bacallado

1 / 1

Page 2: Lecture 20: Bagging, Random Forests, Boosting · Lecture 20: Bagging, Random Forests, Boosting Reading: Chapter8 STATS202: Dataminingandanalysis JonathanTaylor November12,2018 Slidecredits:

Classification and Regression trees, in a nut shell

I Grow the tree by recursively splitting the samples in the leafRi according to Xj > s, such that (Ri, Xj , s) maximize thedrop in RSS.

→ Greedy algorithm.

I Create a sequence of subtrees T0, T1, . . . , Tm using a pruningalgorithm.

I Select the best tree Ti (or the best α) by cross validation.

2 / 1

Page 3: Lecture 20: Bagging, Random Forests, Boosting · Lecture 20: Bagging, Random Forests, Boosting Reading: Chapter8 STATS202: Dataminingandanalysis JonathanTaylor November12,2018 Slidecredits:

Classification and Regression trees, in a nut shell

I Grow the tree by recursively splitting the samples in the leafRi according to Xj > s, such that (Ri, Xj , s) maximize thedrop in RSS.

→ Greedy algorithm.

I Create a sequence of subtrees T0, T1, . . . , Tm using a pruningalgorithm.

I Select the best tree Ti (or the best α) by cross validation.

2 / 1

Page 4: Lecture 20: Bagging, Random Forests, Boosting · Lecture 20: Bagging, Random Forests, Boosting Reading: Chapter8 STATS202: Dataminingandanalysis JonathanTaylor November12,2018 Slidecredits:

Classification and Regression trees, in a nut shell

I Grow the tree by recursively splitting the samples in the leafRi according to Xj > s, such that (Ri, Xj , s) maximize thedrop in RSS.

→ Greedy algorithm.

I Create a sequence of subtrees T0, T1, . . . , Tm using a pruningalgorithm.

I Select the best tree Ti (or the best α) by cross validation.

2 / 1

Page 5: Lecture 20: Bagging, Random Forests, Boosting · Lecture 20: Bagging, Random Forests, Boosting Reading: Chapter8 STATS202: Dataminingandanalysis JonathanTaylor November12,2018 Slidecredits:

Classification and Regression trees, in a nut shell

I Grow the tree by recursively splitting the samples in the leafRi according to Xj > s, such that (Ri, Xj , s) maximize thedrop in RSS.

→ Greedy algorithm.

I Create a sequence of subtrees T0, T1, . . . , Tm using a pruningalgorithm.

I Select the best tree Ti (or the best α) by cross validation.

2 / 1

Page 6: Lecture 20: Bagging, Random Forests, Boosting · Lecture 20: Bagging, Random Forests, Boosting Reading: Chapter8 STATS202: Dataminingandanalysis JonathanTaylor November12,2018 Slidecredits:

Example. Heart dataset.

How do we deal with categorical predictors?

|Thal:a

Ca < 0.5

MaxHR < 161.5

RestBP < 157

Chol < 244MaxHR < 156

MaxHR < 145.5

ChestPain:bc

Chol < 244 Sex < 0.5

Ca < 0.5

Slope < 1.5

Age < 52 Thal:b

ChestPain:a

Oldpeak < 1.1

RestECG < 1

No YesNo

NoYes

No

No No No Yes

Yes No No

No Yes

Yes Yes

Yes

5 10 15

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Tree Size

Err

or

TrainingCross−ValidationTest

|Thal:a

Ca < 0.5

MaxHR < 161.5 ChestPain:bc

Ca < 0.5

No No

No Yes

Yes Yes

3 / 1

Page 7: Lecture 20: Bagging, Random Forests, Boosting · Lecture 20: Bagging, Random Forests, Boosting Reading: Chapter8 STATS202: Dataminingandanalysis JonathanTaylor November12,2018 Slidecredits:

Categorical predictors

I If there are only 2 categories, then the split is obvious. Wedon’t have to choose the splitting point s, as for a numericalvariable.

I If there are more than 2 categories:I Order the categories according to the average of the response:

ChestPain : a > ChestPain : c > ChestPain : b

I Treat as a numerical variable with this ordering, and choose asplitting point s.

I One can show that this is the optimal way of partitioning.

4 / 1

Page 8: Lecture 20: Bagging, Random Forests, Boosting · Lecture 20: Bagging, Random Forests, Boosting Reading: Chapter8 STATS202: Dataminingandanalysis JonathanTaylor November12,2018 Slidecredits:

Categorical predictors

I If there are only 2 categories, then the split is obvious. Wedon’t have to choose the splitting point s, as for a numericalvariable.

I If there are more than 2 categories:I Order the categories according to the average of the response:

ChestPain : a > ChestPain : c > ChestPain : b

I Treat as a numerical variable with this ordering, and choose asplitting point s.

I One can show that this is the optimal way of partitioning.

4 / 1

Page 9: Lecture 20: Bagging, Random Forests, Boosting · Lecture 20: Bagging, Random Forests, Boosting Reading: Chapter8 STATS202: Dataminingandanalysis JonathanTaylor November12,2018 Slidecredits:

Categorical predictors

I If there are only 2 categories, then the split is obvious. Wedon’t have to choose the splitting point s, as for a numericalvariable.

I If there are more than 2 categories:I Order the categories according to the average of the response:

ChestPain : a > ChestPain : c > ChestPain : b

I Treat as a numerical variable with this ordering, and choose asplitting point s.

I One can show that this is the optimal way of partitioning.

4 / 1

Page 10: Lecture 20: Bagging, Random Forests, Boosting · Lecture 20: Bagging, Random Forests, Boosting Reading: Chapter8 STATS202: Dataminingandanalysis JonathanTaylor November12,2018 Slidecredits:

Missing data

I Suppose we can assign every sample to a leaf Ri despite themissing data.

I When choosing a new split with variable Xj (growing the tree):

I Only consider the samples which have the variable Xj .

I In addition to choosing the best split, choose a second bestsplit using a different variable, and a third best, ...

I To propagate a sample down the tree, if it is missing a variableto make a decision, try the second best decision, or the thirdbest, ...called “surrogate split”

5 / 1

Page 11: Lecture 20: Bagging, Random Forests, Boosting · Lecture 20: Bagging, Random Forests, Boosting Reading: Chapter8 STATS202: Dataminingandanalysis JonathanTaylor November12,2018 Slidecredits:

Missing data

I Suppose we can assign every sample to a leaf Ri despite themissing data.

I When choosing a new split with variable Xj (growing the tree):

I Only consider the samples which have the variable Xj .

I In addition to choosing the best split, choose a second bestsplit using a different variable, and a third best, ...

I To propagate a sample down the tree, if it is missing a variableto make a decision, try the second best decision, or the thirdbest, ...called “surrogate split”

5 / 1

Page 12: Lecture 20: Bagging, Random Forests, Boosting · Lecture 20: Bagging, Random Forests, Boosting Reading: Chapter8 STATS202: Dataminingandanalysis JonathanTaylor November12,2018 Slidecredits:

Missing data

I Suppose we can assign every sample to a leaf Ri despite themissing data.

I When choosing a new split with variable Xj (growing the tree):I Only consider the samples which have the variable Xj .

I In addition to choosing the best split, choose a second bestsplit using a different variable, and a third best, ...

I To propagate a sample down the tree, if it is missing a variableto make a decision, try the second best decision, or the thirdbest, ...called “surrogate split”

5 / 1

Page 13: Lecture 20: Bagging, Random Forests, Boosting · Lecture 20: Bagging, Random Forests, Boosting Reading: Chapter8 STATS202: Dataminingandanalysis JonathanTaylor November12,2018 Slidecredits:

Missing data

I Suppose we can assign every sample to a leaf Ri despite themissing data.

I When choosing a new split with variable Xj (growing the tree):I Only consider the samples which have the variable Xj .

I In addition to choosing the best split, choose a second bestsplit using a different variable, and a third best, ...

I To propagate a sample down the tree, if it is missing a variableto make a decision, try the second best decision, or the thirdbest, ...called “surrogate split”

5 / 1

Page 14: Lecture 20: Bagging, Random Forests, Boosting · Lecture 20: Bagging, Random Forests, Boosting Reading: Chapter8 STATS202: Dataminingandanalysis JonathanTaylor November12,2018 Slidecredits:

Missing data

I Suppose we can assign every sample to a leaf Ri despite themissing data.

I When choosing a new split with variable Xj (growing the tree):I Only consider the samples which have the variable Xj .

I In addition to choosing the best split, choose a second bestsplit using a different variable, and a third best, ...

I To propagate a sample down the tree, if it is missing a variableto make a decision, try the second best decision, or the thirdbest, ...called “surrogate split”

5 / 1

Page 15: Lecture 20: Bagging, Random Forests, Boosting · Lecture 20: Bagging, Random Forests, Boosting Reading: Chapter8 STATS202: Dataminingandanalysis JonathanTaylor November12,2018 Slidecredits:

Bagging

I Bagging = Bootstrap Aggregating

I In the Bootstrap, we replicate our dataset by sampling withreplacement:

I Original dataset: x = c(x1, x2, . . . , x100)

I Bootstrap samples:boot1 = sample(x, 100, replace = True), ...,bootB = sample(x, 100, replace = True).

I We used these samples to get the Standard Error of aparameter estimate:

SE(β1) ≈1

B

B∑b=1

β(b)1

6 / 1

Page 16: Lecture 20: Bagging, Random Forests, Boosting · Lecture 20: Bagging, Random Forests, Boosting Reading: Chapter8 STATS202: Dataminingandanalysis JonathanTaylor November12,2018 Slidecredits:

Bagging

I Bagging = Bootstrap Aggregating

I In the Bootstrap, we replicate our dataset by sampling withreplacement:

I Original dataset: x = c(x1, x2, . . . , x100)

I Bootstrap samples:boot1 = sample(x, 100, replace = True), ...,bootB = sample(x, 100, replace = True).

I We used these samples to get the Standard Error of aparameter estimate:

SE(β1) ≈1

B

B∑b=1

β(b)1

6 / 1

Page 17: Lecture 20: Bagging, Random Forests, Boosting · Lecture 20: Bagging, Random Forests, Boosting Reading: Chapter8 STATS202: Dataminingandanalysis JonathanTaylor November12,2018 Slidecredits:

Bagging

I Bagging = Bootstrap Aggregating

I In the Bootstrap, we replicate our dataset by sampling withreplacement:

I Original dataset: x = c(x1, x2, . . . , x100)

I Bootstrap samples:boot1 = sample(x, 100, replace = True), ...,bootB = sample(x, 100, replace = True).

I We used these samples to get the Standard Error of aparameter estimate:

SE(β1) ≈1

B

B∑b=1

β(b)1

6 / 1

Page 18: Lecture 20: Bagging, Random Forests, Boosting · Lecture 20: Bagging, Random Forests, Boosting Reading: Chapter8 STATS202: Dataminingandanalysis JonathanTaylor November12,2018 Slidecredits:

Bagging

I In Bagging we average the predictions of a model fit to manyBootstrap samples.

Example. Bagging the Lasso

I Let yL,b be the prediction of the Lasso applied to the bthbootstrap sample.

I Bagging prediction:

yboot =1

B

B∑b=1

yL,b.

7 / 1

Page 19: Lecture 20: Bagging, Random Forests, Boosting · Lecture 20: Bagging, Random Forests, Boosting Reading: Chapter8 STATS202: Dataminingandanalysis JonathanTaylor November12,2018 Slidecredits:

Bagging

I In Bagging we average the predictions of a model fit to manyBootstrap samples.

Example. Bagging the LassoI Let yL,b be the prediction of the Lasso applied to the bth

bootstrap sample.

I Bagging prediction:

yboot =1

B

B∑b=1

yL,b.

7 / 1

Page 20: Lecture 20: Bagging, Random Forests, Boosting · Lecture 20: Bagging, Random Forests, Boosting Reading: Chapter8 STATS202: Dataminingandanalysis JonathanTaylor November12,2018 Slidecredits:

Bagging

I In Bagging we average the predictions of a model fit to manyBootstrap samples.

Example. Bagging the LassoI Let yL,b be the prediction of the Lasso applied to the bth

bootstrap sample.

I Bagging prediction:

yboot =1

B

B∑b=1

yL,b.

7 / 1

Page 21: Lecture 20: Bagging, Random Forests, Boosting · Lecture 20: Bagging, Random Forests, Boosting Reading: Chapter8 STATS202: Dataminingandanalysis JonathanTaylor November12,2018 Slidecredits:

When does Bagging make sense?

When a regression method or a classifier has a tendency to overfit,Bagging reduces the variance of the prediction.

I When n is large, the empirical distribution is similar to thetrue distribution of the samples.

I Bootstrap samples are like independent realizations of thedata.

I Bagging amounts to averaging the fits from many independentdatasets, which would reduce the variance by a factor 1/B.

8 / 1

Page 22: Lecture 20: Bagging, Random Forests, Boosting · Lecture 20: Bagging, Random Forests, Boosting Reading: Chapter8 STATS202: Dataminingandanalysis JonathanTaylor November12,2018 Slidecredits:

When does Bagging make sense?

When a regression method or a classifier has a tendency to overfit,Bagging reduces the variance of the prediction.

I When n is large, the empirical distribution is similar to thetrue distribution of the samples.

I Bootstrap samples are like independent realizations of thedata.

I Bagging amounts to averaging the fits from many independentdatasets, which would reduce the variance by a factor 1/B.

8 / 1

Page 23: Lecture 20: Bagging, Random Forests, Boosting · Lecture 20: Bagging, Random Forests, Boosting Reading: Chapter8 STATS202: Dataminingandanalysis JonathanTaylor November12,2018 Slidecredits:

When does Bagging make sense?

When a regression method or a classifier has a tendency to overfit,Bagging reduces the variance of the prediction.

I When n is large, the empirical distribution is similar to thetrue distribution of the samples.

I Bootstrap samples are like independent realizations of thedata.

I Bagging amounts to averaging the fits from many independentdatasets, which would reduce the variance by a factor 1/B.

8 / 1

Page 24: Lecture 20: Bagging, Random Forests, Boosting · Lecture 20: Bagging, Random Forests, Boosting Reading: Chapter8 STATS202: Dataminingandanalysis JonathanTaylor November12,2018 Slidecredits:

When does Bagging make sense?

When a regression method or a classifier has a tendency to overfit,Bagging reduces the variance of the prediction.

I When n is large, the empirical distribution is similar to thetrue distribution of the samples.

I Bootstrap samples are like independent realizations of thedata.

I Bagging amounts to averaging the fits from many independentdatasets, which would reduce the variance by a factor 1/B.

8 / 1

Page 25: Lecture 20: Bagging, Random Forests, Boosting · Lecture 20: Bagging, Random Forests, Boosting Reading: Chapter8 STATS202: Dataminingandanalysis JonathanTaylor November12,2018 Slidecredits:

Bagging decision trees

I Disadvantage: Every time we fit a decision tree to aBootstrap sample, we get a different tree T b.

→ Loss of interpretability

I For each predictor, add up the total amount by which the RSS(or Gini index) decreases every time we use the predictor in T b.

I Average this total over each Boostrap estimate T 1, . . . , TB.

Thal

Ca

ChestPain

Oldpeak

MaxHR

RestBP

Age

Chol

Slope

Sex

ExAng

RestECG

Fbs

0 20 40 60 80 100

Variable Importance

9 / 1

Page 26: Lecture 20: Bagging, Random Forests, Boosting · Lecture 20: Bagging, Random Forests, Boosting Reading: Chapter8 STATS202: Dataminingandanalysis JonathanTaylor November12,2018 Slidecredits:

Bagging decision trees

I Disadvantage: Every time we fit a decision tree to aBootstrap sample, we get a different tree T b.

→ Loss of interpretability

I For each predictor, add up the total amount by which the RSS(or Gini index) decreases every time we use the predictor in T b.

I Average this total over each Boostrap estimate T 1, . . . , TB.

Thal

Ca

ChestPain

Oldpeak

MaxHR

RestBP

Age

Chol

Slope

Sex

ExAng

RestECG

Fbs

0 20 40 60 80 100

Variable Importance

9 / 1

Page 27: Lecture 20: Bagging, Random Forests, Boosting · Lecture 20: Bagging, Random Forests, Boosting Reading: Chapter8 STATS202: Dataminingandanalysis JonathanTaylor November12,2018 Slidecredits:

Bagging decision trees

I Disadvantage: Every time we fit a decision tree to aBootstrap sample, we get a different tree T b.

→ Loss of interpretability

I For each predictor, add up the total amount by which the RSS(or Gini index) decreases every time we use the predictor in T b.

I Average this total over each Boostrap estimate T 1, . . . , TB.

Thal

Ca

ChestPain

Oldpeak

MaxHR

RestBP

Age

Chol

Slope

Sex

ExAng

RestECG

Fbs

0 20 40 60 80 100

Variable Importance

9 / 1

Page 28: Lecture 20: Bagging, Random Forests, Boosting · Lecture 20: Bagging, Random Forests, Boosting Reading: Chapter8 STATS202: Dataminingandanalysis JonathanTaylor November12,2018 Slidecredits:

Bagging decision trees

I Disadvantage: Every time we fit a decision tree to aBootstrap sample, we get a different tree T b.

→ Loss of interpretability

I For each predictor, add up the total amount by which the RSS(or Gini index) decreases every time we use the predictor in T b.

I Average this total over each Boostrap estimate T 1, . . . , TB.

Thal

Ca

ChestPain

Oldpeak

MaxHR

RestBP

Age

Chol

Slope

Sex

ExAng

RestECG

Fbs

0 20 40 60 80 100

Variable Importance

9 / 1

Page 29: Lecture 20: Bagging, Random Forests, Boosting · Lecture 20: Bagging, Random Forests, Boosting Reading: Chapter8 STATS202: Dataminingandanalysis JonathanTaylor November12,2018 Slidecredits:

Bagging decision trees

I Disadvantage: Every time we fit a decision tree to aBootstrap sample, we get a different tree T b.

→ Loss of interpretability

I For each predictor, add up the total amount by which the RSS(or Gini index) decreases every time we use the predictor in T b.

I Average this total over each Boostrap estimate T 1, . . . , TB.

Thal

Ca

ChestPain

Oldpeak

MaxHR

RestBP

Age

Chol

Slope

Sex

ExAng

RestECG

Fbs

0 20 40 60 80 100

Variable Importance

9 / 1

Page 30: Lecture 20: Bagging, Random Forests, Boosting · Lecture 20: Bagging, Random Forests, Boosting Reading: Chapter8 STATS202: Dataminingandanalysis JonathanTaylor November12,2018 Slidecredits:

Out-of-bag (OOB) error

I To estimate the test error of a bagging estimate, we could usecross-validation.

I Each time we draw a Bootstrap sample, we only use 63% ofthe observations.

I Idea: use the rest of the observations as a test set.

I OOB error:I For each sample xi, find the prediction ybi for all bootstrap

samples b which do not contain xi. There should be around0.37B of them. Average these predictions to obtain yoob

i .

I Compute the error (yi − yoobi )2.

I Average the errors over all observations i = 1, . . . , n.

10 / 1

Page 31: Lecture 20: Bagging, Random Forests, Boosting · Lecture 20: Bagging, Random Forests, Boosting Reading: Chapter8 STATS202: Dataminingandanalysis JonathanTaylor November12,2018 Slidecredits:

Out-of-bag (OOB) error

I To estimate the test error of a bagging estimate, we could usecross-validation.

I Each time we draw a Bootstrap sample, we only use 63% ofthe observations.

I Idea: use the rest of the observations as a test set.

I OOB error:I For each sample xi, find the prediction ybi for all bootstrap

samples b which do not contain xi. There should be around0.37B of them. Average these predictions to obtain yoob

i .

I Compute the error (yi − yoobi )2.

I Average the errors over all observations i = 1, . . . , n.

10 / 1

Page 32: Lecture 20: Bagging, Random Forests, Boosting · Lecture 20: Bagging, Random Forests, Boosting Reading: Chapter8 STATS202: Dataminingandanalysis JonathanTaylor November12,2018 Slidecredits:

Out-of-bag (OOB) error

I To estimate the test error of a bagging estimate, we could usecross-validation.

I Each time we draw a Bootstrap sample, we only use 63% ofthe observations.

I Idea: use the rest of the observations as a test set.

I OOB error:I For each sample xi, find the prediction ybi for all bootstrap

samples b which do not contain xi. There should be around0.37B of them. Average these predictions to obtain yoob

i .

I Compute the error (yi − yoobi )2.

I Average the errors over all observations i = 1, . . . , n.

10 / 1

Page 33: Lecture 20: Bagging, Random Forests, Boosting · Lecture 20: Bagging, Random Forests, Boosting Reading: Chapter8 STATS202: Dataminingandanalysis JonathanTaylor November12,2018 Slidecredits:

Out-of-bag (OOB) error

I To estimate the test error of a bagging estimate, we could usecross-validation.

I Each time we draw a Bootstrap sample, we only use 63% ofthe observations.

I Idea: use the rest of the observations as a test set.

I OOB error:I For each sample xi, find the prediction ybi for all bootstrap

samples b which do not contain xi. There should be around0.37B of them. Average these predictions to obtain yoob

i .

I Compute the error (yi − yoobi )2.

I Average the errors over all observations i = 1, . . . , n.

10 / 1

Page 34: Lecture 20: Bagging, Random Forests, Boosting · Lecture 20: Bagging, Random Forests, Boosting Reading: Chapter8 STATS202: Dataminingandanalysis JonathanTaylor November12,2018 Slidecredits:

Out-of-bag (OOB) error

I To estimate the test error of a bagging estimate, we could usecross-validation.

I Each time we draw a Bootstrap sample, we only use 63% ofthe observations.

I Idea: use the rest of the observations as a test set.

I OOB error:I For each sample xi, find the prediction ybi for all bootstrap

samples b which do not contain xi. There should be around0.37B of them. Average these predictions to obtain yoob

i .

I Compute the error (yi − yoobi )2.

I Average the errors over all observations i = 1, . . . , n.

10 / 1

Page 35: Lecture 20: Bagging, Random Forests, Boosting · Lecture 20: Bagging, Random Forests, Boosting Reading: Chapter8 STATS202: Dataminingandanalysis JonathanTaylor November12,2018 Slidecredits:

Out-of-bag (OOB) error

I To estimate the test error of a bagging estimate, we could usecross-validation.

I Each time we draw a Bootstrap sample, we only use 63% ofthe observations.

I Idea: use the rest of the observations as a test set.

I OOB error:I For each sample xi, find the prediction ybi for all bootstrap

samples b which do not contain xi. There should be around0.37B of them. Average these predictions to obtain yoob

i .

I Compute the error (yi − yoobi )2.

I Average the errors over all observations i = 1, . . . , n.

10 / 1

Page 36: Lecture 20: Bagging, Random Forests, Boosting · Lecture 20: Bagging, Random Forests, Boosting Reading: Chapter8 STATS202: Dataminingandanalysis JonathanTaylor November12,2018 Slidecredits:

Out-of-bag (OOB) error

0 50 100 150 200 250 300

0.1

00

.15

0.2

00

.25

0.3

0

Number of Trees

Err

or

Test: Bagging

Test: RandomForest

OOB: Bagging

OOB: RandomForest

The test error decreases as we increase B(dashed line is the error for a plain decision tree).

11 / 1

Page 37: Lecture 20: Bagging, Random Forests, Boosting · Lecture 20: Bagging, Random Forests, Boosting Reading: Chapter8 STATS202: Dataminingandanalysis JonathanTaylor November12,2018 Slidecredits:

Random Forests

Bagging has a problem:

→ The trees produced by different Bootstrap samples can be verysimilar.

Random Forests:

I We fit a decision tree to different Bootstrap samples.

I When growing the tree, we select a random sample of m < ppredictors to consider in each step.

I This will lead to very different (or “uncorrelated”) trees fromeach sample.

I Finally, average the prediction of each tree.

12 / 1

Page 38: Lecture 20: Bagging, Random Forests, Boosting · Lecture 20: Bagging, Random Forests, Boosting Reading: Chapter8 STATS202: Dataminingandanalysis JonathanTaylor November12,2018 Slidecredits:

Random Forests

Bagging has a problem:

→ The trees produced by different Bootstrap samples can be verysimilar.

Random Forests:I We fit a decision tree to different Bootstrap samples.

I When growing the tree, we select a random sample of m < ppredictors to consider in each step.

I This will lead to very different (or “uncorrelated”) trees fromeach sample.

I Finally, average the prediction of each tree.

12 / 1

Page 39: Lecture 20: Bagging, Random Forests, Boosting · Lecture 20: Bagging, Random Forests, Boosting Reading: Chapter8 STATS202: Dataminingandanalysis JonathanTaylor November12,2018 Slidecredits:

Random Forests

Bagging has a problem:

→ The trees produced by different Bootstrap samples can be verysimilar.

Random Forests:I We fit a decision tree to different Bootstrap samples.

I When growing the tree, we select a random sample of m < ppredictors to consider in each step.

I This will lead to very different (or “uncorrelated”) trees fromeach sample.

I Finally, average the prediction of each tree.

12 / 1

Page 40: Lecture 20: Bagging, Random Forests, Boosting · Lecture 20: Bagging, Random Forests, Boosting Reading: Chapter8 STATS202: Dataminingandanalysis JonathanTaylor November12,2018 Slidecredits:

Random Forests

Bagging has a problem:

→ The trees produced by different Bootstrap samples can be verysimilar.

Random Forests:I We fit a decision tree to different Bootstrap samples.

I When growing the tree, we select a random sample of m < ppredictors to consider in each step.

I This will lead to very different (or “uncorrelated”) trees fromeach sample.

I Finally, average the prediction of each tree.

12 / 1

Page 41: Lecture 20: Bagging, Random Forests, Boosting · Lecture 20: Bagging, Random Forests, Boosting Reading: Chapter8 STATS202: Dataminingandanalysis JonathanTaylor November12,2018 Slidecredits:

Random Forests

Bagging has a problem:

→ The trees produced by different Bootstrap samples can be verysimilar.

Random Forests:I We fit a decision tree to different Bootstrap samples.

I When growing the tree, we select a random sample of m < ppredictors to consider in each step.

I This will lead to very different (or “uncorrelated”) trees fromeach sample.

I Finally, average the prediction of each tree.

12 / 1

Page 42: Lecture 20: Bagging, Random Forests, Boosting · Lecture 20: Bagging, Random Forests, Boosting Reading: Chapter8 STATS202: Dataminingandanalysis JonathanTaylor November12,2018 Slidecredits:

Random Forests vs. Bagging

0 50 100 150 200 250 300

0.1

00

.15

0.2

00

.25

0.3

0

Number of Trees

Err

or

Test: Bagging

Test: RandomForest

OOB: Bagging

OOB: RandomForest

13 / 1

Page 43: Lecture 20: Bagging, Random Forests, Boosting · Lecture 20: Bagging, Random Forests, Boosting Reading: Chapter8 STATS202: Dataminingandanalysis JonathanTaylor November12,2018 Slidecredits:

Random Forests, choosing m

0 100 200 300 400 500

0.2

0.3

0.4

0.5

Number of Trees

Test C

lassific

ation E

rror

m=p

m=p/2

m= p

The optimal m is usually around√p,

but this can be used as a tuning parameter.

14 / 1

Page 44: Lecture 20: Bagging, Random Forests, Boosting · Lecture 20: Bagging, Random Forests, Boosting Reading: Chapter8 STATS202: Dataminingandanalysis JonathanTaylor November12,2018 Slidecredits:

Boosting regression trees

1. Set f(x) = 0, and ri = yi for i = 1, . . . , n.

2. For b = 1, . . . , B, iterate:

2.1 Fit a decision tree f b with d splits to the response r1, . . . , rn.

2.2 Update the prediction to:

f(x)← f(x) + λf b(x).

2.3 Update the residuals,

ri ← ri − λf b(xi).

3. Output the final model:

f(x) =

B∑b=1

λf b(x).

15 / 1

Page 45: Lecture 20: Bagging, Random Forests, Boosting · Lecture 20: Bagging, Random Forests, Boosting Reading: Chapter8 STATS202: Dataminingandanalysis JonathanTaylor November12,2018 Slidecredits:

Boosting regression trees

1. Set f(x) = 0, and ri = yi for i = 1, . . . , n.

2. For b = 1, . . . , B, iterate:

2.1 Fit a decision tree f b with d splits to the response r1, . . . , rn.

2.2 Update the prediction to:

f(x)← f(x) + λf b(x).

2.3 Update the residuals,

ri ← ri − λf b(xi).

3. Output the final model:

f(x) =

B∑b=1

λf b(x).

15 / 1

Page 46: Lecture 20: Bagging, Random Forests, Boosting · Lecture 20: Bagging, Random Forests, Boosting Reading: Chapter8 STATS202: Dataminingandanalysis JonathanTaylor November12,2018 Slidecredits:

Boosting regression trees

1. Set f(x) = 0, and ri = yi for i = 1, . . . , n.

2. For b = 1, . . . , B, iterate:

2.1 Fit a decision tree f b with d splits to the response r1, . . . , rn.

2.2 Update the prediction to:

f(x)← f(x) + λf b(x).

2.3 Update the residuals,

ri ← ri − λf b(xi).

3. Output the final model:

f(x) =

B∑b=1

λf b(x).

15 / 1

Page 47: Lecture 20: Bagging, Random Forests, Boosting · Lecture 20: Bagging, Random Forests, Boosting Reading: Chapter8 STATS202: Dataminingandanalysis JonathanTaylor November12,2018 Slidecredits:

Boosting regression trees

1. Set f(x) = 0, and ri = yi for i = 1, . . . , n.

2. For b = 1, . . . , B, iterate:

2.1 Fit a decision tree f b with d splits to the response r1, . . . , rn.

2.2 Update the prediction to:

f(x)← f(x) + λf b(x).

2.3 Update the residuals,

ri ← ri − λf b(xi).

3. Output the final model:

f(x) =

B∑b=1

λf b(x).

15 / 1

Page 48: Lecture 20: Bagging, Random Forests, Boosting · Lecture 20: Bagging, Random Forests, Boosting Reading: Chapter8 STATS202: Dataminingandanalysis JonathanTaylor November12,2018 Slidecredits:

Boosting regression trees

1. Set f(x) = 0, and ri = yi for i = 1, . . . , n.

2. For b = 1, . . . , B, iterate:

2.1 Fit a decision tree f b with d splits to the response r1, . . . , rn.

2.2 Update the prediction to:

f(x)← f(x) + λf b(x).

2.3 Update the residuals,

ri ← ri − λf b(xi).

3. Output the final model:

f(x) =

B∑b=1

λf b(x).

15 / 1

Page 49: Lecture 20: Bagging, Random Forests, Boosting · Lecture 20: Bagging, Random Forests, Boosting Reading: Chapter8 STATS202: Dataminingandanalysis JonathanTaylor November12,2018 Slidecredits:

Boosting regression trees

1. Set f(x) = 0, and ri = yi for i = 1, . . . , n.

2. For b = 1, . . . , B, iterate:

2.1 Fit a decision tree f b with d splits to the response r1, . . . , rn.

2.2 Update the prediction to:

f(x)← f(x) + λf b(x).

2.3 Update the residuals,

ri ← ri − λf b(xi).

3. Output the final model:

f(x) =

B∑b=1

λf b(x).

15 / 1

Page 50: Lecture 20: Bagging, Random Forests, Boosting · Lecture 20: Bagging, Random Forests, Boosting Reading: Chapter8 STATS202: Dataminingandanalysis JonathanTaylor November12,2018 Slidecredits:

Boosting regression trees

1. Set f(x) = 0, and ri = yi for i = 1, . . . , n.

2. For b = 1, . . . , B, iterate:

2.1 Fit a decision tree f b with d splits to the response r1, . . . , rn.

2.2 Update the prediction to:

f(x)← f(x) + λf b(x).

2.3 Update the residuals,

ri ← ri − λf b(xi).

3. Output the final model:

f(x) =

B∑b=1

λf b(x).

15 / 1

Page 51: Lecture 20: Bagging, Random Forests, Boosting · Lecture 20: Bagging, Random Forests, Boosting Reading: Chapter8 STATS202: Dataminingandanalysis JonathanTaylor November12,2018 Slidecredits:

Boosting, intuitively

Boosting learns slowly:

We first use the samples that are easiest to predict, then slowlydown weigh these cases, moving on to harder samples.

16 / 1

Page 52: Lecture 20: Bagging, Random Forests, Boosting · Lecture 20: Bagging, Random Forests, Boosting Reading: Chapter8 STATS202: Dataminingandanalysis JonathanTaylor November12,2018 Slidecredits:

Boosting vs. random forests

0 1000 2000 3000 4000 5000

0.0

50.1

00.1

50.2

00.2

5

Number of Trees

Test C

lassific

ation E

rror

Boosting: depth=1

Boosting: depth=2

RandomForest: m= p

The parameter λ = 0.01 in each case.We can tune the model by CV using λ, d,B.

17 / 1