Lecture 20: Bagging, Random Forests, Boosting · Lecture 20: Bagging, Random Forests, Boosting...

Lecture 20: Bagging, Random Forests,Boosting

Reading: Chapter 8

STATS 202: Data mining and analysis

Jonathan TaylorNovember 12, 2018

Slide credits: Sergio Bacallado

Classification and Regression trees, in a nut shell

I Grow the tree by recursively splitting the samples in the leafRi according to Xj > s, such that (Ri, Xj , s) maximize thedrop in RSS.

→ Greedy algorithm.

I Create a sequence of subtrees T0, T1, . . . , Tm using a pruningalgorithm.

I Select the best tree Ti (or the best α) by cross validation.

Example. Heart dataset.

How do we deal with categorical predictors?

|Thal:a

Ca < 0.5

MaxHR < 161.5

RestBP < 157

Chol < 244MaxHR < 156

MaxHR < 145.5

ChestPain:bc

Chol < 244 Sex < 0.5

Ca < 0.5

Slope < 1.5

Age < 52 Thal:b

ChestPain:a

Oldpeak < 1.1

RestECG < 1

No YesNo

No No No Yes

Yes No No

No Yes

Yes Yes

5 10 15

Tree Size

TrainingCross−ValidationTest

|Thal:a

Ca < 0.5

MaxHR < 161.5 ChestPain:bc

Ca < 0.5

No Yes

Yes Yes

Categorical predictors

I If there are only 2 categories, then the split is obvious. Wedon’t have to choose the splitting point s, as for a numericalvariable.

I If there are more than 2 categories:I Order the categories according to the average of the response:

ChestPain : a > ChestPain : c > ChestPain : b

I Treat as a numerical variable with this ordering, and choose asplitting point s.

I One can show that this is the optimal way of partitioning.

Missing data

I Suppose we can assign every sample to a leaf Ri despite themissing data.

I When choosing a new split with variable Xj (growing the tree):

I Only consider the samples which have the variable Xj .

I In addition to choosing the best split, choose a second bestsplit using a different variable, and a third best, ...

I To propagate a sample down the tree, if it is missing a variableto make a decision, try the second best decision, or the thirdbest, ...called “surrogate split”

Missing data

I When choosing a new split with variable Xj (growing the tree):

I Only consider the samples which have the variable Xj .

Missing data

I When choosing a new split with variable Xj (growing the tree):I Only consider the samples which have the variable Xj .

Missing data

Bagging

I Bagging = Bootstrap Aggregating

I In the Bootstrap, we replicate our dataset by sampling withreplacement:

I Original dataset: x = c(x1, x2, . . . , x100)

I Bootstrap samples:boot1 = sample(x, 100, replace = True), ...,bootB = sample(x, 100, replace = True).

I We used these samples to get the Standard Error of aparameter estimate:

SE(β1) ≈1

B∑b=1

β(b)1

Bagging

SE(β1) ≈1

B∑b=1

β(b)1

Bagging

SE(β1) ≈1

B∑b=1

β(b)1

Bagging

I In Bagging we average the predictions of a model fit to manyBootstrap samples.

Example. Bagging the Lasso

I Let yL,b be the prediction of the Lasso applied to the bthbootstrap sample.

I Bagging prediction:

yboot =1

B∑b=1

Bagging

Example. Bagging the LassoI Let yL,b be the prediction of the Lasso applied to the bth

bootstrap sample.

yboot =1

B∑b=1

Bagging

Example. Bagging the LassoI Let yL,b be the prediction of the Lasso applied to the bth

bootstrap sample.

yboot =1

B∑b=1

When does Bagging make sense?

When a regression method or a classifier has a tendency to overfit,Bagging reduces the variance of the prediction.

I When n is large, the empirical distribution is similar to thetrue distribution of the samples.

I Bootstrap samples are like independent realizations of thedata.

I Bagging amounts to averaging the fits from many independentdatasets, which would reduce the variance by a factor 1/B.

Bagging decision trees

I Disadvantage: Every time we fit a decision tree to aBootstrap sample, we get a different tree T b.

→ Loss of interpretability

I For each predictor, add up the total amount by which the RSS(or Gini index) decreases every time we use the predictor in T b.

I Average this total over each Boostrap estimate T 1, . . . , TB.

ChestPain

Oldpeak

RestBP

RestECG

0 20 40 60 80 100

Variable Importance

ChestPain

Oldpeak

RestBP

RestECG

0 20 40 60 80 100

Variable Importance

ChestPain

Oldpeak

RestBP

RestECG

0 20 40 60 80 100

Variable Importance

ChestPain

Oldpeak

RestBP

RestECG

0 20 40 60 80 100

Variable Importance

ChestPain

Oldpeak

RestBP

RestECG

0 20 40 60 80 100

Variable Importance

Out-of-bag (OOB) error

I To estimate the test error of a bagging estimate, we could usecross-validation.

I Each time we draw a Bootstrap sample, we only use 63% ofthe observations.

I Idea: use the rest of the observations as a test set.

I OOB error:I For each sample xi, find the prediction ybi for all bootstrap

samples b which do not contain xi. There should be around0.37B of them. Average these predictions to obtain yoob

I Compute the error (yi − yoobi )2.

I Average the errors over all observations i = 1, . . . , n.

10 / 1

0 50 100 150 200 250 300

Number of Trees

Test: Bagging

Test: RandomForest

OOB: Bagging

OOB: RandomForest

The test error decreases as we increase B(dashed line is the error for a plain decision tree).

11 / 1

Random Forests

Bagging has a problem:

→ The trees produced by different Bootstrap samples can be verysimilar.

Random Forests:

I We fit a decision tree to different Bootstrap samples.

I When growing the tree, we select a random sample of m < ppredictors to consider in each step.

I This will lead to very different (or “uncorrelated”) trees fromeach sample.

I Finally, average the prediction of each tree.

12 / 1

Random Forests

Random Forests:I We fit a decision tree to different Bootstrap samples.

12 / 1

Random Forests

12 / 1

Random Forests

12 / 1

Random Forests

12 / 1

Random Forests vs. Bagging

0 50 100 150 200 250 300

Number of Trees

Test: Bagging

Test: RandomForest

OOB: Bagging

OOB: RandomForest

13 / 1

Random Forests, choosing m

0 100 200 300 400 500

Number of Trees

Test C

lassific

ation E

The optimal m is usually around√p,

but this can be used as a tuning parameter.

14 / 1

Boosting regression trees

1. Set f(x) = 0, and ri = yi for i = 1, . . . , n.

2. For b = 1, . . . , B, iterate:

2.1 Fit a decision tree f b with d splits to the response r1, . . . , rn.

2.2 Update the prediction to:

f(x)← f(x) + λf b(x).

2.3 Update the residuals,

ri ← ri − λf b(xi).

3. Output the final model:

f(x) =

B∑b=1

λf b(x).

15 / 1

2. For b = 1, . . . , B, iterate:

f(x)← f(x) + λf b(x).

f(x) =

B∑b=1

λf b(x).

15 / 1

2. For b = 1, . . . , B, iterate:

f(x)← f(x) + λf b(x).

f(x) =

B∑b=1

λf b(x).

15 / 1

2. For b = 1, . . . , B, iterate:

f(x)← f(x) + λf b(x).

f(x) =

B∑b=1

λf b(x).

15 / 1

2. For b = 1, . . . , B, iterate:

f(x)← f(x) + λf b(x).

f(x) =

B∑b=1

λf b(x).

15 / 1

2. For b = 1, . . . , B, iterate:

f(x)← f(x) + λf b(x).

f(x) =

B∑b=1

λf b(x).

15 / 1

2. For b = 1, . . . , B, iterate:

f(x)← f(x) + λf b(x).

f(x) =

B∑b=1

λf b(x).

15 / 1

Boosting, intuitively

Boosting learns slowly:

We first use the samples that are easiest to predict, then slowlydown weigh these cases, moving on to harder samples.

16 / 1

Boosting vs. random forests

0 1000 2000 3000 4000 5000

Number of Trees

Test C

lassific

ation E

Boosting: depth=1

Boosting: depth=2

RandomForest: m= p

The parameter λ = 0.01 in each case.We can tune the model by CV using λ, d,B.

17 / 1

Lecture 20: Bagging, Random Forests, Boosting · Lecture 20: Bagging, Random Forests, Boosting...

Documents

Transcript of Lecture 20: Bagging, Random Forests, Boosting · Lecture 20: Bagging, Random Forests, Boosting...