Random Forests - Mathematics and Statistics · 588 15. Random Forests Algorithm 15.1 Random Forest...

This is page 587Printer: Opaque this

15Random Forests

15.1 Introduction

Bagging or bootstrap aggregation (section 8.7) is a technique for reducingthe variance of an estimated prediction function. Bagging seems to workespecially well for high-variance, low-bias procedures, such as trees. Forregression, we simply fit the same regression tree many times to bootstrap-sampled versions of the training data, and average the result. For classifi-cation, a committee of trees each cast a vote for the predicted class.

Boosting in Chapter 10 was initially proposed as a committee method aswell, although unlike bagging, the committee of weak learners evolves overtime, and the members cast a weighted vote. Boosting appears to dominatebagging on most problems, and became the preferred choice.

Random forests (Breiman, 2001) is a substantial modification of baggingthat builds a large collection of de-correlated trees, and then averages them.On many problems the performance of random forests is very similar toboosting, and they are simpler to train and tune. As a consequence, randomforests are popular, and are implemented in a variety of packages.

15.2 Definition of Random Forests

The essential idea in bagging (Section 8.7) is to average many noisy butapproximately unbiased models, and hence reduce the variance. Trees areideal candidates for bagging, since they can capture complex interaction

588 15. Random Forests

Algorithm 15.1 Random Forest for Regression or Classification.

1. For b = 1 to B:

(a) Draw a bootstrap sample Z! of size N from the training data.

(b) Grow a random-forest tree Tb to the bootstrapped data, by re-cursively repeating the following steps for each terminal node ofthe tree, until the minimum node size nmin is reached.

i. Select m variables at random from the p variables.

ii. Pick the best variable/split-point among the m.

iii. Split the node into two daughter nodes.

2. Output the ensemble of trees {Tb}B1 .

To make a prediction at a new point x:

Regression: fBrf (x) = 1

B

!Bb=1 Tb(x).

Classification: Let Cb(x) be the class prediction of the bth random-foresttree. Then CB

rf (x) = majority vote {Cb(x)}B1 .

structures in the data, and if grown su!ciently deep, have relatively lowbias. Since trees are notoriously noisy, they benefit greatly from the averag-ing. Moreover, since each tree generated in bagging is identically distributed(i.d.), the expectation of an average of B such trees is the same as the ex-pectation of any one of them. This means the bias of bagged trees is thesame as that of the individual trees, and the only hope of improvement isthrough variance reduction. This is in contrast to boosting, where the treesare grown in an adaptive way to remove bias, and hence are not i.d.

An average of B i.i.d. random variables, each with variance !2, has vari-ance 1

B!2. If the variables are simply i.d. (identically distributed, but not

necessarily independent) with positive pairwise correlation ", the varianceof the average is (Exercise 15.1)

"!2 +1 ! "

B!2. (15.1)

As B increases, the second term disappears, but the first remains, andhence the size of the correlation of pairs of bagged trees limits the benefitsof averaging. The idea in random forests (Algorithm 15.1) is to improvethe variance reduction of bagging by reducing the correlation between thetrees, without increasing the variance too much. This is achieved in thetree-growing process through random selection of the input variables.

Specifically, when growing a tree on a bootstrapped dataset:

Before each split, select m " p of the input variables at randomas candidates for splitting.

15.2 Definition of Random Forests 589

Typically values for m are#

p or even as low as 1.After B such trees {T (x;"b)}B

1 are grown, the random forest (regression)predictor is

fBrf (x) =

1

B

B"

b=1

T (x;"b). (15.2)

As in Section 10.9 (page 356), "b characterizes the bth random forest tree interms of split variables, cutpoints at each node, and terminal-node values.Intuitively, reducing m will reduce the correlation between any pair of treesin the ensemble, and hence by (15.1) reduce the variance of the average.

0 500 1000 1500 2000 2500

0.04

00.

045

0.05

00.

055

0.06

00.

065

0.07

0

Spam Data

Number of Trees

Test

Erro

r

BaggingRandom ForestGradient Boosting (5 Node)

FIGURE 15.1. Bagging, random forest, and gradient boosting, applied to thespam data. For boosting, 5-node trees were used, and the number of trees werechosen by 10-fold cross-validation (2500 trees). Each “step” in the figure corre-sponds to a change in a single misclassification (in a test set of 1536).

Not all estimators can be improved by shaking up the data like this.It seems that highly nonlinear estimators, such as trees, benefit the most.For bootstrapped trees, " is typically small (0.05 or lower is typical; seeFigure 15.9), while !2 is not much larger than the variance for the originaltree. On the other hand, bagging does not change linear estimates, suchas the sample mean (hence its variance either); the pairwise correlationbetween bootstrapped means is about 50% (Exercise 15.4).


Random forests are popular. Leo Breiman’s1 collaborator Adele Cutlermaintains a random forest website2 where the software is freely available,with more than 3000 downloads reported by 2002. There is a randomForestpackage in R, maintained by Andy Liaw, available from the CRAN website.

The authors make grand claims about the success of random forests:“most accurate,” “most interpretable,” and the like. In our experience ran-dom forests do remarkably well, with very little tuning required. A ran-dom forest classifier achieves 4.88% misclassification error on the spam testdata, which compares well with all other methods, and is not significantlyworse than gradient boosting at 4.5%. Bagging achieves 5.4% which issignificantly worse than either (using the McNemar test outlined in Ex-ercise 10.6), so it appears on this example the additional randomizationhelps.

RF−1 RF−3 Bagging GBM−1 GBM−6

0.00

0.05

0.10

0.15

Nested Spheres

Test

Mis

clas

sific

atio

n Er

ror

Bayes Error

FIGURE 15.2. The results of 50 simulations from the “nested spheres” model inIR10. The Bayes decision boundary is the surface of a sphere (additive). “RF-3”refers to a random forest with m = 3, and “GBM-6” a gradient boosted modelwith interaction order six; similarly for “RF-1” and “GBM-1.” The training setswere of size 2000, and the test sets 10, 000.

Figure 15.1 shows the test-error progression on 2500 trees for the threemethods. In this case there is some evidence that gradient boosting hasstarted to overfit, although 10-fold cross-validation chose all 2500 trees.

1Sadly, Leo Breiman died in July, 2005.2http://www.math.usu.edu/!adele/forests/

15.2 Definition of Random Forests 591

0 200 400 600 800 1000

0.32

0.34

0.36

0.38

0.40

0.42

0.44

California Housing Data

Number of Trees

Test

Ave

rage

Abs

olut

e Er

ror

RF m=2RF m=6GBM depth=4GBM depth=6

FIGURE 15.3. Random forests compared to gradient boosting on the Californiahousing data. The curves represent mean absolute error on the test data as afunction of the number of trees in the models. Two random forests are shown, withm = 2 and m = 6. The two gradient boosted models use a shrinkage parameter! = 0.05 in (10.41), and have interaction depths of 4 and 6. The boosted modelsoutperform random forests.

Figure 15.2 shows the results of a simulation3 comparing random foreststo gradient boosting on the nested spheres problem [Equation (10.2) inChapter 10]. Boosting easily outperforms random forests here. Notice thatsmaller m is better here, although part of the reason could be that the truedecision boundary is additive.

Figure 15.3 compares random forests to boosting (with shrinkage) in aregression problem, using the California housing data (Section 10.14.1).Two strong features that emerge are

• Random forests stabilize at about 200 trees, while at 1000 trees boost-ing continues to improve. Boosting is slowed down by the shrinkage,as well as the fact that the trees are much smaller.

• Boosting outperforms random forests here. At 1000 terms, the weakerboosting model (GBM depth 4) has a smaller error than the stronger

3Details: The random forests were fit using the R package randomForest 4.5-11,with 500 trees. The gradient boosting models were fit using R package gbm 1.5, withshrinkage parameter set to 0.05, and 2000 trees.


0 500 1000 1500 2000 2500

0.04

50.

055

0.06

50.

075

Number of Trees

Mis

clas

sific

atio

n Er

ror OOB Error

Test Error

FIGURE 15.4. oob error computed on the spam training data, compared to thetest error computed on the test set.

random forest (RF m = 6); a Wilcoxon test on the mean di#erencesin absolute errors has a p-value of 0.007. For larger m the randomforests performed no better.

15.3 Details of Random Forests

We have glossed over the distinction between random forests for classifica-tion versus regression. When used for classification, a random forest obtainsa class vote from each tree, and then classifies using majority vote (see Sec-tion 8.7 on bagging for a similar discussion). When used for regression, thepredictions from each tree at a target point x are simply averaged, as in(15.2). In addition, the inventors make the following recommendations:

• For classification, the default value for m is $#p% and the minimumnode size is one.

• For regression, the default value for m is $p/3% and the minimumnode size is five.

In practice the best values for these parameters will depend on the problem,and they should be treated as tuning parameters. In Figure 15.3 the m = 6performs much better than the default value $8/3% = 2.

15.3.1 Out of Bag Samples

An important feature of random forests is its use of out-of-bag (oob) sam-ples:

15.3 Details of Random Forests 593

For each observation zi = (xi, yi), construct its random forestpredictor by averaging only those trees corresponding to boot-strap samples in which zi did not appear.

An oob error estimate is almost identical to that obtained by N -fold cross-validation; see Exercise 15.2. Hence unlike many other nonlinear estimators,random forests can be fit in one sequence, with cross-validation being per-formed along the way. Once the oob error stabilizes, the training can beterminated.

Figure 15.4 shows the oob misclassification error for the spam data, com-pared to the test error. Although 2500 trees are averaged here, it appearsfrom the plot that about 200 would be su!cient.

15.3.2 Variable Importance

Variable importance plots can be constructed for random forests in exactlythe same way as they were for gradient-boosted models (Section 10.13).At each split in each tree, the improvement in the split-criterion is theimportance measure attributed to the splitting variable, and is accumulatedover all the trees in the forest separately for each variable. The left plotof Figure 15.5 shows the variable importances computed in this way forthe spam data; compare with the corresponding Figure 10.6 on page 354 forgradient boosting. Boosting ignores some variables completely, while therandom forest does not. The candidate split-variable selection increasesthe chance that any single variable gets included in a random forest, whileno such selection occurs with boosting.

Random forests also use the oob samples to construct a di#erent variable-importance measure, apparently to measure the prediction strength of eachvariable. When the bth tree is grown, the oob samples are passed downthe tree, and the prediction accuracy is recorded. Then the values for thejth variable are randomly permuted in the oob samples, and the accuracyis again computed. The decrease in accuracy as a result of this permutingis averaged over all trees, and is used as a measure of the importance ofvariable j in the random forest. These are expressed as a percent of themaximum in the right plot in Figure 15.5. Although the rankings of thetwo methods are similar, the importances in the right plot are more uni-form over the variables. The randomization e#ectively voids the e#ect ofa variable, much like setting a coe!cient to zero in a linear model (Exer-cise 15.7). This does not measure the e#ect on prediction were this variablenot available, because if the model was refitted without the variable, othervariables could be used as surrogates.


!$

removefree

CAPAVEyour

CAPMAXhp

CAPTOTmoney

ouryou

george000eduhpl

business1999

internet(

willall

emailre

receiveovermail

;650

meetinglabs

orderaddress

pmpeople

make#

creditfontdata

technology85

[lab

telnetreport

originalproject

conferencedirect

415857

addresses3dcs

partstable

Gini

0 20 40 60 80 100

Variable Importance

!remove

$CAPAVE

hpfree

CAPMAXedu

georgeCAPTOT

yourour

1999re

youhpl

business000

meetingmoney

(will

internet650pm

receiveover

email;

fontmail

technologyorder

alllabs

[85

addressoriginal

labtelnet

peopleproject

datacredit

conference857

#415

makecs

reportdirect

addresses3d

partstable

Randomization

0 20 40 60 80 100

Variable Importance

FIGURE 15.5. Variable importance plots for a classification random forestgrown on the spam data. The left plot bases the importance on the Gini split-ting index, as in gradient boosting. The rankings compare well with the rankingsproduced by gradient boosting (Figure 10.6 on page 354). The right plot uses oob

randomization to compute variable importances, and tends to spread the impor-tances more uniformly.

15.3 Details of Random Forests 595

Proximity Plot

1

23

4

5

6

Random Forest Classifier

1

2

34

56

Dimension 1

Dim

ension

2

X1X

2

FIGURE 15.6. (Left): Proximity plot for a random forest classifier grown tothe mixture data. (Right): Decision boundary and training data for random foreston mixture data. Six points have been identified in each plot.

15.3.3 Proximity Plots

One of the advertised outputs of a random forest is a proximity plot. Fig-ure 15.6 shows a proximity plot for the mixture data defined in Section 2.3.3in Chapter 2. In growing a random forest, an N & N proximity matrix isaccumulated for the training data. For every tree, any pair of oob obser-vations sharing a terminal node has their proximity increased by one. Thisproximity matrix is then represented in two dimensions using multidimen-sional scaling (Section 14.8). The idea is that even though the data may behigh-dimensional, involving mixed variables, etc., the proximity plot givesan indication of which observations are e#ectively close together in the eyesof the random forest classifier.

Proximity plots for random forests often look very similar, irrespective ofthe data, which casts doubt on their utility. They tend to have a star shape,one arm per class, which is more pronounced the better the classificationperformance.

Since the mixture data are two-dimensional, we can map points from theproximity plot to the original coordinates, and get a better understanding ofwhat they represent. It seems that points in pure regions class-wise map tothe extremities of the star, while points nearer the decision boundaries mapnearer the center. This is not surprising when we consider the constructionof the proximity matrices. Neighboring points in pure regions will oftenend up sharing a bucket, since when a terminal node is pure, it is no longer


split by a random forest tree-growing algorithm. On the other hand, pairsof points that are close but belong to di#erent classes will sometimes sharea terminal node, but not always.

15.3.4 Random Forests and Overfitting

When the number of variables is large, but the fraction of relevant variablessmall, random forests are likely to perform poorly with small m. At eachsplit the chance can be small that the relevant variables will be selected.Figure 15.7 shows the results of a simulation that supports this claim. De-tails are given in the figure caption and Exercise 15.3. At the top of eachpair we see the hyper-geometric probability that a relevant variable will beselected at any split by a random forest tree (in this simulation, the relevantvariables are all equal in stature). As this probability gets small, the gapbetween boosting and random forests increases. When the number of rele-vant variables increases, the performance of random forests is surprisinglyrobust to an increase in the number of noise variables. For example, with 6relevant and 100 noise variables, the probability of a relevant variable beingselected at any split is 0.46, assuming m =

#(6 + 100) ' 10. According to

Figure 15.7, this does not hurt the performance of random forests comparedwith boosting. This robustness is largely due to the relative insensitivity ofmisclassification cost to the bias and variance of the probability estimatesin each tree. We consider random forests for regression in the next section.

Another claim is that random forests “cannot overfit” the data. It iscertainly true that increasing B does not cause the random forest sequenceto overfit; like bagging, the random forest estimate (15.2) approximates theexpectation

frf(x) = E!T (x;") = limB"#

f(x)Brf (15.3)

with an average over B realizations of ". The distribution of " here is con-ditional on the training data. However, this limit can overfit the data; theaverage of fully grown trees can result in too rich a model, and incur unnec-essary variance. Segal (2004) demonstrates small gains in performance bycontrolling the depths of the individual trees grown in random forests. Ourexperience is that using full-grown trees seldom costs much, and results inone less tuning parameter.

Figure 15.8 shows the modest e#ect of depth control in a simple regressionexample. Classifiers are less sensitive to variance, and this e#ect of over-fitting is seldom seen with random-forest classification.

15.4 Analysis of Random Forests 597

Test

Mis

clas

sific

atio

n Er

ror

0.10

0.15

0.20

0.25

0.30

Bayes Error

(2, 5) (2, 25) (2, 50) (2, 100) (2, 150)

Number of (Relevant, Noise) Variables

0.52 0.34 0.25 0.19 0.15

Random ForestGradient Boosting

FIGURE 15.7. A comparison of random forests and gradient boosting on prob-lems with increasing numbers of noise variables. In each case the true decisionboundary depends on two variables, and an increasing number of noise variablesare included. Random forests uses its default value m =

!p. At the top of each

pair is the probability that one of the relevant variables is chosen at any split.The results are based on 50 simulations for each pair, with a training sample of300, and a test sample of 500.

15.4 Analysis of Random Forests

In this section we analyze the mechanisms at play with the additionalrandomization employed by random forests. For this discussion we focuson regression and squared error loss, since this gets at the main points,and bias and variance are more complex with 0–1 loss (see Section 7.3.1).Furthermore, even in the case of a classification problem, we can considerthe random-forest average as an estimate of the class posterior probabilities,for which bias and variance are appropriate descriptors.

15.4.1 Variance and the De-Correlation E!ect

The limiting form (B ( )) of the random forest regression estimator is

frf(x) = E!|ZT (x;"(Z)), (15.4)

where we have made explicit the dependence on the training data Z. Herewe consider estimation at a single target point x. From (15.1) we see that


50 30 20 10 5

1.00

1.05

1.10

Minimum Node Size

Mea

n Sq

uare

d Te

st E

rror

Shallow Deep

FIGURE 15.8. The e!ect of tree size on the error in random forest regres-sion. In this example, the true surface was additive in two of the 12 variables,plus additive unit-variance Gaussian noise. Tree depth is controlled here by theminimum node size; the smaller the minimum node size, the deeper the trees.

Varfrf(x) = "(x)!2(x). (15.5)

Here

• "(x) is the sampling correlation between any pair of trees used in theaveraging:

"(x) = corr[T (x;"1(Z)), T (x;"2(Z))], (15.6)

where "1(Z) and "2(Z) are a randomly drawn pair of random foresttrees grown to the randomly sampled Z;

• !2(x) is the sampling variance of any single randomly drawn tree,

!2(x) = Var T (x;"(Z)). (15.7)

It is easy to confuse "(x) with the average correlation between fitted treesin a given random-forest ensemble; that is, think of the fitted trees as N -vectors, and compute the average pairwise correlation between these vec-tors, conditioned on the data. This is not the case; this conditional corre-lation is not directly relevant in the averaging process, and the dependenceon x in "(x) warns us of the distinction. Rather, "(x) is the theoreticalcorrelation between a pair of random-forest trees evaluated at x, inducedby repeatedly making training sample draws Z from the population, andthen drawing a pair of random forest trees. In statistical jargon, this is thecorrelation induced by the sampling distribution of Z and ".

More precisely, the variability averaged over in the calculations in (15.6)and (15.7) is both


• conditional on Z: due to the bootstrap sampling and feature samplingat each split, and

• a result of the sampling variability of Z itself.

In fact, the conditional covariance of a pair of tree fits at x is zero, becausethe bootstrap and feature sampling is i.i.d; see Exercise 15.5.

1 4 7 13 19 25 31 37 43 49

0.00

0.02

0.04

0.06

0.08

Number of Randomly Selected Splitting Variables m

Cor

rela

tion

betw

een

Tree

s

FIGURE 15.9. Correlations between pairs of trees drawn by a random-forestregression algorithm, as a function of m. The boxplots represent the correlationsat 600 randomly chosen prediction points x.

The following demonstrations are based on a simulation model

Y =1#50

50"

j=1

Xj + #, (15.8)

with all the Xj and # iid Gaussian. We use 500 training sets of size 100, anda single set of test locations of size 600. Since regression trees are nonlinearin Z, the patterns we see below will di#er somewhat depending on thestructure of the model.

Figure 15.9 shows how the correlation (15.6) between pairs of trees de-creases as m decreases: pairs of tree predictions at x for di#erent trainingsets Z are likely to be less similar if they do not use the same splittingvariables.

In the left panel of Figure 15.10 we consider the variances of single treepredictors, VarT (x;"(Z)) (averaged over 600 prediction points x drawnrandomly from our simulation model). This is the total variance, and can be


decomposed into two parts using standard conditional variance arguments(see Exercise 15.5):

Var!,ZT (x;"(Z)) = VarZE!|ZT (x;"(Z)) + EZVar!|ZT (x;"(Z))

Total Variance = VarZfrf(x) + within-Z Variance(15.9)

The second term is the within-Z variance—a result of the randomization,which increases as m decreases. The first term is in fact the sampling vari-ance of the random forest ensemble (shown in the right panel), which de-creases as m decreases. The variance of the individual trees does not changeappreciably over much of the range of m, hence in light of (15.5), the vari-ance of the ensemble is dramatically lower than this tree variance.

0 10 20 30 40 50

1.80

1.85

1.90

1.95

Single Tree

m

Varia

nce

Within ZTotal

0 10 20 30 40 50

0.65

0.70

0.75

0.80

0.85

Random Forest Ensemble

m

Mea

n Sq

uare

d Er

ror a

nd S

quar

ed B

ias

Varia

nce

Mean Squared ErrorSquared BiasVariance 0.

00.

050.

100.

150.

20

FIGURE 15.10. Simulation results. The left panel shows the average variance ofa single random forest tree, as a function of m. “Within Z” refers to the averagewithin-sample contribution to the variance, resulting from the bootstrap samplingand split-variable sampling (15.9). “Total” includes the sampling variability ofZ. The horizontal line is the average variance of a single fully grown tree (with-out bootstrap sampling). The right panel shows the average mean-squared error,squared bias and variance of the ensemble, as a function of m. Note that thevariance axis is on the right (same scale, di!erent level). The horizontal line isthe average squared-bias of a fully grown tree.

15.4.2 Bias

As in bagging, the bias of a random forest is the same as the bias of anyof the individual sampled trees T (x;"(Z)):


Bias(x) = µ(x) ! EZfrf(x)

= µ(x) ! EZE!|ZT (x;"(Z)). (15.10)

This is also typically greater (in absolute terms) than the bias of an un-pruned tree grown to Z, since the randomization and reduced sample spaceimpose restrictions. Hence the improvements in prediction obtained by bag-ging or random forests are solely a result of variance reduction.

Any discussion of bias depends on the unknown true function. Fig-ure 15.10 (right panel) shows the squared bias for our additive model simu-lation (estimated from the 500 realizations). Although for di#erent modelsthe shape and rate of the bias curves may di#er, the general trend is thatas m decreases, the bias increases. Shown in the figure is the mean-squarederror, and we see a classical bias-variance trade-o# in the choice of m. Forall m the squared bias of the random forest is greater than that for a singletree (horizontal line).

These patterns suggest a similarity with ridge regression (Section 3.4.1).Ridge regression is useful (in linear models) when one has a large numberof variables with similarly sized coe!cients; ridge shrinks their coe!cientstoward zero, and those of strongly correlated variables toward each other.Although the size of the training sample might not permit all the variablesto be in the model, this regularization via ridge stabilizes the model and al-lows all the variables to have their say (albeit diminished). Random forestswith small m perform a similar averaging. Each of the relevant variablesget their turn to be the primary split, and the ensemble averaging reducesthe contribution of any individual variable. Since this simulation exam-ple (15.8) is based on a linear model in all the variables, ridge regressionachieves a lower mean-squared error (about 0.45 with df($opt) ' 29).

15.4.3 Adaptive Nearest Neighbors

The random forest classifier has much in common with the k-nearest neigh-bor classifier (Section 13.3); in fact a weighted version thereof. Since eachtree is grown to maximal size, for a particular "!, T (x;"!(Z)) is the re-sponse value for one of the training samples4. The tree-growing algorithmfinds an “optimal” path to that observation, choosing the most informativepredictors from those at its disposal. The averaging process assigns weightsto these training responses, which ultimately vote for the prediction. Hencevia the random-forest voting mechanism, those observations close to thetarget point get assigned weights—an equivalent kernel—which combine toform the classification decision.

Figure 15.11 demonstrates the similarity between the decision boundaryof 3-nearest neighbors and random forests on the mixture data.

4We gloss over the fact that pure nodes are not split further, and hence there can bemore than one observation in a terminal node


Random Forest Classifier

oo

ooo

oo

o

o

o

o

o

o

oo

o

o o

oo

o

o

o

o

o

o

o

oo

oo

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

oo

o o

o

o

o

o

o

o

o

o

o

oo o

oo

oo

o

oo

o

oo

oo

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

ooo

o

o

oo

o

oo

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o oooo

o

ooo o

o

o

o

o

o

o

o

ooo

ooo

oooo

o

ooo

o

o

o

o

o

o

o

o o

o

o

o

o

o

o

o oo

oo

o

o

oo

o

o

ooo o

o oo

o

o

o

o

o

o

o

o

o

o

Training Error: 0.000Test Error: 0.238Bayes Error: 0.210

3−Nearest Neighbors

oo

ooo

oo

o

o

o

o

o

o

oo

o

o o

oo

o

o

o

o

o

o

o

oo

oo

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

oo

o o

o

o

o

o

o

o

o

o

o

oo o

oo

oo

o

oo

o

oo

oo

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

ooo

o

o

oo

o

oo

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o oooo

o

ooo o

o

o

o

o

o

o

o

ooo

ooo

oooo

o

ooo

o

o

o

o

o

o

o

o o

o

o

o

o

o

o

o oo

oo

o

o

oo

o

o

ooo o

o oo

o

o

o

o

o

o

o

o

o

o

Training Error: 0.130Test Error: 0.242Bayes Error: 0.210

FIGURE 15.11. Random forests versus 3-NN on the mixture data. The axis-ori-ented nature of the individual trees in a random forest lead to decision regionswith an axis-oriented flavor.

Bibliographic Notes

Random forests as described here were introduced by Breiman (2001), al-though many of the ideas had cropped up earlier in the literature in dif-ferent forms. Notably Ho (1995) introduced the term “random forest,” andused a consensus of trees grown in random subspaces of the features. Theidea of using stochastic perturbation and averaging to avoid overfitting wasintroduced by Kleinberg (1990), and later in Kleinberg (1996). Amit andGeman (1997) used randomized trees grown on image features for imageclassification problems. Breiman (1996a) introduced bagging, a precursorto his version of random forests. Dietterich (2000b) also proposed an im-provement on bagging using additional randomization. His approach wasto rank the top 20 candidate splits at each node, and then select from thelist at random. He showed through simulations and real examples that thisadditional randomization improved over the performance of bagging. Fried-man and Hall (2007) showed that sub-sampling (without replacement) isan e#ective alternative to bagging. They showed that growing and aver-aging trees on samples of size N/2 is approximately equivalent (in termsbias/variance considerations) to bagging, while using smaller fractions ofN reduces the variance even further (through decorrelation).

There are several free software implementations of random forests. Inthis chapter we used the randomForest package in R, maintained by AndyLiaw, available from the CRAN website. This allows both split-variable se-lection, as well as sub-sampling. Adele Cutler maintains a random forestwebsite http://www.math.usu.edu/*adele/forests/ where (as of Au-gust 2008) the software written by Leo Breiman and Adele Cutler is freely

Exercises 603

available. Their code, and the name “random forests”, is exclusively li-censed to Salford Systems for commercial release. The Weka machine learn-ing archive http://www.cs.waikato.ac.nz/ml/weka/ at Waikato Univer-sity, New Zealand, o#ers a free java implementation of random forests.

Exercises

Ex. 15.1 Derive the variance formula (15.1). This appears to fail if " isnegative; diagnose the problem in this case.

Ex. 15.2 Show that as the number of bootstrap samples B gets large, theoob error estimate for a random forest approaches its N -fold CV errorestimate, and that in the limit, the identity is exact.

Ex. 15.3 Consider the simulation model used in Figure 15.7 (Mease andWyner, 2008). Binary observations are generated with probabilities

Pr(Y = 1|X) = q + (1 ! 2q) · 1

$

%J"

j=1

Xj > J/2

&

', (15.11)

where X * U [0, 1]p, 0 " q " 12 , and J " p is some predefined (even)

number. Describe this probability surface, and give the Bayes error rate.

Ex. 15.4 Suppose xi, i = 1, . . . , N are iid (µ,!2). Let x!1 and x!

2 be twobootstrap realizations of the sample mean. Show that the sampling cor-relation corr(x!

1, x!2) = n

2n$1 ' 50%. Along the way, derive var(x!1) and

the variance of the bagged mean xbag. Here x is a linear statistic; baggingproduces no reduction in variance for linear statistics.

Ex. 15.5 Show that the sampling correlation between a pair of random-forest trees at a point x is given by

"(x) =VarZ[E!|ZT (x;"(Z))]

VarZ[E!|ZT (x;"(Z))] + EZVar!|Z[T (x,"(Z)]. (15.12)

The term in the numerator is VarZ[frf(x)], and the second term in thedenominator is the expected conditional variance due to the randomizationin random forests.

Ex. 15.6 Fit a series of random-forest classifiers to the spam data, to explorethe sensitivity to the parameter m. Plot both the oob error as well as thetest error against a suitably chosen range of values for m.


Ex. 15.7 Suppose we fit a linear regression model to N observations withresponse yi and predictors xi1, . . . , xip. Assume that all variables are stan-dardized to have mean zero and standard deviation one. Let RSS be themean-squared residual on the training data, and % the estimated coe!cient.Denote by RSS!

j the mean-squared residual on the training data using the

same %, but with the N values for the jth variable randomly permutedbefore the predictions are calculated. Show that

EP [RSS!j ! RSS] = 2%2

j , (15.13)

where EP denotes expectation with respect to the permutation distribution.Argue that this is approximately true when the evaluations are done usingan independent test set.

Random Forests - Mathematics and Statistics · 588 15. Random Forests Algorithm 15.1 Random Forest...

Documents

Transcript of Random Forests - Mathematics and Statistics · 588 15. Random Forests Algorithm 15.1 Random Forest...