Understanding Random Forests: From Theory to Practice

46
Understanding Random Forests From Theory to Practice Gilles Louppe Universit´ e de Li` ege, Belgium October 9, 2014 1 / 39

description

Slides of my PhD defense, held on October 9.

Transcript of Understanding Random Forests: From Theory to Practice

Page 1: Understanding Random Forests: From Theory to Practice

Understanding Random ForestsFrom Theory to Practice

Gilles Louppe

Universite de Liege, Belgium

October 9, 2014

1 / 39

Page 2: Understanding Random Forests: From Theory to Practice

Motivation

2 / 39

Page 3: Understanding Random Forests: From Theory to Practice

Objective

From a set of measurements,

learn a model

to predict and understand a phenomenon.

3 / 39

Page 4: Understanding Random Forests: From Theory to Practice

Running example

From physicochemicalproperties (alcohol, acidity,

sulphates, ...),

learn a model

to predict wine tastepreferences (from 0 to 10).

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis, Modeling wine

preferences by data mining from physicochemical properties, 2009.

4 / 39

Page 5: Understanding Random Forests: From Theory to Practice

Outline

1 Motivation

2 Growing decision trees and random forestsReview of state-of-the-art, minor contributions

3 Interpreting random forestsMajor contributions (Theory)

4 Implementing and accelerating random forestsMajor contributions (Practice)

5 Conclusions

5 / 39

Page 6: Understanding Random Forests: From Theory to Practice

Supervised learning

• The inputs are random variables X = X1, ..., Xp ;

• The output is a random variable Y .

• Data comes as a finite learning set

L = {(xi , yi )|i = 0, . . . ,N − 1},

where xi ∈ X = X1 × ...× Xp and yi ∈ Y are randomly drawnfrom PX ,Y .

E.g., (xi , yi ) = ((color = red, alcohol = 12, ...), score = 6)

• The goal is to find a model ϕL : X 7→ Y minimizing

Err(ϕL) = EX ,Y {L(Y ,ϕL(X ))}.

6 / 39

Page 7: Understanding Random Forests: From Theory to Practice

Performance evaluation

Classification

• Symbolic output (e.g., Y = {yes, no})

• Zero-one loss

L(Y ,ϕL(X )) = 1(Y 6= ϕL(X ))

Regression

• Numerical output (e.g., Y = R)

• Squared error loss

L(Y ,ϕL(X )) = (Y −ϕL(X ))2

7 / 39

Page 8: Understanding Random Forests: From Theory to Practice

Divide and conquer

X1

X2

8 / 39

Page 9: Understanding Random Forests: From Theory to Practice

Divide and conquer

0.7 X1

X2

8 / 39

Page 10: Understanding Random Forests: From Theory to Practice

Divide and conquer

0.7

0.5

X1

X2

8 / 39

Page 11: Understanding Random Forests: From Theory to Practice

Decision trees

0.7

0.5

X1

X2

t5t3

t4𝑡2

𝑋1 ≤0.7

𝑡1

𝑡3

𝑡4 𝑡5

𝒙

𝑝(𝑌 = 𝑐|𝑋 = 𝒙)

Split node

Leaf node≤ >

𝑋2 ≤0.5≤ >

t ∈ ϕ : nodes of the tree ϕXt : split variable at tvt ∈ R : split threshold at tϕ(x) = arg maxc∈Y p(Y = c |X = x)

9 / 39

Page 12: Understanding Random Forests: From Theory to Practice

Learning from data (CART)

function BuildDecisionTree(L)Create node t from the learning sample Lt = L

if the stopping criterion is met for t thenyt = some constant value

elseFind the split on Lt that maximizes impurity decrease

s∗ = arg maxs∈Q

∆i(s, t)

Partition Lt into LtL ∪ LtR according to s∗

tL = BuildDecisionTree(LL)tR = BuildDecisionTree(LR)

end ifreturn t

end function

10 / 39

Page 13: Understanding Random Forests: From Theory to Practice

Back to our example

alcohol <= 10.625

vol. acidity <= 0.237 alcohol <= 11.741

y = 5.915 y = 5.382 vol. acidity <= 0.442 y = 6.516

y = 6.131 y = 5.557

11 / 39

Page 14: Understanding Random Forests: From Theory to Practice

Bias-variance decomposition

Theorem. For the squared error loss, thebias-variance decomposition of theexpected generalization error at X = x is

EL{Err(ϕL(x))} = noise(x)+bias2(x)+var(x)

where

noise(x) = Err(ϕB(x)),

bias2(x) = (ϕB(x) − EL{ϕL(x)})2,

var(x) = EL{(EL{ϕL(x)}−ϕL(x))2}.

y

P

'B (x)

L 'L(x)

bias2 (x)

noise(x) var(x)

{ }

12 / 39

Page 15: Understanding Random Forests: From Theory to Practice

Diagnosing the generalization error of a decision tree

• (Residual error : Lowest achievable error, independent of ϕL.)

• Bias : Decision trees usually have low bias.

• Variance : They often suffer from high variance.

• Solution : Combine the predictions of several randomized treesinto a single model.

13 / 39

Page 16: Understanding Random Forests: From Theory to Practice

Random forests

𝒙

𝑝𝜑1(𝑌 = 𝑐|𝑋 = 𝒙)

𝜑1 𝜑𝑀

𝑝𝜑𝑚(𝑌 = 𝑐|𝑋 = 𝒙)

𝑝𝜓(𝑌 = 𝑐|𝑋 = 𝒙)

Randomization• Bootstrap samples } Random Forests• Random selection of K 6 p split variables } Extra-Trees• Random selection of the threshold

14 / 39

Page 17: Understanding Random Forests: From Theory to Practice

Bias-variance decomposition (cont.)

Theorem. For the squared error loss, the bias-variancedecomposition of the expected generalization errorEL{Err(ψL,θ1,...,θM (x))} at X = x of an ensemble of Mrandomized models ϕL,θm is

EL{Err(ψL,θ1,...,θM (x))} = noise(x) + bias2(x) + var(x),

where

noise(x) = Err(ϕB(x)),

bias2(x) = (ϕB(x) − EL,θ{ϕL,θ(x)})2,

var(x) = ρ(x)σ2L,θ(x) +

1 − ρ(x)

Mσ2L,θ(x).

and where ρ(x) is the Pearson correlation coefficient between thepredictions of two randomized trees built on the same learning set.

15 / 39

Page 18: Understanding Random Forests: From Theory to Practice

Diagnosing the generalization error of random forests

• Bias : Identical to the bias of a single randomized tree.

• Variance : var(x) = ρ(x)σ2L,θ(x) +

1−ρ(x)M σ2

L,θ(x)

As M →∞, var(x)→ ρ(x)σ2L,θ(x)

The stronger the randomization, ρ(x)→ 0, var(x)→ 0.The weaker the randomization, ρ(x)→ 1, var(x)→ σ2

L,θ(x)

Bias-variance trade-off. Randomization increases bias but makesit possible to reduce the variance of the corresponding ensemblemodel. The crux of the problem is to find the right trade-off.

16 / 39

Page 19: Understanding Random Forests: From Theory to Practice

Back to our example

Method Trees MSE

CART 1 1.055Random Forest 50 0.517Extra-Trees 50 0.507

Combining several randomized trees indeed works better !

17 / 39

Page 20: Understanding Random Forests: From Theory to Practice

Outline

1 Motivation

2 Growing decision trees and random forests

3 Interpreting random forests

4 Implementing and accelerating random forests

5 Conclusions

18 / 39

Page 21: Understanding Random Forests: From Theory to Practice

Variable importances

• Interpretability can be recovered through variable importances

• Two main importance measures :The mean decrease of impurity (MDI) : summing totalimpurity reductions at all tree nodes where the variableappears (Breiman et al., 1984) ;The mean decrease of accuracy (MDA) : measuringaccuracy reduction on out-of-bag samples when the values ofthe variable are randomly permuted (Breiman, 2001).

• We focus here on MDI because :It is faster to compute ;It does not require to use bootstrap sampling ;In practice, it correlates well with the MDA measure.

19 / 39

Page 22: Understanding Random Forests: From Theory to Practice

Mean decrease of impurity

𝜑1 𝜑𝑀 𝜑2

Importance of variable Xj for an ensemble of M trees ϕm is :

Imp(Xj) =1

M

M∑m=1

∑t∈ϕm

1(jt = j)[p(t)∆i(t)

],

where jt denotes the variable used at node t, p(t) = Nt/N and∆i(t) is the impurity reduction at node t :

∆i(t) = i(t) −NtL

Nti(tL) −

Ntr

Nti(tR)

20 / 39

Page 23: Understanding Random Forests: From Theory to Practice

Back to our example

MDI scores as computed from a forest of 1000 fully developedtrees on the Wine dataset (Random Forest, default parameters).

0.00 0.05 0.10 0.15 0.20 0.25 0.30

color

fixed acidity

citric acid

dens ity

chlorides

pH

res idual sugar

total sulfur dioxide

sulphates

free sulfur dioxide

volatile acidity

alcohol

21 / 39

Page 24: Understanding Random Forests: From Theory to Practice

What does it mean ?

• MDI works well, but it is not well understood theoretically ;

• We would like to better characterize it and derive its mainproperties from this characterization.

• Working assumptions :

All variables are discrete ;Multi-way splits a la C4.5 (i.e., one branch per value) ;Shannon entropy as impurity measure :

i(t) = −∑c

Nt,c

Ntlog

Nt,c

Nt

Totally randomized trees (RF with K = 1) ;Asymptotic conditions : N →∞, M →∞.

22 / 39

Page 25: Understanding Random Forests: From Theory to Practice

Result 1 : Three-level decomposition (Louppe et al., 2013)

Theorem. Variable importances provide a three-leveldecomposition of the information jointly provided by all the inputvariables about the output, accounting for all interaction terms ina fair and exhaustive way.

I (X1, . . . ,Xp;Y )︸ ︷︷ ︸Information jointly provided

by all input variablesabout the output

=

p∑j=1

Imp(Xj)︸ ︷︷ ︸i) Decomposition in terms of

the MDI importance ofeach input variable

Imp(Xj) =

p−1∑k=0

1

C kp

1

p − k︸ ︷︷ ︸ii) Decomposition along

the degrees k of interactionwith the other variables

∑B∈Pk(V−j)

I (Xj ;Y |B)

︸ ︷︷ ︸iii) Decomposition along all

interaction terms Bof a given degree k

E.g. : p = 3, Imp(X1) = 13 I(X1;Y )+ 1

6 (I(X1;Y |X2)+I(X1;Y |X3))+13 I(X1;Y |X2,X3)

23 / 39

Page 26: Understanding Random Forests: From Theory to Practice

Illustration : 7-segment display (Breiman et al., 1984)

y x1 x2 x3 x4 x5 x6 x7

0 1 1 1 0 1 1 11 0 0 1 0 0 1 02 1 0 1 1 1 0 13 1 0 1 1 0 1 14 0 1 1 1 0 1 05 1 1 0 1 0 1 16 1 1 0 1 1 1 17 1 0 1 0 0 1 08 1 1 1 1 1 1 19 1 1 1 1 0 1 1

24 / 39

Page 27: Understanding Random Forests: From Theory to Practice

Illustration : 7-segment display (Breiman et al., 1984)

Imp(Xj) =

p−1∑k=0

1

C kp

1

p − k

∑B∈Pk(V−j)

I (Xj ;Y |B)

Var Imp

X1 0.412X2 0.581X3 0.531X4 0.542X5 0.656X6 0.225X7 0.372∑

3.321

0 1 2 3 4 5 6

X1

X2

X3

X4

X5

X6

X7

k24 / 39

Page 28: Understanding Random Forests: From Theory to Practice

Result 2 : Irrelevant variables (Louppe et al., 2013)

Theorem. Variable importances depend only on the relevantvariables.

Theorem. A variable Xj is irrelevant if and only if Imp(Xj) = 0.

⇒ The importance of a relevant variable is insensitive to theaddition or the removal of irrelevant variables.

Definition (Kohavi & John, 1997). A variable X is irrelevant (to Y with respect to V )

if, for all B ⊆ V , I(X ;Y |B) = 0. A variable is relevant if it is not irrelevant.

25 / 39

Page 29: Understanding Random Forests: From Theory to Practice

Relaxing assumptions

When trees are not totally random...

• There can be relevant variables with zero importances (due tomasking effects).

• The importance of relevant variables can be influenced by thenumber of irrelevant variables.

When the learning set is finite...

• Importances are biased towards variables of high cardinality.

• This effect can be minimized by collecting impurity termsmeasured from large enough sample only.

When splits are not multiway...

• i(t) does not actually measure the mutual information.

26 / 39

Page 30: Understanding Random Forests: From Theory to Practice

Back to our exampleMDI scores as computed from a forest of 1000 fixed-depth trees onthe Wine dataset (Extra-Trees, K = 1, max depth = 5).

0.00 0.05 0.10 0.15 0.20 0.25 0.30

pH

res idual sugar

fixed acidity

sulphates

free sulfur dioxide

citric acid

chlorides

total sulfur dioxide

dens ity

color

volatile acidity

alcohol

Taking into account (some of) the biasesresults in quite a different story !

27 / 39

Page 31: Understanding Random Forests: From Theory to Practice

Outline

1 Motivation

2 Growing decision trees and random forests

3 Interpreting random forests

4 Implementing and accelerating random forests

5 Conclusions

28 / 39

Page 32: Understanding Random Forests: From Theory to Practice

Implementation (Buitinck et al., 2013)

Scikit-Learn

• Open source machine learning library forPython

• Classical and well-establishedalgorithms

• Emphasis on code quality and usability

scikit

A long team effort

Time for building a Random Forest (relative to version 0.10)

1 0.99 0.98

0.330.11 0.04

0.10 0.11 0.12 0.13 0.14 0.15

29 / 39

Page 33: Understanding Random Forests: From Theory to Practice

Implementation overview

• Modular implementation, designed with a strict separation ofconcerns

Builders : for building and connecting nodes into a treeSplitters : for finding a splitCriteria : for evaluating the goodness of a splitTree : dedicated data structure

• Efficient algorithmic formulation [See Louppe, 2014]

Dedicated sorting procedureEfficient evaluation of consecutive splits

• Close to the metal, carefully coded, implementation2300+ lines of Python, 3000+ lines of Cython, 1700+ lines of tests

# But we kept it stupid simple for users!

clf = RandomForestClassifier()

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

30 / 39

Page 34: Understanding Random Forests: From Theory to Practice

A winning strategyScikit-Learn implementation proves to be one of the fastestamong all libraries and programming languages.

0

2000

4000

6000

8000

10000

12000

14000

Fit

tim

e(s

)

203.01 211.53

4464.65

3342.83

1518.14 1711.94

1027.91

13427.06

10941.72

Scikit-Learn-RFScikit-Learn-ETsOpenCV-RFOpenCV-ETsOK3-RFOK3-ETsWeka-RFR-RFOrange-RF

Scikit-LearnPython, Cython

OpenCVC++

OK3C Weka

Java

randomForestR, Fortran

OrangePython

31 / 39

Page 35: Understanding Random Forests: From Theory to Practice

Computational complexity (Louppe, 2014)

Average time complexity

CART Θ(pN log2 N)

Random Forest Θ(MKN log2 N)Extra-Trees Θ(MKN logN)

• N : number of samples in L

• p : number of input variables

• K : the number of variables randomly drawn at each node

• N = 0.632N.

32 / 39

Page 36: Understanding Random Forests: From Theory to Practice

Improving scalability through randomization

Motivation

• Randomization and averaging allow to improve accuracy byreducing variance.

• As a nice side-effect, the resulting algorithms are fast andembarrassingly parallel.

• Why not purposely exploit randomization to make thealgorithm even more scalable (and at least as accurate) ?

Problem

• Let assume a supervised learning problem of Ns samplesdefined over Nf features. Let also assume T computingnodes, each with a memory capacity limited to Mmax , withMmax � Ns × Nf .

• How to best exploit the memory constraint to obtain the mostaccurate model, as quickly as possible ?

33 / 39

Page 37: Understanding Random Forests: From Theory to Practice

A straightforward solution : Random Patches (Louppe et al., 2012)

X Y

...}

1. Draw a subsample r of psNs

random examples, with pfNf

random features.

2. Build a base estimator on r .

3. Repeat 1-2 for a number T ofestimators.

4. Aggregate the predictions byvoting.

ps and pf are two meta-parameters thatshould be selected

• such that psNs × pfNf 6 Mmax

• to optimize accuracy

34 / 39

Page 38: Understanding Random Forests: From Theory to Practice

Impact of memory constraint

0.1 0.2 0.3 0.4 0.5Memory constraint

0.80

0.82

0.84

0.86

0.88

0.90

0.92

0.94

Accuracy

RP-ETRP-DTETRF

35 / 39

Page 39: Understanding Random Forests: From Theory to Practice

Lessons learned from subsampling

• Training each estimator on the whole data is (often) useless.The size of the random patches can be reduced without(significant) loss in accuracy.

• As a result, both memory consumption and training time canbe reduced, at low cost.

• With strong memory constraints, RP can exploit data betterthan the other methods.

• Sampling features is critical to improve accuracy. Samplingthe examples only is often ineffective.

36 / 39

Page 40: Understanding Random Forests: From Theory to Practice

Outline

1 Motivation

2 Growing decision trees and random forests

3 Interpreting random forests

4 Implementing and accelerating random forests

5 Conclusions

37 / 39

Page 41: Understanding Random Forests: From Theory to Practice

Opening the black box

• Random forests constitute one of the most robust andeffective machine learning algorithms for many problems.

• While simple in design and easy to use, random forests remainhowever

hard to analyze theoretically,non-trivial to interpret,difficult to implement properly.

• Through an in-depth re-assessment of the method, thisdissertation has proposed original contributions on theseissues.

38 / 39

Page 42: Understanding Random Forests: From Theory to Practice

Future works

Variable importances

• Theoretical characterization of variable importances in a finitesetting.

• (Re-analysis of) empirical studies based on variableimportances, in light of the results and conclusions of thethesis.

• Study of variable importances in boosting.

Subsampling

• Finer study of subsampling statistical mechanisms.

• Smart sampling.

39 / 39

Page 43: Understanding Random Forests: From Theory to Practice

Questions ?

40 / 39

Page 44: Understanding Random Forests: From Theory to Practice

Backup slides

41 / 39

Page 45: Understanding Random Forests: From Theory to Practice

Condorcet’s jury theorem

Let consider a group of M voters.

If each voter has an independentprobability p > 1

2 of voting for the correctdecision, then adding more voters increasesthe probability of the majority decision tobe correct.

When M →∞, the probability that thedecision taken by the group is correctapproaches 1.

42 / 39

Page 46: Understanding Random Forests: From Theory to Practice

Interpretation of ρ(x) (Louppe, 2014)

Theorem. ρ(x) =VL{Eθ|L{ϕL,θ(x)}}

VL{Eθ|L{ϕL,θ(x)}}+EL{Vθ|L{ϕL,θ(x)}}

In other words, it is the ratio between

• the variance due to the learning set and

• the total variance, accounting for random effects due to boththe learning set and the random perburbations.

ρ(x)→ 1 when variance is mostly due to the learning set ;ρ(x)→ 0 when variance is mostly due to the randomperturbations ;ρ(x) > 0.

43 / 39