Bias-variance decomposition in Random Forests

Bias-variance decompositionin Random Forests

Gilles Louppe @glouppe

Paris ML S02E04, December 8, 2014

1 / 14

https://twitter.com/glouppe

Motivation

In supervised learning, combining the predictions of severalrandomized models often achieves better results than a singlenon-randomized model.

Why ?

2 / 14

Supervised learning

• The inputs are random variables X = X1, ..., Xp ;

• The output is a random variable Y .

• Data comes as a finite learning set

L = {(xi , yi )|i = 0, . . . ,N − 1},

where xi ∈ X = X1 × ...× Xp and yi ∈ Y are randomly drawnfrom PX ,Y .

• The goal is to find a model ϕL : X 7→ Y minimizing

Err(ϕL) = EX ,Y {L(Y ,ϕL(X ))}.

3 / 14

Performance evaluation

Classification

• Symbolic output (e.g., Y = {yes, no})

• Zero-one loss

L(Y ,ϕL(X )) = 1(Y 6= ϕL(X ))

Regression

• Numerical output (e.g., Y = R)

• Squared error loss

L(Y ,ϕL(X )) = (Y −ϕL(X ))2

4 / 14

Decision trees

0.7

0.5

X1

X2

t5t3

t4𝑡2

𝑋1 ≤0.7

𝑡1

𝑡3

𝑡4 𝑡5

𝒙

𝑝(𝑌 = 𝑐|𝑋 = 𝒙)

Split node

Leaf node≤ >

𝑋2 ≤0.5≤ >

t ∈ ϕ : nodes of the tree ϕXt : split variable at tvt ∈ R : split threshold at tϕ(x) = arg maxc∈Y p(Y = c |X = x)

5 / 14

Bias-variance decomposition in regression

Theorem. For the squared error loss, the bias-variancedecomposition of the expected generalization error at X = x is

EL{Err(ϕL(x))} = noise(x) + bias2(x) + var(x)

where

noise(x) = Err(ϕB(x)),

bias2(x) = (ϕB(x) − EL{ϕL(x)})2,

var(x) = EL{(EL{ϕL(x)}−ϕL(x))2}.

6 / 14

Bias-variance decomposition

7 / 14

Diagnosing the generalization error of a decision tree

• (Residual error : Lowest achievable error, independent of ϕL.)

• Bias : Decision trees usually have low bias.

• Variance : They often suffer from high variance.

• Solution : Combine the predictions of several randomized treesinto a single model.

8 / 14

Random forests

𝒙

𝑝𝜑1(𝑌 = 𝑐|𝑋 = 𝒙)

𝜑1 𝜑𝑀

…

𝑝𝜑𝑚(𝑌 = 𝑐|𝑋 = 𝒙)

∑

𝑝𝜓(𝑌 = 𝑐|𝑋 = 𝒙)

Randomization• Bootstrap samples } Random Forests• Random selection of K 6 p split variables } Extra-Trees• Random selection of the threshold

9 / 14

Bias-variance decomposition (cont.)

Theorem. For the squared error loss, the bias-variancedecomposition of the expected generalization errorEL{Err(ψL,θ1,...,θM (x))} at X = x of an ensemble of Mrandomized models ϕL,θm is

EL{Err(ψL,θ1,...,θM (x))} = noise(x) + bias2(x) + var(x),

where

noise(x) = Err(ϕB(x)),

bias2(x) = (ϕB(x) − EL,θ{ϕL,θ(x)})2,

var(x) = ρ(x)σ2L,θ(x) +1 − ρ(x)

Mσ2L,θ(x).

and where ρ(x) is the Pearson correlation coefficient between thepredictions of two randomized trees built on the same learning set.

10 / 14

Interpretation of ρ(x) (Louppe, 2014)

Theorem. ρ(x) =VL{Eθ|L{ϕL,θ(x)}}

VL{Eθ|L{ϕL,θ(x)}}+EL{Vθ|L{ϕL,θ(x)}}

In other words, it is the ratio between

• the variance due to the learning set and

• the total variance, accounting for random effects due to boththe learning set and the random perburbations.

ρ(x)→ 1 when variance is mostly due to the learning set ;ρ(x)→ 0 when variance is mostly due to the randomperturbations ;ρ(x) > 0.

11 / 14

Diagnosing the generalization error of random forests

• Bias : Identical to the bias of a single randomized tree.

• Variance : var(x) = ρ(x)σ2L,θ(x) +1−ρ(x)

M σ2L,θ(x)

As M →∞, var(x)→ ρ(x)σ2L,θ(x)The stronger the randomization, ρ(x)→ 0, var(x)→ 0.The weaker the randomization, ρ(x)→ 1, var(x)→ σ2L,θ(x)

Bias-variance trade-off. Randomization increases bias but makesit possible to reduce the variance of the corresponding ensemblemodel through averaging.

The crux of the problem is to find the right trade-off.Tips : tune max features in Random Forests.

12 / 14

Bias-variance decomposition in classification

Theorem. For the zero-one loss and binary classification, theexpected generalization error EL{Err(ϕL(x))} at X = xdecomposes as follows :

EL{Err(ϕL(x))} = P(ϕB(x) 6= Y )

+Φ(0.5 − EL{p̂L(Y = ϕB(x))}√

VL{p̂L(Y = ϕB(x))})(2P(ϕB(x) = Y ) − 1)

• For EL{p̂L(Y = ϕB(x)} > 0.5, VL{p̂L(Y = ϕB(x))}→ 0makes Φ→ 0 and the expected generalization error tends tothe error of the Bayes model.

• Conversely, for EL{p̂L(Y = ϕB(x)} < 0.5,VL{p̂L(Y = ϕB(x))}→ 0 makes Φ→ 1 and the error ismaximal.

13 / 14

Questions ?

14 / 14

Bias-variance decomposition in Random Forests

Technology

Transcript of Bias-variance decomposition in Random Forests