Linear Models Of Regression: Bias-Variance Decomposition ...
Bias-variance decomposition in Random Forests
-
Upload
gilles-louppe -
Category
Technology
-
view
813 -
download
0
Transcript of Bias-variance decomposition in Random Forests
Bias-variance decompositionin Random Forests
Gilles Louppe @glouppe
Paris ML S02E04, December 8, 2014
1 / 14
Motivation
In supervised learning, combining the predictions of severalrandomized models often achieves better results than a singlenon-randomized model.
Why ?
2 / 14
Supervised learning
• The inputs are random variables X = X1, ..., Xp ;
• The output is a random variable Y .
• Data comes as a finite learning set
L = {(xi , yi )|i = 0, . . . ,N − 1},
where xi ∈ X = X1 × ...× Xp and yi ∈ Y are randomly drawnfrom PX ,Y .
• The goal is to find a model ϕL : X 7→ Y minimizing
Err(ϕL) = EX ,Y {L(Y ,ϕL(X ))}.
3 / 14
Performance evaluation
Classification
• Symbolic output (e.g., Y = {yes, no})
• Zero-one loss
L(Y ,ϕL(X )) = 1(Y 6= ϕL(X ))
Regression
• Numerical output (e.g., Y = R)
• Squared error loss
L(Y ,ϕL(X )) = (Y −ϕL(X ))2
4 / 14
Decision trees
0.7
0.5
X1
X2
t5t3
t4𝑡2
𝑋1 ≤0.7
𝑡1
𝑡3
𝑡4 𝑡5
𝒙
𝑝(𝑌 = 𝑐|𝑋 = 𝒙)
Split node
Leaf node≤ >
𝑋2 ≤0.5≤ >
t ∈ ϕ : nodes of the tree ϕXt : split variable at tvt ∈ R : split threshold at tϕ(x) = arg maxc∈Y p(Y = c |X = x)
5 / 14
Bias-variance decomposition in regression
Theorem. For the squared error loss, the bias-variancedecomposition of the expected generalization error at X = x is
EL{Err(ϕL(x))} = noise(x) + bias2(x) + var(x)
where
noise(x) = Err(ϕB(x)),
bias2(x) = (ϕB(x) − EL{ϕL(x)})2,
var(x) = EL{(EL{ϕL(x)}−ϕL(x))2}.
6 / 14
Bias-variance decomposition
7 / 14
Diagnosing the generalization error of a decision tree
• (Residual error : Lowest achievable error, independent of ϕL.)
• Bias : Decision trees usually have low bias.
• Variance : They often suffer from high variance.
• Solution : Combine the predictions of several randomized treesinto a single model.
8 / 14
Random forests
𝒙
𝑝𝜑1(𝑌 = 𝑐|𝑋 = 𝒙)
𝜑1 𝜑𝑀
…
𝑝𝜑𝑚(𝑌 = 𝑐|𝑋 = 𝒙)
∑
𝑝𝜓(𝑌 = 𝑐|𝑋 = 𝒙)
Randomization• Bootstrap samples } Random Forests• Random selection of K 6 p split variables } Extra-Trees• Random selection of the threshold
9 / 14
Bias-variance decomposition (cont.)
Theorem. For the squared error loss, the bias-variancedecomposition of the expected generalization errorEL{Err(ψL,θ1,...,θM (x))} at X = x of an ensemble of Mrandomized models ϕL,θm is
EL{Err(ψL,θ1,...,θM (x))} = noise(x) + bias2(x) + var(x),
where
noise(x) = Err(ϕB(x)),
bias2(x) = (ϕB(x) − EL,θ{ϕL,θ(x)})2,
var(x) = ρ(x)σ2L,θ(x) +1 − ρ(x)
Mσ2L,θ(x).
and where ρ(x) is the Pearson correlation coefficient between thepredictions of two randomized trees built on the same learning set.
10 / 14
Interpretation of ρ(x) (Louppe, 2014)
Theorem. ρ(x) =VL{Eθ|L{ϕL,θ(x)}}
VL{Eθ|L{ϕL,θ(x)}}+EL{Vθ|L{ϕL,θ(x)}}
In other words, it is the ratio between
• the variance due to the learning set and
• the total variance, accounting for random effects due to boththe learning set and the random perburbations.
ρ(x)→ 1 when variance is mostly due to the learning set ;ρ(x)→ 0 when variance is mostly due to the randomperturbations ;ρ(x) > 0.
11 / 14
Diagnosing the generalization error of random forests
• Bias : Identical to the bias of a single randomized tree.
• Variance : var(x) = ρ(x)σ2L,θ(x) +1−ρ(x)
M σ2L,θ(x)
As M →∞, var(x)→ ρ(x)σ2L,θ(x)The stronger the randomization, ρ(x)→ 0, var(x)→ 0.The weaker the randomization, ρ(x)→ 1, var(x)→ σ2L,θ(x)
Bias-variance trade-off. Randomization increases bias but makesit possible to reduce the variance of the corresponding ensemblemodel through averaging.
The crux of the problem is to find the right trade-off.Tips : tune max features in Random Forests.
12 / 14
Bias-variance decomposition in classification
Theorem. For the zero-one loss and binary classification, theexpected generalization error EL{Err(ϕL(x))} at X = xdecomposes as follows :
EL{Err(ϕL(x))} = P(ϕB(x) 6= Y )
+Φ(0.5 − EL{p̂L(Y = ϕB(x))}√
VL{p̂L(Y = ϕB(x))})(2P(ϕB(x) = Y ) − 1)
• For EL{p̂L(Y = ϕB(x)} > 0.5, VL{p̂L(Y = ϕB(x))}→ 0makes Φ→ 0 and the expected generalization error tends tothe error of the Bayes model.
• Conversely, for EL{p̂L(Y = ϕB(x)} < 0.5,VL{p̂L(Y = ϕB(x))}→ 0 makes Φ→ 1 and the error ismaximal.
13 / 14
Questions ?
14 / 14