A walk in random forests - unistra.frirma.math.unistra.fr/~gardes/SEMINAIRE/scornet.pdf ·...

A walk in random forests

Erwan Scornet (LSTA, Institut Curie),Supervised by Gerard Biau (LSTA)

and Jean-Philippe Vert (Institut Curie)

Seminaire Statistiques - IRMAStrasbourg, October 2015

Erwan Scornet Random forests

Background on random forests

Random forests are a class of algorithms used to solve regression and classificationproblems

They are often used in applied fields since they handle high-dimensionalsettings.

They have good predictive power and can outperform state-of-the-art meth-ods.


Background on random forests

But theoretical results are not yet entirely sufficient to explain their goodaccuracy.


1 Construction of random forests

2 Random forests and kernel methods

3 Consistency of Breiman forests


General framework of the presentation

Regression setting

We are given a training set Dn = {(X1,Y1), ..., (Xn,Yn)} where the pairs(Xi ,Yi ) ∈ [0, 1]d × R are i .i .d . distributed as (X ,Y ).

We assume that

Y = m(X) + ε,

where ε ∼ N (0, σ2). We want to build an estimate of the regressionfunction m using random forest algorithm.


How to build a tree?

Breiman Random forests are defined by

1 A splitting rule : minimize the square loss.

2 A stopping rule : leave exactly one point in each cell.


How to perform splits of Breiman’s forests?

For a cut direction j ∈ {1, . . . , d} and a split position z ∈ [0, 1] , thecriterion takes the form

Ln(j , z) =1

Nn(A)

n∑i=1

(Yi − YAL

1X

(j)i <z− YAR

1X

(j)i ≥z

)2

,

where

AL = {x ∈ A : x(j) < z} and AR = {x ∈ A : x(j) ≥ z}YA is the average of the Yi ’s belonging to A.

Nn(A) is the number of points in A



An example: j = 1 and z = 0.5.

16,2

14,8

17,1

5,8

16,2

7,1

6,2

5,7

5,5




16,2

14,8

17,1

5,8

16,2

7,1

6,2

5,7

5,5

Ln(1, 0.5) =1

Nn(A)

n∑i=1

(Yi − YAL

1X

(1)i <0.5︸︷︷︸

Average on AL

− YAR1X

(1)i ≥0.5

)2

,




16,2

14,8

17,1

5,8

16,2

7,1

6,2

5,7

5,5

Ln(1, 0.5) =1

Nn(A)

n∑i=1

(Yi − YAL

1X

(1)i <0.5

− YAR1X

(1)i ≥0.5︸︷︷︸

Average on AR

)2

,


Construction of random forests

Randomness in tree construction

resample the data set via bootstrap;

At each node, preselect a subset of mtry variables eligible forsplitting.


Literature

Random forests were created by Breiman [2001].

Many extentions have been proposed to

solve ranking problems [Clemencon et al., 2013],solve survival analysis problems [Ishwaran et al., 2008],perform quantile estimation [Meinshausen, 2006],

and to improve calculation time [Geurts et al., 2006].

Many theoretical results focus on simplified version on random forests,whose construction is independent of the dataset[Biau et al., 2008, Ishwaran and Kogalur, 2010, Biau, 2012, Genuer,2012, Zhu et al., 2012].

Asymptotic normality of random forests [Mentch and Hooker, 2014,Wager, 2014].


Random prediction or not?

Tree estimate:

mn(x,Θ) =n∑

i=1

1Xi∈An(x,Θ)

Nn(x,Θ)Yi

where Nn(x,Θ) is the number of points in the cell An(x,Θ).



Tree estimate:

mn(x,Θ) =n∑

i=1

1Xi∈An(x,Θ)

Nn(x,Θ)Yi


M-Finite forest estimate :

mM,n(x,Θ1, . . . ,ΘM) =1

M

M∑m=1

mn(x,Θm)



Tree estimate:

mn(x,Θ) =n∑

i=1

1Xi∈An(x,Θ)

Nn(x,Θ)Yi



mM,n(x,Θ1, . . . ,ΘM) =1

M

M∑m=1

mn(x,Θm)

Conditionally on Dn, the estimate mM,n depends on Θ1, . . . ,ΘM .



Tree estimate:

mn(x,Θ) =n∑

i=1

1Xi∈An(x,Θ)

Nn(x,Θ)Yi



mM,n(x,Θ1, . . . ,ΘM) =1

M

M∑m=1

mn(x,Θm) →M→∞

EΘ [mn(x,Θ)]︸︷︷︸m∞,n(x)


Theoretical difficulties for studying random forests

The infinite random forests estimate takes the form

m∞,n(x) =n∑

i=1

YiEΘ

[1Xi∈An(x,Θ)

Nn(x,Θ)

],

where

Nn(x,Θ) is the number of points in the cell An(x,Θ).

Two different difficulties:

The number of points in each cell is unknown.

The tree dependency on the random variable Θ is unknown.


Kernel based on Random Forests (KeRF)

5,56,2

6,8

5,3

6,0

15,1

16,2

14,8

17,118

5,8

5,8

16,2

16,2

7,1

6,25,7

5,5

5,56,2

6,8

5,3

6,0

15,1

16,2

14,8

17,118

5,8

5,8

16,2

16,2

7,1

6,25,7

5,5

5,56,2

6,8

5,3

6,0

15,1

16,2

14,8

17,118

5,8

5,8

16,2

16,2

7,1

6,25,7

5,5


Kernel based on Random Forests (KeRF)

5,56,2

6,8

5,3

6,0

15,1

16,2

14,8

17,118

5,8

5,8

16,2

16,2

7,1

6,25,7

5,5

5,56,2

6,8

5,3

6,0

15,1

16,2

14,8

17,118

5,8

5,8

16,2

16,2

7,1

6,25,7

5,5

5,56,2

6,8

5,3

6,0

15,1

16,2

14,8

17,118

5,8

5,8

16,2

16,2

7,1

6,25,7

5,5

Infinite KeRF estimate:

m∞,n(x) =

∑ni=1 YiKk(x,Xi )∑nj=1 Kk(x,Xj)

,

where Kk(x,Xi ) = PΘ [Xi ∈ An(x,Θ)].


Breiman KeRF vs Breiman random forests

n = 800, d = 50 n = 600, d = 100

Y = X 21 + exp(−X 2

2 ) Y = − sin(2X1) + X 22 + X3

− exp(−X4) +N (0, 0.5)


A simple model: the centred forest



p=1/2

p=1/2

p=1/2

p=1/2


Centred KeRF vs centred random forests

n = 800, d = 50 n = 600, d = 100

Y = X 21 + exp(−X 2

2 ) Y = − sin(2X1) + X 22 + X3

− exp(−X4) +N (0, 0.5)


Uniform KeRF vs uniform random forests

n = 800, d = 50 n = 600, d = 100

Y = X 21 + exp(−X 2

2 ) Y = − sin(2X1) + X 22 + X3

− exp(−X4) +N (0, 0.5)


Analyzing KeRF estimates

Infinite KeRF estimate: m∞,n(x) =∑n

i=1 YiKk (x,Xi )∑nj=1 Kk (x,Xj )

Local averaging estimate and thus easier to analyze.

One common assumption on kernel estimate is that Kk(x, z) = K ( x−zk )

which is not the case here. Thus, standard methods to deal with ker-nel estimate cannot be directly adapted to our case.

Generally, Kk(x,Xi ) cannot be explicited (due to the complexity ofpartitioning). But it can be computed for centred/uniform randomforests.


Centred forests

For all x, z ∈ [0, 1]d ,

K cck (x, z) =

∑k1,...,kd∑dj=1 kj=k

k!

k1! . . . kd !

(1

d

)k d∏m=1

1d2km xme=d2km zme.

Representations of z 7→ K cck ((0.5, 0.5), z) for k = 1, 2, 5


Uniform forests

For all z ∈ [0, 1]d ,

K ufk (0, z) =

∑k1,...,kd∑dj=1 kj=k

k!

k1! . . . kd !

(1

d

)k d∏m=1

zm

∞∑j=km

(− log zm)j

j!.

Representations of z 7→ K ufk

(0, (z1 − 0.5, z2 − 0.5)

)for k = 1, 2, 5


Rate of consistency of KeRF

Centred KeRF

Assume that m is Lipschitz. Then, provided 2k/n→ 0, and k →∞,

E[mcc∞,n(X)−m(X)

]2 ≤ C1n−1/(3+d log 2)(log n)2.



Centred KeRF



]2 ≤ C1n−1/(3+d log 2)(log n)2.

Uniform KeRF

Assume that m is Lipschitz. Then, provided 2k/n→ 0, and k →∞

E[muf∞,n(X)−m(X)

]2 ≤ Cn−1/(3+1.5d log 2)(log n)2.



Centred KeRF



]2 ≤ C1n−1/(3+d log 2)(log n)2.

Uniform KeRF

Assume that m is Lipschitz. Then, provided 2k/n→ 0, and k →∞

E[muf∞,n(X)−m(X)

]2 ≤ Cn−1/(3+1.5d log 2)(log n)2.

Minimax rate for Lipschitz functions: n−1

1+0.5d


Summary of KeRF

Pros

KeRF and random forests are close in terms of accuracy.

KeRF estimates are more amenable to analysis, since they are kernelestimates.

The weighted function Kk is related to the shape of the partitions.

Cons

Computing the infinite kernel Kk is time consuming.

Breiman KeRF is difficult to express since the kernel K depends onthe data set.


Tree consistency

For a tree whose construction is independent of data, if

1 diam(An(X))→ 0, in probability;

2 Nn(An(X))→∞, in probability;

then the tree is consistent, that is

limn→∞

E |mn(X)−m(X)|2 = 0.


Consistency of centred random forest

Estimation error [Biau, 2012]

Under proper assumptions on the regression model,

E[mcc∞,n(X)− mcc

∞,n(X)]2 ≤ Cσ2 2kn

nk1/2n

Approximation error [Biau, 2012]



]2 ≤ 2dL2.2−0.75knd log 2 + ‖m‖2

∞e−n/2kn



If the forest is fully grown, that is, if kn = blog2 nc




∞,n(X)]2 ≤ Cσ2 2kn

nk1/2n




]2 ≤ 2dL2.2−0.75knd log 2 + ‖m‖2

∞e−n/2kn







∞,n(X)]2 ≤ Cσ2(log2 n)−1/2




]2 ≤ 2dL2.2−0.75knd log 2 + ‖m‖2

∞e−n/2kn







∞,n(X)]2 ≤ Cσ2(log2 n)−1/2




]2 ≤ 2dL22−0.75knd log 2 + ‖m‖2

∞e−n/2kn







∞,n(X)]2 ≤ Cσ2(log2 n)−1/2




]2 ≤ 2dL2n−0.75d log 2 + ‖m‖2

∞×1


Algorithm for Breiman random forest

Randomness for Breiman random forests

Data sampling : bootstrap

At each cell, select randomly mtry coordinates among {1, . . . , d}.

Choose the split by minimizing the CART-split criterion on the cellalong the mtry selected coordinates.

Stop when each cell contains exactly one point.


Algorithm for Breiman random forest

Randomness for Breiman random forests

Data sampling : subsampling, that is choosing an points among nwith an < n

At each cell, select randomly mtry coordinates among {1, . . . , d}.

Choose the split by minimizing the CART-split criterion on the cellalong the mtry selected coordinates.

Stop when the number of cell is exactly tn.


Assumption (H1)

Additive regression model:

Y =d∑

i=1

mi (X(i)) + ε,

where

X is uniformly distributed on [0, 1]d ,

ε ∼ N (0, σ2) with ε independent of X,

Each model component mi is continuous.


Consistency

Theorem [S. et al., 2014]

Assume that (H1) is satisfied. Then, provided an → ∞ andtn(log an)9/an → 0, random forests are consistent, i.e.,

limn→∞

E [m∞,n(X)−m(X)]2 = 0.

Remarks

First consistency result for Breiman’s original forest.

Consistency of CART.


Sparsity and random forests

Assume that

Y =S∑

i=1

mi (X(i)) + ε,

for some S < d .

Denote by j1,n(X), . . . , jk,n(X) the first k cut directions used toconstruct the cell containing X.

Proposition [S. et al., 2014]

Let k ∈ N? and ξ > 0. Under appropriate assumptions, with probability1− ξ, for all n large enough, we have, for all 1 ≤ q ≤ k,

jq,n(X) ∈ {1, . . . ,S}.


G. Biau. Analysis of a random forests model. Journal of Machine Learning Research, 13:1063–1095,2012.

G. Biau, L. Devroye, and G. Lugosi. Consistency of random forests and other averaging classifiers.Journal of Machine Learning Research, 9:2015–2033, 2008.

L. Breiman. Random forests. Machine Learning, 45:5–32, 2001.

S. Clemencon, M. Depecker, and N. Vayatis. Ranking forests. Journal of machine learning research,14(1):39–73, 2013.

R. Genuer. Variance reduction in purely random forests. Journal of Nonparametric Statistics, 24:543–562, 2012.

P. Geurts, D. Ernst, and L. Wehenkel. Extremely randomized trees. Springer science, Mars 2006.

H. Ishwaran and U. Kogalur. Consistency of random survival forests. Statistics & Probability Letters,80:1056–1064, 2010.

H. Ishwaran, U. Kogalur, E. Blackstone, and M. Lauer. Random survival forest. The annals ofapplied statistics, 2(3):841–860, 2008.

N. Meinshausen. Quantile regression forests. Journal of Machine Learning Research, 7:983–999,2006.

L. Mentch and G. Hooker. Ensemble trees and clts: Statistical inference for supervised learning.arXiv:1404.6473, 2014.

S., Gerard Biau, and Jean-Philippe Vert. Consistency of random forests. arXiv:1405.2881, 2014.

S. Wager. Asymptotic theory for random forests. arXiv:1405.0352, 2014.

R. Zhu, D. Zeng, and M.R. Kosorok. Reinforcement learning trees. 2012.


Merci pour votre attention !


A walk in random forests - unistra.frirma.math.unistra.fr/~gardes/SEMINAIRE/scornet.pdf ·...

Documents

Transcript of A walk in random forests - unistra.frirma.math.unistra.fr/~gardes/SEMINAIRE/scornet.pdf ·...