Importance Sampling: An Alternative View of Ensemble Learning ...

50
Importance Sampling: An Alternative View of Ensemble Learning Jerome H. Friedman Bogdan Popescu Stanford University 1

Transcript of Importance Sampling: An Alternative View of Ensemble Learning ...

Page 1: Importance Sampling: An Alternative View of Ensemble Learning ...

Importance Sampling:An Alternative View of Ensemble Learning

Jerome H. FriedmanBogdan PopescuStanford University

1

Page 2: Importance Sampling: An Alternative View of Ensemble Learning ...

PREDICTIVE LEARNING

Given data: {zi}N1 = {yi,xi}

N1 v q(z)

y = “output” or “response” attribute (variable)x = {x1, · · ·, xn} = “inputs” or “predictors”

and loss function L(y, F ):estimate F ∗(x) = argminF (x) Eq(z)L(y, F (x))

2

Page 3: Importance Sampling: An Alternative View of Ensemble Learning ...

WHY?

F ∗(x) is best predictor of y |x under L.Examples:Regression: y, F ∈ R

L(y, F ) = |y − F |, (y − F )2

Classification: y, F ∈ {c1, · · ·, cK}L(y, F ) = Ly,F (K ×K matrix)

3

Page 4: Importance Sampling: An Alternative View of Ensemble Learning ...

F ∗(x) = “target” function (regression)concept (classification)

Estimate: F (x)← learning procedure ({zi}N1 )

Here: procedure = “ LEARNING ENSEMBLES”

4

Page 5: Importance Sampling: An Alternative View of Ensemble Learning ...

BASIC LINEAR MODELF (x) =

Pa(p) f(x;p) dp

f(x;p) = “base” learner (basis function)parameters: p = (p1, p2, · ··)p ∈ P indexes particular function of x

from { f(x;p)}p∈Pa(p) = coefficient of f(x;p)

5

Page 6: Importance Sampling: An Alternative View of Ensemble Learning ...

Examples:f(x;p) = [1 + exp(−ptx)]−1 (neural nets)

= multivariate splines (MARS)= decision trees (Mart, RF)∗

6

Page 7: Importance Sampling: An Alternative View of Ensemble Learning ...

NUMERICAL QUADRATURE∫

PI(p) dp w

∑Mm=1 wmI(pm)

here: I(p) = a(p) f(x;p)

Quadrature rule defined by:{pm}

M1 = evaluation points ∈ P

{wm}M1 = weights

7

Page 8: Importance Sampling: An Alternative View of Ensemble Learning ...

F (x) w

M∑

m=1

wm a(pm) f(x;pm)

w

M∑

m=1

cm f(x;pm)

Averaging over x:{c∗m}

M1 =

linear regression of y on {f(x;pm)}M1 (pop.)

Problem: find good {pm}M1 .

8

Page 9: Importance Sampling: An Alternative View of Ensemble Learning ...

MONTE CARLO METHODSr(p) = sampling pdf of p ∈ P

{pm v r(p)}M1

Simple Monte Carlo: r(p) = constantUsually not very good

9

Page 10: Importance Sampling: An Alternative View of Ensemble Learning ...

IMPORTANCE SAMPLING

Customize r(p) for each particular problem (F ∗(x))r(pm) = big =⇒

pm important to high accuracywhen used with {pm′}m′ 6=m

10

Page 11: Importance Sampling: An Alternative View of Ensemble Learning ...

MONTE CARLO METHODS

(1) “Random” Monte Carlo:ignore other points: pm v r(p) iid

(2) “Quasi” Monte Carlo:{pm}

M1 = deterministic

account for other pointsimportance → groups of points

11

Page 12: Importance Sampling: An Alternative View of Ensemble Learning ...

RANDOM MONTE CARLO

(Lack of) importance J(p) depends only on p

One measure: “partial importance”J(p) = Eq(z)L(y, f(x;p))

p∗ = argminp J(p)= best single point (M = 1) rule

f(x;p∗) = optimal single base learner

12

Page 13: Importance Sampling: An Alternative View of Ensemble Learning ...

Usually not very good, especially ifF ∗(x) /∈ {f(x;p)}p∈P

BUT, often used:single logistic regression or tree

Note: J(pm) ignores {pm′}m′ 6=m

Hope: better than r(p) = constant.

13

Page 14: Importance Sampling: An Alternative View of Ensemble Learning ...

PARTIAL IMPORTANCE SAMPLINGr(p) = g(J(p))

g(·) = monotone decreasing functionr(p∗) = max w center (location)p 6= p∗ =⇒ r(p) < r(p∗)d(p,p∗) = J(p)− J(p∗)

14

Page 15: Importance Sampling: An Alternative View of Ensemble Learning ...

Besides location,Critical parameter for importance sampling:Scale (width) of r(p):

σ =∫

Pd(p,p∗) r(p) dp

Controlled by choice of g(·):σ = too large → r(p) = constant.σ = too small → best single point rule p∗

15

Page 16: Importance Sampling: An Alternative View of Ensemble Learning ...

−1.0 −0.5 0.0 0.5 1.0

05

1015

2025

16

Page 17: Importance Sampling: An Alternative View of Ensemble Learning ...

Questions:(1) how to choose g(·) v σ(2) sample from r(p) = g(J(p))

17

Page 18: Importance Sampling: An Alternative View of Ensemble Learning ...

TRICK

Perturbation sampling ⇒ repeatedly:(1) randomly modify (perturb) problem(2) find optimal f(x;pm) for perturbed problempm = Rm

{

argminp Eq(z)L(y, f(x;p))}

control width σ of r(p) by degreePerturb: L(y, F ), q(z), algorithm, hybrid.

18

Page 19: Importance Sampling: An Alternative View of Ensemble Learning ...

EXAMPLES

Perturb loss function:Lm(y, f) = L(y, f) + η · lm(y, f)

lm(y, f) = random functionLm(y, f) = L(y, f + η · hm(x))

hm(x) = random function of xpm = argminp Eq(z)Lm(y, f(x;p))Width σ of r(p) v value of η

19

Page 20: Importance Sampling: An Alternative View of Ensemble Learning ...

Perturb sampling distribution:qm(z) = [wm(z)]

η q(z)wm(z) = random function of zpm = argminp Eqm(z)L(y, f(x;p))Width σ of r(p) v value of η

20

Page 21: Importance Sampling: An Alternative View of Ensemble Learning ...

Perturb algorithm:pm = rand[argminp]Eq(z)L(y, f(x;p))control width σ of r(p) by degreerepeated partial optimizationsperturb partial solutionsExamples:

Dittereich - random treesBreiman - random forests

21

Page 22: Importance Sampling: An Alternative View of Ensemble Learning ...

GOAL

Produce a good {pm}M1 so that

M∑

m=1

c∗mf(x;pm) w F ∗(x)

where{c∗m}

M1 = pop. linear regression (L)

of y on {f(x;pm)}M1

Note: both depend on knowing population q(z).

22

Page 23: Importance Sampling: An Alternative View of Ensemble Learning ...

FINITE DATA{zi}

N1 v q(z)

q(z) =∑N

i=11Nδ(z− zi)

Apply perturbation sampling based on q(z):Loss function / algorithm:

q(z)→ q(z)width σ of r(p) controlled as before

23

Page 24: Importance Sampling: An Alternative View of Ensemble Learning ...

Sampling distribution: random reweightingqm(z) =

∑Ni=1 wimδ(z− zi)

wim v Pr(w) : Ewim = 1/Nwidth σ of r(p) controlled by std(wim)Fastest comp: wim ∈ {0, 1/K}⇒ draw K from N without replacementσ v std(w) = (N/K − 1)1/2/Ncomputation v K/N

24

Page 25: Importance Sampling: An Alternative View of Ensemble Learning ...

Quadrature Coefficients

Population:Linear regression of y on {f(x;pm)}

M1 :

{c∗m}M1 = argmin{cm}Eq(z)L

(

y,∑M

m=1 cm f(x;pm))

25

Page 26: Importance Sampling: An Alternative View of Ensemble Learning ...

Finite data: regularized linear regression

{cm}M1 = argmin{cm}Eq(z)L

(

y,∑M

m=1 cm f(x;pm))

+λ ·∑M

m=1 | cm − c(0)m | (Lasso)

Regularization ⇒ reduce variance{c

(0)m }M1 = prior guess (usually = 0)

λ > 0 chosen by cross–validationFast algorithm: sol’ns for all λ(see HTF 2001, EHT 2002)

26

Page 27: Importance Sampling: An Alternative View of Ensemble Learning ...

Importance Sampled Learning Ensembles (ISLE)Numerical Integration

F (x) =∫

Pa(p) f(x;p) dp

w∑M

m=1 cm f(x; pm)

{pm}M1 v r(p) importance sampling

v perturbation sampling on q(z){cm}

M1 ← regularized linear regression

of y on {f(x; pm)}M1

27

Page 28: Importance Sampling: An Alternative View of Ensemble Learning ...

BAGGING (Breiman 1996)

Perturb data distribution q(z):qm(z) = bootstrap sample =

∑Ni=1 wimδ(z− zi)

wim ∈ {0, 1/N, 2/N, · · ·, 1}v multinomial (1/N)

pm = argminp Eqm(z)L(y, f(x;p))

F (x) =∑M

m=11Mf(x;pm) (average)

28

Page 29: Importance Sampling: An Alternative View of Ensemble Learning ...

Width σ of r(p):E(std(wim)) = (1− 1/N)

−1/N w 1/NFixed ⇒ no controlNo joint fitting of coefficients:

λ =∞ & c(0)m = 1/M

Potential improvements:Different σ (sampling strategy)λ <∞⇒ jointly fit coeff’s to data

29

Page 30: Importance Sampling: An Alternative View of Ensemble Learning ...

RANDOM FORESTS (Breiman 1998)

f(x;p) = T (x) = largest possible decision treeHybrid sampling strategy :

(1) qm(z) = bootstrap sample (bagging)(2) random algorithm modification:

select var. for each split fromamong randomly chosen subsetBreiman: ns = fl(log2 n+ 1)

30

Page 31: Importance Sampling: An Alternative View of Ensemble Learning ...

F (x) =∑M

m=11MTm(x) (average)

As an ISLE: σ(RF ) > σ(Bag), (↑ as ns ↓)Potential improvements: same as bagging

Different σ (sampling strategy)λ <∞⇒ jointly fit coeff’s to data(more later)

31

Page 32: Importance Sampling: An Alternative View of Ensemble Learning ...

SEQUENTIAL SAMPLING

Random Monte Carlo: {pm v r(p)}M1 iid

Quasi–Monte Carlo: {pm}M1 = deterministic

J({pm}M1 ) = min{αm}Eq(z)L

(

y,∑M

m=1 αm f(x;pm))

Joint regression of y on { f(x;pm)}M1 (pop.)

32

Page 33: Importance Sampling: An Alternative View of Ensemble Learning ...

Approximation: sequential sampling(forward stagewise)

Jm(p | {pl}m−11 ) = minαEq(z)L(y, α f(x;p) + hm(x))

hm(x) =∑m−1

l=1 αl f(x;pl), αl = sol’n for pl

pm = argminp Jm(p | {pl}m−11 )

Repeatedly modify loss function:Similar to Lm(y, f) = L(y, f) + η · hm(x))but here η = 1 & hm(x) = deterministic

33

Page 34: Importance Sampling: An Alternative View of Ensemble Learning ...

Connection to Boosting

AdaBoost (Freund & Shapire 1996):L(y, f) = exp(−y · f), y ∈ {−1, 1}

F (x) = sign(

∑Mm=1 αm f(x;pm)

)

{αm}M1 = sequential partial reg. coeff’s

Gradient Boosting (MART – Friedman 2001):general y & L(y, f), αm = shrunk (η << 1)F (x) =

∑Mm=1 αm f(x;pm)

34

Page 35: Importance Sampling: An Alternative View of Ensemble Learning ...

Potential improvements (ISLE):(1) F (x) =

∑Mm=1 cm f(x; pm)

{pm}M1 v seq. sampling on q(x)

{cm}M1 v regularized linear regression

(2) and/or hybrid with random qm(z) (speed)(sample K from N without replacement)

35

Page 36: Importance Sampling: An Alternative View of Ensemble Learning ...

ISLE Paradigm

Wide variety of ISLE methods:(1) base learner f(x;p); (2) loss criterion L(y, f)(3) perturbation method(4) degree of perturbation: σ of r(p)(5) iid vs. sequential(6) hybrids

Examine several options.

36

Page 37: Importance Sampling: An Alternative View of Ensemble Learning ...

Monte Carlo Study

100 data sets: each N = 10000, n = 40{yil = Fl(xi) + εil}

10000i=1 , l = 1, 100

{Fl(x)}1001 = different (random) target fun’s.

xi v N(0, I40), εil v N(0, V arx(Fl(x)))⇒ signal/noise = 1/1

37

Page 38: Importance Sampling: An Alternative View of Ensemble Learning ...

Evaluation Criteria

Relative RMS error:rmse(Fjl) = [1−R2(Fl, Fjl)]

1/2

Comparative RMS error:cmse = rmse(Fjl)/mink{rmse(Fkl)}(adjusts for problem difficulty)

j, k ∈ {respective methods}10000 indep. obs.

38

Page 39: Importance Sampling: An Alternative View of Ensemble Learning ...

Properties of Fl(x)

(1) 30 “noise” variables(2) wide variety of fun’s (difficulty)(3) emphasize lower order interactions(3) not in span of base learners

Decision Trees

39

Page 40: Importance Sampling: An Alternative View of Ensemble Learning ...

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Bag Bag_P Bag_.05 Bag_.05_P Bag_6 Bag_6_P Bag_6_.05_P

Rela

tive R

MS

err

or

Bagging Relative RMS Error

40

Page 41: Importance Sampling: An Alternative View of Ensemble Learning ...

1.0

1.2

1.4

1.6

1.8

2.0

Bag Bag_P Bag_.05 Bag_.05_P Bag_6 Bag_6_P Bag_6_.05_P

Co

mp

ara

tive

RM

S e

rro

r

Bagging Comparative RMS Error

41

Page 42: Importance Sampling: An Alternative View of Ensemble Learning ...

1.0

1.5

2.0

2.5

RF RF_P RF_.05 RF_.05_P RF_6 RF_6_P RF_6_.05_P

Co

mp

ara

tive

RM

S e

rro

r

Random Forests Comparative RMS error

42

Page 43: Importance Sampling: An Alternative View of Ensemble Learning ...

1.0

1.2

1.4

1.6

Bag RF Bag_6_05_P RF_6_05_P

Co

mp

ara

tive

RM

S e

rro

r

Bag/RF Comparative RMS Error

43

Page 44: Importance Sampling: An Alternative View of Ensemble Learning ...

1.0

1.1

1.2

1.3

Mart Mart_P Mart_10_P Mart_.01_10_P Mart_.01_20_P

Co

mp

ara

tive

RM

S e

rro

r

Sequential Sampling Comparative RMS Error

44

Page 45: Importance Sampling: An Alternative View of Ensemble Learning ...

1.0

1.2

1.4

1.6

1.8

2.0

2.2

Mart Mart_P Seq_.01_.2 Bag Bag_6_.05_p RF RF_6_.05_p

Seq/Bag/RF Comparative RMS Error

45

Page 46: Importance Sampling: An Alternative View of Ensemble Learning ...

SUMMARY: Theory – unify:(1) Bagging, (2) random forests,(3) Bayesian model averaging,(4) boosting

single paradigm v Monte Carlo integration(1) – (3) : iid Monte Carlo, p v r(p)

(1), (2) perturb. sampling; (3) MCMC(4) quasi–Monte Carlo: approx. seq. sampling

46

Page 47: Importance Sampling: An Alternative View of Ensemble Learning ...

Practice:{wi}

M1 ← lasso linear regression

(1) improves accuracy of RF v bagging (faster)(2) combined with aggressive subsampling

& weaker base learners, improves speed:bagging & RF > 102, MART v 5

allowing much bigger data sets.Also, prediction many times faster.

47

Page 48: Importance Sampling: An Alternative View of Ensemble Learning ...

FUTURE DIRECTIONS

(1) More thorough understanding (theory)→ specific recommendations

48

Page 49: Importance Sampling: An Alternative View of Ensemble Learning ...

(2) Multiple learning ensembles (MISLES)F (x) =

∑Kk=1

Pk

ak(pk) fk(x,pk) dpk

{ fk(x,pk)}K1 = different (comp.) base learners

F (x) =∑

ckm fk(x,pk){ckm} ← combined lasso regressionExample: f1 = decision trees

f2 = {xj}n1 (no sampling)

49

Page 50: Importance Sampling: An Alternative View of Ensemble Learning ...

SLIDEShttp://www-stat.stanford.edu/˜jhf/isletalk.pdf

REFERENCES

HTF (2001): Hastie, Tibshirani, & Friedman,Elements of Statistical Learning, Springer.

EHT (2002): Efron, Hastie, & Tibshirani,Least angle regression (Stanford preprint)

50