Importance Sampling: An Alternative View of Ensemble Learning ...

Post on 15-Dec-2016

217 views 0 download

Transcript of Importance Sampling: An Alternative View of Ensemble Learning ...

Importance Sampling:An Alternative View of Ensemble Learning

Jerome H. FriedmanBogdan PopescuStanford University

1

PREDICTIVE LEARNING

Given data: {zi}N1 = {yi,xi}

N1 v q(z)

y = “output” or “response” attribute (variable)x = {x1, · · ·, xn} = “inputs” or “predictors”

and loss function L(y, F ):estimate F ∗(x) = argminF (x) Eq(z)L(y, F (x))

2

WHY?

F ∗(x) is best predictor of y |x under L.Examples:Regression: y, F ∈ R

L(y, F ) = |y − F |, (y − F )2

Classification: y, F ∈ {c1, · · ·, cK}L(y, F ) = Ly,F (K ×K matrix)

3

F ∗(x) = “target” function (regression)concept (classification)

Estimate: F (x)← learning procedure ({zi}N1 )

Here: procedure = “ LEARNING ENSEMBLES”

4

BASIC LINEAR MODELF (x) =

Pa(p) f(x;p) dp

f(x;p) = “base” learner (basis function)parameters: p = (p1, p2, · ··)p ∈ P indexes particular function of x

from { f(x;p)}p∈Pa(p) = coefficient of f(x;p)

5

Examples:f(x;p) = [1 + exp(−ptx)]−1 (neural nets)

= multivariate splines (MARS)= decision trees (Mart, RF)∗

6

NUMERICAL QUADRATURE∫

PI(p) dp w

∑Mm=1 wmI(pm)

here: I(p) = a(p) f(x;p)

Quadrature rule defined by:{pm}

M1 = evaluation points ∈ P

{wm}M1 = weights

7

F (x) w

M∑

m=1

wm a(pm) f(x;pm)

w

M∑

m=1

cm f(x;pm)

Averaging over x:{c∗m}

M1 =

linear regression of y on {f(x;pm)}M1 (pop.)

Problem: find good {pm}M1 .

8

MONTE CARLO METHODSr(p) = sampling pdf of p ∈ P

{pm v r(p)}M1

Simple Monte Carlo: r(p) = constantUsually not very good

9

IMPORTANCE SAMPLING

Customize r(p) for each particular problem (F ∗(x))r(pm) = big =⇒

pm important to high accuracywhen used with {pm′}m′ 6=m

10

MONTE CARLO METHODS

(1) “Random” Monte Carlo:ignore other points: pm v r(p) iid

(2) “Quasi” Monte Carlo:{pm}

M1 = deterministic

account for other pointsimportance → groups of points

11

RANDOM MONTE CARLO

(Lack of) importance J(p) depends only on p

One measure: “partial importance”J(p) = Eq(z)L(y, f(x;p))

p∗ = argminp J(p)= best single point (M = 1) rule

f(x;p∗) = optimal single base learner

12

Usually not very good, especially ifF ∗(x) /∈ {f(x;p)}p∈P

BUT, often used:single logistic regression or tree

Note: J(pm) ignores {pm′}m′ 6=m

Hope: better than r(p) = constant.

13

PARTIAL IMPORTANCE SAMPLINGr(p) = g(J(p))

g(·) = monotone decreasing functionr(p∗) = max w center (location)p 6= p∗ =⇒ r(p) < r(p∗)d(p,p∗) = J(p)− J(p∗)

14

Besides location,Critical parameter for importance sampling:Scale (width) of r(p):

σ =∫

Pd(p,p∗) r(p) dp

Controlled by choice of g(·):σ = too large → r(p) = constant.σ = too small → best single point rule p∗

15

−1.0 −0.5 0.0 0.5 1.0

05

1015

2025

16

Questions:(1) how to choose g(·) v σ(2) sample from r(p) = g(J(p))

17

TRICK

Perturbation sampling ⇒ repeatedly:(1) randomly modify (perturb) problem(2) find optimal f(x;pm) for perturbed problempm = Rm

{

argminp Eq(z)L(y, f(x;p))}

control width σ of r(p) by degreePerturb: L(y, F ), q(z), algorithm, hybrid.

18

EXAMPLES

Perturb loss function:Lm(y, f) = L(y, f) + η · lm(y, f)

lm(y, f) = random functionLm(y, f) = L(y, f + η · hm(x))

hm(x) = random function of xpm = argminp Eq(z)Lm(y, f(x;p))Width σ of r(p) v value of η

19

Perturb sampling distribution:qm(z) = [wm(z)]

η q(z)wm(z) = random function of zpm = argminp Eqm(z)L(y, f(x;p))Width σ of r(p) v value of η

20

Perturb algorithm:pm = rand[argminp]Eq(z)L(y, f(x;p))control width σ of r(p) by degreerepeated partial optimizationsperturb partial solutionsExamples:

Dittereich - random treesBreiman - random forests

21

GOAL

Produce a good {pm}M1 so that

M∑

m=1

c∗mf(x;pm) w F ∗(x)

where{c∗m}

M1 = pop. linear regression (L)

of y on {f(x;pm)}M1

Note: both depend on knowing population q(z).

22

FINITE DATA{zi}

N1 v q(z)

q(z) =∑N

i=11Nδ(z− zi)

Apply perturbation sampling based on q(z):Loss function / algorithm:

q(z)→ q(z)width σ of r(p) controlled as before

23

Sampling distribution: random reweightingqm(z) =

∑Ni=1 wimδ(z− zi)

wim v Pr(w) : Ewim = 1/Nwidth σ of r(p) controlled by std(wim)Fastest comp: wim ∈ {0, 1/K}⇒ draw K from N without replacementσ v std(w) = (N/K − 1)1/2/Ncomputation v K/N

24

Quadrature Coefficients

Population:Linear regression of y on {f(x;pm)}

M1 :

{c∗m}M1 = argmin{cm}Eq(z)L

(

y,∑M

m=1 cm f(x;pm))

25

Finite data: regularized linear regression

{cm}M1 = argmin{cm}Eq(z)L

(

y,∑M

m=1 cm f(x;pm))

+λ ·∑M

m=1 | cm − c(0)m | (Lasso)

Regularization ⇒ reduce variance{c

(0)m }M1 = prior guess (usually = 0)

λ > 0 chosen by cross–validationFast algorithm: sol’ns for all λ(see HTF 2001, EHT 2002)

26

Importance Sampled Learning Ensembles (ISLE)Numerical Integration

F (x) =∫

Pa(p) f(x;p) dp

w∑M

m=1 cm f(x; pm)

{pm}M1 v r(p) importance sampling

v perturbation sampling on q(z){cm}

M1 ← regularized linear regression

of y on {f(x; pm)}M1

27

BAGGING (Breiman 1996)

Perturb data distribution q(z):qm(z) = bootstrap sample =

∑Ni=1 wimδ(z− zi)

wim ∈ {0, 1/N, 2/N, · · ·, 1}v multinomial (1/N)

pm = argminp Eqm(z)L(y, f(x;p))

F (x) =∑M

m=11Mf(x;pm) (average)

28

Width σ of r(p):E(std(wim)) = (1− 1/N)

−1/N w 1/NFixed ⇒ no controlNo joint fitting of coefficients:

λ =∞ & c(0)m = 1/M

Potential improvements:Different σ (sampling strategy)λ <∞⇒ jointly fit coeff’s to data

29

RANDOM FORESTS (Breiman 1998)

f(x;p) = T (x) = largest possible decision treeHybrid sampling strategy :

(1) qm(z) = bootstrap sample (bagging)(2) random algorithm modification:

select var. for each split fromamong randomly chosen subsetBreiman: ns = fl(log2 n+ 1)

30

F (x) =∑M

m=11MTm(x) (average)

As an ISLE: σ(RF ) > σ(Bag), (↑ as ns ↓)Potential improvements: same as bagging

Different σ (sampling strategy)λ <∞⇒ jointly fit coeff’s to data(more later)

31

SEQUENTIAL SAMPLING

Random Monte Carlo: {pm v r(p)}M1 iid

Quasi–Monte Carlo: {pm}M1 = deterministic

J({pm}M1 ) = min{αm}Eq(z)L

(

y,∑M

m=1 αm f(x;pm))

Joint regression of y on { f(x;pm)}M1 (pop.)

32

Approximation: sequential sampling(forward stagewise)

Jm(p | {pl}m−11 ) = minαEq(z)L(y, α f(x;p) + hm(x))

hm(x) =∑m−1

l=1 αl f(x;pl), αl = sol’n for pl

pm = argminp Jm(p | {pl}m−11 )

Repeatedly modify loss function:Similar to Lm(y, f) = L(y, f) + η · hm(x))but here η = 1 & hm(x) = deterministic

33

Connection to Boosting

AdaBoost (Freund & Shapire 1996):L(y, f) = exp(−y · f), y ∈ {−1, 1}

F (x) = sign(

∑Mm=1 αm f(x;pm)

)

{αm}M1 = sequential partial reg. coeff’s

Gradient Boosting (MART – Friedman 2001):general y & L(y, f), αm = shrunk (η << 1)F (x) =

∑Mm=1 αm f(x;pm)

34

Potential improvements (ISLE):(1) F (x) =

∑Mm=1 cm f(x; pm)

{pm}M1 v seq. sampling on q(x)

{cm}M1 v regularized linear regression

(2) and/or hybrid with random qm(z) (speed)(sample K from N without replacement)

35

ISLE Paradigm

Wide variety of ISLE methods:(1) base learner f(x;p); (2) loss criterion L(y, f)(3) perturbation method(4) degree of perturbation: σ of r(p)(5) iid vs. sequential(6) hybrids

Examine several options.

36

Monte Carlo Study

100 data sets: each N = 10000, n = 40{yil = Fl(xi) + εil}

10000i=1 , l = 1, 100

{Fl(x)}1001 = different (random) target fun’s.

xi v N(0, I40), εil v N(0, V arx(Fl(x)))⇒ signal/noise = 1/1

37

Evaluation Criteria

Relative RMS error:rmse(Fjl) = [1−R2(Fl, Fjl)]

1/2

Comparative RMS error:cmse = rmse(Fjl)/mink{rmse(Fkl)}(adjusts for problem difficulty)

j, k ∈ {respective methods}10000 indep. obs.

38

Properties of Fl(x)

(1) 30 “noise” variables(2) wide variety of fun’s (difficulty)(3) emphasize lower order interactions(3) not in span of base learners

Decision Trees

39

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Bag Bag_P Bag_.05 Bag_.05_P Bag_6 Bag_6_P Bag_6_.05_P

Rela

tive R

MS

err

or

Bagging Relative RMS Error

40

1.0

1.2

1.4

1.6

1.8

2.0

Bag Bag_P Bag_.05 Bag_.05_P Bag_6 Bag_6_P Bag_6_.05_P

Co

mp

ara

tive

RM

S e

rro

r

Bagging Comparative RMS Error

41

1.0

1.5

2.0

2.5

RF RF_P RF_.05 RF_.05_P RF_6 RF_6_P RF_6_.05_P

Co

mp

ara

tive

RM

S e

rro

r

Random Forests Comparative RMS error

42

1.0

1.2

1.4

1.6

Bag RF Bag_6_05_P RF_6_05_P

Co

mp

ara

tive

RM

S e

rro

r

Bag/RF Comparative RMS Error

43

1.0

1.1

1.2

1.3

Mart Mart_P Mart_10_P Mart_.01_10_P Mart_.01_20_P

Co

mp

ara

tive

RM

S e

rro

r

Sequential Sampling Comparative RMS Error

44

1.0

1.2

1.4

1.6

1.8

2.0

2.2

Mart Mart_P Seq_.01_.2 Bag Bag_6_.05_p RF RF_6_.05_p

Seq/Bag/RF Comparative RMS Error

45

SUMMARY: Theory – unify:(1) Bagging, (2) random forests,(3) Bayesian model averaging,(4) boosting

single paradigm v Monte Carlo integration(1) – (3) : iid Monte Carlo, p v r(p)

(1), (2) perturb. sampling; (3) MCMC(4) quasi–Monte Carlo: approx. seq. sampling

46

Practice:{wi}

M1 ← lasso linear regression

(1) improves accuracy of RF v bagging (faster)(2) combined with aggressive subsampling

& weaker base learners, improves speed:bagging & RF > 102, MART v 5

allowing much bigger data sets.Also, prediction many times faster.

47

FUTURE DIRECTIONS

(1) More thorough understanding (theory)→ specific recommendations

48

(2) Multiple learning ensembles (MISLES)F (x) =

∑Kk=1

Pk

ak(pk) fk(x,pk) dpk

{ fk(x,pk)}K1 = different (comp.) base learners

F (x) =∑

ckm fk(x,pk){ckm} ← combined lasso regressionExample: f1 = decision trees

f2 = {xj}n1 (no sampling)

49

SLIDEShttp://www-stat.stanford.edu/˜jhf/isletalk.pdf

REFERENCES

HTF (2001): Hastie, Tibshirani, & Friedman,Elements of Statistical Learning, Springer.

EHT (2002): Efron, Hastie, & Tibshirani,Least angle regression (Stanford preprint)

50