Importance Sampling: An Alternative View of Ensemble Learning ...

Importance Sampling:An Alternative View of Ensemble Learning

Jerome H. FriedmanBogdan PopescuStanford University

PREDICTIVE LEARNING

Given data: {zi}N1 = {yi,xi}

N1 v q(z)

y = “output” or “response” attribute (variable)x = {x1, · · ·, xn} = “inputs” or “predictors”

and loss function L(y, F ):estimate F ∗(x) = argminF (x) Eq(z)L(y, F (x))

F ∗(x) is best predictor of y |x under L.Examples:Regression: y, F ∈ R

L(y, F ) = |y − F |, (y − F )2

Classification: y, F ∈ {c1, · · ·, cK}L(y, F ) = Ly,F (K ×K matrix)

F ∗(x) = “target” function (regression)concept (classification)

Estimate: F (x)← learning procedure ({zi}N1 )

Here: procedure = “ LEARNING ENSEMBLES”

BASIC LINEAR MODELF (x) =

Pa(p) f(x;p) dp

f(x;p) = “base” learner (basis function)parameters: p = (p1, p2, · ··)p ∈ P indexes particular function of x

from { f(x;p)}p∈Pa(p) = coefficient of f(x;p)

Examples:f(x;p) = [1 + exp(−ptx)]−1 (neural nets)

= multivariate splines (MARS)= decision trees (Mart, RF)∗

NUMERICAL QUADRATURE∫

PI(p) dp w

∑Mm=1 wmI(pm)

here: I(p) = a(p) f(x;p)

Quadrature rule defined by:{pm}

M1 = evaluation points ∈ P

{wm}M1 = weights

F (x) w

wm a(pm) f(x;pm)

cm f(x;pm)

Averaging over x:{c∗m}

linear regression of y on {f(x;pm)}M1 (pop.)

Problem: find good {pm}M1 .

MONTE CARLO METHODSr(p) = sampling pdf of p ∈ P

{pm v r(p)}M1

Simple Monte Carlo: r(p) = constantUsually not very good

IMPORTANCE SAMPLING

Customize r(p) for each particular problem (F ∗(x))r(pm) = big =⇒

pm important to high accuracywhen used with {pm′}m′ 6=m

MONTE CARLO METHODS

(1) “Random” Monte Carlo:ignore other points: pm v r(p) iid

(2) “Quasi” Monte Carlo:{pm}

M1 = deterministic

account for other pointsimportance → groups of points

RANDOM MONTE CARLO

(Lack of) importance J(p) depends only on p

One measure: “partial importance”J(p) = Eq(z)L(y, f(x;p))

p∗ = argminp J(p)= best single point (M = 1) rule

f(x;p∗) = optimal single base learner

Usually not very good, especially ifF ∗(x) /∈ {f(x;p)}p∈P

BUT, often used:single logistic regression or tree

Note: J(pm) ignores {pm′}m′ 6=m

Hope: better than r(p) = constant.

PARTIAL IMPORTANCE SAMPLINGr(p) = g(J(p))

g(·) = monotone decreasing functionr(p∗) = max w center (location)p 6= p∗ =⇒ r(p) < r(p∗)d(p,p∗) = J(p)− J(p∗)

Besides location,Critical parameter for importance sampling:Scale (width) of r(p):

σ =∫

Pd(p,p∗) r(p) dp

Controlled by choice of g(·):σ = too large → r(p) = constant.σ = too small → best single point rule p∗

−1.0 −0.5 0.0 0.5 1.0

Questions:(1) how to choose g(·) v σ(2) sample from r(p) = g(J(p))

Perturbation sampling ⇒ repeatedly:(1) randomly modify (perturb) problem(2) find optimal f(x;pm) for perturbed problempm = Rm

argminp Eq(z)L(y, f(x;p))}

control width σ of r(p) by degreePerturb: L(y, F ), q(z), algorithm, hybrid.

EXAMPLES

Perturb loss function:Lm(y, f) = L(y, f) + η · lm(y, f)

lm(y, f) = random functionLm(y, f) = L(y, f + η · hm(x))

hm(x) = random function of xpm = argminp Eq(z)Lm(y, f(x;p))Width σ of r(p) v value of η

Perturb sampling distribution:qm(z) = [wm(z)]

η q(z)wm(z) = random function of zpm = argminp Eqm(z)L(y, f(x;p))Width σ of r(p) v value of η

Perturb algorithm:pm = rand[argminp]Eq(z)L(y, f(x;p))control width σ of r(p) by degreerepeated partial optimizationsperturb partial solutionsExamples:

Dittereich - random treesBreiman - random forests

Produce a good {pm}M1 so that

c∗mf(x;pm) w F ∗(x)

where{c∗m}

M1 = pop. linear regression (L)

of y on {f(x;pm)}M1

Note: both depend on knowing population q(z).

FINITE DATA{zi}

N1 v q(z)

q(z) =∑N

i=11Nδ(z− zi)

Apply perturbation sampling based on q(z):Loss function / algorithm:

q(z)→ q(z)width σ of r(p) controlled as before

Sampling distribution: random reweightingqm(z) =

∑Ni=1 wimδ(z− zi)

wim v Pr(w) : Ewim = 1/Nwidth σ of r(p) controlled by std(wim)Fastest comp: wim ∈ {0, 1/K}⇒ draw K from N without replacementσ v std(w) = (N/K − 1)1/2/Ncomputation v K/N

Quadrature Coefficients

Population:Linear regression of y on {f(x;pm)}

{c∗m}M1 = argmin{cm}Eq(z)L

y,∑M

m=1 cm f(x;pm))

Finite data: regularized linear regression

{cm}M1 = argmin{cm}Eq(z)L

y,∑M

m=1 cm f(x;pm))

+λ ·∑M

m=1 | cm − c(0)m | (Lasso)

Regularization ⇒ reduce variance{c

(0)m }M1 = prior guess (usually = 0)

λ > 0 chosen by cross–validationFast algorithm: sol’ns for all λ(see HTF 2001, EHT 2002)

Importance Sampled Learning Ensembles (ISLE)Numerical Integration

F (x) =∫

Pa(p) f(x;p) dp

m=1 cm f(x; pm)

{pm}M1 v r(p) importance sampling

v perturbation sampling on q(z){cm}

M1 ← regularized linear regression

of y on {f(x; pm)}M1

BAGGING (Breiman 1996)

Perturb data distribution q(z):qm(z) = bootstrap sample =

∑Ni=1 wimδ(z− zi)

wim ∈ {0, 1/N, 2/N, · · ·, 1}v multinomial (1/N)

pm = argminp Eqm(z)L(y, f(x;p))

F (x) =∑M

m=11Mf(x;pm) (average)

Width σ of r(p):E(std(wim)) = (1− 1/N)

−1/N w 1/NFixed ⇒ no controlNo joint fitting of coefficients:

λ =∞ & c(0)m = 1/M

Potential improvements:Different σ (sampling strategy)λ <∞⇒ jointly fit coeff’s to data

RANDOM FORESTS (Breiman 1998)

f(x;p) = T (x) = largest possible decision treeHybrid sampling strategy :

(1) qm(z) = bootstrap sample (bagging)(2) random algorithm modification:

select var. for each split fromamong randomly chosen subsetBreiman: ns = fl(log2 n+ 1)

F (x) =∑M

m=11MTm(x) (average)

As an ISLE: σ(RF ) > σ(Bag), (↑ as ns ↓)Potential improvements: same as bagging

Different σ (sampling strategy)λ <∞⇒ jointly fit coeff’s to data(more later)

SEQUENTIAL SAMPLING

Random Monte Carlo: {pm v r(p)}M1 iid

Quasi–Monte Carlo: {pm}M1 = deterministic

J({pm}M1 ) = min{αm}Eq(z)L

y,∑M

m=1 αm f(x;pm))

Joint regression of y on { f(x;pm)}M1 (pop.)

Approximation: sequential sampling(forward stagewise)

Jm(p | {pl}m−11 ) = minαEq(z)L(y, α f(x;p) + hm(x))

hm(x) =∑m−1

l=1 αl f(x;pl), αl = sol’n for pl

pm = argminp Jm(p | {pl}m−11 )

Repeatedly modify loss function:Similar to Lm(y, f) = L(y, f) + η · hm(x))but here η = 1 & hm(x) = deterministic

Connection to Boosting

AdaBoost (Freund & Shapire 1996):L(y, f) = exp(−y · f), y ∈ {−1, 1}

F (x) = sign(

∑Mm=1 αm f(x;pm)

{αm}M1 = sequential partial reg. coeff’s

Gradient Boosting (MART – Friedman 2001):general y & L(y, f), αm = shrunk (η << 1)F (x) =

∑Mm=1 αm f(x;pm)

Potential improvements (ISLE):(1) F (x) =

∑Mm=1 cm f(x; pm)

{pm}M1 v seq. sampling on q(x)

{cm}M1 v regularized linear regression

(2) and/or hybrid with random qm(z) (speed)(sample K from N without replacement)

ISLE Paradigm

Wide variety of ISLE methods:(1) base learner f(x;p); (2) loss criterion L(y, f)(3) perturbation method(4) degree of perturbation: σ of r(p)(5) iid vs. sequential(6) hybrids

Examine several options.

Monte Carlo Study

100 data sets: each N = 10000, n = 40{yil = Fl(xi) + εil}

10000i=1 , l = 1, 100

{Fl(x)}1001 = different (random) target fun’s.

xi v N(0, I40), εil v N(0, V arx(Fl(x)))⇒ signal/noise = 1/1

Evaluation Criteria

Relative RMS error:rmse(Fjl) = [1−R2(Fl, Fjl)]

Comparative RMS error:cmse = rmse(Fjl)/mink{rmse(Fkl)}(adjusts for problem difficulty)

j, k ∈ {respective methods}10000 indep. obs.

Properties of Fl(x)

(1) 30 “noise” variables(2) wide variety of fun’s (difficulty)(3) emphasize lower order interactions(3) not in span of base learners

Decision Trees

Bag Bag_P Bag_.05 Bag_.05_P Bag_6 Bag_6_P Bag_6_.05_P

tive R

Bagging Relative RMS Error

Bag Bag_P Bag_.05 Bag_.05_P Bag_6 Bag_6_P Bag_6_.05_P

Bagging Comparative RMS Error

RF RF_P RF_.05 RF_.05_P RF_6 RF_6_P RF_6_.05_P

Random Forests Comparative RMS error

Bag RF Bag_6_05_P RF_6_05_P

Bag/RF Comparative RMS Error

Mart Mart_P Mart_10_P Mart_.01_10_P Mart_.01_20_P

Sequential Sampling Comparative RMS Error

Mart Mart_P Seq_.01_.2 Bag Bag_6_.05_p RF RF_6_.05_p

Seq/Bag/RF Comparative RMS Error

SUMMARY: Theory – unify:(1) Bagging, (2) random forests,(3) Bayesian model averaging,(4) boosting

single paradigm v Monte Carlo integration(1) – (3) : iid Monte Carlo, p v r(p)

(1), (2) perturb. sampling; (3) MCMC(4) quasi–Monte Carlo: approx. seq. sampling

Practice:{wi}

M1 ← lasso linear regression

(1) improves accuracy of RF v bagging (faster)(2) combined with aggressive subsampling

& weaker base learners, improves speed:bagging & RF > 102, MART v 5

allowing much bigger data sets.Also, prediction many times faster.

FUTURE DIRECTIONS

(1) More thorough understanding (theory)→ specific recommendations

(2) Multiple learning ensembles (MISLES)F (x) =

∑Kk=1

ak(pk) fk(x,pk) dpk

{ fk(x,pk)}K1 = different (comp.) base learners

F (x) =∑

ckm fk(x,pk){ckm} ← combined lasso regressionExample: f1 = decision trees

f2 = {xj}n1 (no sampling)

SLIDEShttp://www-stat.stanford.edu/˜jhf/isletalk.pdf

REFERENCES

HTF (2001): Hastie, Tibshirani, & Friedman,Elements of Statistical Learning, Springer.

EHT (2002): Efron, Hastie, & Tibshirani,Least angle regression (Stanford preprint)

Importance Sampling: An Alternative View of Ensemble Learning ...

Documents

Transcript of Importance Sampling: An Alternative View of Ensemble Learning ...

Importance Sampling Spherical Harmonics - …graphics.ucsd.edu/~henrik/papers/importance_sampling_spherical... · W. Jarosz, N. Carr & H. W. Jensen / Importance Sampling Spherical

Importance Sampling for Reinforcement Learning …cshelton/papers/docs/phd.pdf · Importance Sampling for Reinforcement Learning with Multiple Objectives by ... Importance Sampling

Efficient Importance sampling Simulations for Digital ...

Importance Sampling for Portfolio Credit Riskfinmath.stanford.edu/seminars/documents/Glasserman_IS_CreditRisk.pdf · Importance Sampling for Portfolio Credit Risk ... are low for

Adaptive Importance Sampling in Particle Filteringstaff.utia.cas.cz/smidl/files/publ/fusion.pdf · Adaptive Importance Sampling in Particle Filtering Václav Šmídl InstituteofInformationTheoryandAutomation,

New Approaches to Importance Sampling for Portfolio Credit ... · New Approaches to Importance Sampling for Portfolio Credit Risk ... New Approaches to Importance Sampling for Portfolio

Sequential Importance Sampling Resamplingjordan/courses/260... · 2010-04-20 · Sequential Importance Sampling is a special case of Importance Sampling. Importance Sampling only

sequential importance sampling for estimating expectations ...

4 Importance sampling and update algorithms

Importance sampling Framework at MPC

1 Approximate Inference 2: Importance Sampling. (Unnormalized) Importance Sampling.

Rendering: Importance Sampling - TU Wien

Importance Sampling: Intrinsic Dimension and Computational ...

Fast Hierarchical Importance Sampling with Blue …ravir/6160/papers/importance...Fast Hierarchical Importance Sampling with Blue Noise Properties Victor Ostromoukhov University of

Importance Sampling of Realistic Light Sources

Importance Sampling for the Efficient Simulation ...

Biostatistics 615/815 Lecture 16: Importance sampling ... · Recap. . . . Importance sampling. . . . . . . Rare Event. . . . . . . . . . Integration. . . . . . . . . . Root Finding

Importance sampling for MC simulation (“Importance-weighted random walk”)

Sequential Importance Sampling for Multiway Tables

AUTOMATED STATE–DEPENDENT IMPORTANCE SAMPLING FOR MARKOV ... · Automated State-Dependent Importance Sampling 5 suitable vunder which to carry out the importance sampling. A possible