Importance Sampling: An Alternative View of Ensemble Learning ...
Transcript of Importance Sampling: An Alternative View of Ensemble Learning ...
Importance Sampling:An Alternative View of Ensemble Learning
Jerome H. FriedmanBogdan PopescuStanford University
1
PREDICTIVE LEARNING
Given data: {zi}N1 = {yi,xi}
N1 v q(z)
y = “output” or “response” attribute (variable)x = {x1, · · ·, xn} = “inputs” or “predictors”
and loss function L(y, F ):estimate F ∗(x) = argminF (x) Eq(z)L(y, F (x))
2
WHY?
F ∗(x) is best predictor of y |x under L.Examples:Regression: y, F ∈ R
L(y, F ) = |y − F |, (y − F )2
Classification: y, F ∈ {c1, · · ·, cK}L(y, F ) = Ly,F (K ×K matrix)
3
F ∗(x) = “target” function (regression)concept (classification)
Estimate: F (x)← learning procedure ({zi}N1 )
Here: procedure = “ LEARNING ENSEMBLES”
4
BASIC LINEAR MODELF (x) =
∫
Pa(p) f(x;p) dp
f(x;p) = “base” learner (basis function)parameters: p = (p1, p2, · ··)p ∈ P indexes particular function of x
from { f(x;p)}p∈Pa(p) = coefficient of f(x;p)
5
Examples:f(x;p) = [1 + exp(−ptx)]−1 (neural nets)
= multivariate splines (MARS)= decision trees (Mart, RF)∗
6
NUMERICAL QUADRATURE∫
PI(p) dp w
∑Mm=1 wmI(pm)
here: I(p) = a(p) f(x;p)
Quadrature rule defined by:{pm}
M1 = evaluation points ∈ P
{wm}M1 = weights
7
F (x) w
M∑
m=1
wm a(pm) f(x;pm)
w
M∑
m=1
cm f(x;pm)
Averaging over x:{c∗m}
M1 =
linear regression of y on {f(x;pm)}M1 (pop.)
Problem: find good {pm}M1 .
8
MONTE CARLO METHODSr(p) = sampling pdf of p ∈ P
{pm v r(p)}M1
Simple Monte Carlo: r(p) = constantUsually not very good
9
IMPORTANCE SAMPLING
Customize r(p) for each particular problem (F ∗(x))r(pm) = big =⇒
pm important to high accuracywhen used with {pm′}m′ 6=m
10
MONTE CARLO METHODS
(1) “Random” Monte Carlo:ignore other points: pm v r(p) iid
(2) “Quasi” Monte Carlo:{pm}
M1 = deterministic
account for other pointsimportance → groups of points
11
RANDOM MONTE CARLO
(Lack of) importance J(p) depends only on p
One measure: “partial importance”J(p) = Eq(z)L(y, f(x;p))
p∗ = argminp J(p)= best single point (M = 1) rule
f(x;p∗) = optimal single base learner
12
Usually not very good, especially ifF ∗(x) /∈ {f(x;p)}p∈P
BUT, often used:single logistic regression or tree
Note: J(pm) ignores {pm′}m′ 6=m
Hope: better than r(p) = constant.
13
PARTIAL IMPORTANCE SAMPLINGr(p) = g(J(p))
g(·) = monotone decreasing functionr(p∗) = max w center (location)p 6= p∗ =⇒ r(p) < r(p∗)d(p,p∗) = J(p)− J(p∗)
14
Besides location,Critical parameter for importance sampling:Scale (width) of r(p):
σ =∫
Pd(p,p∗) r(p) dp
Controlled by choice of g(·):σ = too large → r(p) = constant.σ = too small → best single point rule p∗
15
−1.0 −0.5 0.0 0.5 1.0
05
1015
2025
16
Questions:(1) how to choose g(·) v σ(2) sample from r(p) = g(J(p))
17
TRICK
Perturbation sampling ⇒ repeatedly:(1) randomly modify (perturb) problem(2) find optimal f(x;pm) for perturbed problempm = Rm
{
argminp Eq(z)L(y, f(x;p))}
control width σ of r(p) by degreePerturb: L(y, F ), q(z), algorithm, hybrid.
18
EXAMPLES
Perturb loss function:Lm(y, f) = L(y, f) + η · lm(y, f)
lm(y, f) = random functionLm(y, f) = L(y, f + η · hm(x))
hm(x) = random function of xpm = argminp Eq(z)Lm(y, f(x;p))Width σ of r(p) v value of η
19
Perturb sampling distribution:qm(z) = [wm(z)]
η q(z)wm(z) = random function of zpm = argminp Eqm(z)L(y, f(x;p))Width σ of r(p) v value of η
20
Perturb algorithm:pm = rand[argminp]Eq(z)L(y, f(x;p))control width σ of r(p) by degreerepeated partial optimizationsperturb partial solutionsExamples:
Dittereich - random treesBreiman - random forests
21
GOAL
Produce a good {pm}M1 so that
M∑
m=1
c∗mf(x;pm) w F ∗(x)
where{c∗m}
M1 = pop. linear regression (L)
of y on {f(x;pm)}M1
Note: both depend on knowing population q(z).
22
FINITE DATA{zi}
N1 v q(z)
q(z) =∑N
i=11Nδ(z− zi)
Apply perturbation sampling based on q(z):Loss function / algorithm:
q(z)→ q(z)width σ of r(p) controlled as before
23
Sampling distribution: random reweightingqm(z) =
∑Ni=1 wimδ(z− zi)
wim v Pr(w) : Ewim = 1/Nwidth σ of r(p) controlled by std(wim)Fastest comp: wim ∈ {0, 1/K}⇒ draw K from N without replacementσ v std(w) = (N/K − 1)1/2/Ncomputation v K/N
24
Quadrature Coefficients
Population:Linear regression of y on {f(x;pm)}
M1 :
{c∗m}M1 = argmin{cm}Eq(z)L
(
y,∑M
m=1 cm f(x;pm))
25
Finite data: regularized linear regression
{cm}M1 = argmin{cm}Eq(z)L
(
y,∑M
m=1 cm f(x;pm))
+λ ·∑M
m=1 | cm − c(0)m | (Lasso)
Regularization ⇒ reduce variance{c
(0)m }M1 = prior guess (usually = 0)
λ > 0 chosen by cross–validationFast algorithm: sol’ns for all λ(see HTF 2001, EHT 2002)
26
Importance Sampled Learning Ensembles (ISLE)Numerical Integration
F (x) =∫
Pa(p) f(x;p) dp
w∑M
m=1 cm f(x; pm)
{pm}M1 v r(p) importance sampling
v perturbation sampling on q(z){cm}
M1 ← regularized linear regression
of y on {f(x; pm)}M1
27
BAGGING (Breiman 1996)
Perturb data distribution q(z):qm(z) = bootstrap sample =
∑Ni=1 wimδ(z− zi)
wim ∈ {0, 1/N, 2/N, · · ·, 1}v multinomial (1/N)
pm = argminp Eqm(z)L(y, f(x;p))
F (x) =∑M
m=11Mf(x;pm) (average)
28
Width σ of r(p):E(std(wim)) = (1− 1/N)
−1/N w 1/NFixed ⇒ no controlNo joint fitting of coefficients:
λ =∞ & c(0)m = 1/M
Potential improvements:Different σ (sampling strategy)λ <∞⇒ jointly fit coeff’s to data
29
RANDOM FORESTS (Breiman 1998)
f(x;p) = T (x) = largest possible decision treeHybrid sampling strategy :
(1) qm(z) = bootstrap sample (bagging)(2) random algorithm modification:
select var. for each split fromamong randomly chosen subsetBreiman: ns = fl(log2 n+ 1)
30
F (x) =∑M
m=11MTm(x) (average)
As an ISLE: σ(RF ) > σ(Bag), (↑ as ns ↓)Potential improvements: same as bagging
Different σ (sampling strategy)λ <∞⇒ jointly fit coeff’s to data(more later)
31
SEQUENTIAL SAMPLING
Random Monte Carlo: {pm v r(p)}M1 iid
Quasi–Monte Carlo: {pm}M1 = deterministic
J({pm}M1 ) = min{αm}Eq(z)L
(
y,∑M
m=1 αm f(x;pm))
Joint regression of y on { f(x;pm)}M1 (pop.)
32
Approximation: sequential sampling(forward stagewise)
Jm(p | {pl}m−11 ) = minαEq(z)L(y, α f(x;p) + hm(x))
hm(x) =∑m−1
l=1 αl f(x;pl), αl = sol’n for pl
pm = argminp Jm(p | {pl}m−11 )
Repeatedly modify loss function:Similar to Lm(y, f) = L(y, f) + η · hm(x))but here η = 1 & hm(x) = deterministic
33
Connection to Boosting
AdaBoost (Freund & Shapire 1996):L(y, f) = exp(−y · f), y ∈ {−1, 1}
F (x) = sign(
∑Mm=1 αm f(x;pm)
)
{αm}M1 = sequential partial reg. coeff’s
Gradient Boosting (MART – Friedman 2001):general y & L(y, f), αm = shrunk (η << 1)F (x) =
∑Mm=1 αm f(x;pm)
34
Potential improvements (ISLE):(1) F (x) =
∑Mm=1 cm f(x; pm)
{pm}M1 v seq. sampling on q(x)
{cm}M1 v regularized linear regression
(2) and/or hybrid with random qm(z) (speed)(sample K from N without replacement)
35
ISLE Paradigm
Wide variety of ISLE methods:(1) base learner f(x;p); (2) loss criterion L(y, f)(3) perturbation method(4) degree of perturbation: σ of r(p)(5) iid vs. sequential(6) hybrids
Examine several options.
36
Monte Carlo Study
100 data sets: each N = 10000, n = 40{yil = Fl(xi) + εil}
10000i=1 , l = 1, 100
{Fl(x)}1001 = different (random) target fun’s.
xi v N(0, I40), εil v N(0, V arx(Fl(x)))⇒ signal/noise = 1/1
37
Evaluation Criteria
Relative RMS error:rmse(Fjl) = [1−R2(Fl, Fjl)]
1/2
Comparative RMS error:cmse = rmse(Fjl)/mink{rmse(Fkl)}(adjusts for problem difficulty)
j, k ∈ {respective methods}10000 indep. obs.
38
Properties of Fl(x)
(1) 30 “noise” variables(2) wide variety of fun’s (difficulty)(3) emphasize lower order interactions(3) not in span of base learners
Decision Trees
39
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Bag Bag_P Bag_.05 Bag_.05_P Bag_6 Bag_6_P Bag_6_.05_P
Rela
tive R
MS
err
or
Bagging Relative RMS Error
40
1.0
1.2
1.4
1.6
1.8
2.0
Bag Bag_P Bag_.05 Bag_.05_P Bag_6 Bag_6_P Bag_6_.05_P
Co
mp
ara
tive
RM
S e
rro
r
Bagging Comparative RMS Error
41
1.0
1.5
2.0
2.5
RF RF_P RF_.05 RF_.05_P RF_6 RF_6_P RF_6_.05_P
Co
mp
ara
tive
RM
S e
rro
r
Random Forests Comparative RMS error
42
1.0
1.2
1.4
1.6
Bag RF Bag_6_05_P RF_6_05_P
Co
mp
ara
tive
RM
S e
rro
r
Bag/RF Comparative RMS Error
43
1.0
1.1
1.2
1.3
Mart Mart_P Mart_10_P Mart_.01_10_P Mart_.01_20_P
Co
mp
ara
tive
RM
S e
rro
r
Sequential Sampling Comparative RMS Error
44
1.0
1.2
1.4
1.6
1.8
2.0
2.2
Mart Mart_P Seq_.01_.2 Bag Bag_6_.05_p RF RF_6_.05_p
Seq/Bag/RF Comparative RMS Error
45
SUMMARY: Theory – unify:(1) Bagging, (2) random forests,(3) Bayesian model averaging,(4) boosting
single paradigm v Monte Carlo integration(1) – (3) : iid Monte Carlo, p v r(p)
(1), (2) perturb. sampling; (3) MCMC(4) quasi–Monte Carlo: approx. seq. sampling
46
Practice:{wi}
M1 ← lasso linear regression
(1) improves accuracy of RF v bagging (faster)(2) combined with aggressive subsampling
& weaker base learners, improves speed:bagging & RF > 102, MART v 5
allowing much bigger data sets.Also, prediction many times faster.
47
FUTURE DIRECTIONS
(1) More thorough understanding (theory)→ specific recommendations
48
(2) Multiple learning ensembles (MISLES)F (x) =
∑Kk=1
∫
Pk
ak(pk) fk(x,pk) dpk
{ fk(x,pk)}K1 = different (comp.) base learners
F (x) =∑
ckm fk(x,pk){ckm} ← combined lasso regressionExample: f1 = decision trees
f2 = {xj}n1 (no sampling)
49
SLIDEShttp://www-stat.stanford.edu/˜jhf/isletalk.pdf
REFERENCES
HTF (2001): Hastie, Tibshirani, & Friedman,Elements of Statistical Learning, Springer.
EHT (2002): Efron, Hastie, & Tibshirani,Least angle regression (Stanford preprint)
50