Ensemble methods - University Of...

Ensemble methods

CS 446

Why ensembles?

Standard machine learning setup:

I We have some data.

I We train 10 predictors (3-nn, least squares, SVM, ResNet, . . . ).

I We output the best on a validation set.

Question: can we do better than the best?

What if we use an ensemble/aggregate/combination?

We’ll consider two approaches: boosting and bagging.

1 / 27

Bagging

2 / 27

Bagging?

This first approach is based upon a simple idea:

I If the predictors have indepedent errors,a majority vote of their outputs should be good.

Let’s first check this.

3 / 27

Combining classifiers

Suppose we have n classifiers.Suppose each is wrong independently with probability 0.4.Model classifier errors as random variables (Zi)

ni=1 (thus E(Zi) = 0.4).

We can model the distribution of errors with Binom(n, 0.4).

Red: all classifiers wrong. Green: at least half classifiers wrong.

Green region is error of majority vote! 0.075� 0.4 !!!

4 / 27



ni=1 (thus E(Zi) = 0.4).



0.0 0.2 0.4 0.6 0.8 1.00.00

0.05

0.10

0.15

0.20

0.25

#classifiers = n = 10


4 / 27



ni=1 (thus E(Zi) = 0.4).



0.0 0.2 0.4 0.6 0.8 1.00.000

0.025

0.050

0.075

0.100

0.125

0.150

0.175



4 / 27



ni=1 (thus E(Zi) = 0.4).



0.0 0.2 0.4 0.6 0.8 1.00.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14



4 / 27



ni=1 (thus E(Zi) = 0.4).



0.0 0.2 0.4 0.6 0.8 1.00.00

0.02

0.04

0.06

0.08

0.10

0.12



4 / 27



ni=1 (thus E(Zi) = 0.4).



0.0 0.2 0.4 0.6 0.8 1.00.00

0.02

0.04

0.06

0.08

0.10

0.12#classifiers = n = 50


4 / 27



ni=1 (thus E(Zi) = 0.4).



0.0 0.2 0.4 0.6 0.8 1.00.00

0.02

0.04

0.06

0.08

0.10



4 / 27



ni=1 (thus E(Zi) = 0.4).

We can model the distribution of errors with Binom(n, 0.4).Red: all classifiers wrong.

Green: at least half classifiers wrong.

0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.20.0

0.1

0.2

0.3

0.4

0.5#classifiers = n = 2, fraction red = 0.16


4 / 27



ni=1 (thus E(Zi) = 0.4).



0.0 0.2 0.4 0.6 0.8 1.00.0

0.1

0.2

0.3

0.4

#classifiers = n = 3, fraction red = 0.064


4 / 27



ni=1 (thus E(Zi) = 0.4).



0.0 0.2 0.4 0.6 0.8 1.00.00

0.05

0.10

0.15

0.20

0.25

0.30



4 / 27



ni=1 (thus E(Zi) = 0.4).



0.0 0.2 0.4 0.6 0.8 1.00.00

0.05

0.10

0.15

0.20

0.25

0.30

#classifiers = n = 6, fraction red = 0.004096


4 / 27



ni=1 (thus E(Zi) = 0.4).



0.0 0.2 0.4 0.6 0.8 1.00.00

0.05

0.10

0.15

0.20

0.25



4 / 27



ni=1 (thus E(Zi) = 0.4).


Red: all classifiers wrong.


0.0 0.2 0.4 0.6 0.8 1.00.00

0.05

0.10

0.15

0.20

0.25

#classifiers = n = 10, fraction green = 0.366897


4 / 27



ni=1 (thus E(Zi) = 0.4).




0.0 0.2 0.4 0.6 0.8 1.00.000

0.025

0.050

0.075

0.100

0.125

0.150

0.175



4 / 27



ni=1 (thus E(Zi) = 0.4).




0.0 0.2 0.4 0.6 0.8 1.00.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14



4 / 27



ni=1 (thus E(Zi) = 0.4).




0.0 0.2 0.4 0.6 0.8 1.00.00

0.02

0.04

0.06

0.08

0.10

0.12



4 / 27



ni=1 (thus E(Zi) = 0.4).




0.0 0.2 0.4 0.6 0.8 1.00.00

0.02

0.04

0.06

0.08

0.10

0.12#classifiers = n = 50, fraction green = 0.0978074


4 / 27



ni=1 (thus E(Zi) = 0.4).




0.0 0.2 0.4 0.6 0.8 1.00.00

0.02

0.04

0.06

0.08

0.10



4 / 27

Majority vote

0.0 0.2 0.4 0.6 0.8 1.00.00

0.05

0.10

0.15

0.20

0.25


Green region is error of majority vote!Suppose yi ∈ {−1,+1}.

MAJ(y1, . . . , yn) :=

{+1 when

∑i yi ≥ 0,

−1 when∑i yi < 0.

Error rate of majority classifier (with individual error probability p):

Pr[Binom(n, p) ≥ n/2] =n∑

i=n/2

(n

i

)pi(1−p)n−i ≤ exp

(−n(1/2− p)2

).

5 / 27

Majority vote

0.0 0.2 0.4 0.6 0.8 1.00.000

0.025

0.050

0.075

0.100

0.125

0.150

0.175



MAJ(y1, . . . , yn) :=

{+1 when

∑i yi ≥ 0,




i=n/2

(n

i


(−n(1/2− p)2

).

5 / 27

Majority vote

0.0 0.2 0.4 0.6 0.8 1.00.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14



MAJ(y1, . . . , yn) :=

{+1 when

∑i yi ≥ 0,




i=n/2

(n

i


(−n(1/2− p)2

).

5 / 27

Majority vote

0.0 0.2 0.4 0.6 0.8 1.00.00

0.02

0.04

0.06

0.08

0.10

0.12



MAJ(y1, . . . , yn) :=

{+1 when

∑i yi ≥ 0,




i=n/2

(n

i


(−n(1/2− p)2

).

5 / 27

Majority vote

0.0 0.2 0.4 0.6 0.8 1.00.00

0.02

0.04

0.06

0.08

0.10

0.12#classifiers = n = 50, fraction green = 0.0978074


MAJ(y1, . . . , yn) :=

{+1 when

∑i yi ≥ 0,




i=n/2

(n

i


(−n(1/2− p)2

).

5 / 27

Majority vote

0.0 0.2 0.4 0.6 0.8 1.00.00

0.02

0.04

0.06

0.08

0.10



MAJ(y1, . . . , yn) :=

{+1 when

∑i yi ≥ 0,




i=n/2

(n

i


(−n(1/2− p)2

).

5 / 27

Bottom line

0.0 0.2 0.4 0.6 0.8 1.00.00

0.02

0.04

0.06

0.08

0.10


Green region is error of majority vote!

Error of majority vote classifier goes down exponentially in n.


i=n/2

(n

i


(−n(1/2− p)2

).

6 / 27

From independent errors to an algorithm

How to use independent errors in an algorithm?

1. For t = 1, 2, . . . , T :

1.1 Obtain IID data St := ((x(t)i , y

(t)i ))ni=1,

1.2 Train classifier ft on St.

2. Output x 7→ MAJ(f1(x), . . . , fT (x)

).

I Good news: errors are independent!(Our exponential error estimate from before is valid.)

I Bad news: classifiers trained on 1/T fraction of data(why not just train ResNet on all of it. . . ).

7 / 27

Bagging

Bagging = Bootstrap aggregating (Leo Breiman, 1994).

1. Obtain IID data S := ((xi, yi))ni=1.

2. For t = 1, 2, . . . , T :

2.1 Resample n points uniformly at random with replacement from S,obtaining “Bootstrap sample” St.

2.2 Train classifier ft on St.


).

I Good news: using most of the data for each ft !

I Bad news: errors no longer indepedent. . . ?

8 / 27

Sampling with replacement?

Question:

Take n samples uniformly at random with replacement froma population of size n. What is the probability that a givenindividual is not picked?

Answer:

(1− 1

n

)n; for large n: lim

n→∞

(1− 1

n

)n=

1

e≈ 0.3679 .

Implications for bagging:

I Each bootstrap sample contains about 63% of the data set.

I Remaining 37% can be used to estimate error rate of classifiertrained on the bootstrap sample.

I If we have three classifiers,some of their error estimates must share examples!Independence is violated!

9 / 27

Random Forests

Random Forests (Leo Breiman, 2001).

1. Obtain IID data S := ((xi, yi))ni=1.

2. For t = 1, 2, . . . , T :

2.1 Resample n points uniformly at random with replacement from S,obtaining “Bootstrap sample” St.

2.2 Train a decision tree fT on St as follows:when greedily splitting tree nodes,consider only

√d and not d possible features.


).

I Heuristic news: maybe errors are more independent now?

10 / 27

Boosting — a quick look

11 / 27

Boosting overview

I We no longer assume classifiers have independent errors.

I We no longer output a simple majority:we reweight the classifiers via optimization.

I There is a rich theory with many interpretations.

12 / 27

Simplified boosting scheme

1. Start with data ((xi, yi)ni=1 and classifiers (h1, . . . , hT ).

2. Find weights w ∈ RT which approximately minimize

1

n

n∑i=1

`

yi T∑j=1

wjhj(xi)

=1

n

n∑i=1

`(yiw

Tzi),

where zi =(h1(xi), . . . , hT (xi)

)∈ RT .

(We use classifiers to give us features.)

3. Predict with x 7→∑Tj=1 wjhj(x).

Remarks.

I If ` is convex, this is standard linear prediction: convex in w.I In the classical setting:`(r) = exp(−r), optimizer = coordinate descent, T =∞.

I Most commonly, (h1, . . . , hT ) are decision stumps.I Popular software implementation: xgboost.

13 / 27

Decision stumps?

sepal length/width1.5 2 2.5 3

peta

l le

ngth

/wid

th

2

2.5

3

3.5

4

4.5

5

5.5

6 Classifying irises by sepal and petalmeasurements

I X = R2, Y = {1, 2, 3}I x1 = ratio of sepal length to width

I x2 = ratio of petal length to width

14 / 27

Decision stumps?


peta

l le

ngth

/wid

th

2

2.5

3

3.5

4

4.5

5

5.5




y = 2

14 / 27

Decision stumps?


peta

l le

ngth

/wid

th

2

2.5

3

3.5

4

4.5

5

5.5




x1 > 1.7

14 / 27

Decision stumps?


peta

l le

ngth

/wid

th

2

2.5

3

3.5

4

4.5

5

5.5




x1 > 1.7

y = 1 y = 3

14 / 27

Decision stumps?


peta

l le

ngth

/wid

th

2

2.5

3

3.5

4

4.5

5

5.5




x1 > 1.7

y = 1 y = 3

. . . and stop there!

14 / 27

Boosting decision stumps

Minimizing 1n

∑ni=1 `

(yi∑Tj=1 wjhj(xi)

)over w ∈ RT ,

where (h1, . . . , hT ) are decision stumps.

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-3.000 -1.5000.000

0.000

1.500

1.500

3.000

4.500

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-4.0

00

-3.0

00-2

.000

-1.00

0

0.000

1.000

2.000

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-3.2

00

-2.4

00

-1.60

0-0

.800

0.000

0.800

1.600

Boosted stumps.(O(n) param.)

2-layer ReLU.(O(n) param.)


15 / 27


Minimizing 1n

∑ni=1 `

(yi∑Tj=1 wjhj(xi)

)over w ∈ RT ,


1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-4.0

00-2

.000

0.00

0

0.000

2.000

2.000

4.000

6.000

8.000

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-6.0

00-4

.500

-3.0

00

-1.5

00

0.00

0

1.50

0

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-8.0

00-6

.000

-4.0

00

-2.0

000.

000

2.00

0




15 / 27


Minimizing 1n

∑ni=1 `

(yi∑Tj=1 wjhj(xi)

)over w ∈ RT ,


1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-5.000

-2.5

00

-2.500

0.00

0

0.000

2.500

5.000

7.50010.000

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-8.0

00-6

.000

-4.0

00

-2.0

000.

000

0.00

0

2.00

0

4.000

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-12.

000

-9.0

00-6

.000

-3.0

000.

000 0.000

3.00

0

6.000




15 / 27


Minimizing 1n

∑ni=1 `

(yi∑Tj=1 wjhj(xi)

)over w ∈ RT ,


1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-5.0

00-2

.500

-2.500

0.00

0

0.000

2.5005.000

7.50010.00012.500

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-10.

000

-7.5

00-5

.000

-2.5

000.

000

0.00

0

2.50

0

5.000

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-16.

000

-12.

000

-8.0

00

-4.0

000.

000

0.000

4.00

0

8.000




15 / 27


Minimizing 1n

∑ni=1 `

(yi∑Tj=1 wjhj(xi)

)over w ∈ RT ,


1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-6.000

-3.0

00

-3.000

0.00

0

0.000

3.000

6.000

9.000 12.000

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-12.

000

-9.0

00-6

.000

-3.0

00

0.00

0

0.00

0

3.00

0

6.000

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-20.

000

-15.

000

-10.

000

-5.0

000.

000

0.000

5.00

0

10.000




15 / 27

Boosting — classical perspective

16 / 27

Coordinate descent?

The classical methods used coordinate descent:

I Find the maximum magnitude coordinate of the gradient:

argmaxj

∣∣∣∣∣∣ d

dwj

n∑i=1

`(∑

j

wjhj(xi)yi

)∣∣∣∣∣∣= argmax

j

∣∣∣∣∣∣n∑i=1

`′(∑

j

wjhj(xi)yi

)hj(xi)yi

∣∣∣∣∣∣= argmax

j

∣∣∣∣∣∣n∑i=1

qihj(xi)yi

∣∣∣∣∣∣ ,where we’ve defined qi := `′(

∑j hj(xi)yi).

I Iterate: w′ := w − ηsej , where j is the maximum coordinate,s ∈ {−1,+1} is its sign, and η is a step size.

17 / 27

Interpreting coordinate descent

Suppose hj : Rd → {−1,+1}; then hj(x)y = 2 · 1[hj(x) = y]− 1, andeach step solves

argmaxj

∣∣∣∣∣∣n∑i=1

qihj(x)iyi

∣∣∣∣∣∣ = argmaxj

∣∣∣∣∣∣n∑i=1

qi(1[hj(xi) = y]− 1/2

)∣∣∣∣∣∣ .We are solving a weighted zero-one loss minimization problem.

Remarks:

I The classical choice of coordinate descent is equivalent to solving aproblem akin to weighted zero-one loss minimization.

I We can abstract away a finite set (h1, . . . , hT ), and have anarbitrary set of predictors (e.g., all linear classifiers).

18 / 27

Classical boosting setup

There is a Weak Learning Oracle, and a correspondingγ-weak-learnable assumption:

A set of points is γ-weak-learnable a weak learning oracle if forany weighting q, it returns predictor h so that Eq

(h(X)Y

)≥ γ.

Interpretation: for any reweighting q, we get a predictor h which is atleast γ-correlated with the target.

Remarks:I The classical methods iteratively invoke the oracle with different

weightings and then output a final aggregated predictor.I The best-known method, AdaBoost, performs coordinate-descent

updates (invoking the oracle) with a specific step size, and needsO( 1

γ2 ln(1ε )) iterations for accuracy ε > 0.

I The original description of AdaBoost is in terms of the sequence ofweightings q1, q2, . . ., and says nothing about coordinate descent.

I Adaptive Boosting: method doesn’t need to know γ, and adapts tovarying γt := Eqt

(ht(X)Y ).

19 / 27

Example: AdaBoost with decision stumps

(This example from Schapire&Freund’s book.)

Weak learning oracle (WLO): pick the best decision stump, meaningF :=

{x 7→ sign(xi − b) : i ∈ {1, . . . , d}, b ∈ R

}.

(Straightforward to handle weights in ERM.)

Remark:

I Only need to consider O(n) stumps (Why?).

20 / 27

Example: execution of AdaBoost

D1

D2 D3

+

+

–

–

+

+

–

–

f1 f2 f3

21 / 27


D1

D2 D3

+

+

–

–

+

+

–

–

f1

f2 f3

21 / 27


D1 D2

D3

+

+

–

–

+

+

–

–

f1

f2 f3

21 / 27


D1 D2

D3

+

+

–

–

+

+

–

–

f1 f2

f3

21 / 27


D1 D2 D3

+

+

–

–

+

+

–

–

f1 f2

f3

21 / 27


D1 D2 D3

+

+

–

–

+

+

–

–

f1 f2 f3

21 / 27

Example: final classifier from AdaBoost

+

+

–

–

f1 f2 f3

Final classifierf(x) = sign(0.42f1(x) + 0.65f2(x) + 0.92f3(x))

(Zero training error rate!)

22 / 27

A typical run of boosting.

AdaBoost+C4.5 on “letters” dataset.

Err

orra

te

16 1 Introduction and Overview

10 100 10000

5

10

15

20

Figure 1.7The training and test percent error rates obtained using boosting on an OCR dataset with C4.5 as the base learner.The top and bottom curves are test and training error, respectively. The top horizontal line shows the test error rateusing just C4.5. The bottom line shows the final test error rate of AdaBoost after 1000 rounds. (Reprinted withpermission of the Institute of Mathematical Statistics.)

This pronounced lack of overfitting seems to flatly contradict our earlier intuition that sim-pler is better. Surely, a combination of five trees is much, much simpler than a combinationof 1000 trees (about 200 times simpler, in terms of raw size), and both perform equallywell on the training set (perfectly, in fact). So how can it be that the far larger and morecomplex combined classifier performs so much better on the test set? This would appear tobe a paradox.

One superficially plausible explanation is that the αt ’s are converging rapidly to zero,so that the number of base classifiers being combined is effectively bounded. However, asnoted above, the ϵt ’s remain around 5–6% in this case, well below 1

2 , which means that theweights αt on the individual base classifiers are also bounded well above zero, so that thecombined classifier is constantly growing and evolving with each round of boosting.

Such resistance to overfitting is typical of boosting, although, as we have seen in sec-tion 1.2.3, boosting certainly can overfit. This resistance is one of the properties that makeit such an attractive learning algorithm. But how can we understand this behavior?

In chapter 5, we present a theoretical explanation of how, why, and whenAdaBoost worksand, in particular, of why it often does not overfit. Briefly, the main idea is the following.The description above of AdaBoost’s performance on the training set took into accountonly the training error, which is already zero after just five rounds. However, training errortells only part of the story, in that it reports just the number of examples that are correctlyor incorrectly classified. Instead, to understand AdaBoost, we also need to consider howconfident the predictions being made by the algorithm are. We will see that such confidencecan be measured by a quantity called the margin. According to this explanation, althoughthe training error—that is, whether or not the predictions are correct—is not changing

AdaBoost training error

AdaBoost test error

C4.5 test error

# of rounds T

(# nodes across all decision trees in f is >2× 106)

Training error rate is zero after just five rounds,but test error rate continues to decrease, even up to 1000 rounds!

(Figure 1.7 from Schapire & Freund text)23 / 27

Boosting the margin.

Final classifier from AdaBoost:

f(x) = sign

(∑Tt=1 αtft(x)∑Tt=1 |αt|

)︸︷︷︸g(x) ∈ [−1,+1]

.

Call y · g(x) ∈ [−1,+1] the margin achieved on example (x, y).(Note: `1 not `2 normalized.)

Margin theory [Schapire, Freund, Bartlett, and Lee, 1998]:I Larger margins ⇒ better generalization, independent of T .I AdaBoost tends to increase margins on training examples.

“letters” dataset:T = 5 T = 100 T = 1000

training error rate 0.0% 0.0% 0.0%test error rate 8.4% 3.3% 3.1%

% margins ≤0.5 7.7% 0.0% 0.0%min. margin 0.14 0.52 0.55

I Similar phenomenon in deep networks and gradient descent.

24 / 27

Boosting the margin.

Final classifier from AdaBoost:

f(x) = sign

(∑Tt=1 αtft(x)∑Tt=1 |αt|

)︸︷︷︸g(x) ∈ [−1,+1]

.

Call y · g(x) ∈ [−1,+1] the margin achieved on example (x, y).(Note: `1 not `2 normalized.)

Margin theory [Schapire, Freund, Bartlett, and Lee, 1998]:I Larger margins ⇒ better generalization, independent of T .I AdaBoost tends to increase margins on training examples.

“letters” dataset:T = 5 T = 100 T = 1000

training error rate 0.0% 0.0% 0.0%test error rate 8.4% 3.3% 3.1%

% margins ≤0.5 7.7% 0.0% 0.0%min. margin 0.14 0.52 0.55

I Similar phenomenon in deep networks and gradient descent.24 / 27

Margin plots

Given ((xi, yi))ni=1 and f , plot unnormalized margin distribution

f(xi)yi −maxy 6=yi

f(xi)y.


25 / 27

Margin plots



f(xi)y.

1 0 1 2 3 4 5

0.0

0.2

0.4

0.6

0.8

1.0

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-3.000 -1.5000.000

0.000

1.500

1.500

3.000

4.500


25 / 27

Margin plots



f(xi)y.

0 1 2 3 4 5 6 7

0.0

0.2

0.4

0.6

0.8

1.0

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-4.0

00-2

.000

0.00

0

0.000

2.000

2.000

4.000

6.000

8.000


25 / 27

Margin plots



f(xi)y.

0 2 4 6 8

0.0

0.2

0.4

0.6

0.8

1.0

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-5.000

-2.5

00

-2.500

0.00

0

0.000

2.500

5.000

7.50010.000


25 / 27

Margin plots



f(xi)y.

2 4 6 8 10

0.0

0.2

0.4

0.6

0.8

1.0

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-5.0

00-2

.500

-2.500

0.00

0

0.000

2.5005.000

7.50010.00012.500


25 / 27

Margin plots



f(xi)y.

2 4 6 8 10 12

0.0

0.2

0.4

0.6

0.8

1.0

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-6.000

-3.0

00

-3.000

0.00

0

0.000

3.000

6.000

9.000 12.000


25 / 27

Margin plots



f(xi)y.

1 0 1 2 3 4 5

0.0

0.2

0.4

0.6

0.8

1.0

2 1 0 1 2 3 4

0.0

0.2

0.4

0.6

0.8

1.0

2 1 0 1 2 3

0.0

0.2

0.4

0.6

0.8

1.0

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-3.000 -1.5000.000

0.000

1.500

1.500

3.000

4.500

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-4.0

00

-3.0

00-2

.000

-1.00

0

0.000

1.000

2.000

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-3.2

00

-2.4

00

-1.60

0-0

.800

0.000

0.800

1.600




25 / 27

Margin plots



f(xi)y.

0 1 2 3 4 5 6 7

0.0

0.2

0.4

0.6

0.8

1.0

0 2 4 6

0.0

0.2

0.4

0.6

0.8

1.0

0 2 4 6 8 10

0.0

0.2

0.4

0.6

0.8

1.0

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-4.0

00-2

.000

0.00

0

0.000

2.000

2.000

4.000

6.000

8.000

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-6.0

00-4

.500

-3.0

00

-1.5

00

0.00

0

1.50

0

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-8.0

00-6

.000

-4.0

00

-2.0

000.

000

2.00

0




25 / 27

Margin plots



f(xi)y.

0 2 4 6 8

0.0

0.2

0.4

0.6

0.8

1.0

0 2 4 6 8 10

0.0

0.2

0.4

0.6

0.8

1.0

0 2 4 6 8 10 12 14 16

0.0

0.2

0.4

0.6

0.8

1.0

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-5.000

-2.5

00

-2.500

0.00

0

0.000

2.500

5.000

7.50010.000

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-8.0

00-6

.000

-4.0

00

-2.0

000.

000

0.00

0

2.00

0

4.000

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-12.

000

-9.0

00-6

.000

-3.0

000.

000 0.000

3.00

0

6.000




25 / 27

Margin plots



f(xi)y.

2 4 6 8 10

0.0

0.2

0.4

0.6

0.8

1.0

0 2 4 6 8 10 12

0.0

0.2

0.4

0.6

0.8

1.0

0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0

0.0

0.2

0.4

0.6

0.8

1.0

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-5.0

00-2

.500

-2.500

0.00

0

0.000

2.5005.000

7.50010.00012.500

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-10.

000

-7.5

00-5

.000

-2.5

000.

000

0.00

0

2.50

0

5.000

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-16.

000

-12.

000

-8.0

00

-4.0

000.

000

0.000

4.00

0

8.000




25 / 27

Margin plots



f(xi)y.

2 4 6 8 10 12

0.0

0.2

0.4

0.6

0.8

1.0

0 2 4 6 8 10 12 14

0.0

0.2

0.4

0.6

0.8

1.0

0 5 10 15 20

0.0

0.2

0.4

0.6

0.8

1.0

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-6.000

-3.0

00

-3.000

0.00

0

0.000

3.000

6.000

9.000 12.000

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-12.

000

-9.0

00-6

.000

-3.0

00

0.00

0

0.00

0

3.00

0

6.000

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

-20.

000

-15.

000

-10.

000

-5.0

000.

000

0.000

5.00

0

10.000




25 / 27

Summary

26 / 27

Summary

I We can do better than best predictor.

I (Bagging.) If errors are independent, majority vote works well.

I (Boosting.) If they are not independent, reweighted majority workswell; weights can be found with convex optimization.

27 / 27

Ensemble methods - University Of...

Documents

Transcript of Ensemble methods - University Of...