Oregon State University – Intelligent Systems Group 8/22/2003ICML 2003 1 Giorgio Valentini...

25
8/22/2003 ICML 2003 1 Oregon State University – Intelligent Systems Group Giorgio Valentini Dipartimento di Scienze dell Informazione Università degli Studi di Milano, Italy [email protected] Thomas G. Dietterich Department of Computer Science Oregon State University Corvallis, Oregon 97331 USA http://www.cs.orst.edu/~tgd Low Bias Bagged Support Vector Machines
  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    215
  • download

    0

Transcript of Oregon State University – Intelligent Systems Group 8/22/2003ICML 2003 1 Giorgio Valentini...

Page 1: Oregon State University – Intelligent Systems Group 8/22/2003ICML 2003 1 Giorgio Valentini Dipartimento di Scienze dell Informazione Università degli Studi.

8/22/2003 ICML 2003 1

Ore

gon

Sta

te U

nive

rsit

y –

Inte

llig

ent S

yste

ms

Gro

up

Giorgio ValentiniDipartimento di Scienze dell Informazione

Università degli Studi di Milano, [email protected]

Thomas G. DietterichDepartment of Computer Science

Oregon State UniversityCorvallis, Oregon 97331 USAhttp://www.cs.orst.edu/~tgd

Low Bias Bagged Support Vector Machines

Page 2: Oregon State University – Intelligent Systems Group 8/22/2003ICML 2003 1 Giorgio Valentini Dipartimento di Scienze dell Informazione Università degli Studi.

8/22/2003 ICML 2003 2

Ore

gon

Sta

te U

nive

rsit

y –

Inte

llig

ent S

yste

ms

Gro

upTwo Questions:

Can bagging help SVMs? If so, how should SVMs be tuned to give

the best bagged performance?

Page 3: Oregon State University – Intelligent Systems Group 8/22/2003ICML 2003 1 Giorgio Valentini Dipartimento di Scienze dell Informazione Università degli Studi.

8/22/2003 ICML 2003 3

Ore

gon

Sta

te U

nive

rsit

y –

Inte

llig

ent S

yste

ms

Gro

upThe Answers

Can bagging help SVMs? Yes

If so, how should SVMs be tuned to give the best bagged performance? Tune to minimize the bias of each SVM

Page 4: Oregon State University – Intelligent Systems Group 8/22/2003ICML 2003 1 Giorgio Valentini Dipartimento di Scienze dell Informazione Università degli Studi.

8/22/2003 ICML 2003 4

Ore

gon

Sta

te U

nive

rsit

y –

Inte

llig

ent S

yste

ms

Gro

upSVMs

Soft Margin Classifier Maximizes VC dimension subject to soft separation of

the training data Dot product can be generalized using kernels K(xj,xi;) Set C and using an internal validation set

Excellent control of the bias/variance tradeoff: Is there any room for improvement?

minimize: ||w||2 + C i i

subject to: yi (w ¢ xi + b) + i ¸ 1

Page 5: Oregon State University – Intelligent Systems Group 8/22/2003ICML 2003 1 Giorgio Valentini Dipartimento di Scienze dell Informazione Università degli Studi.

8/22/2003 ICML 2003 5

Ore

gon

Sta

te U

nive

rsit

y –

Inte

llig

ent S

yste

ms

Gro

upBias/Variance Error Decomposition

for Squared Loss

For regression problems, loss is (ŷ – y)2

error2 = bias2 + variance + noise ES[(ŷ-y)2] = (ES[ŷ] – f(x))2 + ES[(ŷ – ES[ŷ])2] + E[(y – f(x))2]

Bias: Systematic error at data point x averaged over all training sets S of size N

Variance: Variation around the average Noise: Errors in the observed labels of x

Page 6: Oregon State University – Intelligent Systems Group 8/22/2003ICML 2003 1 Giorgio Valentini Dipartimento di Scienze dell Informazione Università degli Studi.

8/22/2003 ICML 2003 6

Ore

gon

Sta

te U

nive

rsit

y –

Inte

llig

ent S

yste

ms

Gro

upExample: 20 points

y = x + 2 sin(1.5x) + N(0,0.2)

Page 7: Oregon State University – Intelligent Systems Group 8/22/2003ICML 2003 1 Giorgio Valentini Dipartimento di Scienze dell Informazione Università degli Studi.

8/22/2003 ICML 2003 7

Ore

gon

Sta

te U

nive

rsit

y –

Inte

llig

ent S

yste

ms

Gro

upExample: 50 fits

(20 examples each)

Page 8: Oregon State University – Intelligent Systems Group 8/22/2003ICML 2003 1 Giorgio Valentini Dipartimento di Scienze dell Informazione Università degli Studi.

8/22/2003 ICML 2003 8

Ore

gon

Sta

te U

nive

rsit

y –

Inte

llig

ent S

yste

ms

Gro

upBias

Page 9: Oregon State University – Intelligent Systems Group 8/22/2003ICML 2003 1 Giorgio Valentini Dipartimento di Scienze dell Informazione Università degli Studi.

8/22/2003 ICML 2003 9

Ore

gon

Sta

te U

nive

rsit

y –

Inte

llig

ent S

yste

ms

Gro

upVariance

Page 10: Oregon State University – Intelligent Systems Group 8/22/2003ICML 2003 1 Giorgio Valentini Dipartimento di Scienze dell Informazione Università degli Studi.

8/22/2003 ICML 2003 10

Ore

gon

Sta

te U

nive

rsit

y –

Inte

llig

ent S

yste

ms

Gro

upNoise

Page 11: Oregon State University – Intelligent Systems Group 8/22/2003ICML 2003 1 Giorgio Valentini Dipartimento di Scienze dell Informazione Università degli Studi.

8/22/2003 ICML 2003 11

Ore

gon

Sta

te U

nive

rsit

y –

Inte

llig

ent S

yste

ms

Gro

upVariance Reduction and Bagging

Bagging attempts to simulate a large number of training sets and compute the average prediction ym of those training sets

It then predicts ym

If the simulation is good enough, this eliminates all of the variance

Page 12: Oregon State University – Intelligent Systems Group 8/22/2003ICML 2003 1 Giorgio Valentini Dipartimento di Scienze dell Informazione Università degli Studi.

8/22/2003 ICML 2003 12

Ore

gon

Sta

te U

nive

rsit

y –

Inte

llig

ent S

yste

ms

Gro

upBias and Variance for 0/1 Loss

(Domingos, 2000)

At each test point x, we have 100 estimates: ŷ1, …, ŷ100 2 {–1,+1}

Main prediction: ym = majority vote Bias(x) = 0 if ym is correct and 1 otherwise Variance(x) = probability that ŷ ym

Unbiased variance VU(x): variance when Bias = 0 Biased variance VB(x): variance when Bias = 1

Error rate(x) = Bias(x) + VU(x) – VB(x) Noise is assumed to be zero

Page 13: Oregon State University – Intelligent Systems Group 8/22/2003ICML 2003 1 Giorgio Valentini Dipartimento di Scienze dell Informazione Università degli Studi.

8/22/2003 ICML 2003 13

Ore

gon

Sta

te U

nive

rsit

y –

Inte

llig

ent S

yste

ms

Gro

upGood Variance and Bad Variance

Error rate(x) = Bias(x) + VU(x) – VB(x)

VB(x) is “good” variance, but only when the bias is high

VU(x) is “bad” variance

Bagging will reduce both types of variance. This gives good results if Bias(x) is small.

Goal: Tune classifiers to have small bias and rely on bagging to reduce variance

Page 14: Oregon State University – Intelligent Systems Group 8/22/2003ICML 2003 1 Giorgio Valentini Dipartimento di Scienze dell Informazione Università degli Studi.

8/22/2003 ICML 2003 14

Ore

gon

Sta

te U

nive

rsit

y –

Inte

llig

ent S

yste

ms

Gro

upLobag

Given: Training examples {(xi,yi)}N

i=1

Learning algorithm with tuning parameters Parameter settings to try {1,2,…}

Do: Apply internal bagging to compute out-of-bag

estimates of the bias of each parameter setting. Let * be the setting that gives minimum bias

Perform bagging using *

Page 15: Oregon State University – Intelligent Systems Group 8/22/2003ICML 2003 1 Giorgio Valentini Dipartimento di Scienze dell Informazione Università degli Studi.

8/22/2003 ICML 2003 15

Ore

gon

Sta

te U

nive

rsit

y –

Inte

llig

ent S

yste

ms

Gro

upExample:

Letter2, RBF kernel, = 100

minimum error minimum

bias

Page 16: Oregon State University – Intelligent Systems Group 8/22/2003ICML 2003 1 Giorgio Valentini Dipartimento di Scienze dell Informazione Università degli Studi.

8/22/2003 ICML 2003 16

Ore

gon

Sta

te U

nive

rsit

y –

Inte

llig

ent S

yste

ms

Gro

upExperimental Study

Seven data sets: P2, waveform, grey-landsat, spam, musk, letter2 (letter recognition ‘B’ vs ‘R’), letter2+noise (20% added noise)

Three kernels: dot product, RBF ( = gaussian width), polynomial ( = degree)

Training set: 100 examples Final classifier is bag of 100 SVMs

trained with chosen C and

Page 17: Oregon State University – Intelligent Systems Group 8/22/2003ICML 2003 1 Giorgio Valentini Dipartimento di Scienze dell Informazione Università degli Studi.

8/22/2003 ICML 2003 17

Ore

gon

Sta

te U

nive

rsit

y –

Inte

llig

ent S

yste

ms

Gro

upResults: Dot Product Kernel

Page 18: Oregon State University – Intelligent Systems Group 8/22/2003ICML 2003 1 Giorgio Valentini Dipartimento di Scienze dell Informazione Università degli Studi.

8/22/2003 ICML 2003 18

Ore

gon

Sta

te U

nive

rsit

y –

Inte

llig

ent S

yste

ms

Gro

upResults (2): Gaussian Kernel

Page 19: Oregon State University – Intelligent Systems Group 8/22/2003ICML 2003 1 Giorgio Valentini Dipartimento di Scienze dell Informazione Università degli Studi.

8/22/2003 ICML 2003 19

Ore

gon

Sta

te U

nive

rsit

y –

Inte

llig

ent S

yste

ms

Gro

upResults (3): Polynomial Kernel

Page 20: Oregon State University – Intelligent Systems Group 8/22/2003ICML 2003 1 Giorgio Valentini Dipartimento di Scienze dell Informazione Università degli Studi.

8/22/2003 ICML 2003 20

Ore

gon

Sta

te U

nive

rsit

y –

Inte

llig

ent S

yste

ms

Gro

upMcNemar’s Tests:

Bagging versus Single SVM

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Linear Polynomial RBF

Wins

Ties

Losses

Page 21: Oregon State University – Intelligent Systems Group 8/22/2003ICML 2003 1 Giorgio Valentini Dipartimento di Scienze dell Informazione Università degli Studi.

8/22/2003 ICML 2003 21

Ore

gon

Sta

te U

nive

rsit

y –

Inte

llig

ent S

yste

ms

Gro

upMcNemar’s Test:

Lobag versus Single SVM

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Linear Polynomial RBF

Wins

Ties

Losses

Page 22: Oregon State University – Intelligent Systems Group 8/22/2003ICML 2003 1 Giorgio Valentini Dipartimento di Scienze dell Informazione Università degli Studi.

8/22/2003 ICML 2003 22

Ore

gon

Sta

te U

nive

rsit

y –

Inte

llig

ent S

yste

ms

Gro

upMcNemar’s Test:

Lobag versus Bagging

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Linear Polynomial RBF

Wins

Ties

Losses

Page 23: Oregon State University – Intelligent Systems Group 8/22/2003ICML 2003 1 Giorgio Valentini Dipartimento di Scienze dell Informazione Università degli Studi.

8/22/2003 ICML 2003 23

Ore

gon

Sta

te U

nive

rsit

y –

Inte

llig

ent S

yste

ms

Gro

upResults: McNemar’s Test

(wins – ties – losses)

KernelLobag vs Bagging

Lobag vs Single

Bagging vs Single

Linear 3 – 26 – 1 23 – 7 – 0 21 – 9 – 0

Polynomial 12 – 23 – 0 17 – 17 – 1 9 – 24 – 2

Gaussian 17 – 17 – 1 18 – 15 – 2 9 – 22 – 4

Total 32 – 66 – 2 58 – 39 – 3 39 – 55 – 6

Page 24: Oregon State University – Intelligent Systems Group 8/22/2003ICML 2003 1 Giorgio Valentini Dipartimento di Scienze dell Informazione Università degli Studi.

8/22/2003 ICML 2003 24

Ore

gon

Sta

te U

nive

rsit

y –

Inte

llig

ent S

yste

ms

Gro

upDiscussion

For small training sets Bagging can improve SVM error rates, especially

for linear kernels Lobag is at least as good as bagging and often

better

Consistent with previous experience Bagging works better with unpruned trees Bagging works better with neural networks that are

trained longer or with less weight decay

Page 25: Oregon State University – Intelligent Systems Group 8/22/2003ICML 2003 1 Giorgio Valentini Dipartimento di Scienze dell Informazione Università degli Studi.

8/22/2003 ICML 2003 25

Ore

gon

Sta

te U

nive

rsit

y –

Inte

llig

ent S

yste

ms

Gro

upConclusions

Lobag is recommended for SVM problems with high variance (small training sets, high noise, many features)

Added cost: SVMs require internal validation to set C and Lobag requires internal bagging to estimate bias for

each setting of C and Future research:

Smart search for low-bias settings of C and Experiments with larger training sets