Ch 7. Sparse Kernel Machines Pattern Recognition and Machine Learning, C. M. Bishop, 2006....

36
Ch 7. Sparse Kernel Ch 7. Sparse Kernel Machines Machines Pattern Recognition and Machine Learning, Pattern Recognition and Machine Learning, C. M. Bishop, 2006. C. M. Bishop, 2006. Summarized by S. Kim Biointelligence Laboratory, Seoul National Univ ersity http://bi.snu.ac.kr/

Transcript of Ch 7. Sparse Kernel Machines Pattern Recognition and Machine Learning, C. M. Bishop, 2006....

Page 1: Ch 7. Sparse Kernel Machines Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by S. Kim Biointelligence Laboratory, Seoul National.

Ch 7. Sparse Kernel MachinesCh 7. Sparse Kernel Machines

Pattern Recognition and Machine Learning, Pattern Recognition and Machine Learning, C. M. Bishop, 2006.C. M. Bishop, 2006.

Summarized by

S. Kim

Biointelligence Laboratory, Seoul National University

http://bi.snu.ac.kr/

Page 2: Ch 7. Sparse Kernel Machines Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by S. Kim Biointelligence Laboratory, Seoul National.

2 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

ContentsContents

Maximum Margin Classifiers Overlapping Class Distributions Relation to Logistic Regression Multiclass SVMs SVMs for Regression

Relevance Vector Machines RVM for Regression Analysis of Sparsity RVMs for Classification

Page 3: Ch 7. Sparse Kernel Machines Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by S. Kim Biointelligence Laboratory, Seoul National.

3 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Maximum Margin ClassifiersMaximum Margin Classifiers

Problem settings Two-class classification using linear models Assume that training data set is linearly separable

Support vector machine approaches The decision boundary is chosen to be the one for which the

margin is maximized

by T )()( xwx

supportvectors

Page 4: Ch 7. Sparse Kernel Machines Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by S. Kim Biointelligence Laboratory, Seoul National.

4 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Maximum Margin SolutionMaximum Margin Solution

For all data points, The distance of a point to the decision surface

The maximum margin solution

0)( nnyt x

w

xw

w

x ))(()( btyt nT

nnn

))((min1

maxarg,

bt nT

nnb

xwww

Nnbt nT

nb

,...,1 ,1))(( subject to 2

1minarg

2

,xww

w

Page 5: Ch 7. Sparse Kernel Machines Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by S. Kim Biointelligence Laboratory, Seoul National.

5 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Dual RepresentationDual Representation

Introducing Lagrange multipliers,

Min. points satisfy the derivatives of L w.r.t. w and b equal 0

Dual representation

N

nn

Tnn btabL

1

21))((

2

1),,( xwwaw

Find Appendix E for more details

N

nnn

N

nnnn tata

11

0 ,)(xw

N

1n

1 11

0

,...,1 ,0 subject to

),(2

1)(

~

nn

n

N

n

N

mmnmnmn

N

nn

ta

Nna

kttaaaL xxa

Page 6: Ch 7. Sparse Kernel Machines Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by S. Kim Biointelligence Laboratory, Seoul National.

6 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Classifying New DataClassifying New Data

Optimization subjects to

Found by solving a quadratic programming problem

bktaybyN

nnnn

T 1

),()( )()( xxxxwx

01)(

01)(

0

nnn

nn

n

yta

yt

a

x

x

Karush-Kuhn-Tucker (KKT) Conditions Appendix E

0na 1)( nn yt x : support vectorsor

Sn Smmnmmn

S

ktatN

b ),(1

xx O(N3)

a

Page 7: Ch 7. Sparse Kernel Machines Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by S. Kim Biointelligence Laboratory, Seoul National.

7 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Example of Separable Data ClassificationExample of Separable Data Classification

Figure 7.2

Page 8: Ch 7. Sparse Kernel Machines Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by S. Kim Biointelligence Laboratory, Seoul National.

8 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Overlapping Class DistributionsOverlapping Class Distributions

Allow some misclassified examples soft margin Introduce slack variables

Nnn ,...,1 ,0

nnnnT

nnn ytbtyt 1)( 1))(()( xxwx

Page 9: Ch 7. Sparse Kernel Machines Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by S. Kim Biointelligence Laboratory, Seoul National.

9 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Soft Margin SolutionSoft Margin Solution

Minimize

KKT conditions:

N

nnC

1

2

2

1w

: trade-off between minimizing training errors and controlling model complexity0C

N

nnn

N

nnnnn

N

nn ytaCbL

111

21)(

2

1),,( xwaw

0

0

0

01)(

01)(

0

nn

n

n

nnnn

nnn

n

yta

yt

a

x

x

0na

nnn yt 1)(x

: support vectors

or

Page 10: Ch 7. Sparse Kernel Machines Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by S. Kim Biointelligence Laboratory, Seoul National.

10 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Dual RepresentationDual Representation

Dual representation

Classifying new data and obtaining b ( hard margin classifiers)

N

1n

1 11

0

,...,1 ,0 subject to

),(2

1)(

~

nn

n

N

n

N

mmnmnmn

N

nn

ta

NnCa

kttaaaL xxa

Page 11: Ch 7. Sparse Kernel Machines Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by S. Kim Biointelligence Laboratory, Seoul National.

11 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Alternative FormulationAlternative Formulation

v-SVM (Schölkopf et al., 2000)

N

1n

N

1n

1 1

0

10 subject to

),(2

1)(

~

va

ta

Na

kttaaL

n

nn

n

N

n

N

mmnmnmn xxa

- Upper bound on the fraction of margin errors- Lower bound on the fraction of support vectors

Page 12: Ch 7. Sparse Kernel Machines Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by S. Kim Biointelligence Laboratory, Seoul National.

12 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Example of Nonseparable Data Classification (Example of Nonseparable Data Classification (vv-SVM)-SVM)

Page 13: Ch 7. Sparse Kernel Machines Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by S. Kim Biointelligence Laboratory, Seoul National.

13 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Solutions of the QP ProblemSolutions of the QP Problem

Chunking (Vapnik, 1982) Idea: the value of Lagrangian is unchanged if we remove the ro

ws and columns of the kernel matrix corresponding to Lagrange multipliers that have value zero

Protected conjugate gradients (Burges, 1998)

Decomposition methods (Osuna et al., 1996) Sequential minimal optimization (Platt, 1999)

Page 14: Ch 7. Sparse Kernel Machines Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by S. Kim Biointelligence Laboratory, Seoul National.

14 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Relation to Logistic Regression Relation to Logistic Regression (Section 4.3.2)(Section 4.3.2)

For data points on the correct side,

For the remaining points,

0

nnty1

N

nnC

1

2

2

1w

part positive thedenotes where

1),(

2 where

),(

1

1

2

nnnnSV

N

nnnSV

tytyE

C

tyE

w

: hinge error function

Page 15: Ch 7. Sparse Kernel Machines Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by S. Kim Biointelligence Laboratory, Seoul National.

15 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Relation to Logistic Regression (Cont’d)Relation to Logistic Regression (Cont’d)

From maximum likelihood logistic regression

Error function with a quadratic regularizer

)()|(

)()(1)|1(

)()|1(

ytytp

yyytp

yytp

ytytE

tyE

LR

N

nnnLR

exp1ln)( where

)(1

2w

Page 16: Ch 7. Sparse Kernel Machines Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by S. Kim Biointelligence Laboratory, Seoul National.

16 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Comparison of Error FunctionsComparison of Error Functions

Hinge error function

Error function for logistic regressionMisclassification error

Squared error

Page 17: Ch 7. Sparse Kernel Machines Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by S. Kim Biointelligence Laboratory, Seoul National.

17 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Multiclass SVMsMulticlass SVMs

One-versus-the-rest: K separate SVMs Can lead inconsistent results (Figure 4.2)

Imbalanced training sets Positive class: +1, negative class: -1/(K-1) (Lee et al., 2001)

An objective function for training all SVMs simultaneously (Weston and Watkins, 1999)

One-versus-one: K(K-1)/2 SVMs Based on error-correcting output codes (Allwein et al., 2000)

Generalization of the voting scheme of the one-versus-one

Page 18: Ch 7. Sparse Kernel Machines Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by S. Kim Biointelligence Laboratory, Seoul National.

18 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

SVMs for RegressionSVMs for Regression

Simple linear regression: minimize ε-insensitive error function

2

1

2

22

1w

N

nnn ty

otherwise ,)(

)( if ,0)(

ty

tytyE

x

xx

quadratic error function

ε-insensitive error function

Page 19: Ch 7. Sparse Kernel Machines Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by S. Kim Biointelligence Laboratory, Seoul National.

19 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

SVMs for Regression (Cont’d)SVMs for Regression (Cont’d)

Minimize

2

1 2

1)( wx

N

nnn tyEC

0ˆ ,0

ˆ)(

)( where

2

1ˆ 2

1

nn

nnn

nnn

N

nnn

yt

yt

C

x

x

w

Page 20: Ch 7. Sparse Kernel Machines Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by S. Kim Biointelligence Laboratory, Seoul National.

20 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Dual ProblemDual Problem

N

1n

11

1 1

ˆ,0 subject to

)ˆ()ˆ(

),()ˆ)(ˆ(2

1)ˆ,(

~

nn

nn

n

N

nnn

N

nnn

N

n

N

mmnmmnn

aa

Caa

taaaa

kaaaaL

xxaa

N

nnnnn

N

nnnnn

N

nnnnn

N

nnn

tyatya

CL

11

1

2

1

ˆˆ

ˆˆ2

w

Page 21: Ch 7. Sparse Kernel Machines Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by S. Kim Biointelligence Laboratory, Seoul National.

21 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

PredictionsPredictions

KKT conditions:

N

nnnn aa

1

)(ˆ xw

bkaayN

nnnn

1

),(ˆ)( xxx

0ˆˆ

0

0ˆˆ

0

nn

nn

nnnn

nnnn

aC

aC

tya

tya

N

mmnmmnn

Tn kaattb

1

),(ˆ)( xxxw

(from derivatives of the Lagrangian)

Page 22: Ch 7. Sparse Kernel Machines Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by S. Kim Biointelligence Laboratory, Seoul National.

22 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Alternative FormulationAlternative Formulation

v-SVM (Schölkopf et al., 2000)

N

1n

N

1n

1

1 1

ˆ

ˆ,0 subject to

)ˆ(

),()ˆ)(ˆ(2

1)ˆ,(

~

vCaa

aa

NCaa

taa

kaaaaL

nn

nn

nn

n

N

nnn

N

n

N

mmnmmnn xxaa

fraction of points lying outside the tube

Page 23: Ch 7. Sparse Kernel Machines Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by S. Kim Biointelligence Laboratory, Seoul National.

23 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Example of Example of vv-SVM Regression-SVM Regression

Page 24: Ch 7. Sparse Kernel Machines Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by S. Kim Biointelligence Laboratory, Seoul National.

24 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Relevance Vector MachinesRelevance Vector Machines

SVM Outputs are decisions rather than posterior probabilities The extension to K>2 classes is problematic There is a complexity parameter C Kernel functions are centered on training data points and

required to be positive definite

RVM Bayesian sparse kernel technique Much sparser models Faster performance on test data

Page 25: Ch 7. Sparse Kernel Machines Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by S. Kim Biointelligence Laboratory, Seoul National.

25 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

RVM for RegressionRVM for Regression

RVM is a linear form in Chapter 3 with a modified prior

2

1

1

)()()( where

)),(|(),,|(

xwxx

xwx

TM

iiiwy

ytNtp

N

inn bkwy

1

),()( xxx

N

nii

N

nnn

wNp

tpp

1

1

1

1

),0|()|(

),,|(),,|(

αw

wxwxt

Page 26: Ch 7. Sparse Kernel Machines Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by S. Kim Biointelligence Laboratory, Seoul National.

26 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

RVM for Regression (Cont’d)RVM for Regression (Cont’d)

α and β are determined using evidence approximation (type-2 maximum likelihood) (Section 3.5)

)(

)( elementsh matrix wit : where

where

),|(),,,|(

1

i

nini

T

T

diag

MN

Np

A

ΦΦA

tΦm

mwαXtw

wαwwXtαXt dppp )|(),,|(),,|(

TT

N

T

tt

N

Np

ΦΦAICt

tCtC

C0tαXt

111

1

,,..., where

ln)2ln(2

1

),|(ln),,|(ln

From the result (3.49)for linear regression models

Maximize

Page 27: Ch 7. Sparse Kernel Machines Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by S. Kim Biointelligence Laboratory, Seoul National.

27 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

RVM for Regression (Cont’d)RVM for Regression (Cont’d)

Two approaches By derivatives of marginal likelihood

EM algorithm Section 9.3.4

Predictive distribution

iiii

i i

new

i

inewi Nm

1 where

,2

1

2

Φmt

)()()( where

)(),(|

),,,,|(),,|(),,,,|(

1*2

2

*****

xxx

xxm

wαtXxwwxαtXx

T

TtN

dptptp

Section 3.3.2

Page 28: Ch 7. Sparse Kernel Machines Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by S. Kim Biointelligence Laboratory, Seoul National.

28 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Example of RVM RegressionExample of RVM Regression

More compact than SVM Parameters are determined automatically Require more training time than SVM

RVM regression v-SVM regression

Page 29: Ch 7. Sparse Kernel Machines Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by S. Kim Biointelligence Laboratory, Seoul National.

29 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Mechanism for SparsityMechanism for Sparsity

TTtt

Np

11

21 ,, where

),|(),|(

ICt

C0tt

only isotropic noise, α = ∞ a finite value of α

Page 30: Ch 7. Sparse Kernel Machines Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by S. Kim Biointelligence Laboratory, Seoul National.

30 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Sparse SolutionSparse Solution

Pull out the contribution from αi in

TΦΦAIC 11

Tiiii

Tiii

ij

Tjjj

1

111

C

IC

Φ ofcolumn th : where ii

Using (C.7), (C.15) in Appendix C

iiTii

iTiii

i

iiTiii

1

1111

111

C

CCCC

CCC

Page 31: Ch 7. Sparse Kernel Machines Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by S. Kim Biointelligence Laboratory, Seoul National.

31 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Sparse Solution (Cont’d)Sparse Solution (Cont’d)

For log marginal likelihood function L,

Stationary points of the marginal likelihood w.r.t.αi

)()()( 1 iLL αα

tC

C1

1

2

where

lnln2

1)(

iTii

iiTii

ii

iiiii

q

s

s

qs

Sparsity: measures the extent to which overlaps with the other basis vectors

Quality of : represents a measure of the alignment of the basis vector with the error between t and y-i

ii

02

)(2

221

ii

iiii

i

i

s

sqs

d

d

Page 32: Ch 7. Sparse Kernel Machines Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by S. Kim Biointelligence Laboratory, Seoul National.

32 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Sequential Sparse Bayesian Learning AlgorithmSequential Sparse Bayesian Learning Algorithm

1. Initialize

2. Initialize using , with , with the remaining

3. Evaluate and for all basis functions

4. Select a candidate

5. If ( is already in the model), update

6. If , add to the model, and evaluate

7. If , remove from the model, and set

8. Update

9. Go to 3 until converged

121

211 sqs 1 )( ijj

m

i iii sq ,2

i

iiii sqs 22

iii sq ,2i iiii sqs 22

iii sq ,2i i

Page 33: Ch 7. Sparse Kernel Machines Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by S. Kim Biointelligence Laboratory, Seoul National.

33 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

RVM for ClassificationRVM for Classification

Probabilistic linear classification model (Chapter 4)

with ARD prior

)(),( xwwx Ty

N

niiwNp

1

1),0|()|( αw

- Initialize - Build a Gaussian approximation to the posterior distribution- Obtain an approximation to the marginal likelihood- Maximize the marginal likelihood (re-estimate ) until converged

α

α

Page 34: Ch 7. Sparse Kernel Machines Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by S. Kim Biointelligence Laboratory, Seoul National.

34 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

RVM for Classification (Cont’d)RVM for Classification (Cont’d)

The posterior distribution is obtained by maximizing

)( where

const2

1)1ln()1(ln

)|(ln)|()|(ln),|(ln

1

i

TN

nnnnn

diag

ytyt

pppp

A

Aww

αtαwwtαtw

Iterative reweighted least squares (IRLS) from Section 4.3.3

ABΦΦαtw

AwytΦαtw

T

T

p

p

),|(ln

),|(ln

)( matrix,design :

,1 matrix, diagonal : where

nini

nnn yybNN

B

Resulting Gaussian approximation to the posterior distribution

11* , ABΦΦytΦAw TT

Page 35: Ch 7. Sparse Kernel Machines Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by S. Kim Biointelligence Laboratory, Seoul National.

35 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

RVM for Classification (Cont’d)RVM for Classification (Cont’d)

Marginal likelihood using Laplace approximation (Section 4.4)

Set the derivative of the marginal likelihood equal to zero, and rearranging then gives

If we define ,

2/12/** )2)(|()|(

)|()|()|(

Mpp

dppp

αwwt

wαwwtαt

iiii

i

inewi

w 1 where2*

ytBΦwt 1ˆ *

T

TNp

ΦAΦBC

tCtCαt

where

ˆ)ˆ(ln)2ln(2

1),|(ln 1

Same in the regression case

Page 36: Ch 7. Sparse Kernel Machines Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by S. Kim Biointelligence Laboratory, Seoul National.

36 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Example of RVM ClassificationExample of RVM Classification