Download - Ch 7. Sparse Kernel Machines Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by S. Kim Biointelligence Laboratory, Seoul National.

Ch 7. Sparse Kernel MachinesCh 7. Sparse Kernel Machines

Pattern Recognition and Machine Learning, Pattern Recognition and Machine Learning, C. M. Bishop, 2006.C. M. Bishop, 2006.

Summarized by

S. Kim

Biointelligence Laboratory, Seoul National University

http://bi.snu.ac.kr/

2 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

ContentsContents

Maximum Margin Classifiers Overlapping Class Distributions Relation to Logistic Regression Multiclass SVMs SVMs for Regression

Relevance Vector Machines RVM for Regression Analysis of Sparsity RVMs for Classification



Maximum Margin ClassifiersMaximum Margin Classifiers

Problem settings Two-class classification using linear models Assume that training data set is linearly separable

Support vector machine approaches The decision boundary is chosen to be the one for which the

margin is maximized

by T )()( xwx

supportvectors



Maximum Margin SolutionMaximum Margin Solution

For all data points, The distance of a point to the decision surface

The maximum margin solution

0)( nnyt x

w

xw

w

x ))(()( btyt nT

nnn

))((min1

maxarg,

bt nT

nnb

xwww

Nnbt nT

nb

,...,1 ,1))(( subject to 2

1minarg

2

,xww

w



Dual RepresentationDual Representation

Introducing Lagrange multipliers,

Min. points satisfy the derivatives of L w.r.t. w and b equal 0

Dual representation

N

nn

Tnn btabL

1

21))((

2

1),,( xwwaw

Find Appendix E for more details

N

nnn

N

nnnn tata

11

0 ,)(xw

N

1n

1 11

0

,...,1 ,0 subject to

),(2

1)(

~

nn

n

N

n

N

mmnmnmn

N

nn

ta

Nna

kttaaaL xxa



Classifying New DataClassifying New Data

Optimization subjects to

Found by solving a quadratic programming problem

bktaybyN

nnnn

T 1

),()( )()( xxxxwx

01)(

01)(

0

nnn

nn

n

yta

yt

a

x

x

Karush-Kuhn-Tucker (KKT) Conditions Appendix E

0na 1)( nn yt x : support vectorsor

Sn Smmnmmn

S

ktatN

b ),(1

xx O(N3)

a



Example of Separable Data ClassificationExample of Separable Data Classification

Figure 7.2



Overlapping Class DistributionsOverlapping Class Distributions

Allow some misclassified examples soft margin Introduce slack variables

Nnn ,...,1 ,0

nnnnT

nnn ytbtyt 1)( 1))(()( xxwx



Soft Margin SolutionSoft Margin Solution

Minimize

KKT conditions:

N

nnC

1

2

2

1w

: trade-off between minimizing training errors and controlling model complexity0C

N

nnn

N

nnnnn

N

nn ytaCbL

111

21)(

2

1),,( xwaw

0

0

0

01)(

01)(

0

nn

n

n

nnnn

nnn

n

yta

yt

a

x

x

0na

nnn yt 1)(x

: support vectors

or



Dual RepresentationDual Representation

Dual representation

Classifying new data and obtaining b ( hard margin classifiers)

N

1n

1 11

0

,...,1 ,0 subject to

),(2

1)(

~

nn

n

N

n

N

mmnmnmn

N

nn

ta

NnCa

kttaaaL xxa



Alternative FormulationAlternative Formulation

v-SVM (Schölkopf et al., 2000)

N

1n

N

1n

1 1

0

10 subject to

),(2

1)(

~

va

ta

Na

kttaaL

n

nn

n

N

n

N

mmnmnmn xxa

- Upper bound on the fraction of margin errors- Lower bound on the fraction of support vectors



Example of Nonseparable Data Classification (Example of Nonseparable Data Classification (vv-SVM)-SVM)



Solutions of the QP ProblemSolutions of the QP Problem

Chunking (Vapnik, 1982) Idea: the value of Lagrangian is unchanged if we remove the ro

ws and columns of the kernel matrix corresponding to Lagrange multipliers that have value zero

Protected conjugate gradients (Burges, 1998)

Decomposition methods (Osuna et al., 1996) Sequential minimal optimization (Platt, 1999)



Relation to Logistic Regression Relation to Logistic Regression (Section 4.3.2)(Section 4.3.2)

For data points on the correct side,

For the remaining points,

0

nnty1

N

nnC

1

2

2

1w

part positive thedenotes where

1),(

2 where

),(

1

1

2

nnnnSV

N

nnnSV

tytyE

C

tyE

w

: hinge error function



Relation to Logistic Regression (Cont’d)Relation to Logistic Regression (Cont’d)

From maximum likelihood logistic regression

Error function with a quadratic regularizer

)()|(

)()(1)|1(

)()|1(

ytytp

yyytp

yytp

ytytE

tyE

LR

N

nnnLR

exp1ln)( where

)(1

2w



Comparison of Error FunctionsComparison of Error Functions

Hinge error function

Error function for logistic regressionMisclassification error

Squared error



Multiclass SVMsMulticlass SVMs

One-versus-the-rest: K separate SVMs Can lead inconsistent results (Figure 4.2)

Imbalanced training sets Positive class: +1, negative class: -1/(K-1) (Lee et al., 2001)

An objective function for training all SVMs simultaneously (Weston and Watkins, 1999)

One-versus-one: K(K-1)/2 SVMs Based on error-correcting output codes (Allwein et al., 2000)

Generalization of the voting scheme of the one-versus-one



SVMs for RegressionSVMs for Regression

Simple linear regression: minimize ε-insensitive error function

2

1

2

22

1w

N

nnn ty

otherwise ,)(

)( if ,0)(

ty

tytyE

x

xx

quadratic error function

ε-insensitive error function



SVMs for Regression (Cont’d)SVMs for Regression (Cont’d)

Minimize

2

1 2

1)( wx

N

nnn tyEC

0ˆ ,0

ˆ)(

)( where

2

1ˆ 2

1

nn

nnn

nnn

N

nnn

yt

yt

C

x

x

w



Dual ProblemDual Problem

N

1n

11

1 1

0ˆ

ˆ,0 subject to

)ˆ()ˆ(

),()ˆ)(ˆ(2

1)ˆ,(

~

nn

nn

n

N

nnn

N

nnn

N

n

N

mmnmmnn

aa

Caa

taaaa

kaaaaL

xxaa

N

nnnnn

N

nnnnn

N

nnnnn

N

nnn

tyatya

CL

11

1

2

1

ˆˆ

ˆˆ2

1ˆ

w



PredictionsPredictions

KKT conditions:

N

nnnn aa

1

)(ˆ xw

bkaayN

nnnn

1

),(ˆ)( xxx

0ˆˆ

0

0ˆˆ

0

nn

nn

nnnn

nnnn

aC

aC

tya

tya

N

mmnmmnn

Tn kaattb

1

),(ˆ)( xxxw

(from derivatives of the Lagrangian)



Alternative FormulationAlternative Formulation

v-SVM (Schölkopf et al., 2000)

N

1n

N

1n

1

1 1

ˆ

0ˆ

ˆ,0 subject to

)ˆ(

),()ˆ)(ˆ(2

1)ˆ,(

~

vCaa

aa

NCaa

taa

kaaaaL

nn

nn

nn

n

N

nnn

N

n

N

mmnmmnn xxaa

fraction of points lying outside the tube



Example of Example of vv-SVM Regression-SVM Regression



Relevance Vector MachinesRelevance Vector Machines

SVM Outputs are decisions rather than posterior probabilities The extension to K>2 classes is problematic There is a complexity parameter C Kernel functions are centered on training data points and

required to be positive definite

RVM Bayesian sparse kernel technique Much sparser models Faster performance on test data



RVM for RegressionRVM for Regression

RVM is a linear form in Chapter 3 with a modified prior

2

1

1

)()()( where

)),(|(),,|(

xwxx

xwx

TM

iiiwy

ytNtp

N

inn bkwy

1

),()( xxx

N

nii

N

nnn

wNp

tpp

1

1

1

1

),0|()|(

),,|(),,|(

αw

wxwxt



RVM for Regression (Cont’d)RVM for Regression (Cont’d)

α and β are determined using evidence approximation (type-2 maximum likelihood) (Section 3.5)

)(

)( elementsh matrix wit : where

where

),|(),,,|(

1

i

nini

T

T

diag

MN

Np

A

xΦ

ΦΦA

tΦm

mwαXtw

wαwwXtαXt dppp )|(),,|(),,|(

TT

N

T

tt

N

Np

ΦΦAICt

tCtC

C0tαXt

111

1

,,..., where

ln)2ln(2

1

),|(ln),,|(ln

From the result (3.49)for linear regression models

Maximize



RVM for Regression (Cont’d)RVM for Regression (Cont’d)

Two approaches By derivatives of marginal likelihood

EM algorithm Section 9.3.4

Predictive distribution

iiii

i i

new

i

inewi Nm

1 where

,2

1

2

Φmt

)()()( where

)(),(|

),,,,|(),,|(),,,,|(

1*2

2

*****

xxx

xxm

wαtXxwwxαtXx

T

TtN

dptptp

Section 3.3.2



Example of RVM RegressionExample of RVM Regression

More compact than SVM Parameters are determined automatically Require more training time than SVM

RVM regression v-SVM regression



Mechanism for SparsityMechanism for Sparsity

TTtt

Np

11

21 ,, where

),|(),|(

ICt

C0tt

only isotropic noise, α = ∞ a finite value of α



Sparse SolutionSparse Solution

Pull out the contribution from αi in

TΦΦAIC 11

Tiiii

Tiii

ij

Tjjj

1

111

C

IC

Φ ofcolumn th : where ii

Using (C.7), (C.15) in Appendix C

iiTii

iTiii

i

iiTiii

1

1111

111

C

CCCC

CCC



Sparse Solution (Cont’d)Sparse Solution (Cont’d)

For log marginal likelihood function L,

Stationary points of the marginal likelihood w.r.t.αi

)()()( 1 iLL αα

tC

C1

1

2

where

lnln2

1)(

iTii

iiTii

ii

iiiii

q

s

s

qs

Sparsity: measures the extent to which overlaps with the other basis vectors

Quality of : represents a measure of the alignment of the basis vector with the error between t and y-i

ii

02

)(2

221

ii

iiii

i

i

s

sqs

d

d



Sequential Sparse Bayesian Learning AlgorithmSequential Sparse Bayesian Learning Algorithm

1. Initialize

2. Initialize using , with , with the remaining

3. Evaluate and for all basis functions

4. Select a candidate

5. If ( is already in the model), update

6. If , add to the model, and evaluate

7. If , remove from the model, and set

8. Update

9. Go to 3 until converged

121

211 sqs 1 )( ijj

m

i iii sq ,2

i

iiii sqs 22

iii sq ,2i iiii sqs 22

iii sq ,2i i



RVM for ClassificationRVM for Classification

Probabilistic linear classification model (Chapter 4)

with ARD prior

)(),( xwwx Ty

N

niiwNp

1

1),0|()|( αw

- Initialize - Build a Gaussian approximation to the posterior distribution- Obtain an approximation to the marginal likelihood- Maximize the marginal likelihood (re-estimate ) until converged

α

α



RVM for Classification (Cont’d)RVM for Classification (Cont’d)

The posterior distribution is obtained by maximizing

)( where

const2

1)1ln()1(ln

)|(ln)|()|(ln),|(ln

1

i

TN

nnnnn

diag

ytyt

pppp

A

Aww

αtαwwtαtw

Iterative reweighted least squares (IRLS) from Section 4.3.3

ABΦΦαtw

AwytΦαtw

T

T

p

p

),|(ln

),|(ln

)( matrix,design :

,1 matrix, diagonal : where

nini

nnn yybNN

xΦ

B

Resulting Gaussian approximation to the posterior distribution

11* , ABΦΦytΦAw TT



RVM for Classification (Cont’d)RVM for Classification (Cont’d)

Marginal likelihood using Laplace approximation (Section 4.4)

Set the derivative of the marginal likelihood equal to zero, and rearranging then gives

If we define ,

2/12/** )2)(|()|(

)|()|()|(

Mpp

dppp

αwwt

wαwwtαt

iiii

i

inewi

w 1 where2*

ytBΦwt 1ˆ *

T

TNp

ΦAΦBC

tCtCαt

where

ˆ)ˆ(ln)2ln(2

1),|(ln 1

Same in the regression case



Example of RVM ClassificationExample of RVM Classification