Ch 7. Sparse Kernel MachinesCh 7. Sparse Kernel Machines
Pattern Recognition and Machine Learning, Pattern Recognition and Machine Learning, C. M. Bishop, 2006.C. M. Bishop, 2006.
Summarized by
S. Kim
Biointelligence Laboratory, Seoul National University
http://bi.snu.ac.kr/
2 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
ContentsContents
Maximum Margin Classifiers Overlapping Class Distributions Relation to Logistic Regression Multiclass SVMs SVMs for Regression
Relevance Vector Machines RVM for Regression Analysis of Sparsity RVMs for Classification
3 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Maximum Margin ClassifiersMaximum Margin Classifiers
Problem settings Two-class classification using linear models Assume that training data set is linearly separable
Support vector machine approaches The decision boundary is chosen to be the one for which the
margin is maximized
by T )()( xwx
supportvectors
4 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Maximum Margin SolutionMaximum Margin Solution
For all data points, The distance of a point to the decision surface
The maximum margin solution
0)( nnyt x
w
xw
w
x ))(()( btyt nT
nnn
))((min1
maxarg,
bt nT
nnb
xwww
Nnbt nT
nb
,...,1 ,1))(( subject to 2
1minarg
2
,xww
w
5 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Dual RepresentationDual Representation
Introducing Lagrange multipliers,
Min. points satisfy the derivatives of L w.r.t. w and b equal 0
Dual representation
N
nn
Tnn btabL
1
21))((
2
1),,( xwwaw
Find Appendix E for more details
N
nnn
N
nnnn tata
11
0 ,)(xw
N
1n
1 11
0
,...,1 ,0 subject to
),(2
1)(
~
nn
n
N
n
N
mmnmnmn
N
nn
ta
Nna
kttaaaL xxa
6 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Classifying New DataClassifying New Data
Optimization subjects to
Found by solving a quadratic programming problem
bktaybyN
nnnn
T 1
),()( )()( xxxxwx
01)(
01)(
0
nnn
nn
n
yta
yt
a
x
x
Karush-Kuhn-Tucker (KKT) Conditions Appendix E
0na 1)( nn yt x : support vectorsor
Sn Smmnmmn
S
ktatN
b ),(1
xx O(N3)
a
7 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Example of Separable Data ClassificationExample of Separable Data Classification
Figure 7.2
8 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Overlapping Class DistributionsOverlapping Class Distributions
Allow some misclassified examples soft margin Introduce slack variables
Nnn ,...,1 ,0
nnnnT
nnn ytbtyt 1)( 1))(()( xxwx
9 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Soft Margin SolutionSoft Margin Solution
Minimize
KKT conditions:
N
nnC
1
2
2
1w
: trade-off between minimizing training errors and controlling model complexity0C
N
nnn
N
nnnnn
N
nn ytaCbL
111
21)(
2
1),,( xwaw
0
0
0
01)(
01)(
0
nn
n
n
nnnn
nnn
n
yta
yt
a
x
x
0na
nnn yt 1)(x
: support vectors
or
10 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Dual RepresentationDual Representation
Dual representation
Classifying new data and obtaining b ( hard margin classifiers)
N
1n
1 11
0
,...,1 ,0 subject to
),(2
1)(
~
nn
n
N
n
N
mmnmnmn
N
nn
ta
NnCa
kttaaaL xxa
11 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Alternative FormulationAlternative Formulation
v-SVM (Schölkopf et al., 2000)
N
1n
N
1n
1 1
0
10 subject to
),(2
1)(
~
va
ta
Na
kttaaL
n
nn
n
N
n
N
mmnmnmn xxa
- Upper bound on the fraction of margin errors- Lower bound on the fraction of support vectors
12 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Example of Nonseparable Data Classification (Example of Nonseparable Data Classification (vv-SVM)-SVM)
13 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Solutions of the QP ProblemSolutions of the QP Problem
Chunking (Vapnik, 1982) Idea: the value of Lagrangian is unchanged if we remove the ro
ws and columns of the kernel matrix corresponding to Lagrange multipliers that have value zero
Protected conjugate gradients (Burges, 1998)
Decomposition methods (Osuna et al., 1996) Sequential minimal optimization (Platt, 1999)
14 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Relation to Logistic Regression Relation to Logistic Regression (Section 4.3.2)(Section 4.3.2)
For data points on the correct side,
For the remaining points,
0
nnty1
N
nnC
1
2
2
1w
part positive thedenotes where
1),(
2 where
),(
1
1
2
nnnnSV
N
nnnSV
tytyE
C
tyE
w
: hinge error function
15 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Relation to Logistic Regression (Cont’d)Relation to Logistic Regression (Cont’d)
From maximum likelihood logistic regression
Error function with a quadratic regularizer
)()|(
)()(1)|1(
)()|1(
ytytp
yyytp
yytp
ytytE
tyE
LR
N
nnnLR
exp1ln)( where
)(1
2w
16 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Comparison of Error FunctionsComparison of Error Functions
Hinge error function
Error function for logistic regressionMisclassification error
Squared error
17 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Multiclass SVMsMulticlass SVMs
One-versus-the-rest: K separate SVMs Can lead inconsistent results (Figure 4.2)
Imbalanced training sets Positive class: +1, negative class: -1/(K-1) (Lee et al., 2001)
An objective function for training all SVMs simultaneously (Weston and Watkins, 1999)
One-versus-one: K(K-1)/2 SVMs Based on error-correcting output codes (Allwein et al., 2000)
Generalization of the voting scheme of the one-versus-one
18 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
SVMs for RegressionSVMs for Regression
Simple linear regression: minimize ε-insensitive error function
2
1
2
22
1w
N
nnn ty
otherwise ,)(
)( if ,0)(
ty
tytyE
x
xx
quadratic error function
ε-insensitive error function
19 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
SVMs for Regression (Cont’d)SVMs for Regression (Cont’d)
Minimize
2
1 2
1)( wx
N
nnn tyEC
0ˆ ,0
ˆ)(
)( where
2
1ˆ 2
1
nn
nnn
nnn
N
nnn
yt
yt
C
x
x
w
20 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Dual ProblemDual Problem
N
1n
11
1 1
0ˆ
ˆ,0 subject to
)ˆ()ˆ(
),()ˆ)(ˆ(2
1)ˆ,(
~
nn
nn
n
N
nnn
N
nnn
N
n
N
mmnmmnn
aa
Caa
taaaa
kaaaaL
xxaa
N
nnnnn
N
nnnnn
N
nnnnn
N
nnn
tyatya
CL
11
1
2
1
ˆˆ
ˆˆ2
1ˆ
w
21 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
PredictionsPredictions
KKT conditions:
N
nnnn aa
1
)(ˆ xw
bkaayN
nnnn
1
),(ˆ)( xxx
0ˆˆ
0
0ˆˆ
0
nn
nn
nnnn
nnnn
aC
aC
tya
tya
N
mmnmmnn
Tn kaattb
1
),(ˆ)( xxxw
(from derivatives of the Lagrangian)
22 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Alternative FormulationAlternative Formulation
v-SVM (Schölkopf et al., 2000)
N
1n
N
1n
1
1 1
ˆ
0ˆ
ˆ,0 subject to
)ˆ(
),()ˆ)(ˆ(2
1)ˆ,(
~
vCaa
aa
NCaa
taa
kaaaaL
nn
nn
nn
n
N
nnn
N
n
N
mmnmmnn xxaa
fraction of points lying outside the tube
23 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Example of Example of vv-SVM Regression-SVM Regression
24 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Relevance Vector MachinesRelevance Vector Machines
SVM Outputs are decisions rather than posterior probabilities The extension to K>2 classes is problematic There is a complexity parameter C Kernel functions are centered on training data points and
required to be positive definite
RVM Bayesian sparse kernel technique Much sparser models Faster performance on test data
25 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
RVM for RegressionRVM for Regression
RVM is a linear form in Chapter 3 with a modified prior
2
1
1
)()()( where
)),(|(),,|(
xwxx
xwx
TM
iiiwy
ytNtp
N
inn bkwy
1
),()( xxx
N
nii
N
nnn
wNp
tpp
1
1
1
1
),0|()|(
),,|(),,|(
αw
wxwxt
26 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
RVM for Regression (Cont’d)RVM for Regression (Cont’d)
α and β are determined using evidence approximation (type-2 maximum likelihood) (Section 3.5)
)(
)( elementsh matrix wit : where
where
),|(),,,|(
1
i
nini
T
T
diag
MN
Np
A
xΦ
ΦΦA
tΦm
mwαXtw
wαwwXtαXt dppp )|(),,|(),,|(
TT
N
T
tt
N
Np
ΦΦAICt
tCtC
C0tαXt
111
1
,,..., where
ln)2ln(2
1
),|(ln),,|(ln
From the result (3.49)for linear regression models
Maximize
27 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
RVM for Regression (Cont’d)RVM for Regression (Cont’d)
Two approaches By derivatives of marginal likelihood
EM algorithm Section 9.3.4
Predictive distribution
iiii
i i
new
i
inewi Nm
1 where
,2
1
2
Φmt
)()()( where
)(),(|
),,,,|(),,|(),,,,|(
1*2
2
*****
xxx
xxm
wαtXxwwxαtXx
T
TtN
dptptp
Section 3.3.2
28 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Example of RVM RegressionExample of RVM Regression
More compact than SVM Parameters are determined automatically Require more training time than SVM
RVM regression v-SVM regression
29 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Mechanism for SparsityMechanism for Sparsity
TTtt
Np
11
21 ,, where
),|(),|(
ICt
C0tt
only isotropic noise, α = ∞ a finite value of α
30 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Sparse SolutionSparse Solution
Pull out the contribution from αi in
TΦΦAIC 11
Tiiii
Tiii
ij
Tjjj
1
111
C
IC
Φ ofcolumn th : where ii
Using (C.7), (C.15) in Appendix C
iiTii
iTiii
i
iiTiii
1
1111
111
C
CCCC
CCC
31 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Sparse Solution (Cont’d)Sparse Solution (Cont’d)
For log marginal likelihood function L,
Stationary points of the marginal likelihood w.r.t.αi
)()()( 1 iLL αα
tC
C1
1
2
where
lnln2
1)(
iTii
iiTii
ii
iiiii
q
s
s
qs
Sparsity: measures the extent to which overlaps with the other basis vectors
Quality of : represents a measure of the alignment of the basis vector with the error between t and y-i
ii
02
)(2
221
ii
iiii
i
i
s
sqs
d
d
32 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Sequential Sparse Bayesian Learning AlgorithmSequential Sparse Bayesian Learning Algorithm
1. Initialize
2. Initialize using , with , with the remaining
3. Evaluate and for all basis functions
4. Select a candidate
5. If ( is already in the model), update
6. If , add to the model, and evaluate
7. If , remove from the model, and set
8. Update
9. Go to 3 until converged
121
211 sqs 1 )( ijj
m
i iii sq ,2
i
iiii sqs 22
iii sq ,2i iiii sqs 22
iii sq ,2i i
33 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
RVM for ClassificationRVM for Classification
Probabilistic linear classification model (Chapter 4)
with ARD prior
)(),( xwwx Ty
N
niiwNp
1
1),0|()|( αw
- Initialize - Build a Gaussian approximation to the posterior distribution- Obtain an approximation to the marginal likelihood- Maximize the marginal likelihood (re-estimate ) until converged
α
α
34 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
RVM for Classification (Cont’d)RVM for Classification (Cont’d)
The posterior distribution is obtained by maximizing
)( where
const2
1)1ln()1(ln
)|(ln)|()|(ln),|(ln
1
i
TN
nnnnn
diag
ytyt
pppp
A
Aww
αtαwwtαtw
Iterative reweighted least squares (IRLS) from Section 4.3.3
ABΦΦαtw
AwytΦαtw
T
T
p
p
),|(ln
),|(ln
)( matrix,design :
,1 matrix, diagonal : where
nini
nnn yybNN
xΦ
B
Resulting Gaussian approximation to the posterior distribution
11* , ABΦΦytΦAw TT
35 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
RVM for Classification (Cont’d)RVM for Classification (Cont’d)
Marginal likelihood using Laplace approximation (Section 4.4)
Set the derivative of the marginal likelihood equal to zero, and rearranging then gives
If we define ,
2/12/** )2)(|()|(
)|()|()|(
Mpp
dppp
αwwt
wαwwtαt
iiii
i
inewi
w 1 where2*
ytBΦwt 1ˆ *
T
TNp
ΦAΦBC
tCtCαt
where
ˆ)ˆ(ln)2ln(2
1),|(ln 1
Same in the regression case
36 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Example of RVM ClassificationExample of RVM Classification
Top Related