Support Vector Machine - ECSE · 4/3/2011 1 Chapter 7 Support Vector Machine Table of Content •...
-
Upload
truongthuy -
Category
Documents
-
view
225 -
download
3
Transcript of Support Vector Machine - ECSE · 4/3/2011 1 Chapter 7 Support Vector Machine Table of Content •...
4/3/2011
1
Chapter 7
Support Vector Machine
Table of Content
• Margin and support vectors
• SVM formulation
• Slack variables and hinge loss
• SVM for multiple class
• SVM with Kernels
• Relevance Vector Machine
Support Vector Machine (SVM)
• Like LDA, traditional SVM is a linear and binary
classifier
• Unlike LSQ and Fisher criterion, SVM
approaches the 2-class classification problem
using the concept of margin and support
vectors.
Margin and Support VectorsMargin is defined to be the smallest distance between
the decision boundary and any of the samples. Support
vectors are data points located on the margin line.
4/3/2011
2
Support Vector Machine
• Pick the decision boundary with
the largest margin!
• Linear hyperplane defined by
“support” vectors
• Moving other points does not
affect the decision boundary
• Only need to store the support
vectors to predict labels of new
points
Two-class Classification
with Linear Model
bxwxy T +=)(
SVM Formulation
• Two-class classification with the linear model is
• Given the target tn={-1,+1}, the distance of a
point xn
to the decision surface is given by
• SVM is to find the model parameters W by
maximizing the margin, i.e,
bxwxy T +=)(
+= ]
||||
)([minmaxarg**,
, w
bxwtbw n
Tn
nbw
||||
)(
||||
)(
w
bxwt
w
xyt nT
nnn +=
Maximizing the Margin
nabxwtts
w
a
nT
n
bw
∀≥+
=
)(..
||||
2max
,γ
2 is added for mathematical convenience
4/3/2011
3
Support Vector Machines
γ
Let a=1
nbxwtts
w
nT
n
bw
∀≥+
=
1)(..
||||
2max
,γ
Support Vector Machines
γ
nbxwtts
w
nT
n
bw
∀≥+
=
1)(..
2
||||1min
, γ
nbxwtts
w
nT
n
bw
∀≥+
=
1)(..
||||
2max
,γ
Support Vector Machines
γnbxwtts
w
nT
n
bw
∀≥+
=
1)(..
2
||||1min
, γ
nbxwtts
www
nT
n
T
bw
∀≥+
==
1)(..
22
||||1min
2
, γ
Solving SVM
γ
nbxwttswww
nT
n
T
bw∀≥+== 1)(..
22
||||1min
2
, γ
This can now be solved by standard quadratic programming , i.e.,
Only a few an
greater than 0, corresponding to the support vectors.
Nsv
is the number of support vectors.
]}1)([2
{(),,(min1
,∑
=
−+−=N
nn
Tnn
T
nbw
bxwtaww
abwL
Introducing the Lagrange multipliers an
, we have
∑∑∑===
=−==N
nnnn
N
nn
Tn
N
nnn tatxwbxtaw
SV
111
0)(
4/3/2011
4
In the solution above to w and b, an
is still unknown. To solve
for an ,
we can substitute and into L(w,b, an),
yielding
an
can be solved by maximizing L(an) , i.e.,
QP solves optimization to get global optimal of an, then w, and b.
Solving SVM (cont’d)
n
N
nnn xtaw ∑
=
=1
),(2
1
2
1)(
,1
,1
mnmnnm
mn
N
nn
mTnmn
nmmn
N
nnn
xxkttaaa
xxttaaaaL
∑∑
∑∑
−=
−=
=
=
mTnmn xxxxk =),(where is the kernel function.
01
=∑=
N
nnnta
0&0)(max1
* =≥= ∑=
n
N
nnnn
an taatosubjectaLa
n
4/3/2011 15
Characteristics of the Solution
�Many of the αn are zero
�w is a linear combination of a small number of data points
� This “sparse” representation can be viewed as data compression as in the construction of KNN classifier
�xn
with non-zero αn
are called support vectors (SV)
� The decision boundary is determined only by the SV
� Let n (n=1, ..., s) be the indices of the s support vectors. We can write
�For testing with a new data z
� Compute and
classify z as class 1 if the sum is positive, and class 2
otherwise
�Note: w need not be formed explicitly
)(11
n
s
nn
Tn
s
nnn txwbxtaw −== ∑∑
==
bzw t +
4/3/2011 16
α6=1.4
A Geometrical Interpretation
Class 1
Class 2
α1=0.8
α2=0
α3=0
α4=0
α5=0
α7=0
α8=0.6
α9=0
α10=0
4/3/2011 17
The Quadratic Programming Problem
�Many approaches have been proposed
� Loqo, cplex, etc. (see http://www.numerical.rl.ac.uk/qp/qp.html)
�Most are “interior-point” methods
� Start with an initial solution that can violate the constraints
� Improve this solution by optimizing the objective function and/or reducing the amount of constraint violation
�For SVM, sequential minimal optimization (SMO) seems
to be the most popular
� A QP with two variables is trivial to solve
� Each iteration of SMO picks a pair of (αi,αj) and solve the QP with these two variables; repeat until convergence
� In practice, we can just regard the QP solver as a “black-box” without bothering how it works
4/3/2011
5
Data is still not linearly separable-
Soft Margin
nbxwtts
Cww
nnT
n
nn
T
bw n
∀−≥+
+ ∑
ξ
ξξ
1)(..
2min
,,
The Soft Margin method will choose a hyperplane that splits the
examples as cleanly as possible, while still maximizing the distance
to the nearest cleanly split examples.
Slack variables –Hinge loss
++−= ))(1( nnT
n tbxwξ
nbxwtts
Cww
nnT
n
nn
T
bw
∀−≥+
+ ∑
ξ
ξ
1)(..
2min
,
)( bxwt nT
n +
Hinge loss
Soft Margin SVM
∑+ iCw ξ2||||2
1min
iiT
i bxwt ξ−≥+ 1)(
0≥iξ
s.t.
The Lagrangian for Soft Margin SVM
∑∑
∑
−+−+−
+=
iiiiT
ii
i
rbxwta
CwrabwL
ξξ
ξξ
}1)({
||||2
1),,,,( 2
4/3/2011
6
Take the derivative
0=∂∂=
∂∂=
∂∂
ξL
b
L
w
L
ii
ii
iii
rca
ta
xtaw
−=
=
=
∑
∑0
KKT condition
• The solution satisfies KKT condition
0
0}1)({
==+−+
ii
iiT
ii
r
bxwta
ξξ
Plug back into the original problem
∑ =
∀≥≥
iii
i
t
xts
0
,0C .. i
αα
∑ ∑−ji i
ijT
ijijia
xxtti ,2
1min ααα
4/3/2011
7
Multiple classes SVM
Multiple-Class SVM
• One possibility is to use N two-way discriminant functions: one-v.s.-rest
– Each function discriminates one class from the rest. This produces N binary classifiers.
• Another possibility is to use N(N-1)/2 two-way discriminant functions: one-v.s.-one
– Each function discriminates between two particular classes.
• Single Multi-class SVM
One-v.s-the-rest
4/3/2011
8
One-versus-one
• Another approach is to train N(N−1)/2 different 2-class SVMs on all possible pairs of classes, and then to classify test points according to which class has the highest number of ‘votes’
• lead to ambiguities in the resulting classification
• requires significantly more training time for large N
Single Multi-class SVM
Multi-class SVM
• Although the application of SVMs to multiclass
classification problems remains an open issue,
in practice the one-versus-the-rest approach is
the most widely used in spite of its ad-hoc
formulation and its practical limitations.
SVM with Kernels for
Non-linear Classification
• The original optimal hyperplane was a linear classifier. Cannot directly apply to classes that are not linearly separable
• Kernel trick was introduced to create nonlinear SVM classifiers
• This allows the algorithm to fit the maximum-margin hyperplane in a high dimensional transformed feature space, where the classes are linearly separable.
4/3/2011
9
4/3/2011 36
Extension to Non-linear Decision Boundary
�Key idea: transform xi to a higher dimensional space to
“make life easier”
� Input space: the space the point xi are located
� Feature space: the space of φ(xi) after transformation�Why transform?
� Linear operation in the feature space is equivalent to non-linear operation in input space
� Classification can become easier with a proper transformation. In the XOR problem, for example, adding a new feature of x1x2 make the problem linearly separable
4/3/2011 37
The Kernel Trick
�Recall the SVM optimization problem
�The data points only appear as inner product
�As long as we can calculate the inner product in the feature space, we do not need the mapping explicitly.
Computational expense of increased dimensionality is
avoided.
�Many common geometric operations (angles, distances) can be expressed by inner products
�Define the kernel function K by
0&0
2
1)(max
1
,1
=≥≥
−=
∑
∑∑
=
=
n
N
nnn
mTnmn
nmmn
N
nnn
taaCtosubject
xxttaaaaL
4/3/2011 CSE 802. Prepared by Martin Law 38
Map from input 2D to 3D feature space
4/3/2011
10
4/3/2011 CSE 802. Prepared by Martin Law 39 4/3/2011 40
An Example for φ(.) and K(.,.)
�Suppose φ(.) is given as follows
�An inner product in the feature space is
�So, if we define the kernel function as follows, there is no need to carry out φ(.) explicitly
�This use of kernel function to avoid carrying out φ(.) explicitly is known as the kernel trick
4/3/2011 41
Examples of Kernel Functions
�Polynomial kernel with degree d
�Radial basis function kernel with width σ
� Closely related to radial basis function neural networks
� The feature space is infinite-dimensional
�Sigmoid with parameter κ and θ
4/3/2011 42
Modification Due to Kernel Function
�Change all inner products to kernel functions
�For training,
Original
With kernel
function
0&0
2
1)(max
1
,1
=≥≥
−=
∑
∑∑
=
=
n
N
nnn
mTnmn
nmmn
N
nnn
taaCtosubject
xxttaaaaL
0&0
),(2
1)(max
1
,1
=≥≥
−=
∑
∑∑
=
=
n
N
nnn
mnmnnm
mn
N
nnn
taaCtosubject
xxkttaaaaL
4/3/2011
11
4/3/2011 43
Modification Due to Kernel Function
�For testing, the new data z is classified as class 1 if f ≥0, and as class -1 if f <0
Original
With kernel
function
bzwzf t +=)(
)(11
n
s
nn
Tn
s
nnn txwbxtaw −== ∑∑
==
bxzKtazf n
N
nnn
SV
+=∑=
),()(1
� =1
��
�[�� −��������
�=1
� ��, �� ]���
�=1
4/3/2011 46
Example
�Suppose we have 5 1D data points
� x1=1, x2=2, x3=4, x4=5, x5=6, with 1, 2, 6 as class 1 and 4, 5 as class 2 ⇒ y1=1, y2=1, y3=-1, y4=-1, y5=1
�We use the polynomial kernel of degree 2
� K(x,y) = (xy+1)2
� C is set to 100
�We first find αi (i=1, …, 5) by
4/3/2011 47
Example
�By using a QP solver, we get
�α1=0, α2=2.5, α3=0, α4=7.333, α5=4.833
�Note that the constraints are indeed satisfied
� The support vectors are {x2=2, x4=5, x5=6}
�The discriminant function is
�b is recovered by solving f(2)=1 or by f(5)=-1 or by f(6)=1, as x2 and x5 lie on the line and x4lies on the line
�All three give b=9
4/3/2011
12
4/3/2011 48
Example
Value of discriminant function
1 2 4 5 6
class 2 class 1class 1
SVM Learning via Quadratic Programming
Rewrite above expressions as standard quadratic programming problem:
� where C is the soft margin weight parameter, a=[a1, a2, …, aN]t, t=[t1, t2,
…, tN]t, eNx1=[1, 1, …, 1]
t, and H is a NxN matrix whose element is
Hm,n=tmtnk(xm,xn). The constrained quadratic minimization problem above
can be solved by Matlab function
4/3/2011 CSE 802. Prepared by Martin Law 49
),(21
)(max,1
nmmnnm
mn
N
nnn
axxkttaaaaL ∑∑ −=
=
0&01
=≥≥ ∑=
n
N
nnn taaCtosubject
aeHaaaxxkttaaaL ttN
nnnmmn
nmmnn
a−=−=− ∑∑
= 2
1),(
2
1)(min
1,
0&0 =≥≥ taaCtosubject tn
)],...,,[,]0,...,0,0[,0,[],[],,,( 11txN
txN CCCteHquadprog −
SVM Learning via Quadratic Programming
In this command, the first two parameters H and –e are objective function coefficients;
The next two parameters are inequality constraints, left blank [],[] in this example;
The next two parameters t and 0, are equality constraints;
The last two parameters [0,0,…0] and [100, 100, …100] are the
lower and upper bound of variables
1 1quadprog( , ,[],[], ,0,[0,0,...0] ,[100,100,...100] )T T TN NH e t × ×−
MATLAB Demo
The Matlab codes:x=[1 2 4 5 6];y=[1 1 -1 -1 1];H = zeros(5,5);for i=1:5for j=1:5
H(i,j) = y(i)*y(j)*(x(i)*x(j)+1)^2;end
ende = ones(5,1);lb = zeros(5,1);ub = 100*ones(5,1);Astar = quadprog(H,-e,[],[],y,0,lb,ub);
The output result:Astar = [0; 2.5000; 0; 7.3333; 4.8333]
4/3/2011
13
4/3/2011 52
Choosing the Kernel Function
�Probably the most tricky part of using SVM.
�The kernel function is important because it creates the
kernel matrix, which summarizes all the data
�Many principles have been proposed (diffusion kernel,
Fisher kernel, string kernel, …)
�There is even research to estimate the kernel matrix from available information
� In practice, a low degree polynomial kernel or RBF kernel with a reasonable width is a good initial try
�Note that SVM with RBF kernel is closely related to RBF
neural networks, with the centers of the radial basis functions automatically chosen for SVM
SVM Parameter selection
• The effectiveness of SVM depends on the selection of kernel, the kernel's parameters, and soft margin parameter C.
• Typically, each combination of parameter choices is checked using cross validation, and the parameters with best cross-validation accuracy are picked.
• The final model, which is used for testing and for classifying new data, is then trained on the whole training set using the selected parameters.
4/3/2011 CSE 802. Prepared by Martin Law 54
Software
�A list of SVM implementation can be found at
http://www.kernel-machines.org/software.html
�Some implementation (such as LIBSVM) can handle multi-class classification
�SVMLight is among one of the earliest implementation of
SVM
�Several Matlab toolboxes for SVM are also available
4/3/2011 CSE 802. Prepared by Martin Law 55
Summary: Steps for Classification
�Select the kernel function to use
�Select the parameter of the kernel function and the
value of C
� You can use the values suggested by the SVM software, or
you can set apart a validation set to determine the values of the parameter
�Execute the training algorithm and obtain the αi
�Unseen data can be classified using the αi and the
support vectors
4/3/2011
14
4/3/2011 CSE 802. Prepared by Martin Law 56
Strengths and Weaknesses of SVM
�Strengths
� Training is relatively easy
� No local optimal, unlike in neural networks
� It scales relatively well to high dimensional data
� Tradeoff between classifier complexity and error can be
controlled explicitly
�Non-traditional data like strings and trees can be used as input to SVM, instead of feature vectors
�Weaknesses
�Need to choose a “good” kernel function.
4/3/2011 CSE 802. Prepared by Martin Law 57
Other Types of Kernel Methods
�A lesson learnt in SVM: a linear algorithm in the feature
space is equivalent to a non-linear algorithm in the input
space
�Standard linear algorithms can be generalized to its non-
linear version by going to the feature space
� Kernel principal component analysis, kernel independent component analysis, kernel canonical correlation analysis,
kernel k-means, 1-class SVM are some examples
4/3/2011 CSE 802. Prepared by Martin Law 58
Conclusion
�SVM is a useful alternative to neural networks
�Two key concepts of SVM: maximize the margin and the
kernel trick
�Many SVM implementations are available on the web for
you to try on your data set!
4/3/2011 CSE 802. Prepared by Martin Law 59
Resources
�http://www.kernel-machines.org/
�http://www.support-vector.net/
�http://www.support-vector.net/icml-tutorial.pdf
�http://www.kernel-machines.org/papers/tutorial-
nips.ps.gz
�http://www.clopinet.com/isabelle/Projects/SVM/applist.h
tml
4/3/2011
15
Relevance Vector Machine (RVM)
• RVM for regression
• RVM for classification
Motivations
SVM has the following problems:
RVM for Regression
• Traditional Regression
• SVM Regression
The number of weight is equal to the number of training samples N
i.e., one w for each sample
)(),(1
xwwxy i
N
iiϕ∑
=
=
),(),(1
i
N
ii xxkwwxy ∑
=
=
RVM for Regression (cont’d)
αi is the hyperparameter, one αi for each wi.
The target value t for input x follows a
Gaussian distribution )),,((~ 1−βxwyNt
4/3/2011
16
Bayesian RVM for Regression Bayesian RVM for Regression
Bayesian Approach (cont’d)
RVM for Classification
4/3/2011
17