Support Vector Machine - ECSE · 4/3/2011 1 Chapter 7 Support Vector Machine Table of Content •...

4/3/2011

1

Chapter 7

Support Vector Machine

Table of Content

• Margin and support vectors

• SVM formulation

• Slack variables and hinge loss

• SVM for multiple class

• SVM with Kernels

• Relevance Vector Machine

Support Vector Machine (SVM)

• Like LDA, traditional SVM is a linear and binary

classifier

• Unlike LSQ and Fisher criterion, SVM

approaches the 2-class classification problem

using the concept of margin and support

vectors.

Margin and Support VectorsMargin is defined to be the smallest distance between

the decision boundary and any of the samples. Support

vectors are data points located on the margin line.

4/3/2011

2

Support Vector Machine

• Pick the decision boundary with

the largest margin!

• Linear hyperplane defined by

“support” vectors

• Moving other points does not

affect the decision boundary

• Only need to store the support

vectors to predict labels of new

points

Two-class Classification

with Linear Model

bxwxy T +=)(

SVM Formulation

• Two-class classification with the linear model is

• Given the target tn={-1,+1}, the distance of a

point xn

to the decision surface is given by

• SVM is to find the model parameters W by

maximizing the margin, i.e,

bxwxy T +=)(

+= ]

||||

)([minmaxarg**,

, w

bxwtbw n

Tn

nbw

||||

)(

||||

)(

w

bxwt

w

xyt nT

nnn +=

Maximizing the Margin

nabxwtts

w

a

nT

n

bw

∀≥+

=

)(..

||||

2max

,γ

2 is added for mathematical convenience

4/3/2011

3

Support Vector Machines

γ

Let a=1

nbxwtts

w

nT

n

bw

∀≥+

=

1)(..

||||

2max

,γ


γ

nbxwtts

w

nT

n

bw

∀≥+

=

1)(..

2

||||1min

, γ

nbxwtts

w

nT

n

bw

∀≥+

=

1)(..

||||

2max

,γ


γnbxwtts

w

nT

n

bw

∀≥+

=

1)(..

2

||||1min

, γ

nbxwtts

www

nT

n

T

bw

∀≥+

==

1)(..

22

||||1min

2

, γ

Solving SVM

γ

nbxwttswww

nT

n

T

bw∀≥+== 1)(..

22

||||1min

2

, γ

This can now be solved by standard quadratic programming , i.e.,

Only a few an

greater than 0, corresponding to the support vectors.

Nsv

is the number of support vectors.

]}1)([2

{(),,(min1

,∑

=

−+−=N

nn

Tnn

T

nbw

bxwtaww

abwL

Introducing the Lagrange multipliers an

, we have

∑∑∑===

=−==N

nnnn

N

nn

Tn

N

nnn tatxwbxtaw

SV

111

0)(

4/3/2011

4

In the solution above to w and b, an

is still unknown. To solve

for an ,

we can substitute and into L(w,b, an),

yielding

an

can be solved by maximizing L(an) , i.e.,

QP solves optimization to get global optimal of an, then w, and b.

Solving SVM (cont’d)

n

N

nnn xtaw ∑

=

=1

),(2

1

2

1)(

,1

,1

mnmnnm

mn

N

nn

mTnmn

nmmn

N

nnn

xxkttaaa

xxttaaaaL

∑∑

∑∑

−=

−=

=

=

mTnmn xxxxk =),(where is the kernel function.

01

=∑=

N

nnnta

0&0)(max1

* =≥= ∑=

n

N

nnnn

an taatosubjectaLa

n

4/3/2011 15

Characteristics of the Solution

�Many of the αn are zero

�w is a linear combination of a small number of data points

� This “sparse” representation can be viewed as data compression as in the construction of KNN classifier

�xn

with non-zero αn

are called support vectors (SV)

� The decision boundary is determined only by the SV

� Let n (n=1, ..., s) be the indices of the s support vectors. We can write

�For testing with a new data z

� Compute and

classify z as class 1 if the sum is positive, and class 2

otherwise

�Note: w need not be formed explicitly

)(11

n

s

nn

Tn

s

nnn txwbxtaw −== ∑∑

==

bzw t +

4/3/2011 16

α6=1.4

A Geometrical Interpretation

Class 1

Class 2

α1=0.8

α2=0

α3=0

α4=0

α5=0

α7=0

α8=0.6

α9=0

α10=0

4/3/2011 17

The Quadratic Programming Problem

�Many approaches have been proposed

� Loqo, cplex, etc. (see http://www.numerical.rl.ac.uk/qp/qp.html)

�Most are “interior-point” methods

� Start with an initial solution that can violate the constraints

� Improve this solution by optimizing the objective function and/or reducing the amount of constraint violation

�For SVM, sequential minimal optimization (SMO) seems

to be the most popular

� A QP with two variables is trivial to solve

� Each iteration of SMO picks a pair of (αi,αj) and solve the QP with these two variables; repeat until convergence

� In practice, we can just regard the QP solver as a “black-box” without bothering how it works

4/3/2011

5

Data is still not linearly separable-

Soft Margin

nbxwtts

Cww

nnT

n

nn

T

bw n

∀−≥+

+ ∑

ξ

ξξ

1)(..

2min

,,

The Soft Margin method will choose a hyperplane that splits the

examples as cleanly as possible, while still maximizing the distance

to the nearest cleanly split examples.

Slack variables –Hinge loss

++−= ))(1( nnT

n tbxwξ

nbxwtts

Cww

nnT

n

nn

T

bw

∀−≥+

+ ∑

ξ

ξ

1)(..

2min

,

)( bxwt nT

n +

Hinge loss

Soft Margin SVM

∑+ iCw ξ2||||2

1min

iiT

i bxwt ξ−≥+ 1)(

0≥iξ

s.t.

The Lagrangian for Soft Margin SVM

∑∑

∑

−+−+−

+=

iiiiT

ii

i

rbxwta

CwrabwL

ξξ

ξξ

}1)({

||||2

1),,,,( 2

4/3/2011

6

Take the derivative

0=∂∂=

∂∂=

∂∂

ξL

b

L

w

L

ii

ii

iii

rca

ta

xtaw

−=

=

=

∑

∑0

KKT condition

• The solution satisfies KKT condition

0

0}1)({

==+−+

ii

iiT

ii

r

bxwta

ξξ

Plug back into the original problem

∑ =

∀≥≥

iii

i

t

xts

0

,0C .. i

αα

∑ ∑−ji i

ijT

ijijia

xxtti ,2

1min ααα

4/3/2011

7

Multiple classes SVM

Multiple-Class SVM

• One possibility is to use N two-way discriminant functions: one-v.s.-rest

– Each function discriminates one class from the rest. This produces N binary classifiers.

• Another possibility is to use N(N-1)/2 two-way discriminant functions: one-v.s.-one

– Each function discriminates between two particular classes.

• Single Multi-class SVM

One-v.s-the-rest

4/3/2011

8

One-versus-one

• Another approach is to train N(N−1)/2 different 2-class SVMs on all possible pairs of classes, and then to classify test points according to which class has the highest number of ‘votes’

• lead to ambiguities in the resulting classification

• requires significantly more training time for large N

Single Multi-class SVM

Multi-class SVM

• Although the application of SVMs to multiclass

classification problems remains an open issue,

in practice the one-versus-the-rest approach is

the most widely used in spite of its ad-hoc

formulation and its practical limitations.

SVM with Kernels for

Non-linear Classification

• The original optimal hyperplane was a linear classifier. Cannot directly apply to classes that are not linearly separable

• Kernel trick was introduced to create nonlinear SVM classifiers

• This allows the algorithm to fit the maximum-margin hyperplane in a high dimensional transformed feature space, where the classes are linearly separable.

4/3/2011

9

4/3/2011 36

Extension to Non-linear Decision Boundary

�Key idea: transform xi to a higher dimensional space to

“make life easier”

� Input space: the space the point xi are located

� Feature space: the space of φ(xi) after transformation�Why transform?

� Linear operation in the feature space is equivalent to non-linear operation in input space

� Classification can become easier with a proper transformation. In the XOR problem, for example, adding a new feature of x1x2 make the problem linearly separable

4/3/2011 37

The Kernel Trick

�Recall the SVM optimization problem

�The data points only appear as inner product

�As long as we can calculate the inner product in the feature space, we do not need the mapping explicitly.

Computational expense of increased dimensionality is

avoided.

�Many common geometric operations (angles, distances) can be expressed by inner products

�Define the kernel function K by

0&0

2

1)(max

1

,1

=≥≥

−=

∑

∑∑

=

=

n

N

nnn

mTnmn

nmmn

N

nnn

taaCtosubject

xxttaaaaL

4/3/2011 CSE 802. Prepared by Martin Law 38

Map from input 2D to 3D feature space

4/3/2011

10

4/3/2011 CSE 802. Prepared by Martin Law 39 4/3/2011 40

An Example for φ(.) and K(.,.)

�Suppose φ(.) is given as follows

�An inner product in the feature space is

�So, if we define the kernel function as follows, there is no need to carry out φ(.) explicitly

�This use of kernel function to avoid carrying out φ(.) explicitly is known as the kernel trick

4/3/2011 41

Examples of Kernel Functions

�Polynomial kernel with degree d

�Radial basis function kernel with width σ

� Closely related to radial basis function neural networks

� The feature space is infinite-dimensional

�Sigmoid with parameter κ and θ

4/3/2011 42

Modification Due to Kernel Function

�Change all inner products to kernel functions

�For training,

Original

With kernel

function

0&0

2

1)(max

1

,1

=≥≥

−=

∑

∑∑

=

=

n

N

nnn

mTnmn

nmmn

N

nnn

taaCtosubject

xxttaaaaL

0&0

),(2

1)(max

1

,1

=≥≥

−=

∑

∑∑

=

=

n

N

nnn

mnmnnm

mn

N

nnn

taaCtosubject

xxkttaaaaL

4/3/2011

11

4/3/2011 43

Modification Due to Kernel Function

�For testing, the new data z is classified as class 1 if f ≥0, and as class -1 if f <0

Original

With kernel

function

bzwzf t +=)(

)(11

n

s

nn

Tn

s

nnn txwbxtaw −== ∑∑

==

bxzKtazf n

N

nnn

SV

+=∑=

),()(1

� =1

��

�[�� −��

�=1

� ��, �� ]��

�=1

4/3/2011 46

Example

�Suppose we have 5 1D data points

� x1=1, x2=2, x3=4, x4=5, x5=6, with 1, 2, 6 as class 1 and 4, 5 as class 2 ⇒ y1=1, y2=1, y3=-1, y4=-1, y5=1

�We use the polynomial kernel of degree 2

� K(x,y) = (xy+1)2

� C is set to 100

�We first find αi (i=1, …, 5) by

4/3/2011 47

Example

�By using a QP solver, we get

�α1=0, α2=2.5, α3=0, α4=7.333, α5=4.833

�Note that the constraints are indeed satisfied

� The support vectors are {x2=2, x4=5, x5=6}

�The discriminant function is

�b is recovered by solving f(2)=1 or by f(5)=-1 or by f(6)=1, as x2 and x5 lie on the line and x4lies on the line

�All three give b=9

4/3/2011

12

4/3/2011 48

Example

Value of discriminant function

1 2 4 5 6

class 2 class 1class 1

SVM Learning via Quadratic Programming

Rewrite above expressions as standard quadratic programming problem:

� where C is the soft margin weight parameter, a=[a1, a2, …, aN]t, t=[t1, t2,

…, tN]t, eNx1=[1, 1, …, 1]

t, and H is a NxN matrix whose element is

Hm,n=tmtnk(xm,xn). The constrained quadratic minimization problem above

can be solved by Matlab function


),(21

)(max,1

nmmnnm

mn

N

nnn

axxkttaaaaL ∑∑ −=

=

0&01

=≥≥ ∑=

n

N

nnn taaCtosubject

aeHaaaxxkttaaaL ttN

nnnmmn

nmmnn

a−=−=− ∑∑

= 2

1),(

2

1)(min

1,

0&0 =≥≥ taaCtosubject tn

)],...,,[,]0,...,0,0[,0,[],[],,,( 11txN

txN CCCteHquadprog −

SVM Learning via Quadratic Programming

In this command, the first two parameters H and –e are objective function coefficients;

The next two parameters are inequality constraints, left blank [],[] in this example;

The next two parameters t and 0, are equality constraints;

The last two parameters [0,0,…0] and [100, 100, …100] are the

lower and upper bound of variables

1 1quadprog( , ,[],[], ,0,[0,0,...0] ,[100,100,...100] )T T TN NH e t × ×−

MATLAB Demo

The Matlab codes:x=[1 2 4 5 6];y=[1 1 -1 -1 1];H = zeros(5,5);for i=1:5for j=1:5

H(i,j) = y(i)*y(j)*(x(i)*x(j)+1)^2;end

ende = ones(5,1);lb = zeros(5,1);ub = 100*ones(5,1);Astar = quadprog(H,-e,[],[],y,0,lb,ub);

The output result:Astar = [0; 2.5000; 0; 7.3333; 4.8333]

4/3/2011

13

4/3/2011 52

Choosing the Kernel Function

�Probably the most tricky part of using SVM.

�The kernel function is important because it creates the

kernel matrix, which summarizes all the data

�Many principles have been proposed (diffusion kernel,

Fisher kernel, string kernel, …)

�There is even research to estimate the kernel matrix from available information

� In practice, a low degree polynomial kernel or RBF kernel with a reasonable width is a good initial try

�Note that SVM with RBF kernel is closely related to RBF

neural networks, with the centers of the radial basis functions automatically chosen for SVM

SVM Parameter selection

• The effectiveness of SVM depends on the selection of kernel, the kernel's parameters, and soft margin parameter C.

• Typically, each combination of parameter choices is checked using cross validation, and the parameters with best cross-validation accuracy are picked.

• The final model, which is used for testing and for classifying new data, is then trained on the whole training set using the selected parameters.


Software

�A list of SVM implementation can be found at

http://www.kernel-machines.org/software.html

�Some implementation (such as LIBSVM) can handle multi-class classification

�SVMLight is among one of the earliest implementation of

SVM

�Several Matlab toolboxes for SVM are also available


Summary: Steps for Classification

�Select the kernel function to use

�Select the parameter of the kernel function and the

value of C

� You can use the values suggested by the SVM software, or

you can set apart a validation set to determine the values of the parameter

�Execute the training algorithm and obtain the αi

�Unseen data can be classified using the αi and the

support vectors

4/3/2011

14


Strengths and Weaknesses of SVM

�Strengths

� Training is relatively easy

� No local optimal, unlike in neural networks

� It scales relatively well to high dimensional data

� Tradeoff between classifier complexity and error can be

controlled explicitly

�Non-traditional data like strings and trees can be used as input to SVM, instead of feature vectors

�Weaknesses

�Need to choose a “good” kernel function.


Other Types of Kernel Methods

�A lesson learnt in SVM: a linear algorithm in the feature

space is equivalent to a non-linear algorithm in the input

space

�Standard linear algorithms can be generalized to its non-

linear version by going to the feature space

� Kernel principal component analysis, kernel independent component analysis, kernel canonical correlation analysis,

kernel k-means, 1-class SVM are some examples


Conclusion

�SVM is a useful alternative to neural networks

�Two key concepts of SVM: maximize the margin and the

kernel trick

�Many SVM implementations are available on the web for

you to try on your data set!


Resources

�http://www.kernel-machines.org/

�http://www.support-vector.net/

�http://www.support-vector.net/icml-tutorial.pdf

�http://www.kernel-machines.org/papers/tutorial-

nips.ps.gz

�http://www.clopinet.com/isabelle/Projects/SVM/applist.h

tml

4/3/2011

15

Relevance Vector Machine (RVM)

• RVM for regression

• RVM for classification

Motivations

SVM has the following problems:

RVM for Regression

• Traditional Regression

• SVM Regression

The number of weight is equal to the number of training samples N

i.e., one w for each sample

)(),(1

xwwxy i

N

iiϕ∑

=

=

),(),(1

i

N

ii xxkwwxy ∑

=

=

RVM for Regression (cont’d)

αi is the hyperparameter, one αi for each wi.

The target value t for input x follows a

Gaussian distribution )),,((~ 1−βxwyNt

4/3/2011

16

Bayesian RVM for Regression Bayesian RVM for Regression

Bayesian Approach (cont’d)

RVM for Classification

4/3/2011

17

Support Vector Machine - ECSE · 4/3/2011 1 Chapter 7 Support Vector Machine Table of Content •...

Documents

Transcript of Support Vector Machine - ECSE · 4/3/2011 1 Chapter 7 Support Vector Machine Table of Content •...