Support Vector Machines for Classification

Support Vector Machines

(C) CDAC Mumbai Workshop on Machine Learning

Support Vector MachinesPrakash B. Pimpale

CDAC Mumbai

Outlineo Introductiono Towards SVMo Basic Concept


o Basic Concepto Implementationso Issueso Conclusion & References

Introduction:o SVMs – a supervised learning methods for classification and Regression

o Base: Vapnik-Chervonenkis theoryo First practical implementation: Early nineties


o First practical implementation: Early ninetieso Satisfying from theoretical point of viewo Can lead to high performance in practical applications

o Currently considered one of the most efficient family of algorithms in Machine Learning

Towards SVMA:I found really good function describing the training examples using ANN but couldn’t classify test example that efficiently, what could be the problem?

B: It didn’t generalize well! A: What should I do now?


A: What should I do now? B: Try SVM!A: why?B: SVM 1)Generalises well

And what's more….2)Computationally efficient (just a convex optimization problem)3)Robust in high dimensions also (no overfitting)

A: Why is it so?B: So many questions…?? L

o Vapnik & Chervonenkis Statistical Learning Theory Result: Relates ability to learn a rule for classifying training data to ability of resulting rule to classify unseen examples (Generalization)


o Let a rule ,o Empirical Risk of : Measure of quality of classification on

training dataBest performance

Worst performance

f Ff ∈f

0)( =fR emp

1)( =fR emp

What about the Generalization?

o Risk of classifier: Probability that rule ƒ makes a mistake on a new sample randomly generated by random machine

))(()( yxfPfR ≠=


Best Generalization

Worst Generalization

o Many times small Empirical Risk implies Small Risk

0)( =fR

1)( =fR

Is the problem solved? …….. NO!

o Is Risk of selected by Empirical Risk Minimization (ERM) near to that of ideal ?

o No, not in case of overfittingo Important Result of Statistical Learning Theory

tfif


o Important Result of Statistical Learning Theory

Where, V(F)- VC dimension of class FN- number of observations for trainingC- Universal Constant

NFV

CfRfER it)(

)()( +≤

What it says:

o Risk of rule selected by ERM is not far from Risk of the ideal rule if-

1) N is large enough2)VC dimension of F should be small enough


2)VC dimension of F should be small enough

[VC dimension? In short larger a class F, the larger its VC dimension (Sorry Vapnik sir!)]

Structural Risk Minimization (SRM)o Consider family of F =>

)(......)(..........)()(

..

................

10

10

FVFVFVFV

ts

FFFF

n

n

≤≤≤≤≤

⊂⊂⊂⊂


o Find the minimum Empirical Risk for each subclass and its VC dimension

o Select a subclass with minimum bound on the Risk (i.e. sum of the VC dimension and empirical risk)

)(......)(..........)()( 10 FVFVFVFV n ≤≤≤≤≤

SRM Graphically:NFV

CfRfER it)(

)()( +≤


A: What it has to do with SVM….?

B:SVM is an approximate implementation of SRM!A: How?B: Just in simple way for now:


Just import a result:Maximizing distance of the decision boundary from training points minimizes the VC dimension resulting into the Good generalization!

A: Means Now onwards our target is Maximizing Distance

between decision boundary and the Training points!

B: Yeah, Right!A: Ok, I am convinced that SVM will generalize well, but can you please explain what is the concept of SVM and how to implement it, are there any


SVM and how to implement it, are there any packages available?

B: Yeah, don’t worry, there are many implementations available, just use them for your application, now the next part of the presentation will give a basic idea about the SVM, so be with me!

Basic Concept of SVM:o Which line will classify the unseen data well?


data well?o The dotted line! Its line with Maximum Margin!

Cont…

Support Vectors Support Vectors


+

−

=+

1

0

1

bXW T

Some definitions:o Functional Margin:w.r.t.

1) individual examples : 2)example set },.....,1);,{( )()( miyxS ii ==

)(ˆ )()()( bxWy iTii +=γ

)(ˆminˆ iγγ =


o Geometric Margin:w.r.t1)Individual examples: 2) example set S,

)(

,...,1ˆminˆ i

miγγ

==

+

=

||||||||)()()(

Wb

xWW

y i

T

iiγ

)(

,...,1min i

miγγ

==

Problem Formulation:


+

−

=+

1

0

1

bXW T

Cont..o Distance of a point (u, v) from Ax+By+C=0, is given by

|Ax+By+C|/||n||Where ||n|| is norm of vector n(A,B)o Distance of hyperpalne from origin =

|||| Wb


o Distance of point A from origin =

o Distance of point B from Origin =

o Distance between points A and B (Margin) =

|||| W

||||1

Wb +

||||1

Wb −

||||2W

Cont…We have data set

1

)()( ,....,1},,{

RYandRX

miYXd

ii

∈∈

=


separating hyperplane

10

10

..

0

)()(

)()(

−=<+

+=>+

=+

iiT

iiT

T

YifbXW

YifbXW

ts

bXW

Cont…o Suppose training data satisfy following constrains also,

Combining these to the one,11

11)()(

)()(

−=−≤+

+=+≥+iiT

iiT

YforbXW

YforbXW


Combining these to the one,

o Our objective is to find Hyperplane(W,b) with maximal separation between it and closest data points while satisfying the above constrains

iforbXWY iTi ∀≥+ 1)( )()(

THE PROBLEM:

||||2

max, WbW


such that

Also we know


WWW T=||||

Cont..

WW T

bW 21

min,

So the Problem can be written as:


bW 2,


Such that

It is just a convex quadratic optimization problem !

2||||WWW T =Notice:

DUALo Solving dual for our problem will lead us to apply SVM for

nonlinearly separable data, efficientlyo It can be shown that

)),,(min(maxmin αα

bWLprimal≥

=


o Primal problem:

Such that

)),,(min(maxmin,0

αα

bWLprimalbW≥

=

WW T

bW 21

min,


Constructing Lagrangiano Lagrangian for our problem:

[ ]∑ −+−=m

iTii bXWYWbWL )()(2 1)(||||

21

),,( αα


Where a Lagrange multiplier and

o Now minimizing it w.r.t. W and b:We set derivatives of Lagrangian w.r.t. W and b to zero

[ ]∑=

−+−=i

i bXWYWbWL1

1)(||||2

),,( αα

α 0≥iα

Cont…o Setting derivative w.r.t. W to zero, it gives:

)(

1

)(

..

0im

i

ii

ei

XYW ∑=

=− α


o Setting derivative w.r.t. b to zero, it gives:

)(

1

)(

..

im

i

ii XYW

ei

∑=

= α

∑=

=m

i

iiY

1

)( 0α

Cont…o Plugging these results into Lagrangian gives

∑∑==

−=m

ji

jTiji

jim

ii XXYYbWL

1,

)()()()(

1

)()(21

),,( αααα


o Say it

o This is result of our minimization w.r.t W and b,

== jii 1,1

∑∑==

−=m

ji

jTiji

jim

ii XXYYD

1,

)()()()(

1

)()(21

)( αααα

So The DUAL:o Now Dual becomes::

∑∑==

=≥

−=

i

m

ji

jiji

jim

ii

mi

ts

XXYYD1,

)()()()(

1

,...,1,0

..

,21

)(max

α

ααααα


o Solving this optimization problem gives us o Also Karush-Kuhn-Tucker (KKT) condition is satisfied at this solution i.e.

∑=

=

=≥m

i

ii

i

Y

mi

1

)( 0

,...,1,0

α

α

iα

[ ] miforbXWY iTii ,...,1,01)( )()( ==−+α

Values of W and b:o W can be found using

)(

1

)( im

i

ii XYW ∑

=

= α


o b can be found using:

1i =

2

*min*max*

)(1:

)(1: )()(

iTYi

iTYi

XWXWb

ii =−=+

−=

What if data is nonlinearly separable?o The maximal margin

hyperplane can classify only linearly separable data

o What if the data is linearly


o What if the data is linearly non-separable?

o Take your data to linearly separable ( higher dimensional space) and use maximal margin hyperplane there!

Taking it to higher dimension works! Ex. XOR


Doing it in higher dimensional space o Let be non linear mapping from input space X (original space) to feature space (higher dimensional) F

o Then our inner (dot) product in higher

FX→Φ:

)()( , ji XX


o Then our inner (dot) product in higher dimensional space is

o Now, the problem becomes:

)()( , ji XX

)(),( )()( ji XX φφ

∑

∑∑

=

==

=

=≥

−=

m

i

ii

i

m

ji

jiji

jim

ii

Y

mi

ts

XXYYD

1

)(

1,

)()()()(

1

0

,...,1,0

..

)(,)(21

)(max

α

α

φφααααα

Kernel function:o There exist a way to compute inner product in feature space as function of original input points – Its kernel function!

o Kernel function:


o Kernel function:

o We need not know to compute

)(),(),( zxzxK φφ=

),( zxKφ

An example:For n=3, feature mapping is given as :

∑∑=

=

∈

n

jj

n

ii

T

n

zxzxzxKei

zxzxK

Rzxlet2

)()(),(..

)(),(

, φ

31

21

11

xx

xx

xx

xx


∑

∑∑

∑∑

=

= =

==

=

=

=

n

jijiji

n

i

n

jjiji

jjj

iii

zzxx

zzxx

zxzxzxKei

1,

1 1

11

))((

)()(),(..

=

33

23

13

32

22

12

)(

xx

xx

xx

xx

xx

xx

xφ

)(),(),( zxzxK φφ=

example cont…o Here,

31

)(),( 2

=

=

=

zx

zxzxK

forT

4

2

2

1

)(

22

12

21

11

=

=

xx

xx

xx

xx

xφ


[ ]

121)(),(

11

4

321

4

3

2

1

2 ==

=

=

=

=

zxzxK

zx

zx

T

T

[ ]

121

16

12

12

9

4221)()(

16

12

12

9

)(

=

=

=

zx

z

T φφ

φ

So our SVM for the non-linearly separable data:o Optimization problem:

∑∑==

=≥

−=m

ji

jiji

jim

ii

mi

ts

XXKYYD1,

)()()()(

1

,...,1,0

..

,21

)(max

α

ααααα


o Decision function

∑=

=

=≥m

i

ii

i

Y

mi

1

)( 0

,...,1,0

α

α

)),(()(1

)()(∑=

+=m

i

iii bXXKYSignXF α

Some commonly used Kernel functions:

o Linear:

o Polynomial of degree d: dTYXYXK )1(),( +=

YXYXK T=),(


o Polynomial of degree d:

o Gaussian Radial Basis Function (RBF):

o Tanh kernel:

YXYXK )1(),( +=

2

2||||

2),( σ

YX

eYXK−

−=

))(tanh(),( δρ −= YXYXK T

Implementations:Some Ready to use available SVM implementations:1)LIBSVM:A library for SVM by Chih-Chung Chang and

chih-Jen Lin

(at: http://www.csie.ntu.edu.tw/~cjlin/libsvm/)


(at: http://www.csie.ntu.edu.tw/~cjlin/libsvm/)2)SVM light : An implementation in C by Thorsten

Joachims(at: http://svmlight.joachims.org/ )3)Weka: A Data Mining Software in Java by University

of Waikato (at: http://www.cs.waikato.ac.nz/ml/weka/ )

Issues:o Selecting suitable kernel: Its most of the time trial and error

o Multiclass classification: One decision function for each class( l1 vs l-1 ) and then finding one with max


each class( l1 vs l-1 ) and then finding one with max value i.e. if X belongs to class 1, then for this and other (l-1) classes vales of decision functions:

1)(

.

.

1)(

1)(

2

1

−≤

−≤

+≥

XF

XF

XF

l

Cont….o Sensitive to noise: Mislabeled data can badly affect the performance

o Good performance for the applications like-1)computational biology and medical applications (protein, cancer classification problems)


(protein, cancer classification problems)2)Image classification3)hand-written character recognitionAnd many others…..

o Use SVM :High dimensional, linearly separable data (strength), for nonlinearly depends on choice of kernel

Conclusion:Support Vector Machines provides very simple method for linear classification. But performance, in case of nonlinearly separable data, largely depends on the choice of kernel!


data, largely depends on the choice of kernel!

References:o Nello Cristianini and John Shawe-Taylor (2000)??

An Introduction to Support Vector Machines and Other Kernel-based Learning MethodsCambridge University Press

o Christopher J.C. Burges (1998)??A tutorial on Support Vector Machines for pattern recognitionUsama Fayyad, editor, Data Mining and Knowledge Discovery, 2, 121-167. Kluwer Academic Publishers, Boston.


Kluwer Academic Publishers, Boston.o Andrew Ng (2007)

CSS229 Lecture NotesStanford Engineering Everywhere, Stanford University .

o Support Vector Machines <http://www.svms.org > (Accessed 10.11.2008)o Wikipediao Kernel-Machines.org<http://www.kernel-machines.org >(Accessed 10.11.2008)

Thank You!


Thank You!

[email protected] ; [email protected]

Support Vector Machines for Classification

Technology

Transcript of Support Vector Machines for Classification