Support Vector Machines for Classification

41
Support Vector Machines (C) CDAC Mumbai Workshop on Machine Learning Support Vector Machines Prakash B. Pimpale CDAC Mumbai

description

In machine learning, support vector machines (SVMs, also support vector networks[1]) are supervised learning models with associated learning algorithms that analyze data and recognize patterns, used for classification and regression analysis. The basic SVM takes a set of input data and predicts, for each given input, which of two possible classes forms the output, making it a non-probabilistic binary linear classifier.

Transcript of Support Vector Machines for Classification

Support Vector Machines

(C) CDAC Mumbai Workshop on Machine Learning

Support Vector MachinesPrakash B. Pimpale

CDAC Mumbai

Outlineo Introductiono Towards SVMo Basic Concept

(C) CDAC Mumbai Workshop on Machine Learning

o Basic Concepto Implementationso Issueso Conclusion & References

Introduction:o SVMs – a supervised learning methods for classification and Regression

o Base: Vapnik-Chervonenkis theoryo First practical implementation: Early nineties

(C) CDAC Mumbai Workshop on Machine Learning

o First practical implementation: Early ninetieso Satisfying from theoretical point of viewo Can lead to high performance in practical applications

o Currently considered one of the most efficient family of algorithms in Machine Learning

Towards SVMA:I found really good function describing the training examples using ANN but couldn’t classify test example that efficiently, what could be the problem?

B: It didn’t generalize well! A: What should I do now?

(C) CDAC Mumbai Workshop on Machine Learning

A: What should I do now? B: Try SVM!A: why?B: SVM 1)Generalises well

And what's more….2)Computationally efficient (just a convex optimization problem)3)Robust in high dimensions also (no overfitting)

A: Why is it so?B: So many questions…?? L

o Vapnik & Chervonenkis Statistical Learning Theory Result: Relates ability to learn a rule for classifying training data to ability of resulting rule to classify unseen examples (Generalization)

(C) CDAC Mumbai Workshop on Machine Learning

o Let a rule ,o Empirical Risk of : Measure of quality of classification on

training dataBest performance

Worst performance

f Ff ∈f

0)( =fR emp

1)( =fR emp

What about the Generalization?

o Risk of classifier: Probability that rule ƒ makes a mistake on a new sample randomly generated by random machine

))(()( yxfPfR ≠=

(C) CDAC Mumbai Workshop on Machine Learning

Best Generalization

Worst Generalization

o Many times small Empirical Risk implies Small Risk

0)( =fR

1)( =fR

Is the problem solved? …….. NO!

o Is Risk of selected by Empirical Risk Minimization (ERM) near to that of ideal ?

o No, not in case of overfittingo Important Result of Statistical Learning Theory

tfif

(C) CDAC Mumbai Workshop on Machine Learning

o Important Result of Statistical Learning Theory

Where, V(F)- VC dimension of class FN- number of observations for trainingC- Universal Constant

NFV

CfRfER it)(

)()( +≤

What it says:

o Risk of rule selected by ERM is not far from Risk of the ideal rule if-

1) N is large enough2)VC dimension of F should be small enough

(C) CDAC Mumbai Workshop on Machine Learning

2)VC dimension of F should be small enough

[VC dimension? In short larger a class F, the larger its VC dimension (Sorry Vapnik sir!)]

Structural Risk Minimization (SRM)o Consider family of F =>

)(......)(..........)()(

..

................

10

10

FVFVFVFV

ts

FFFF

n

n

≤≤≤≤≤

⊂⊂⊂⊂

(C) CDAC Mumbai Workshop on Machine Learning

o Find the minimum Empirical Risk for each subclass and its VC dimension

o Select a subclass with minimum bound on the Risk (i.e. sum of the VC dimension and empirical risk)

)(......)(..........)()( 10 FVFVFVFV n ≤≤≤≤≤

SRM Graphically:NFV

CfRfER it)(

)()( +≤

(C) CDAC Mumbai Workshop on Machine Learning

A: What it has to do with SVM….?

B:SVM is an approximate implementation of SRM!A: How?B: Just in simple way for now:

(C) CDAC Mumbai Workshop on Machine Learning

Just import a result:Maximizing distance of the decision boundary from training points minimizes the VC dimension resulting into the Good generalization!

A: Means Now onwards our target is Maximizing Distance

between decision boundary and the Training points!

B: Yeah, Right!A: Ok, I am convinced that SVM will generalize well, but can you please explain what is the concept of SVM and how to implement it, are there any

(C) CDAC Mumbai Workshop on Machine Learning

SVM and how to implement it, are there any packages available?

B: Yeah, don’t worry, there are many implementations available, just use them for your application, now the next part of the presentation will give a basic idea about the SVM, so be with me!

Basic Concept of SVM:o Which line will classify the unseen data well?

(C) CDAC Mumbai Workshop on Machine Learning

data well?o The dotted line! Its line with Maximum Margin!

Cont…

Support Vectors Support Vectors

(C) CDAC Mumbai Workshop on Machine Learning

+

=+

1

0

1

bXW T

Some definitions:o Functional Margin:w.r.t.

1) individual examples : 2)example set },.....,1);,{( )()( miyxS ii ==

)(ˆ )()()( bxWy iTii +=γ

)(ˆminˆ iγγ =

(C) CDAC Mumbai Workshop on Machine Learning

o Geometric Margin:w.r.t1)Individual examples: 2) example set S,

)(

,...,1ˆminˆ i

miγγ

==

+

=

||||||||)()()(

Wb

xWW

y i

T

iiγ

)(

,...,1min i

miγγ

==

Problem Formulation:

(C) CDAC Mumbai Workshop on Machine Learning

+

=+

1

0

1

bXW T

Cont..o Distance of a point (u, v) from Ax+By+C=0, is given by

|Ax+By+C|/||n||Where ||n|| is norm of vector n(A,B)o Distance of hyperpalne from origin =

|||| Wb

(C) CDAC Mumbai Workshop on Machine Learning

o Distance of point A from origin =

o Distance of point B from Origin =

o Distance between points A and B (Margin) =

|||| W

||||1

Wb +

||||1

Wb −

||||2W

Cont…We have data set

1

)()( ,....,1},,{

RYandRX

miYXd

ii

∈∈

=

(C) CDAC Mumbai Workshop on Machine Learning

separating hyperplane

10

10

..

0

)()(

)()(

−=<+

+=>+

=+

iiT

iiT

T

YifbXW

YifbXW

ts

bXW

Cont…o Suppose training data satisfy following constrains also,

Combining these to the one,11

11)()(

)()(

−=−≤+

+=+≥+iiT

iiT

YforbXW

YforbXW

(C) CDAC Mumbai Workshop on Machine Learning

Combining these to the one,

o Our objective is to find Hyperplane(W,b) with maximal separation between it and closest data points while satisfying the above constrains

iforbXWY iTi ∀≥+ 1)( )()(

THE PROBLEM:

||||2

max, WbW

(C) CDAC Mumbai Workshop on Machine Learning

such that

Also we know

iforbXWY iTi ∀≥+ 1)( )()(

WWW T=||||

Cont..

WW T

bW 21

min,

So the Problem can be written as:

(C) CDAC Mumbai Workshop on Machine Learning

bW 2,

iforbXWY iTi ∀≥+ 1)( )()(

Such that

It is just a convex quadratic optimization problem !

2||||WWW T =Notice:

DUALo Solving dual for our problem will lead us to apply SVM for

nonlinearly separable data, efficientlyo It can be shown that

)),,(min(maxmin αα

bWLprimal≥

=

(C) CDAC Mumbai Workshop on Machine Learning

o Primal problem:

Such that

)),,(min(maxmin,0

αα

bWLprimalbW≥

=

WW T

bW 21

min,

iforbXWY iTi ∀≥+ 1)( )()(

Constructing Lagrangiano Lagrangian for our problem:

[ ]∑ −+−=m

iTii bXWYWbWL )()(2 1)(||||

21

),,( αα

(C) CDAC Mumbai Workshop on Machine Learning

Where a Lagrange multiplier and

o Now minimizing it w.r.t. W and b:We set derivatives of Lagrangian w.r.t. W and b to zero

[ ]∑=

−+−=i

i bXWYWbWL1

1)(||||2

),,( αα

α 0≥iα

Cont…o Setting derivative w.r.t. W to zero, it gives:

)(

1

)(

..

0im

i

ii

ei

XYW ∑=

=− α

(C) CDAC Mumbai Workshop on Machine Learning

o Setting derivative w.r.t. b to zero, it gives:

)(

1

)(

..

im

i

ii XYW

ei

∑=

= α

∑=

=m

i

iiY

1

)( 0α

Cont…o Plugging these results into Lagrangian gives

∑∑==

−=m

ji

jTiji

jim

ii XXYYbWL

1,

)()()()(

1

)()(21

),,( αααα

(C) CDAC Mumbai Workshop on Machine Learning

o Say it

o This is result of our minimization w.r.t W and b,

== jii 1,1

∑∑==

−=m

ji

jTiji

jim

ii XXYYD

1,

)()()()(

1

)()(21

)( αααα

So The DUAL:o Now Dual becomes::

∑∑==

=≥

−=

i

m

ji

jiji

jim

ii

mi

ts

XXYYD1,

)()()()(

1

,...,1,0

..

,21

)(max

α

ααααα

(C) CDAC Mumbai Workshop on Machine Learning

o Solving this optimization problem gives us o Also Karush-Kuhn-Tucker (KKT) condition is satisfied at this solution i.e.

∑=

=

=≥m

i

ii

i

Y

mi

1

)( 0

,...,1,0

α

α

[ ] miforbXWY iTii ,...,1,01)( )()( ==−+α

Values of W and b:o W can be found using

)(

1

)( im

i

ii XYW ∑

=

= α

(C) CDAC Mumbai Workshop on Machine Learning

o b can be found using:

1i =

2

*min*max*

)(1:

)(1: )()(

iTYi

iTYi

XWXWb

ii =−=+

−=

What if data is nonlinearly separable?o The maximal margin

hyperplane can classify only linearly separable data

o What if the data is linearly

(C) CDAC Mumbai Workshop on Machine Learning

o What if the data is linearly non-separable?

o Take your data to linearly separable ( higher dimensional space) and use maximal margin hyperplane there!

Taking it to higher dimension works! Ex. XOR

(C) CDAC Mumbai Workshop on Machine Learning

Doing it in higher dimensional space o Let be non linear mapping from input space X (original space) to feature space (higher dimensional) F

o Then our inner (dot) product in higher

FX→Φ:

)()( , ji XX

(C) CDAC Mumbai Workshop on Machine Learning

o Then our inner (dot) product in higher dimensional space is

o Now, the problem becomes:

)()( , ji XX

)(),( )()( ji XX φφ

∑∑

=

==

=

=≥

−=

m

i

ii

i

m

ji

jiji

jim

ii

Y

mi

ts

XXYYD

1

)(

1,

)()()()(

1

0

,...,1,0

..

)(,)(21

)(max

α

α

φφααααα

Kernel function:o There exist a way to compute inner product in feature space as function of original input points – Its kernel function!

o Kernel function:

(C) CDAC Mumbai Workshop on Machine Learning

o Kernel function:

o We need not know to compute

)(),(),( zxzxK φφ=

),( zxKφ

An example:For n=3, feature mapping is given as :

∑∑=

=

n

jj

n

ii

T

n

zxzxzxKei

zxzxK

Rzxlet2

)()(),(..

)(),(

, φ

31

21

11

xx

xx

xx

xx

(C) CDAC Mumbai Workshop on Machine Learning

∑∑

∑∑

=

= =

==

=

=

=

n

jijiji

n

i

n

jjiji

jjj

iii

zzxx

zzxx

zxzxzxKei

1,

1 1

11

))((

)()(),(..

=

33

23

13

32

22

12

)(

xx

xx

xx

xx

xx

xx

)(),(),( zxzxK φφ=

example cont…o Here,

31

)(),( 2

=

=

=

zx

zxzxK

forT

4

2

2

1

)(

22

12

21

11

=

=

xx

xx

xx

xx

(C) CDAC Mumbai Workshop on Machine Learning

[ ]

121)(),(

11

4

321

4

3

2

1

2 ==

=

=

=

=

zxzxK

zx

zx

T

T

[ ]

121

16

12

12

9

4221)()(

16

12

12

9

)(

=

=

=

zx

z

T φφ

φ

So our SVM for the non-linearly separable data:o Optimization problem:

∑∑==

=≥

−=m

ji

jiji

jim

ii

mi

ts

XXKYYD1,

)()()()(

1

,...,1,0

..

,21

)(max

α

ααααα

(C) CDAC Mumbai Workshop on Machine Learning

o Decision function

∑=

=

=≥m

i

ii

i

Y

mi

1

)( 0

,...,1,0

α

α

)),(()(1

)()(∑=

+=m

i

iii bXXKYSignXF α

Some commonly used Kernel functions:

o Linear:

o Polynomial of degree d: dTYXYXK )1(),( +=

YXYXK T=),(

(C) CDAC Mumbai Workshop on Machine Learning

o Polynomial of degree d:

o Gaussian Radial Basis Function (RBF):

o Tanh kernel:

YXYXK )1(),( +=

2

2||||

2),( σ

YX

eYXK−

−=

))(tanh(),( δρ −= YXYXK T

Implementations:Some Ready to use available SVM implementations:1)LIBSVM:A library for SVM by Chih-Chung Chang and

chih-Jen Lin

(at: http://www.csie.ntu.edu.tw/~cjlin/libsvm/)

(C) CDAC Mumbai Workshop on Machine Learning

(at: http://www.csie.ntu.edu.tw/~cjlin/libsvm/)2)SVM light : An implementation in C by Thorsten

Joachims(at: http://svmlight.joachims.org/ )3)Weka: A Data Mining Software in Java by University

of Waikato (at: http://www.cs.waikato.ac.nz/ml/weka/ )

Issues:o Selecting suitable kernel: Its most of the time trial and error

o Multiclass classification: One decision function for each class( l1 vs l-1 ) and then finding one with max

(C) CDAC Mumbai Workshop on Machine Learning

each class( l1 vs l-1 ) and then finding one with max value i.e. if X belongs to class 1, then for this and other (l-1) classes vales of decision functions:

1)(

.

.

1)(

1)(

2

1

−≤

−≤

+≥

XF

XF

XF

l

Cont….o Sensitive to noise: Mislabeled data can badly affect the performance

o Good performance for the applications like-1)computational biology and medical applications (protein, cancer classification problems)

(C) CDAC Mumbai Workshop on Machine Learning

(protein, cancer classification problems)2)Image classification3)hand-written character recognitionAnd many others…..

o Use SVM :High dimensional, linearly separable data (strength), for nonlinearly depends on choice of kernel

Conclusion:Support Vector Machines provides very simple method for linear classification. But performance, in case of nonlinearly separable data, largely depends on the choice of kernel!

(C) CDAC Mumbai Workshop on Machine Learning

data, largely depends on the choice of kernel!

References:o Nello Cristianini and John Shawe-Taylor (2000)??

An Introduction to Support Vector Machines and Other Kernel-based Learning MethodsCambridge University Press

o Christopher J.C. Burges (1998)??A tutorial on Support Vector Machines for pattern recognitionUsama Fayyad, editor, Data Mining and Knowledge Discovery, 2, 121-167. Kluwer Academic Publishers, Boston.

(C) CDAC Mumbai Workshop on Machine Learning

Kluwer Academic Publishers, Boston.o Andrew Ng (2007)

CSS229 Lecture NotesStanford Engineering Everywhere, Stanford University .

o Support Vector Machines <http://www.svms.org > (Accessed 10.11.2008)o Wikipediao Kernel-Machines.org<http://www.kernel-machines.org >(Accessed 10.11.2008)

Thank You!

(C) CDAC Mumbai Workshop on Machine Learning

Thank You!

[email protected] ; [email protected]