Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... ·...

85
Support Vector and Kernel Machines Nello Cristianini BIOwulf Technologies [email protected] http://www.support-vector.net/tutorial.html ICML 2001

Transcript of Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... ·...

Page 1: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

Support Vector and Kernel Machines

Nello CristianiniBIOwulf [email protected]://www.support-vector.net/tutorial.html

ICML 2001

Page 2: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

A Little History

z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik. Greatly developed ever since.

z Initially popularized in the NIPS community, now an important and active field of all Machine Learning research.

z Special issues of Machine Learning Journal, and Journal of Machine Learning Research.

z Kernel Machines: large class of learning algorithms, SVMs a particular instance.

Page 3: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

A Little History

z Annual workshop at NIPSz Centralized website: www.kernel-machines.orgz Textbook (2000): see www.support-vector.netz Now: a large and diverse community: from machine

learning, optimization, statistics, neural networks, functional analysis, etc. etc

z Successful applications in many fields (bioinformatics, text, handwriting recognition, etc)

z Fast expanding field, EVERYBODY WELCOME ! -

Page 4: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

Preliminaries

z Task of this class of algorithms: detect and exploit complex patterns in data (eg: by clustering, classifying, ranking, cleaning, etc. the data)

z Typical problems: how to represent complex patterns; and how to exclude spurious (unstable) patterns (= overfitting)

z The first is a computational problem; the second a statistical problem.

Page 5: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

Very Informal Reasoning

z The class of kernel methods implicitly defines the class of possible patterns by introducing a notion of similarity between data

z Example: similarity between documentsz By lengthz By topicz By language …

z Choice of similarity Î Choice of relevant features

Page 6: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

More formal reasoning

z Kernel methods exploit information about the inner products between data items

z Many standard algorithms can be rewritten so that they only require inner products between data (inputs)

z Kernel functions = inner products in some feature space (potentially very complex)

z If kernel given, no need to specify what features of the data are being used

Page 7: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

Jus t in case …

z Inner product between vectors

z Hyperplane:

x z xzi i

i

, =∑x

x

o

o

o x

x o

o o

x

x

x

w

b

w x b, + = 0

Page 8: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

Overview of the Tutorial

z Introduce basic concepts with extended example of Kernel Perceptron

z Derive Support Vector Machines z Other kernel based algorithmsz Properties and Limitations of Kernelsz On Kernel Alignmentz On Optimizing Kernel Alignment

Page 9: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

Parts I and II: overview

z Linear Learning Machines (LLM)z Kernel Induced Feature Spacesz Generalization Theoryz Optimization Theoryz Support Vector Machines (SVM)

Page 10: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

Modularity

z Any kernel-based learning algorithm composed of two modules:

– A general purpose learning machine– A problem specific kernel function

z Any K-B algorithm can be fitted with any kernelz Kernels themselves can be constructed in a modular

wayz Great for software engineering (and for analysis)

IMPORTANTCONCEPT

Page 11: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

1-Linear Learning Machines

z Simplest case: classification. Decision function is a hyperplane in input space

z The Perceptron Algorithm (Rosenblatt, 57)

z Useful to analyze the Perceptron algorithm, before looking at SVMs and Kernel Methods in general

Page 12: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

Basic Notation

z Input spacez Output spacez Hypothesisz Real-valued:z Training Setz Test errorz Dot product

x X

y Y

h H

f X

S x y x y

x z

i i

∈∈ = − +∈

→=

{ , }

: R

{( , ),..., ( , ),....}

,

1 1

1 1

ε

Page 13: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

Perceptron

z Linear Separation of the input space

x

x

o

o

o x

x o

o o

x

x

x

w

b

f x w x b

h x sign f x

( ) ,

( ) ( ( ))

= +=

Page 14: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

Perceptron Algorithm

Update rule (ignoring threshold):

z if then x

x

o

o

o x

x o

o o

x

x

x

w

b

y w xi k i( , ) ≤ 0

w w y x

k k

k k i i+ ← +← +

1

1

η

Page 15: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

Observations

z Solution is a linear combination of training points

z Only used informative points (mistake driven)z The coefficient of a point in combination

reflects its ‘difficulty’

w y xi i i

i

=

≥∑ α

α 0

Page 16: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

Observations - 2

z Mistake bound:

z coefficients are non-negativez possible to rewrite the algorithm using this alternative

representation

MR≤

���

���γ

2x

x

o

o

o x

x o

o o

x

x

x

g

Page 17: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

Dual Representation

The decision function can be re-written as follows:

f x w x b y x x bi i i( ) , ,= + = +∑α

w y xi i i= ∑α

IMPORTANTCONCEPT

Page 18: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

Dual Representation

z And also the update rule can be rewritten as follows:

z if then

z Note: in dual representation, data appears only inside dot products

y y x x bi j j j iα ,∑ + ≤3 8 0 α α ηi i← +

Page 19: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

Duality: First Property of SVMs

z DUALITY is the first feature of Support Vector Machines

z SVMs are Linear Learning Machines represented in a dual fashion

z Data appear only within dot products (in decision function and in training algorithm)

f x w x b y x x bi i i( ) , ,= + = +∑α

Page 20: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

Limitations of LLMs

Linear classifiers cannot deal with

z Non-linearly separable dataz Noisy data

z + this formulation only deals with vectorial data

Page 21: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

Non-Linear Classifiers

z One solution: creating a net of simple linear classifiers (neurons): a Neural Network(problems: local minima; many parameters; heuristics needed to train; etc)

z Other solution: map data into a richer feature space including non-linear features, then use a linear classifier

Page 22: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

Learning in the Feature Space

z Map data into a feature space where they are linearly separable x x→ φ( )

x

x

x

x

o o

o

o

f ( o ) f (x)

f (x)

f ( o )

f ( o )

f ( o ) f (x)

f (x)

f

X F

Page 23: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

Problems with Feature Space

z Working in high dimensional feature spaces solves the problem of expressing complex functions

BUT:z There is a computational problem (working with

very large vectors)z And a generalization theory problem (curse of

dimensionality)

Page 24: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

Implicit Mapping to Feature Space

We will introduce Kernels:

z Solve the computational problem of working with many dimensions

z Can make it possible to use infinite dimensions – efficiently in time / space

z Other advantages, both practical and conceptual

Page 25: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

Kernel-Induced Feature Spaces

z In the dual representation, the data points only appear inside dot products:

z The dimensionality of space F not necessarily important. May not even know the map

f x y x x bi i i( ) ( , ( ))= +∑α φ φ

φ

Page 26: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

Kernels

z A function that returns the value of the dot product between the images of the two arguments

z Given a function K, it is possible to verify that it is a kernel

K x x x x( , ) ( ), ( )1 2 1 2= φ φ

IMPORTANTCONCEPT

Page 27: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

Kernels

z One can use LLMs in a feature space by simply rewriting it in dual representation and replacing dot products with kernels:

x x K x x x x1 2 1 2 1 2, ( , ) ( ), ( )← = φ φ

Page 28: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

The Kernel Matrix

z (aka the Gram matrix):

K(m,m)…K(m,3)K(m,2)K(m,1)

……………

K(2,m)…K(2,3)K(2,2)K(2,1)

K(1,m)…K(1,3)K(1,2)K(1,1)

IMPORTANTCONCEPT

K=

Page 29: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

The Kernel Matrix

z The central structure in kernel machinesz Information ‘bottleneck’: contains all necessary

information for the learning algorithm z Fuses information about the data AND the

kernelz Many interesting properties:

Page 30: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

Mercer’s Theo rem

z The kernel matrix is Symmetric Positive Definite

z Any symmetric positive definite matrix can be regarded as a kernel matrix, that is as an inner product matrix in some space

Page 31: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

More Formally: Mercer’s Theo rem

z Every (semi) positive definite, symmetric function is a kernel: i.e. there exists a mapping

such that it is possible to write:

Pos. Def.

φ

K x x x x( , ) ( ), ( )1 2 1 2= φ φK x z f x f z d x d z

f L

( , ) ( ) ( ) ≥

∀ ∈I 0

2

Page 32: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

Mercer’s Theo rem

z Eigenvalues expansion of Mercer’s Kernels:

z That is: the eigenfunctions act as features !

K x x x xi

i

i i( , ) ( ) ( )1 2 1 2= ∑λφ φ

Page 33: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

Examples of Kernels

z Simple examples of kernels are:

K x z x z

K x z e

d

x z

( , ) ,

( , ) /

=

= − − 2 2σ

Page 34: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

Example: Polynomial Kernels

x x x

z z z

x z x z x z

x z x z x z x z

x x x x z z z z

x z

==

= + =

= + + =

= =

=

( , );

( , );

, ( )

( , , ),( , , )

( ), ( )

1 2

1 2

21 1 2 2

2

12

12

22

22

1 1 2 2

12

22

1 2 12

22

1 2

2

2 2

φ φ

Page 35: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

Example: Polynomial Kernels

Page 36: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

Example: the two spirals

z Separated by a hyperplane in feature space (gaussian kernels)

Page 37: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

Making Kernels

z The set of kernels is closed under some operations. If K, K’ are kernels, then:

z K+K’ is a kernelz cK is a kernel, if c>0z aK+bK’ is a kernel, for a,b >0z Etc etc etc…… z can make complex kernels from simple ones:

modularity !

IMPORTANTCONCEPT

Page 38: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

Second Property of SVMs:

SVMs are Linear Learning Machines, that z Use a dual representation ANDz Operate in a kernel induced feature space(that is: is a linear function in the feature space implicitely

defined by K)

f x y x x bi i i( ) ( , ( ))= +∑α φ φ

Page 39: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

Kernels over General Structures

z Haussler, Watkins, etc: kernels over sets, over sequences, over trees, etc.

z Applied in text categorization, bioinformatics, etc

Page 40: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

A bad ke rnel …

z … would be a kernel whose kernel matrix is mostly diagonal: all points orthogonal to each other, no clusters, no structure …

1…000

……………

1

0…010

0…001

Page 41: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

No Free Kernel

z If mapping in a space with too many irrelevant features, kernel matrix becomes diagonal

z Need some prior knowledge of target so choose a good kernel

IMPORTANTCONCEPT

Page 42: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

Other Kernel-based algorithms

z Note: other algorithms can use kernels, not justLLMs (e.g. clustering; PCA; etc). Dual representation often possible (in optimizationproblems, by Representer’s theorem).

Page 43: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

%5($.

Page 44: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

The Generalization Problem

z The curse of dimensionality: easy to overfit in high dimensional spaces(=regularities could be found in the training set that are accidental, that is that would not be found again in a test set)

z The SVM problem is ill posed (finding one hyperplanethat separates the data: many such hyperplanes exist)

z Need principled way to choose the best possiblehyperplane

NEW TOPIC

Page 45: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

The Generalization Problem

z Many methods exist to choose a good hyperplane (inductive principles)

z Bayes, statistical learning theory / pac, MDL, …

z Each can be used, we will focus on a simple case motivated by statistical learning theory (will give the basic SVM)

Page 46: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

Statistical (Computational) Learning Theory

z Generalization bounds on the risk of overfitting (in a p.a.c. setting: assumption of I.I.d. data; etc)

z Standard bounds from VC theory give upper and lower bound proportional to VC dimension

z VC dimension of LLMs proportional to dimension of space (can be huge)

Page 47: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

Assumptions and Definitions

z distribution D over input space Xz train and test points drawn randomly (I.I.d.) from D z training error of h: fraction of points in S misclassifed

by hz test error of h: probability under D to misclassify a point

x z VC dimension: size of largest subset of X shattered by

H (every dichotomy implemented)

Page 48: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

VC Bounds

=

m

VCO~ε

VC = (number of dimensions of X) +1

Typically VC >> m, so not useful

Does not tell us which hyperplane to choose

Page 49: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

Margin Based Bounds

f

xfy

m

RO

ii

i

)(min

)/(~ 2

=

=

γ

γε

Note: also compression bounds exist; and online bounds.

Page 50: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

Margin Based Bounds

z (The worst case bound still holds, but if lucky (margin is large)) the other bound can be applied and better generalization can be achieved:

z Best hyperplane: the maximal margin onez Margin is large is kernel chosen well

=

m

RO

2)/(~ γε

IMPORTANTCONCEPT

Page 51: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

Maximal Margin Classifier

z Minimize the risk of overfitting by choosing the maximal margin hyperplane in feature space

z Third feature of SVMs: maximize the marginz SVMs control capacity by increasing the

margin, not by reducing the number of degrees of freedom (dimension free capacity control).

Page 52: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

Two kinds of margin

z Functional and geometric margin:

x

x

o

o

o x

x o

o o

x

x

x

g

funct y f x

geomy f x

f

i i

i i

=

=

min ( )

min( )

Page 53: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

Two kinds of margin

Page 54: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

Max Margin = Minimal Norm

z If we fix the functional margin to 1, the geometric margin equal 1/||w||

z Hence, maximize the margin by minimizing the norm

Page 55: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

Max Margin = Minimal Norm

Distance betweenThe two convex hulls

w x b

w x b

w x x

w

wx x

w

,

,

, ( )

,( )

+

+ −

+ −

+ = +

+ = −

− =

− =

1

1

2

2

x

x

o

o

o x

x o

o o

x

x

x

g

Page 56: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

The primal problem

z Minimize:subject to:

w w

y w x bi i

,

, + ≥4 9 1

IMPORTANTSTEP

Page 57: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

Optimization Theory

z The problem of finding the maximal marginhyperplane: constrained optimization(quadratic programming)

z Use Lagrange theory (or Kuhn-Tucker Theory)z Lagrangian:

1

21

0

w w y w x bi i i, ,− + −∑ �!

"$#

α

α

1 6

Page 58: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

From Primal to Dual

L w w w y w x bi i i

i

( ) , ,= − + −∑ �!

"$#

1

21

0

α

α

1 6

Differentiate and substitute:∂∂∂∂

L

bL

w

=

=

0

0

Page 59: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

The Dual Problem

z Maximize:

z Subject to:

z The duality again ! Can use kernels !

W y y x x

y

i

i

ii j

j i j i j

i

i ii

( ) ,,

α α αα

αα

= −

≥∑ =

∑ ∑1

2

0

0

IMPORTANTSTEP

Page 60: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

Convexity

z This is a Quadratic Optimization problem: convex, no local minima (second effect of Mercer’s conditions)

z Solvable in polynomial time …z (convexity is another fundamental property of

SVMs)

IMPORTANTCONCEPT

Page 61: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

Kuhn-Tucker Theorem

Properties of the solution:z Duality: can use kernelsz KKT conditions:

z Sparseness: only the points nearest to the hyperplane(margin = 1) have positive weight

z They are called support vectors

w y xi i i= ∑α

αi i iy w x b

i

, + − =

1 6 1 0

Page 62: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

KKT Conditions Imply Sparseness

x

x

o

o

o x

x o

o o

x

x

x

g

Sparseness: another fundamental property of SVMs

Page 63: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

Properties of SVMs - Summary

9 Duality9 Kernels9 Margin9 Convexity9 Sparseness

Page 64: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

Dealing with noise

( )2

2

1

γξ

ε ∑ 2+≤

R

m

y w x bi i i, + ≥ −4 9 1 ξ

In the case of non-separable data in feature space, the margin distribution can be optimized

Page 65: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

The Soft-Margin Classifier

1

2

1

2

1

2

w w C

w w C

y w x b

i

i

i

i

i i i

,

,

,

+

+

+ ≥ −

ξ

ξ

ξ4 9

Minimize:

Or:

Subject to:

Page 66: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

Slack Variables

( )2

2

1

γξ

ε ∑ 2+≤

R

m

y w x bi i i, + ≥ −4 9 1 ξ

Page 67: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

Soft Margin-Dual Lagrangian

z Box constraints

z Diagonal

W y y x x

C

y

i

i

ii j

j i j i j

i

i ii

( ) ,,

α α αα

αα

= −

≤ ≤∑ =

∑ ∑1

2

0

0

α αα α α

αα

i

i

ii j

j i j i j j j

i

i i

y y x xC

yi

∑ ∑ ∑− −

≤∑ ≥

1

2

1

2

0

0

,,

Page 68: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

The regression case

z For regression, all the above properties are retained, introducing epsilon-insensitive loss:

L

y i - < w , x i >+b

e

0

x

Page 69: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

Regression: the ε-tube

Page 70: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

Implementation Techniques

z Maximizing a quadratic function, subject to a linear equality constraint (and inequalities as well)

W y y K x x

y

i

i

ii j

j i j i j

i

i ii

( ) ( , ),

α α αα

αα

= −

≥∑ =

∑ ∑1

2

0

0

Page 71: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

Simple Approximation

z Initially complex QP pachages were used.z Stochastic Gradient Ascent (sequentially

update 1 weight at the time) gives excellent approximation in most cases

α α αi ii i

i i i i jK x x

y y K x x← + −���

���∑1

1( , )

( , )

Page 72: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

Full Solution: S.M.O.

z SMO: update two weights simultaneouslyz Realizes gradient descent without leaving the

linear constraint (J. Platt).

z Online versions exist (Li-Long; Gentile)

Page 73: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

Other “ kernelized” Algorithms

z Adatron, nearest neighbour, fisher discriminant, bayes classifier, ridge regression, etc. etc

z Much work in past years into designing kernel based algorithms

z Now: more work on designing good kernels (for any algorithm)

Page 74: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

On Combining Kernels

z When is it advantageous to combine kernels ?z Too many features leads to overfitting also in

kernel methodsz Kernel combination needs to be based on

principlesz Alignment

Page 75: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

Kernel Alignment

z Notion of similarity between kernels:Alignment (= similarity between Gram matrices)

A K KK K

K K K K( , )

,

, ,1 2

1 2

1 1 2 2=

IMPORTANTCONCEPT

Page 76: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

Many interpretations

z As measure of clustering in dataz As Correlation coefficient between ‘oracles’

z Basic idea: the ‘ultimate’ kernel should be YY’, that is should be given by the labels vector (after all: target is the only relevant feature !)

Page 77: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

The ideal kernel

1…1-1-1

……………

11-1-1

-1…-111

-1…-111

YY’=

Page 78: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

Combining Kernels

z Alignment in increased by combining kernels that are aligned to the target and not aligned to each other.

z

A K YYK YY

K K YY YY( , ' )

, '

, ' , '1

1

1 1

=

Page 79: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

Spectral Machines

z Can (approximately) maximize the alignment of a set of labels to a given kernel

z By solving this problem:

z Approximated by principal eigenvector (thresholded) (see courant-hilbert theorem)

yyKy

yy

yi

=

∈ − +

argmax'

{ , }1 1

Page 80: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

Courant-Hilbert theorem

z A: symmetric and positive definite,z Principal Eigenvalue / Eigenvector

characterized by:

λ = max'v

vAv

vv

Page 81: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

Optimizing Kernel Alignment

z One can either adapt the kernel to the labels or vice versa

z In the first case: model selection methodz Second case: clustering / transduction method

Page 82: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

Applications of SVMs

z Bioinformaticsz Machine Visionz Text Categorizationz Handwritten Character Recognitionz Time series analysis

Page 83: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

Text Kernels

z Joachims (bag of words)z Latent semantic kernels (icml2001)z String matching kernelsz …z See KerMIT project …

Page 84: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

Bioinformatics

z Gene Expressionz Protein sequencesz Phylogenetic Informationz Promotersz …

Page 85: Support Vector and Kernel Machines - University of Rochesterstefanko/Teaching/09CS446/... · 2009-01-08 · A Little History z SVMs introduced in COLT-92 by Boser, Guyon, Vapnik.

www.support-vector.net

Conclusions:

z Much more than just a replacement for neural networks. -

z General and rich class of pattern recognition methods

z %RR RQ 690V��ZZZ�VXSSRUW�YHFWRU�QHWz Kernel machines website

www.kernel-machines.orgz www.NeuroCOLT.org