Regression and Classification using Kernel...

74
Regression and Classification using Kernel Methods Barnabás Póczos University of Alberta Oct 15, 2009

Transcript of Regression and Classification using Kernel...

Page 1: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

Regression and Classification using

Kernel MethodsBarnabás Póczos

University of Alberta

Oct 15, 2009

Page 2: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

2

Roadmap I

• Primal Perceptron ) Dual Perceptron ) Kernels(It doesn’t want large margin...)

• Primal hard SVM ) Dual hard SVM ) Kernels (It wants large margin, but assumes the data is

linearly separable in the feature space)

• Primal soft SVM ) Dual soft SVM ) Kernels(It wants large margin, and doesn’t assume that the

data is linearly separable in the feature space)

Classification

Page 3: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

3

Roadmap II

• Ridge Regression ) Dual form ) Kernels

• Primal SVM for Regression ) Dual SVM ) Kernels

• Logistic Regression ) Kernels

• Bayesian Ridge Regression in feature space (Gaussian Processes)) Dual form ) Kernels

Regression

Page 4: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

The Perceptron

Page 5: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

5

The primal algorithm in the feature space

Picture is taken from R. Herbrich

Page 6: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

6

The Dual Perceptron

Picture is taken from R. Herbrich

Page 7: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

The SVM

Page 8: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

8

Linear Classifiersf x yest

f(x,w,b) = sign(w ∙ x + b)

How to classify this data?

Each of these seems fine…

... which is best?

w,b

denotes +1

denotes –1

taken from Andrew W. Moore

Page 9: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

9

Classifier Margin

f(x,w,b) = sign(w ∙ x + b)

The margin of a linear classifier =the width that the boundary could be increased by, before hitting a datapoint

denotes +1

denotes –1

f x yest

w,b

taken from Andrew W. Moore

Page 10: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

10

Maximum Marginf x yest

f(x,w,b) = sign(w ∙ x + b)

Support Vectors are datapoints that “touch” the margin

The maximum margin linear classifier is the linear classifier with the, um, maximum margin.

• the simplest kind of SVM – an LSVM

Linear SVM

denotes +1

denotes –1

w,b

taken from Andrew W. Moore

Page 11: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

11taken from Andrew W. Moore

Page 12: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

12

Specifying a Line and a Margin

• Plus-plane = { x : w ∙ x + b = +1 }• Minus-plane = { x : w ∙ x + b = –1 }

Plus-Plane

Minus-PlaneClassifier Boundary

“Predict Class =

+1”

zone

“Predict Class =

-1”

zone

-1 < w ∙ x + b < 1ifUniverse explodes

w ∙ x + b ≤ –1if–1

w ∙ x + b ≥ 1if+1Classify as..

wx+b=1

wx+b=0

wx+b=-1

taken from Andrew W. Moore

Page 13: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

13

Computing the margin width

Given …• w ∙ x+ + b = +1 • w ∙ x- + b = -1 • x+ = x- + λ w• |x+ – x-| = M•

“Predict Class =

+1”

zone

“Predict Class =

-1”

zonewx+b=1

wx+b=0

wx+b=-1

M = Margin Width =

M = |x+ – x- | =| λ w |=

x-

x+

λ = 2

w ⋅ w €

= 2 w ⋅ ww ⋅ w

= 2

w ⋅ w€

=λ| w |=λ w⋅w

2

w ⋅ w

Yay! Just maximize

…≡ minimize w·w€

2

w ⋅ w

Wait…OMG, I forgot the data! taken from Andrew W. Moore

Page 14: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

14

The Primal Hard SVM

This is a QP problem ) solution is expressible in dual form

The computational effort is O(n2)

Page 15: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

15

Quadratic Programming – in general

a rg m inw

c + d T w + w T K w2

Find

a 1 1 w 1 + a 1 2 w 2 + . . . + a 1 m w m ≤ b 1

a 2 1 w 1 + a 2 2 w 2 + . . . + a 2 m w m ≤ b 2

:

a n 1 w 1 + a n 2 w 2 + . . . + a n m w m ≤ b n

a ( n +1 )1 w 1 + a ( n +1 ) 2 w 2 + . . . + a ( n +1 ) m w m = b ( n +1 )

a ( n + 2 )1 w 1 + a ( n + 2 ) 2 w 2 + . . . + a ( n + 2 ) m w m = b ( n + 2 )

:

a ( n + e )1 w 1 + a ( n + e ) 2 w 2 + . . . + a ( n + e ) m w m = b ( n + e )

and to

n additional linear inequality constraints

e additional linear equality constraints

Quadratic criterion

Subject to

Note w∙x = wTx

taken from Andrew W. Moore

Page 16: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

16

The Dual Hard SVM

Quadratic Programming, the computational effort is O(m2)

Lemma

Page 17: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

17

ProofPrimal Lagrangian:

The solution:

Page 18: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

18

Proof cont.

Page 19: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

19

The Problem with Hard SVMIt assumes samples are linearly separable...

Solutions:• Use kernel with large expressive power

(e.g. RBF with small σ) ) each training samples are linearly separable in the feature space ) Hard SVM can be applied ) overfitting...

• Soft margin SVM instead of Hard SVM• Slack variables...

Page 20: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

20

Hard SVMThe Hard SVM problem can be rewritten:

whereMisclassification

Correct classification

Page 21: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

21

From Hard to Soft constraints

We can try solve the soft version of it

Instead of using hard constraints (points are linearly separable)

whereMisclassification

Correct classification

Page 22: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

22

Problems with l0-1 loss

It is not convex in yf(x) ) It is not convex in w, either...... and we like only convex functions...

Let us approximate it with convex functions!

• Piecewise linear approximations (hinge loss, llin)

• Quadratic approximation (lquad)

Page 23: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

23

Approximation of the Heaviside step function

Picture is taken from R. Herbrich

Page 24: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

24

The hinge loss approximation of l0-1

Where,

Page 25: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

25

The Slack Variables

wx+b = 0

wx+b = -1

M =

2

w ⋅ w

ξ7

ξ 11

ξ2

wx+b = 1

taken from Andrew W. Moore

Page 26: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

26

The Primal Soft SVM problemUsing lin (hinge) loss approximation:

Or we can use this form, too...

Page 27: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

27

The Dual Soft SVM (using hinge loss)

Page 28: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

28

Proofs

where

Page 29: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

29

Proofs

Page 30: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

30

Classifying using SVMs with Kernels• Choose a kernel function• Solve the dual problem

Page 31: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

A few results

Page 32: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

32

Kernels and Linear Classifiers

We will use linear classifiers in this feature space.

Page 33: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

33

The Decision Surface

Picture is taken from R. Herbrich

For general feature maps it can be very difficult

to solve these...

Page 34: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

34

Steve Gunn’s svm toolboxResults, Iris 2vs13, Linear kernel

Page 35: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

35

Results, Iris 1vs23, 2nd order kernel

Page 36: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

36

Results, Iris 1vs23, 2nd order kernel

Page 37: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

37

Results, Iris 1vs23, 13th order kernel

Page 38: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

38

Results, Iris 1vs23, RBF kernel

Page 39: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

39

Results, Iris 1vs23, RBF kernel

Page 40: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

40

Results, Chessboard, Poly kernel

Page 41: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

41

Results, Chessboard, Poly kernel

Page 42: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

42

Results, Chessboard, Poly kernel

Page 43: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

43

Results, Chessboard, Poly kernel

Page 44: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

44

Results, Chessboard, poly kernel

Page 45: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

45

Results, Chessboard, RBF kernel

Page 46: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

Regression

Page 47: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

47

Ridge Regression

Primal:

Dual for a given λ: ...after some calculations...

This can be solved in closed form:

Linear regression:

Page 48: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

48

Kernel Ridge Regression Algorithm

Page 49: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

49

SVM Regression• Typical loss function:

(ridge regression) … quadratic penalty whenever yn ≠ tn

• To be sparse… don’t worry if “close enough”

• Eε( y, t ) =

• … ε insensitive loss function

• No penalty if in ε-tube

22

21)( wtyC

nnn +−∑

0 if |y – t| < ε

|y – t| – ε otherwise

2

21),( wtyEC

nnn +∑ ε

taken from Andrew W. Moore

Page 50: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

50

SVM Regression

• No penalty if yn– ε ≤ tn ≤ yn + ε • Slack variables: {ξn+ , ξn– }

• tn ≤ yn+ ε + ξn+

• tn ≥ yn – ε – ξn–

• Error function:

• … use Lagrange Multipliers { an+, an-, µn+, µn- } ≥ 0

2

21),( wtyEC

nnn +∑ ε

0 if |y – t| < ε

|y – t| – ε otherwiseEε( y, t ) =

2

21)( wC

nnn ++∑ −+ ξξ

)()(

)(21)((...)min 2

∑∑∑∑

+−+−−++−

+−++=

−−++

−−++−+

nnnnn

nnnnn

nn

nnnn

nn

tyatya

wCL

ξεξε

ξµξµξξ

taken from Andrew W. Moore

Page 51: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

51

SVM Regression, con’t

• Set derivatives to 0, solve for {ξn+, ξn–, µn+ , µn- } …

)()(

)(21)((...) 2

∑∑∑∑

+−+−−++−

+−++=

−−++

−−++−+

nnnnn

nnnnn

nn

nnnn

nn

tyatya

wCL

ξεξε

ξµξµξξ

nn

nnn

nn

n mmnmmnnaa

taaaa

xxkaaaaaaL

∑∑∑ ∑

−+−+

−+−+−+

−+−−

−−−=−+

)()(

),())((21),(~min

,

ε

s.t. 0 ≤ an+ ≤ C 0 ≤ an- ≤ C

• Prediction for new x :

),()()( xxkaaxy nn

nn∑ −+ −=

Quadratic Program!!

taken from Andrew W. Moore

Page 52: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

52

SVM Regression, con’t

),()()( xxkaaxy nn

nn∑ −+ −=

• Can ignore xn unless either an+>0 or an->0

• an+>0 only if tn = yn+ ε + ξn+ ie, if on upper boundary of ε-tube (ξn+=0) or above (ξn+>0)

• an->0 only if tn = yn – ε – ξn– ie, if on lower boundary of ε-tube (ξn-=0) or below (ξn->0)

taken from Andrew W. Moore

Page 53: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

53

Kernels in Logistic Regression

• Define weights in terms of support vectors:

• Derive simple gradient descent rule on αitaken from Andrew W. Moore

Page 54: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

A few results

Page 55: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

55

Sinc=sin(π x)/ (π x), RBF kernel

Page 56: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

56

Sinc=sin(π x)/ (π x), RBF kernel

Page 57: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

57

Sinc=sin(π x)/ (π x), RBF kernel

Page 58: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

58

Sinc=sin(π x)/ (π x), RBF kernel

Page 59: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

59

Sinc=sin(π x)/ (π x), RBF kernel

Page 60: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

60

Sinc=sin(π x)/ (π x), RBF kernel

Page 61: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

61

Sinc=sin(π x)/ (π x), poly kernel

Page 62: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

62

Sinc=sin(π x)/ (π x), poly kernel

Page 63: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

63

Sinc=sin(π x)/ (π x), poly kernel

Page 64: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

64

Sinc=sin(π x)/ (π x), poly kernel

Page 65: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

65

Sinc=sin(π x)/ (π x), poly kernel

Page 66: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

66

Sinc=sin(π x)/ (π x), poly kernel

Page 67: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

Summary

Page 68: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

68

Key SVM Ideas• Maximize the margin between + and – examples

• connects to PAC theory• Sparse:

Only the support vectors contribute to solution • Penalize errors in non-separable case• Kernels map examples into a new, usually

nonlinear space• Implicitly do dot products in this new space

(in the “dual” form of the SVM program)

• Kernels are separate from SVMs… but they combine very nicely with SVMs

taken from Andrew W. Moore

Page 69: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

69

Summary IAdvantages• Systematic implementation through quadratic

programming∀∃ very efficient implementations

• Excellent data-dependent generalization bounds • Regularization built into cost function• Statistical performance is independent of

dim. of feature space

• Theoretically related to widely studied fields of regularization theory and sparse approximation

• Fully adaptive procedures available for determining hyper-parameters

taken from Andrew W. Moore

Page 70: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

70

Summary IIDrawbacks• Treatment of non-separable case somewhat

heuristic• Number of support vectors may depend strongly

on the kernel type and the hyper-parameters• Systematic choice of kernels is difficult (prior

information) • … some ideas exist

• Optimization may require clever heuristics for large problems

taken from Andrew W. Moore

Page 71: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

71

What You Should Know• Definition of a maximum margin classifier• Sparse version: (Linear) SVMs• What QP can do for you

(but you don’t need to know how it does it)• How Maximum Margin = a QP problem• How to deal with noisy (non-separable) data

• Slack variable• How to permit “non-linear boundaries”

• Kernel trick• How SVM Kernel functions permit us to pretend

we’re working with a zillion featurestaken from Andrew W. Moore

Page 72: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

72

Thanks for the Attention!

Page 73: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

73

Attic

Page 74: Regression and Classification using Kernel Methodsbapoczos/other_presentations/regress_class_ker… · Regression and Classification using Kernel Methods Barnabás Póczos University

74

Difference between SVMs and Logistic Regression

Solution sparse

High dimensional features with kernels

Semantics of output

Loss function

LogisticRegression

SVMs

Often yes! Almost always no!

“Margin” Real probabilities

Hinge loss Log-loss

Yes! No