Regression and Classification using Kernel...

Post on 11-Aug-2020

6 views 0 download

Transcript of Regression and Classification using Kernel...

Regression and Classification using

Kernel MethodsBarnabás Póczos

University of Alberta

Oct 15, 2009

2

Roadmap I

• Primal Perceptron ) Dual Perceptron ) Kernels(It doesn’t want large margin...)

• Primal hard SVM ) Dual hard SVM ) Kernels (It wants large margin, but assumes the data is

linearly separable in the feature space)

• Primal soft SVM ) Dual soft SVM ) Kernels(It wants large margin, and doesn’t assume that the

data is linearly separable in the feature space)

Classification

3

Roadmap II

• Ridge Regression ) Dual form ) Kernels

• Primal SVM for Regression ) Dual SVM ) Kernels

• Logistic Regression ) Kernels

• Bayesian Ridge Regression in feature space (Gaussian Processes)) Dual form ) Kernels

Regression

The Perceptron

5

The primal algorithm in the feature space

Picture is taken from R. Herbrich

6

The Dual Perceptron

Picture is taken from R. Herbrich

The SVM

8

Linear Classifiersf x yest

f(x,w,b) = sign(w ∙ x + b)

How to classify this data?

Each of these seems fine…

... which is best?

w,b

denotes +1

denotes –1

taken from Andrew W. Moore

9

Classifier Margin

f(x,w,b) = sign(w ∙ x + b)

The margin of a linear classifier =the width that the boundary could be increased by, before hitting a datapoint

denotes +1

denotes –1

f x yest

w,b

taken from Andrew W. Moore

10

Maximum Marginf x yest

f(x,w,b) = sign(w ∙ x + b)

Support Vectors are datapoints that “touch” the margin

The maximum margin linear classifier is the linear classifier with the, um, maximum margin.

• the simplest kind of SVM – an LSVM

Linear SVM

denotes +1

denotes –1

w,b

taken from Andrew W. Moore

11taken from Andrew W. Moore

12

Specifying a Line and a Margin

• Plus-plane = { x : w ∙ x + b = +1 }• Minus-plane = { x : w ∙ x + b = –1 }

Plus-Plane

Minus-PlaneClassifier Boundary

“Predict Class =

+1”

zone

“Predict Class =

-1”

zone

-1 < w ∙ x + b < 1ifUniverse explodes

w ∙ x + b ≤ –1if–1

w ∙ x + b ≥ 1if+1Classify as..

wx+b=1

wx+b=0

wx+b=-1

taken from Andrew W. Moore

13

Computing the margin width

Given …• w ∙ x+ + b = +1 • w ∙ x- + b = -1 • x+ = x- + λ w• |x+ – x-| = M•

“Predict Class =

+1”

zone

“Predict Class =

-1”

zonewx+b=1

wx+b=0

wx+b=-1

M = Margin Width =

M = |x+ – x- | =| λ w |=

x-

x+

λ = 2

w ⋅ w €

= 2 w ⋅ ww ⋅ w

= 2

w ⋅ w€

=λ| w |=λ w⋅w

2

w ⋅ w

Yay! Just maximize

…≡ minimize w·w€

2

w ⋅ w

Wait…OMG, I forgot the data! taken from Andrew W. Moore

14

The Primal Hard SVM

This is a QP problem ) solution is expressible in dual form

The computational effort is O(n2)

15

Quadratic Programming – in general

a rg m inw

c + d T w + w T K w2

Find

a 1 1 w 1 + a 1 2 w 2 + . . . + a 1 m w m ≤ b 1

a 2 1 w 1 + a 2 2 w 2 + . . . + a 2 m w m ≤ b 2

:

a n 1 w 1 + a n 2 w 2 + . . . + a n m w m ≤ b n

a ( n +1 )1 w 1 + a ( n +1 ) 2 w 2 + . . . + a ( n +1 ) m w m = b ( n +1 )

a ( n + 2 )1 w 1 + a ( n + 2 ) 2 w 2 + . . . + a ( n + 2 ) m w m = b ( n + 2 )

:

a ( n + e )1 w 1 + a ( n + e ) 2 w 2 + . . . + a ( n + e ) m w m = b ( n + e )

and to

n additional linear inequality constraints

e additional linear equality constraints

Quadratic criterion

Subject to

Note w∙x = wTx

taken from Andrew W. Moore

16

The Dual Hard SVM

Quadratic Programming, the computational effort is O(m2)

Lemma

17

ProofPrimal Lagrangian:

The solution:

18

Proof cont.

19

The Problem with Hard SVMIt assumes samples are linearly separable...

Solutions:• Use kernel with large expressive power

(e.g. RBF with small σ) ) each training samples are linearly separable in the feature space ) Hard SVM can be applied ) overfitting...

• Soft margin SVM instead of Hard SVM• Slack variables...

20

Hard SVMThe Hard SVM problem can be rewritten:

whereMisclassification

Correct classification

21

From Hard to Soft constraints

We can try solve the soft version of it

Instead of using hard constraints (points are linearly separable)

whereMisclassification

Correct classification

22

Problems with l0-1 loss

It is not convex in yf(x) ) It is not convex in w, either...... and we like only convex functions...

Let us approximate it with convex functions!

• Piecewise linear approximations (hinge loss, llin)

• Quadratic approximation (lquad)

23

Approximation of the Heaviside step function

Picture is taken from R. Herbrich

24

The hinge loss approximation of l0-1

Where,

25

The Slack Variables

wx+b = 0

wx+b = -1

M =

2

w ⋅ w

ξ7

ξ 11

ξ2

wx+b = 1

taken from Andrew W. Moore

26

The Primal Soft SVM problemUsing lin (hinge) loss approximation:

Or we can use this form, too...

27

The Dual Soft SVM (using hinge loss)

28

Proofs

where

29

Proofs

30

Classifying using SVMs with Kernels• Choose a kernel function• Solve the dual problem

A few results

32

Kernels and Linear Classifiers

We will use linear classifiers in this feature space.

33

The Decision Surface

Picture is taken from R. Herbrich

For general feature maps it can be very difficult

to solve these...

34

Steve Gunn’s svm toolboxResults, Iris 2vs13, Linear kernel

35

Results, Iris 1vs23, 2nd order kernel

36

Results, Iris 1vs23, 2nd order kernel

37

Results, Iris 1vs23, 13th order kernel

38

Results, Iris 1vs23, RBF kernel

39

Results, Iris 1vs23, RBF kernel

40

Results, Chessboard, Poly kernel

41

Results, Chessboard, Poly kernel

42

Results, Chessboard, Poly kernel

43

Results, Chessboard, Poly kernel

44

Results, Chessboard, poly kernel

45

Results, Chessboard, RBF kernel

Regression

47

Ridge Regression

Primal:

Dual for a given λ: ...after some calculations...

This can be solved in closed form:

Linear regression:

48

Kernel Ridge Regression Algorithm

49

SVM Regression• Typical loss function:

(ridge regression) … quadratic penalty whenever yn ≠ tn

• To be sparse… don’t worry if “close enough”

• Eε( y, t ) =

• … ε insensitive loss function

• No penalty if in ε-tube

22

21)( wtyC

nnn +−∑

0 if |y – t| < ε

|y – t| – ε otherwise

2

21),( wtyEC

nnn +∑ ε

taken from Andrew W. Moore

50

SVM Regression

• No penalty if yn– ε ≤ tn ≤ yn + ε • Slack variables: {ξn+ , ξn– }

• tn ≤ yn+ ε + ξn+

• tn ≥ yn – ε – ξn–

• Error function:

• … use Lagrange Multipliers { an+, an-, µn+, µn- } ≥ 0

2

21),( wtyEC

nnn +∑ ε

0 if |y – t| < ε

|y – t| – ε otherwiseEε( y, t ) =

2

21)( wC

nnn ++∑ −+ ξξ

)()(

)(21)((...)min 2

∑∑∑∑

+−+−−++−

+−++=

−−++

−−++−+

nnnnn

nnnnn

nn

nnnn

nn

tyatya

wCL

ξεξε

ξµξµξξ

taken from Andrew W. Moore

51

SVM Regression, con’t

• Set derivatives to 0, solve for {ξn+, ξn–, µn+ , µn- } …

)()(

)(21)((...) 2

∑∑∑∑

+−+−−++−

+−++=

−−++

−−++−+

nnnnn

nnnnn

nn

nnnn

nn

tyatya

wCL

ξεξε

ξµξµξξ

nn

nnn

nn

n mmnmmnnaa

taaaa

xxkaaaaaaL

∑∑∑ ∑

−+−+

−+−+−+

−+−−

−−−=−+

)()(

),())((21),(~min

,

ε

s.t. 0 ≤ an+ ≤ C 0 ≤ an- ≤ C

• Prediction for new x :

),()()( xxkaaxy nn

nn∑ −+ −=

Quadratic Program!!

taken from Andrew W. Moore

52

SVM Regression, con’t

),()()( xxkaaxy nn

nn∑ −+ −=

• Can ignore xn unless either an+>0 or an->0

• an+>0 only if tn = yn+ ε + ξn+ ie, if on upper boundary of ε-tube (ξn+=0) or above (ξn+>0)

• an->0 only if tn = yn – ε – ξn– ie, if on lower boundary of ε-tube (ξn-=0) or below (ξn->0)

taken from Andrew W. Moore

53

Kernels in Logistic Regression

• Define weights in terms of support vectors:

• Derive simple gradient descent rule on αitaken from Andrew W. Moore

A few results

55

Sinc=sin(π x)/ (π x), RBF kernel

56

Sinc=sin(π x)/ (π x), RBF kernel

57

Sinc=sin(π x)/ (π x), RBF kernel

58

Sinc=sin(π x)/ (π x), RBF kernel

59

Sinc=sin(π x)/ (π x), RBF kernel

60

Sinc=sin(π x)/ (π x), RBF kernel

61

Sinc=sin(π x)/ (π x), poly kernel

62

Sinc=sin(π x)/ (π x), poly kernel

63

Sinc=sin(π x)/ (π x), poly kernel

64

Sinc=sin(π x)/ (π x), poly kernel

65

Sinc=sin(π x)/ (π x), poly kernel

66

Sinc=sin(π x)/ (π x), poly kernel

Summary

68

Key SVM Ideas• Maximize the margin between + and – examples

• connects to PAC theory• Sparse:

Only the support vectors contribute to solution • Penalize errors in non-separable case• Kernels map examples into a new, usually

nonlinear space• Implicitly do dot products in this new space

(in the “dual” form of the SVM program)

• Kernels are separate from SVMs… but they combine very nicely with SVMs

taken from Andrew W. Moore

69

Summary IAdvantages• Systematic implementation through quadratic

programming∀∃ very efficient implementations

• Excellent data-dependent generalization bounds • Regularization built into cost function• Statistical performance is independent of

dim. of feature space

• Theoretically related to widely studied fields of regularization theory and sparse approximation

• Fully adaptive procedures available for determining hyper-parameters

taken from Andrew W. Moore

70

Summary IIDrawbacks• Treatment of non-separable case somewhat

heuristic• Number of support vectors may depend strongly

on the kernel type and the hyper-parameters• Systematic choice of kernels is difficult (prior

information) • … some ideas exist

• Optimization may require clever heuristics for large problems

taken from Andrew W. Moore

71

What You Should Know• Definition of a maximum margin classifier• Sparse version: (Linear) SVMs• What QP can do for you

(but you don’t need to know how it does it)• How Maximum Margin = a QP problem• How to deal with noisy (non-separable) data

• Slack variable• How to permit “non-linear boundaries”

• Kernel trick• How SVM Kernel functions permit us to pretend

we’re working with a zillion featurestaken from Andrew W. Moore

72

Thanks for the Attention!

73

Attic

74

Difference between SVMs and Logistic Regression

Solution sparse

High dimensional features with kernels

Semantics of output

Loss function

LogisticRegression

SVMs

Often yes! Almost always no!

“Margin” Real probabilities

Hinge loss Log-loss

Yes! No