Regression and Classification using Kernel...

Regression and Classification using

Kernel MethodsBarnabás Póczos

University of Alberta

Oct 15, 2009

Roadmap I

• Primal Perceptron ) Dual Perceptron ) Kernels(It doesn’t want large margin...)

• Primal hard SVM ) Dual hard SVM ) Kernels (It wants large margin, but assumes the data is

linearly separable in the feature space)

• Primal soft SVM ) Dual soft SVM ) Kernels(It wants large margin, and doesn’t assume that the

data is linearly separable in the feature space)

Classification

Roadmap II

• Ridge Regression ) Dual form ) Kernels

• Primal SVM for Regression ) Dual SVM ) Kernels

• Logistic Regression ) Kernels

• Bayesian Ridge Regression in feature space (Gaussian Processes)) Dual form ) Kernels

Regression

The Perceptron

The primal algorithm in the feature space

Picture is taken from R. Herbrich

The Dual Perceptron

The SVM

Linear Classifiersf x yest

f(x,w,b) = sign(w ∙ x + b)

How to classify this data?

Each of these seems fine…

... which is best?

denotes +1

denotes –1

taken from Andrew W. Moore

Classifier Margin

The margin of a linear classifier =the width that the boundary could be increased by, before hitting a datapoint

denotes +1

denotes –1

f x yest

Maximum Marginf x yest

Support Vectors are datapoints that “touch” the margin

The maximum margin linear classifier is the linear classifier with the, um, maximum margin.

• the simplest kind of SVM – an LSVM

Linear SVM

denotes +1

denotes –1

11taken from Andrew W. Moore

Specifying a Line and a Margin

• Plus-plane = { x : w ∙ x + b = +1 }• Minus-plane = { x : w ∙ x + b = –1 }

Plus-Plane

Minus-PlaneClassifier Boundary

“Predict Class =

-1 < w ∙ x + b < 1ifUniverse explodes

w ∙ x + b ≤ –1if–1

w ∙ x + b ≥ 1if+1Classify as..

wx+b=1

wx+b=0

wx+b=-1

Computing the margin width

Given …• w ∙ x+ + b = +1 • w ∙ x- + b = -1 • x+ = x- + λ w• |x+ – x-| = M•

“Predict Class =

zonewx+b=1

wx+b=0

wx+b=-1

M = Margin Width =

M = |x+ – x- | =| λ w |=

λ = 2

w ⋅ w €

= 2 w ⋅ ww ⋅ w

w ⋅ w€

=λ| w |=λ w⋅w

w ⋅ w

Yay! Just maximize

…≡ minimize w·w€

w ⋅ w

Wait…OMG, I forgot the data! taken from Andrew W. Moore

The Primal Hard SVM

This is a QP problem ) solution is expressible in dual form

The computational effort is O(n2)

Quadratic Programming – in general

a rg m inw

c + d T w + w T K w2

a 1 1 w 1 + a 1 2 w 2 + . . . + a 1 m w m ≤ b 1

a 2 1 w 1 + a 2 2 w 2 + . . . + a 2 m w m ≤ b 2

a n 1 w 1 + a n 2 w 2 + . . . + a n m w m ≤ b n

a ( n +1 )1 w 1 + a ( n +1 ) 2 w 2 + . . . + a ( n +1 ) m w m = b ( n +1 )

a ( n + 2 )1 w 1 + a ( n + 2 ) 2 w 2 + . . . + a ( n + 2 ) m w m = b ( n + 2 )

a ( n + e )1 w 1 + a ( n + e ) 2 w 2 + . . . + a ( n + e ) m w m = b ( n + e )

and to

n additional linear inequality constraints

e additional linear equality constraints

Quadratic criterion

Subject to

Note w∙x = wTx

The Dual Hard SVM

Quadratic Programming, the computational effort is O(m2)

ProofPrimal Lagrangian:

The solution:

Proof cont.

The Problem with Hard SVMIt assumes samples are linearly separable...

Solutions:• Use kernel with large expressive power

(e.g. RBF with small σ) ) each training samples are linearly separable in the feature space ) Hard SVM can be applied ) overfitting...

• Soft margin SVM instead of Hard SVM• Slack variables...

Hard SVMThe Hard SVM problem can be rewritten:

whereMisclassification

Correct classification

From Hard to Soft constraints

We can try solve the soft version of it

Instead of using hard constraints (points are linearly separable)

whereMisclassification

Correct classification

Problems with l0-1 loss

It is not convex in yf(x) ) It is not convex in w, either...... and we like only convex functions...

Let us approximate it with convex functions!

• Piecewise linear approximations (hinge loss, llin)

• Quadratic approximation (lquad)

Approximation of the Heaviside step function

The hinge loss approximation of l0-1

Where,

The Slack Variables

wx+b = 0

wx+b = -1

w ⋅ w

wx+b = 1

The Primal Soft SVM problemUsing lin (hinge) loss approximation:

Or we can use this form, too...

The Dual Soft SVM (using hinge loss)

Proofs

Classifying using SVMs with Kernels• Choose a kernel function• Solve the dual problem

A few results

Kernels and Linear Classifiers

We will use linear classifiers in this feature space.

The Decision Surface

For general feature maps it can be very difficult

to solve these...

Steve Gunn’s svm toolboxResults, Iris 2vs13, Linear kernel

Results, Iris 1vs23, 2nd order kernel

Results, Iris 1vs23, 13th order kernel

Results, Iris 1vs23, RBF kernel

Results, Chessboard, Poly kernel

Results, Chessboard, poly kernel

Results, Chessboard, RBF kernel

Regression

Ridge Regression

Primal:

Dual for a given λ: ...after some calculations...

This can be solved in closed form:

Linear regression:

Kernel Ridge Regression Algorithm

SVM Regression• Typical loss function:

(ridge regression) … quadratic penalty whenever yn ≠ tn

• To be sparse… don’t worry if “close enough”

• Eε( y, t ) =

• … ε insensitive loss function

• No penalty if in ε-tube

21)( wtyC

nnn +−∑

0 if |y – t| < ε

|y – t| – ε otherwise

21),( wtyEC

nnn +∑ ε

SVM Regression

• No penalty if yn– ε ≤ tn ≤ yn + ε • Slack variables: {ξn+ , ξn– }

• tn ≤ yn+ ε + ξn+

• tn ≥ yn – ε – ξn–

• Error function:

• … use Lagrange Multipliers { an+, an-, µn+, µn- } ≥ 0

21),( wtyEC

nnn +∑ ε

0 if |y – t| < ε

|y – t| – ε otherwiseEε( y, t ) =

21)( wC

nnn ++∑ −+ ξξ

)(21)((...)min 2

∑∑∑∑

+−+−−++−

+−++=

−−++

−−++−+

tyatya

ξεξε

ξµξµξξ

SVM Regression, con’t

• Set derivatives to 0, solve for {ξn+, ξn–, µn+ , µn- } …

)(21)((...) 2

∑∑∑∑

+−+−−++−

+−++=

−−++

−−++−+

tyatya

ξεξε

ξµξµξξ

n mmnmmnnaa

xxkaaaaaaL

∑∑∑ ∑

−+−+

−+−+−+

−+−−

−−−=−+

),())((21),(~min

s.t. 0 ≤ an+ ≤ C 0 ≤ an- ≤ C

• Prediction for new x :

),()()( xxkaaxy nn

nn∑ −+ −=

Quadratic Program!!

SVM Regression, con’t

),()()( xxkaaxy nn

nn∑ −+ −=

• Can ignore xn unless either an+>0 or an->0

• an+>0 only if tn = yn+ ε + ξn+ ie, if on upper boundary of ε-tube (ξn+=0) or above (ξn+>0)

• an->0 only if tn = yn – ε – ξn– ie, if on lower boundary of ε-tube (ξn-=0) or below (ξn->0)

Kernels in Logistic Regression

• Define weights in terms of support vectors:

• Derive simple gradient descent rule on αitaken from Andrew W. Moore

A few results

Sinc=sin(π x)/ (π x), RBF kernel

Sinc=sin(π x)/ (π x), poly kernel

Summary

Key SVM Ideas• Maximize the margin between + and – examples

• connects to PAC theory• Sparse:

Only the support vectors contribute to solution • Penalize errors in non-separable case• Kernels map examples into a new, usually

nonlinear space• Implicitly do dot products in this new space

(in the “dual” form of the SVM program)

• Kernels are separate from SVMs… but they combine very nicely with SVMs

Summary IAdvantages• Systematic implementation through quadratic

programming∀∃ very efficient implementations

• Excellent data-dependent generalization bounds • Regularization built into cost function• Statistical performance is independent of

dim. of feature space

• Theoretically related to widely studied fields of regularization theory and sparse approximation

• Fully adaptive procedures available for determining hyper-parameters

Summary IIDrawbacks• Treatment of non-separable case somewhat

heuristic• Number of support vectors may depend strongly

on the kernel type and the hyper-parameters• Systematic choice of kernels is difficult (prior

information) • … some ideas exist

• Optimization may require clever heuristics for large problems

What You Should Know• Definition of a maximum margin classifier• Sparse version: (Linear) SVMs• What QP can do for you

(but you don’t need to know how it does it)• How Maximum Margin = a QP problem• How to deal with noisy (non-separable) data

• Slack variable• How to permit “non-linear boundaries”

• Kernel trick• How SVM Kernel functions permit us to pretend

we’re working with a zillion featurestaken from Andrew W. Moore

Thanks for the Attention!

Difference between SVMs and Logistic Regression

Solution sparse

High dimensional features with kernels

Semantics of output

Loss function

LogisticRegression

Often yes! Almost always no!

“Margin” Real probabilities

Hinge loss Log-loss

Yes! No

Regression and Classification using Kernel...

Documents

Transcript of Regression and Classification using Kernel...

Kernel Regression for Signals over Graphs Kernel Regression for Signals over Graphs Arun Venkitaraman, Saikat Chatterjee, Peter Handel¨ School of Electrical Engineering and ACCESS

Survey Kernel Optimized Regression Model for Product Recommendation

Divide and Conquer Kernel Ridge Regression: A Distributed ...

Sparse Kernel Regression with Coe cient-based regularizationjmlr.csail.mit.edu/papers/volume20/13-124/13-124.pdf · Sparse Kernel Regression with Coe cient-based ‘ q regularization

Robustness of Kernel Based Regression: a Comparison of ...

Adaptive Non-Stationary Kernel Regression for Terrain Modeling

Selection in Kernel Ridge Regression Peter Exterkate - PURE

Kernel Methods Jong Cheol Jeong. Out line 6.1 One-Dimensional Kernel Smoothers 6.1.1 Local Linear Regression 6.1.2 Local Polynomial Regression 6.2 Selecting.

CONSTRAINED NONPARAMETRIC KERNEL REGRESSION: ESTIMATION AND … · 2021. 3. 8. · CONSTRAINED NONPARAMETRIC KERNEL REGRESSION: ESTIMATION AND INFERENCE JEFFREY S. RACINE AND CHRISTOPHER

Asymptotic Results in Gamma Kernel Regression

Kernel regression with Weibull-type tails

Asymptotic normality of kernel type regression estimators ...file.statistik.tuwien.ac.at/filz/papers/09JSPI.pdf · Asymptotic normality of kernel type regression ... of a diagonal

Adaptive Kernel Regression for Image Processing and Reconstruction

Robust Kernel-Based Regression Using Orthogonal Matching Pursuit

Unsupervised Kernel Regressionrfm/pubs/ukr_talk.pdf · 2007. 7. 19. · Unsupervised Kernel Regression Unsupervised Learning as Generalized Regression y= f(x; ) + u; E(u) = 0 Now

Kernel Dimension Reduction in Regression - di.ens.fr · Kernel Dimension Reduction in Regression ... tor within each slice of the response variable Y, and principal component analysis

Regression on Manifolds using Kernel Dimension Reduction€¦ · Manifold Kernel Dimension Reduction Experimental Results Summary Regression on Manifolds using Kernel Dimension Reduction

Sparse Kernel Regression with Coe cient-based regularization · Sparse Kernel Regression function K t: X!R x7!K(x;t): Let 0

DETERMINATION OF HYPER-PARAMETERS FOR KERNEL BASED ... · for kernel based methods. Support vector classiﬁcation, kernel logistic regression and support vector regression are considered.

Principal Surfaces from Unsupervised Kernel Regression · 2010. 8. 13. · Principal Surfaces from Unsupervised Kernel Regression Peter Meinicke, Stefan Klanke, Roland Memisevic and