Slides Lecture6

13
K ERNELS

Transcript of Slides Lecture6

Page 1: Slides Lecture6

KERNELS

Page 2: Slides Lecture6

MOTIVATION: KERNELS

IdeaI The SVM uses the scalar product 〈x, xi〉 as a measure of similarity between x

and xi, and of distance to the hyperplane.

I Since the scalar product is linear, the SVM is a linear method.

I By using a nonlinear function instead, we can make the classifier nonlinear.

More precisely

I Scalar product can be regarded as a two-argument function〈 . , . 〉 : Rd × Rd → R.

I We will replace this function with a function k : Rd × Rd → R and substitute

k(x, x′) for every occurrence of⟨x, x′

⟩in the SVM formulae.

I Under certain conditions on k, all optimization/classification results for theSVM still hold. Functions that satisfy these conditions are called kernelfunctions.

Peter Orbanz · Data Mining

Page 3: Slides Lecture6

THE MOST POPULAR KERNEL

RBF Kernel

kRBF(x, x′) := exp(−‖x− x′‖2

2

2σ2

)for some σ ∈ R+

is called an RBF kernel (RBF = radial basis function). The parameter σ is calledbandwidth.Other names for kRBF: Gaussian kernel, squared-exponential kernel.

If we fix x′, the function kRBF( . , x′) is (up to scaling) a spherical Gaussian density onRd, with mean x′ and standard deviation σ.

8 CHAPTER 4. NONPARAMETRIC TECHNIQUES

-2-1

01

2 -2

-1

0

1

2

0

0.05

0.1

0.15

-2-1

01

2

h = 1

!(x)

-2-1

0

1

2 -2

-1

0

1

2

0

0.2

0.4

0.6

-2-1

0

1

2

h = .5

!(x)

-2-1

0

1

2 -2

-1

0

1

2

0

1

2

3

4

-2-1

0

1

2

h = .2

!(x)

Figure 4.3: Examples of two-dimensional circularly symmetric normal Parzen windows!(x/h) for three di!erent values of h. Note that because the "k(·) are normalized,di!erent vertical scales must be used to show their structure.

p(x)p(x) p(x)

Figure 4.4: Three Parzen-window density estimates based on the same set of fivesamples, using the window functions in Fig. 4.3. As before, the vertical axes havebeen scaled to show the structure of each function.

and

limn!"

#2n(x) = 0. (18)

To prove convergence we must place conditions on the unknown density p(x), onthe window function !(u), and on the window width hn. In general, continuity ofp(·) at x is required, and the conditions imposed by Eqs. 12 & 13 are customarilyinvoked. With care, it can be shown that the following additional conditions assureconvergence (Problem 1):

supu

!(u) < ! (19)

lim#u#!"

!(u)d!

i=1

ui = 0 (20)

function surface (d=2)

x1

x2

(c)contours (d=2)

Peter Orbanz · Data Mining

Page 4: Slides Lecture6

SVM WITH RBF KERNEL

f (x) = sign

(n∑

i=1

yiα∗i kRBF(xi, x)

)

I Circled points are support vectors. The the two contour lines running throughsupport vectors are the nonlinear counterparts of the convex hulls.

I The thick black line is the classifier.

I Think of a Gaussian-shaped function kRBF( . , x′) centered at each support vectorx′. These functions add up to a function surface over R2.

I The lines in the image are contour lines of this surface. The classifier runsalong the bottom of the "valley" between the two classes.

I Smoothness of the contours is controlled by σ

Peter Orbanz · Data Mining

Page 5: Slides Lecture6

CHOOSING A KERNEL

TheoryTo define a kernel:

I We have to define a function of two arguments and proof that it is a kernel.

I This is done by checking a set of necessary and sufficient conditions known as“Mercer’s theorem”.

PracticeThe data analyst does not define a kernel, but tries some well-known standard kernelsuntil one seems to work. Most common choices:

I The RBF kernel.

I The "linear kernel" kSP(x, x′) = 〈x, x′〉, i.e. the standard, linear SVM.

Once kernel is chosenI Classifier can be trained by solving the optimization problem using standard

software.

I SVM software packages include implementations of all common kernels.

Peter Orbanz · Data Mining

Page 6: Slides Lecture6

WHICH FUNCTIONS WORK AS KERNELS?

Formal definitionA function k : Rd × Rd → R is called a kernel on Rd if there is some functionφ : Rd → F into some space F with scalar product 〈 . , . 〉F such that

k(x, x′) =⟨φ(x), φ(x′)

⟩F for all x, x′ ∈ Rd .

In other wordsI k is a kernel if it can be interpreted as a scalar product on some other space.

I If we substitute k(x, x′) for 〈x, x′〉 in all SVM equations, we implicitly train alinear SVM on the space F .

I That is why the SVM still works: It still uses scalar products, just on anotherspace.

The mapping φ

I To make this work, φ has to transform the data into data on which a linear SVMworks.

I This is usually achieved by choosing F as a higher-dimensional space than Rd.Peter Orbanz · Data Mining

Page 7: Slides Lecture6

MAPPING INTO HIGHER DIMENSIONS

ExampleHow can a map into higher dimensions make class boundary (more) linear?Consider

φ : R2 → R3 where φ

(x1

x2

):=

x21

2x1x2

x22

Nonlinear Transformation in Kernel Space

!"

!#

$"

$%

$#

Machine Learning I : JoachimM.Buhmann 137/196

Peter Orbanz · Data Mining

Page 8: Slides Lecture6

MAPPING INTO HIGHER DIMENSIONS

ProblemIn previous example: We have to know what the data looks like to choose φ!

SolutionI Choose high dimension h for F .

I Choose components φi of φ(x) = (φ1(x), . . . , φh(x)) as different nonlinearmappings.

I If two points differ in Rd, some of the nonlinear mappings will amplifydifferences.

The RBF kernel is an extreme caseI The function kRBF can be shown to be a kernel, however:

I F is infinite-dimensional for this kernel.

Peter Orbanz · Data Mining

Page 9: Slides Lecture6

DETERMINING WHETHER k IS A KERNEL

Mercer’s theoremA mathematical result called Mercer’s theorem states that, if the function k ispositive, i.e. ∫

Rd×Rdk(x, x′)f (x)f (x′)dxdx′ ≥ 0

for all functions f , then it can be written as

k(x, x′) =∞∑j=1

λjφj(x)φj(x′) .

The φj are functions Rd → R and λi ≥ 0. This means the (possibly infinite) vectorφ(x) = (

√λ1φ1(x),

√λ2φ2(x), . . .) is a feature map.

Kernel arithmeticVarious functions of kernels are again kernels: If k1 and k2 are kernels, then e.g.

k1 + k2 k1 · k2 const. · k1

are again kernels.

Peter Orbanz · Data Mining

Page 10: Slides Lecture6

THE KERNEL TRICK

Kernels in general

I Many linear machine learning and statistics algorithms can be "kernelized".

I The only conditions are:

1. The algorithm uses a scalar product.2. In all relevant equations, the data (and all other elements of Rd) appear

only inside a scalar product.

I This approach to making algorithms non-linear is known as the "kernel trick".

Peter Orbanz · Data Mining

Page 11: Slides Lecture6

SUMMARY: KERNEL SVM

Optimization problem

minvH,c

‖vH‖2F + γ

n∑i=1

ξ2

s.t. yi(k(vH, xi)− c) ≥ 1− ξi and ξi ≥ 0

Note: vH now lives in F .

Dual optimization problem

maxααα∈Rn

W(ααα) :=n∑

i=1

αi −12

n∑i,j=1

αiαjyiyj(k(xi, xj) +1γI{i = j})

s.t.n∑

i=1

yiαi = 0 and αi ≥ 0

Classifierf (x) = sgn

(n∑

i=1

yiα∗i k(xi, x)− c

)

Peter Orbanz · Data Mining

Page 12: Slides Lecture6

SUMMARY: SVMS

Basic SVMI Linear classifier for linearly separable data.

I Positions of affine hyperplane is determined by maximizing margin.

I Maximizing the margin is a convex optimization problem.

Full-fledged SVM

Ingredient Purpose

Maximum margin Good generalization propertiesSlack variables Overlapping classes

Robustness against outliersKernel Nonlinear decision boundary

Use in practice

I Software packages (e.g. libsvm, SVMLite)

I Choose a kernel function (e.g. RBF)

I Cross-validate margin parameter γ and kernel parameters (e.g. bandwidth)

Peter Orbanz · Data Mining

Page 13: Slides Lecture6

HISTORY

I Ca. 1957: Perceptron (Rosenblatt)

I 1970s: Vapnik and Chervonenkis develop learning theory

I 1986: Neural network renaissance (backpropagation algorithm by Rumelhart,Hinton, Williams)

I 1993: SVM (Boser, Guyon, Vapnik)

I 1997: Boosting (Freund and Schapire)

Peter Orbanz · Data Mining