10-701/15-781Recitation: Kernelsaarti/Class/10701_Spring14/KernelTheory.pdfJust because it is a...

10-701/15-781 Recitation : Kernels

Manojit Nandi

February 27, 2014

OutlineMathematical Theory

Banach Space and Hilbert SpacesKernelsCommonly Used Kernels

Kernel TheoryOne Weird Kernel TrickRepresenter Theorem

Do You want to Build a (((((hhhhhSnowman Kernel? It doesn’t have to be(((((hhhhhSnowman Kernel.Operations that Preserve/Create Kernels

10-701/15-781 Kernel Theory Recitation

Current Research in Kernel TheoryFlaxman TestFast Food

References

Manojit Nandi | Carnegie Mellon University — Machine Learning Department 3/36


IMPORTANT: The Kernels used in Support Vector Machines are different fromthe Kernels used in Kernel Regression.

To avoid confusion, I’ll refer to the Support Vector Machines kernels as MercerKernels and the Kernel Regression kernels as Smoothing Kernels



Recall: A square matrix A ∈ RNxN is positive semi-definite if for all vectorsu ∈ Rn, uTAu ≥ 0.

For some kernel function K : XxX → R defined over a set X ; |X | = n, theGram Matrix G is an nXn matrix such that Gij = K(xi, xj) =< φ(xi), φ(xj) >



Mathematical Theory



Banach Space and Hilbert Spaces

A Banach Space is a complete Vector Space V equipped with a norm.

Examples:

• lmp Spaces: Rm equipped with the norm ‖x‖ = (m∑

i=1|xi|p)

1p

• Function Spaces, ‖f‖ := (∫

X |f(x)|P dx)1p



A Hilbert Space is a complete Vector Space V equipped with an inner product< ., . >.The inner product defines a norm (‖x‖ =< x, x >), so Hilbert Spaces are aspecial case of Banach Spaces.

Example:Euclidean Space with standard inner product: x, y ∈ Rn;< x, y >=

n∑i=1xiyi =

xT y

Because Kernel functions are the inner product in some higher-dimensionalspace, each kernel function corresponds to some Hilbert Space H.



Kernels

A Mercer Kernel is a function K defined over a set X such that K : XxX → R.The Mercer Kernel represents an inner product in some higher dimensional space

A Mercer Kernel is symmetric and positive semi-definite. By Mercer’s theo-rem, we know that any symmetric positive semi-definite kernel represents theinner product in some higher-dimensional Hilbert Space H.

Symmetric and Positive Semi-Definite ⇔ Kernel Function ⇔< φ(x), φ(x′) >

for some φ(.).



Why Symmetric?Inner products are symmetric by definition, so therefore if the kernel functionrepresents an inner product in some Hilbert Space, then the kernel function mustbe symmetric as well.K(x, z) =< φ(x), φ(z) >=< φ(z), φ(x) >= K(z, x).



Why Positive Semi-definite?In order to be positive semi-definite, the Gram Matrix G ∈ RnXn must satisfyuTGu ≥ 0 ∀u ∈ Rn. Using functional analysis, one can show that uTGucorresponds to < hu, hu > for some hu in some Hilbert Space H. Because< hu, hu > ≥ 0 by the definition of an inner product, then < hu, hu > ≥ 0 and< hu, hu >= uTGu⇒ uTGu ≥ 0.



Why are Kernels useful?Let x = (x1, x2) and z = (z1, z2). Then,

K(x, z) =< x, z >2= (x1z1 + x2z2)2 = (x21z

21) + (2x1z1x2z2) + (x2

2z22)

=< (x21,√

2x1x2, x22), (z2

1 ,√

2z1z2, z22) >=< φ(x), φ(z) >

Where φ(x) = (x21,√

2x1x2, x22).

What if x = (x1, x2, x3, x4), z = (z1, z2, z3, z4), and K(x, z) =< x, z >50.In this case, φ(x) and φ(z) are 23426∗ dimensional vectors.

∗: From combinatorics, 50 unlabeled balls into 4 labeled boxes =(53

3).



We can either:

1. Calculate < x, z > (Inner product of two four-dimension vectors) and raisethis value (a float) to the 50th power.

2. OR map x and z to φ(x) and φ(z), and calculate the inner product of two23426 dimensional vectors.

Which is faster?



Commonly Used Kernels

Some commonly used Kernels are:

• Linear Kernel: K(x, z) =< x, z >

• Polynomial Kernel: K(x, z) =< x, z >d where d is the degree of thepolynomial

• Gaussian RBF: K(x, z) = exp(− 12σ2 ‖x− z‖

2) = exp(−γ ‖x− z‖2)

• Sigmoid Kernel: K(x, z) = tanh(αxT z + β)



Other Kernels:

1. Laplacian Kernel

2. ANOVA Kernel

3. Circular Kernel

4. Spherical Kernel

5. Wave Kernel

6. Power Kernel

7. Log Kernel

8. B-Spline Kernel

9. Bessel Kernel

10. Cauchy Kernel

11. χ2 Kernel

12. Wavelet Kernel

13. Bayesian Kernel

14. Histogram Kernel

As you can see, there are a lot of kernels.



The Gaussian RBF Kernel is infiniteIn class, Barnabas mentioned that the Gaussian RBF Kernel corresponds to aninfinite dimensional vector space. From the Moore-Aronszajn theorem, we knowthere is a unique Hilbert space of functions for with the Gaussian RBF is areproducing kernel.

One can show that this Hilbert Space has a infinite basis, and so the Gaus-sian RBF Kernel corresponds to an infinite-dimensional vector space. A YouTubevideo of Abu-Mostafa explaining this in more detail has been included in thereferences at the end of this presentation.



Kernel Theory



One Weird Kernel TrickThe kernel trick is one the reasons kernel methods are so popular in machinelearning. With the kernel trick, we can substitute any dot product with a kernelfunction. This means any linear dot product in the formulation of a machinelearning algorithm can be replaced with a non-linear kernel function. The non-linear kernel function may allow us to find separation or structure in the datathat was not present in the linear dot product.

Kernel Trick: < xi, xj > can be replaced with K(xi, xj) for any valid ker-nel function K. Furthermore, we can swap any kernel function with any otherkernel function.



Some examples of algorithms that take advantage of the kernel trick:

• Support Vector Machines (Poster Child for Kernel Methods).

• Kernel Principal Component Analysis

• Kernel Independent Component Analysis

• Kernel K-Means algorithm

• Kernel Gaussian Process Regression

• Kernel Deep Learning



Representer TheoremTheoremLet X be a non-empty set and k a positive-definite real-valued kernel on onXxX with the corresponding reproducing kernel Hilbert Space Hk. Given atraining sample [(x1, y1), ..., (xn, yn)] ∈ XxR, a strictly monotonic increasingreal-valued function g : [0,∞) → R, and an arbitrary emprical loss functionE : (XXR2)m → R ∪∞, then for any f∗ satisfying

f̂ = argminf∈Hk

{E[(x1, y1, f(x1), ..., (xn, yn, f(xn))]}+ g(‖f‖)

f∗ can be represented in the form f∗(.) =n∑

i=1αik(·, xi),

where αi ∈ R for all αi



Example: Let Hk be some RKHS function space with a kernel k(., .). Let[(x1, y1), ..., (xm, ym)] be a training input-output set, and our task is to find f∗that minimizes the following regularized kernel.

f̂ = argminf∈Hk

(n∏

i=1|f(xi)|6)

n∑i=1

[∣∣∣sin(‖xi‖yi−f(xi))

∣∣∣25+yi |f(xi)|249] + exp(‖f‖F )



f̂ = argminf∈Hk

(n∏

i=1|f(xi)|6)

n∑i=1


∣∣∣25+yi |f(xi)|249] + exp(‖f‖F )

Well exp(.) is a strictly monotonic increasing function, so exp(‖f‖F ) fits theg(‖f‖F ) part.

(n∏

i=1|f(xi)|6)

n∑i=1


∣∣∣25+ yi |f(xi)|249] is some arbitrary emper-

ical loss function of the [xi, yi, f(xi)] . So this optimization can be expressed asf̂ = argmin

f∈Hk

{E[(x1, y1, f(x1), ..., (xn, yn, f(xn))]}+ g(‖f‖)

Therefore, by the Representer Theorem, f∗(.) =n∑

i=1αiK(xi, .)



Do You want to Build a (((((((hhhhhhhSnowman Kernel? Itdoesn’t have to be (((((((hhhhhhhSnowman Kernel.



Operations that Preserve/Create KernelsLet k1, ...km be valid kernel functions defined over some set X such that |X | = n,and let α1, ...., αm be non-negative coefficents. The following are valid kernels.

• K(x,z) =m∑

i=1αiki(x, z) (Closed under non-negative linear multiplication)

• K(x,z) =m∏

i=1ki(x, z) (Closed under multiplication)

• K(x,z) =k1(f(x), f(z)) for any function f : X → X

• K(x,z) = g(x)g(z) for any function g : X → R

• K(x,z) = xTATAz for any matrix A ∈ RmXn



Proof: K(x,z) =m∑

i=1αiki(x, z) is a valid kernel

Symmetry: K(z, x) =m∑

i=1αiki(z, x). Because each ki is a kernel function,

it is symmetric, so ki(z, x) = ki(x, z). Therefore,K(z, x) =

m∑i=1αiki(z, x) =

m∑i=1αiki(x, z) = K(x, z)⇒ K(z, x) = K(x, z)

Positive Semi-definite: Let u ∈ Rn be arbitrary. The Gram matrix of K,denoted by G has the property, Gi,j = K(xi, xj) =

m∑i=1αiki(z, x) ⇒ G =

α1G1 + ...+ αmGm. Now uTGu = uT (α1G1 + ...+ αmGm)u= α1uTG1u + ....+ αmuTGmu =

m∑i=1αiuTGiu.

uTGiu ≥ 0, and αi ≥ 0, so αiuTGiu ≥ 0.



Proof: K(x,z) = xTATAz for any matrix A ∈ RmXn is a valid Kernel.

For this proof, we are going to show K(x, z) is an inner product on some HilbertSpace. Let φ(x) = Ax, then < φ(x), φ(z) >= φ(x)Tφ(z) = (Ax)T (Az) =xTATAz = K(x, z)⇒< φ(x), φ(z) >= K(x, z).

Therefore, K(x, z) is an inner product on some Hilbert Space.



Just because it is a valid kernel does not mean it is a good kernel



Some Exercises Prove the following are not valid kernels:

1. K(x, z) = k1(x, z)− k2(x, z) for k1, k2 valid kernels

2. K(x, z) = exp(γ ‖x− z‖2) for some γ > 0



Current Research in Kernel Theory



Flaxman Test

Correlates of homicide: New space/time interaction tests for spatiotemporal pointprocesses

Goal: Develop a statistical test that can test for the strength of interactionbetween time and space for different types of homicides.

Result: Using Reproducing Kernel Hilbert spaces, one can project the space-timedata into a higher dimensional space that can test for a significant interaction ofa kernalized distance of space and time using the Hilbert-Schmidt IndependenceCriterion.


http://www.heinz.cmu.edu/research/483full.pdf

http://www.heinz.cmu.edu/research/483full.pdf


Fast FoodFastfood - Approximating Kernel Expansions in Loglinear time

Problem: Kernel methods do not scale well to large datasets, especially duringprediction time, because the algorithm has to compute the kernel distance betweenthe new vector and all of the data in the training set.

Solution: Using advanced mathematical black magic, dense random Gaussianmatrices can be approximated using Hadamard matrices and diagonal Gaussianmatrices.

V = 1σ√dSHG

∏HB


http://cs.stanford.edu/~quocle/LeSarlosSmola_ICML13.pdf


Where∏∈ {0, 1}n is a permutation matrix, H is a Hadamard matrix, S is a

random scaling matrix with elements along the diagonal, B has random {+−

1}along the diagonal, and G has random Gaussian entries.

This algorithm performs 100x faster than the previous leading algorithm (RandomKitchen Sinks) and uses 1000x less memory.



References

Representer Theorem and Kernel MethodsPredicting Structured Data by Alex SmolaKernel MachinesString and Tree KernelsWhy Gaussian Kernel is Infinite


http://www.cs.berkeley.edu/~bartlett/courses/281b-sp08/8.pdf

http://mitpress.mit.edu/books/predicting-structured-data

http://www.kernel-machines.org/

http://www.stat.purdue.edu/~vishy/talks/StringKernels.pdf

https://www.youtube.com/watch?v=XUj5JbQihlU&t=25m53s

Questions?

10-701/15-781Recitation: Kernelsaarti/Class/10701_Spring14/KernelTheory.pdfJust because it is a...

Documents

Transcript of 10-701/15-781Recitation: Kernelsaarti/Class/10701_Spring14/KernelTheory.pdfJust because it is a...