Post on 11-Feb-2022
Introduction to Kernel methods
ML Workshop, ISI KolkataChiranjib Bhattacharyya
Machine Learning labDept of CSA, IISc
chiru@csa.iisc.ernet.in
http://drona.csa.iisc.ernet.in/~chiru
19th Oct, 2012
Introduction
Kernel methods makes Machine Learning more applicable.Kernels are similarity measuresKernels can help integrate different sources of data
Agenda
1 Kernel TrickSVM and Non-linear Classification
2 Definition of Kernel functions
3 Kernels and Hilbert SpacesRKHS, Representer theorem etc
PART 1: KERNEL TRICK
Binary classification
Classifier
f : X →−1,1. f (x) = sign(w>x+b)
Data:D = (xi,yi)|i = 1, . . . ,m xi ∈X ,yi ∈ 1,−1
find f from D
Binary classification
Classifier
f : X →−1,1. f (x) = sign(w>x+b)
Data:D = (xi,yi)|i = 1, . . . ,m xi ∈X ,yi ∈ 1,−1
find f from D
Review of C-SVM
minw,bCm
∑i=1
max(1− yi(w>xi +b),0)+12‖w‖2
C-SVM formulation
maximizeα −12 ∑
ijαiαjyiyjx>i xj +
m
∑i=1
αi
subject to 0≤ αi ≤ C,∑i
αiyi = 0
At optimality w = ∑mi=1 αiyixi
f (x) = sign(m
∑i=1
αiyix>i x+b)
C-SVM in feature spaces
Let us work with a feature map, Φ(x).
maximizeα −12 ∑
ijαiαjyiyjΦ(xi)>Φ(xj)+
m
∑i=1
αi
subject to 0≤ αi ≤ C,∑i
αiyi = 0
and our classifier is
f (x) = sign(m
∑i=1
αiyiΦ(xi)>Φ(x)+b)
The dot product between any pair of examples computed in thefeature space be denoted by
K(x,z) = Φ(x)>Φ(z)
C-SVM in feature spaces
Let us work with a feature map, Φ(x).
maximizeα −12 ∑
ijαiαjyiyjK(xi,xj)+
m
∑i=1
αi
subject to 0≤ αi ≤ C,∑i
αiyi = 0
and our classifier is
f (x) = sign(m
∑i=1
αiyiK(xi,x)+b)
The dot product between any pair of examples computed in thefeature space be denoted by
K(x,z) = Φ(x)>Φ(z)
An example
Let x ∈ IR2 and Φ(x) = [x21 x2
2
√2x1x2]>
K(x,z) = Φ(x)>Φ(z) = x21z2
1 +2x1x2z1z2 + x22z2
2 =< x,z >2
If K(x,z) = (x>z)r is a dot product in a(d+r−1
r
)feature space
corresponding to x,z ∈ IRd.
If d = 256,r = 4, the feature space size is 6,35,376.
However if we know K one can still solve the SVM formulationwithout explicitly evaluating Φ
Kernel function
Kernel function
K :×X → IR is a Kernel function if
K(x,z) = K(z,x) symmetric
Kis positive semidefinite, i.e.∀n,x1, . . . ,xn ∈X ,
the matrix Kij = K(xi,xj) is psd
Recall that a K ∈ IRd×d is psd if u>Ku≥ 0 for all u ∈ IRd.
Examples of Kernel function
K(x,z) = Φ(x)>Φ(z) where φ : E → IRd
K is symmetric i.e. K(x,z) = K(z,x)
Positive Semidefinite:Let D = x1,x2, . . . ,xn be set of arbitrarily chosen n elements of E .Define
Kij = Φ(xi)>Φ(xj)
For any u ∈ IRn it is straightforward to see that
u>Ku = ‖Φ(D)u‖22 ≥ 0 Φ(D) = [Φ(x1), . . . ,Φ(xn)]
Examples of Kernel function
K(x,z) = Φ(x)>Φ(z) where φ : E → IRd
K is symmetric i.e. K(x,z) = K(z,x)Positive Semidefinite:Let D = x1,x2, . . . ,xn be set of arbitrarily chosen n elements of E .Define
Kij = Φ(xi)>Φ(xj)
For any u ∈ IRn it is straightforward to see that
u>Ku = ‖Φ(D)u‖22 ≥ 0 Φ(D) = [Φ(x1), . . . ,Φ(xn)]
Examples of Kernel functions
K(x,z) = x>z Φ(x) = x
K(x,z) = (x>z)r Φt1t2...td(x) =√
r!t1!t2!....td!x
t11 xt2
2 ...xtdd
∑di=1 ti = r
K(x,z) = e−γ‖x−z‖2
Kernel Construction
Let K1 and K2 be two valid kernels.K(x,y) = Φ(x)>Φ(y)
K(u,v) = K1(u,v)K2(u,v)
K = αK1 +βK2 α,β ≥ 0
K(x,y) =K(x,y)√
K(x,x)√
K(y,y)
K(x,y) = x>y
K(x,y) = (x>y)i
K(x,y) = limN→∞
N
∑i=0
(x>y)i
i!= ex>y
K(x,y) = e−12‖x−y‖2
Kernel Construction
Let K1 and K2 be two valid kernels.K(x,y) = Φ(x)>Φ(y)
K(u,v) = K1(u,v)K2(u,v)
K = αK1 +βK2 α,β ≥ 0
K(x,y) =K(x,y)√
K(x,x)√
K(y,y)
K(x,y) = x>y
K(x,y) = (x>y)i
K(x,y) = limN→∞
N
∑i=0
(x>y)i
i!= ex>y
K(x,y) = e−12‖x−y‖2
Kernel function and feature map
A theorem due to Mercer guarantees a feature map for symmetric, psdkernel functions.Loosely statedFor a symmetric function K : X ×X → IR, there exists an expansionK(x,z) = Φ(x)>Φ(z) iff∫
Xg(x)g(z)K(x,z)dxdz≥ 0
PART 2: Kernels and Hilbert spaces
What is a Dot product(aka Inner Product)
Let X be a vector space.
What is a Dot product
Symmetry < u,v >=< v,u > u,v ∈X
Bilinear < αu+βv,w >= α < u,w > +β < v,w > u,v,w,∈X
Positive Semidefinite < u,u > ≥ 0 u ∈X
< u,u >= 0 iff u = 0
Norm
‖x‖=√〈x,x〉
‖x‖= 0 =⇒ x = 0
Examples of Dot products
X = IRn,< u,v >= u>v
X = IRn,< u,v >=n
∑i=1
λiuivi λi ≥ 0
X = L2(X) =
f :∫
Xf (x)2dx < ∞
f ,g ∈X < f ,g >=
∫X
f (x)g(x)dx
Cauchy Schwartz inequality
Cauchy Schwartz inequality
Let X be an inner product space.
|〈x,y〉| ≤ ‖x‖‖y‖ ∀ x,y ∈X
and equality holds iff x = αz for some scalar α
Proof: ∀α ∈ IR ‖x−αz‖2 ≥ 0
‖x‖2−2α〈x,z〉+α2‖z‖2 ≥ 0∀α
Let α = 〈x,z〉‖z‖2 and the inequality follows by taking square roots. The
claim about equality follows from the definition of norm.
Hilbert Space: Basic facts
Defn: A Inner product space (H ,〈·, ·〉H ) is a Hilbert Space if it isseparable and complete.We will denote the norm as ‖ · ‖H . The orthogonal complement of M,where M ⊂H be a subspace of H is defined asM⊥ = z|〈x,z〉H = 0, ∀x ∈M
Hilbert space Projection theorem
Let M be a subspace of Hilbert space H ,〈·, ·〉H . For every x ∈Hthe following holds
There exists an unique ΠM(x) ∈M such thatΠM(x) = argminz∈M‖x− z‖H
x−ΠM(x) ∈M⊥ 〈z,x−ΠM(x)〉H = 0 ∀z ∈M
‖x‖2H = ‖ΠM(x)‖2
H +‖y‖2H where
x = ΠM(x)+ y where y ∈M⊥
Reproducing kernel Hilbert Space(RKHS)
Let K be any kernel function. Consider the following set
H = f |f (.) =m
∑i=1
αiK(.,xi)∀xi ∈X ,m ∈ N
Dot product
For any f ,g ∈H ,
f (.) =m1
∑i=1
αiK(.,xi) , g(.) =m2
∑i=1
βjK(.,xj)
〈f ,g〉H =m1
∑i=1
m2
∑j=1
αiβjK(xi,xj)
Is it a dot product?
Reproducing kernel Hilbert Space(RKHS)
As K is symmetric, 〈f ,g〉H = 〈g, f 〉H
〈f (.), f (.)〉=m
∑i=1
m
∑j=1
αiαjK(xi,xj)
Recall that K is a psd matrix if K is kernel function and so〈f (.), f (.)〉H ≥ 0
Reproducible Property
for any f ∈H
f (x) =m
∑i=i
αiK(x,xi) = 〈m
∑i=1
αiK(.,xi),K(.,x)〉= 〈f (.),K(.,x)〉
Applying C-S inequality |f (x)| ≤√〈f , f 〉H
√K(x,x) holds leading to
|f (x)|= 0 whenever 〈f , f 〉H = 0
Representer theorem
Representer theorem
Let K be a valid kernel defined on X and H be the correspondingRKHS. Let Ω be an increasing function. The optimization problem
ming∈H
G(g) =m
∑i=1
l(g(xi),yi)+Ω(‖g‖2H )
is solved when g∗ = ∑mi=1 αiK(.,xi)
Proof: Let M = ∑mi=1 αiK(.,xi) i = 1, . . . ,m. Clearly M is a
subspace of H . Take any g ∈H .
g(xi) = 〈g,K(.,xi)〉= 〈gM +gper,K(.,xi)〉
= 〈gM,K(.,xi)〉+ 〈gper,K(.,xi)〉= 〈gM,K(.,xi)〉= gM(xi)
As Ω is an increasing function, Ω(‖g‖2H )≥Ω(‖gM‖2
H )
Representer theorem
Representer theorem
Let K be a valid kernel defined on X and H be the correspondingRKHS. Let Ω be an increasing function. The optimization problem
ming∈H
G(g) =m
∑i=1
l(g(xi),yi)+Ω(‖g‖2H )
is solved when g∗ = ∑mi=1 αiK(.,xi)
Proof: Let M = ∑mi=1 αiK(.,xi) i = 1, . . . ,m. Clearly M is a
subspace of H . Take any g ∈H .
g(xi) = 〈g,K(.,xi)〉= 〈gM +gper,K(.,xi)〉
= 〈gM,K(.,xi)〉+ 〈gper,K(.,xi)〉= 〈gM,K(.,xi)〉= gM(xi)
As Ω is an increasing function, Ω(‖g‖2H )≥Ω(‖gM‖2
H )
References
Kernel methods in Computational BiologyScholkopf et al. 2004Kernel methods for Pattern AnalysisJohn Shawe Taylor and N. CristaniniLearning with KernelsScholkopf and Smola 2002