Kernel Methods
description
Transcript of Kernel Methods
![Page 1: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/1.jpg)
Kernel Methods
Barnabás PóczosUniversity of Alberta
Oct 1, 2009
![Page 2: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/2.jpg)
2
Outline• Quick Introduction• Feature space• Perceptron in the feature space• Kernels• Mercer’s theorem
• Finite domain• Arbitrary domain
• Kernel families• Constructing new kernels from kernels
• Constructing feature maps from kernels• Reproducing Kernel Hilbert Spaces (RKHS)• The Representer Theorem
![Page 3: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/3.jpg)
3
Ralf Herbrich: Learning Kernel Classifiers Chapter 2
![Page 4: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/4.jpg)
Quick Overview
![Page 5: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/5.jpg)
5
Hard 1-dimensional Dataset
x=0
Positive “plane” Negative “plane”
x=0
taken from Andrew W. More; CMU + Nello Cristianini, Ron Meir, Ron Parr
• If the data set is not linearly separable, then adding new features (mapping the data to a larger feature space) the data might become linearly separable
• m general! points in an m-1 dimensional space is always linearly separable by a hyperspace!) it is good to map the data to high dimensional spaces
(For example 4 points in 3D)
![Page 6: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/6.jpg)
6
Hard 1-dimensional DatasetMake up a new
feature!
Sort of… … computed from original feature(s)
x=0
),( 2kkk xxz
Separable! MAGIC!
Now drop this “augmented” data into our linear SVM.
taken from Andrew W. More; CMU + Nello Cristianini, Ron Meir, Ron Parr
![Page 7: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/7.jpg)
7
Feature mapping• m general! points in an m-1 dimensional space is
always linearly separable by a hyperspace!) it is good to map the data to high dimensional spaces
• Having m training data, is it always enough to map the data into a feature space with dimension m-1?
• Nope... We have to think about the test data as well!Even if we don’t know how many test data we have...
•We might want to map our data to a huge (1) dimensional feature space
•Overfitting? Generalization error?... We don’t care now...
![Page 8: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/8.jpg)
8
Feature mapping, but how???
1
![Page 9: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/9.jpg)
9
Observation
Several algorithms use the inner products only, but not the feature values!!!E.g. Perceptron, SVM, Gaussian
Processes...
![Page 10: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/10.jpg)
10
The Perceptron
![Page 11: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/11.jpg)
11
Maximize
k
k1
R
1
2 k lQkl
l1
R
k1
R
where
Qkl yky l (x k x l )
Subject to these constraints:
0 k C k
kykk1
R
0
k
SVM
![Page 12: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/12.jpg)
12
Inner productsSo we need the inner product between
and
Looks ugly, and needs lots of computation...
Can’t we just say that let
![Page 13: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/13.jpg)
13
Finite example
=r
r
rn
n
![Page 14: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/14.jpg)
14
Finite example
Lemma:
Proof:
![Page 15: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/15.jpg)
15
Finite example
Choose 7 2D points
Choose a kernel k
1
2 3
4
56
7
G = 1.0000 0.8131 0.9254 0.9369 0.9630 0.8987 0.9683 0.8131 1.0000 0.8745 0.9312 0.9102 0.9837 0.9264 0.9254 0.8745 1.0000 0.8806 0.9851 0.9286 0.9440 0.9369 0.9312 0.8806 1.0000 0.9457 0.9714 0.9857 0.9630 0.9102 0.9851 0.9457 1.0000 0.9653 0.9862 0.8987 0.9837 0.9286 0.9714 0.9653 1.0000 0.9779 0.9683 0.9264 0.9440 0.9857 0.9862 0.9779 1.0000
![Page 16: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/16.jpg)
16
[U,D]=svd(G), UDUT=G, UUT=IU =
-0.3709 0.5499 0.3392 0.6302 0.0992 -0.1844 -0.0633 -0.3670 -0.6596 -0.1679 0.5164 0.1935 0.2972 0.0985 -0.3727 0.3007 -0.6704 -0.2199 0.4635 -0.1529 0.1862 -0.3792 -0.1411 0.5603 -0.4709 0.4938 0.1029 -0.2148 -0.3851 0.2036 -0.2248 -0.1177 -0.4363 0.5162 -0.5377 -0.3834 -0.3259 -0.0477 -0.0971 -0.3677 -0.7421 -0.2217 -0.3870 0.0673 0.2016 -0.2071 -0.4104 0.1628 0.7531
D =
6.6315 0 0 0 0 0 0 0 0.2331 0 0 0 0 0 0 0 0.1272 0 0 0 0 0 0 0 0.0066 0 0 0 0 0 0 0 0.0016 0 0 0 0 0 0 0 0.000 0 0 0 0 0 0 0 0.000
![Page 17: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/17.jpg)
17
Mapped points=sqrt(D)*UT
Mapped points =
-0.9551 -0.9451 -0.9597 -0.9765 -0.9917 -0.9872 -0.9966 0.2655 -0.3184 0.1452 -0.0681 0.0983 -0.1573 0.0325 0.1210 -0.0599 -0.2391 0.1998 -0.0802 -0.0170 0.0719 0.0511 0.0419 -0.0178 -0.0382 -0.0095 -0.0079 -0.0168 0.0040 0.0077 0.0185 0.0197 -0.0174 -0.0146 -0.0163-0.0011 0.0018 -0.0009 0.0006 0.0032 -0.0045 0.0010 -0.0002 0.0004 0.0007 -0.0008 -0.0020 -0.0008 0.0028
![Page 18: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/18.jpg)
18
Roadmap IWe need feature maps
Implicit (kernel functions)Explicit (feature maps)
Several algorithms need the inner products of features only!
It is much easier to use implicit feature maps (kernels)
Is it a kernel function???Is it a kernel function???
![Page 19: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/19.jpg)
19
Mercer’s theorem,
eigenfunctions, eigenvalues
Positive semi def. integral operators
Infinite dim feature space (l2)
Roadmap II
Is it a kernel function???SVD,
eigenvectors, eigenvalues
Positive semi def. matrices
Finite dim feature space
We have to think aboutthe test data as well...
If the kernel is pos. semi def. , feature map construction
![Page 20: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/20.jpg)
20
Mercer’s theorem
(*)
2 variables 1 variable
![Page 21: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/21.jpg)
21
Mercer’s theorem
...
![Page 22: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/22.jpg)
22
Roadmap III
We want to know which functions are kernels• How to make new kernels from old
kernels?• The polynomial kernel:
We will show another way using RKHS:
Inner product=???
![Page 23: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/23.jpg)
Ready for the details? ;)
![Page 24: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/24.jpg)
24
Hard 1-dimensional Dataset
What would SVMs do with this data?
Not a big surprise
x=0
Positive “plane” Negative “plane”
x=0
Doesn’t look like slack variables will save us this time…
taken from Andrew W. Moore
![Page 25: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/25.jpg)
25
Hard 1-dimensional DatasetMake up a new
feature!
Sort of… … computed from original feature(s)
x=0
),( 2kkk xxz
New features are sometimes called basis functions.
Separable! MAGIC!
Now drop this “augmented” data into our linear SVM.taken from Andrew W. Moore
![Page 26: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/26.jpg)
26
Hard 2-dimensional Dataset
O
OX
X
Let us map this point to the 3rd dimension...
![Page 27: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/27.jpg)
27
Kernels and Linear Classifiers
We will use linear classifiers in this feature space.
![Page 28: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/28.jpg)
28Picture is taken from R. Herbrich
![Page 29: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/29.jpg)
29Picture is taken from R. Herbrich
![Page 30: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/30.jpg)
30
Kernels and Linear Classifiers
Feature functions
![Page 31: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/31.jpg)
31
Back to the Perceptron Example
![Page 32: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/32.jpg)
32
The Perceptron
• The primal algorithm in the feature space
![Page 33: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/33.jpg)
33
The primal algorithm in the feature space
Picture is taken from R. Herbrich
![Page 34: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/34.jpg)
34
The Perceptron
![Page 35: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/35.jpg)
35
The PerceptronThe Dual Algorithm in the feature space
![Page 36: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/36.jpg)
36
The Dual Algorithm in the feature space
Picture is taken from R. Herbrich
![Page 37: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/37.jpg)
37
The Dual Algorithm in the feature space
![Page 38: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/38.jpg)
38
Kernels
Definition: (kernel)
![Page 39: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/39.jpg)
39
Kernels
Definition: (Gram matrix, kernel matrix)
Definition: (Feature space, kernel space)
![Page 40: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/40.jpg)
40
Kernel technique
Lemma:
The Gram matrix is symmetric, PSD matrix.
Proof:
Definition:
![Page 41: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/41.jpg)
41
Kernel technique
Key idea:
![Page 42: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/42.jpg)
42
Kernel technique
![Page 43: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/43.jpg)
43
Finite example
=r
r
rn
n
![Page 44: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/44.jpg)
44
Finite example
Lemma:
Proof:
![Page 45: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/45.jpg)
45
Kernel technique, Finite example
We have seen:
Lemma:
These conditions are necessary
![Page 46: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/46.jpg)
46
Kernel technique, Finite example
Proof: ... wrong in the Herbrich’s book...
![Page 47: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/47.jpg)
47
Kernel technique, Finite example
Summary:
How to generalize this to general sets???
![Page 48: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/48.jpg)
48
Integral operators, eigenfunctions
Definition: Integral operator with kernel k(.,.)
Remark:
![Page 49: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/49.jpg)
49
From Vector domain to Functions
• Observe that each vector v = (v[1], v[2], ..., v[n]) is a mapping from the integers {1,2,..., n} to <
•We can generalize this easily to INFINITE domain
w = (w[1], w[2], ..., w[n], ...) where w is mapping from {1,2,...} to <
12
1 2 1
1
G vi
j
![Page 50: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/50.jpg)
50
From Vector domain to Functions
From integers we can further extend to
• < or • <m
• Strings• Graphs• Sets• Whatever• …
![Page 51: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/51.jpg)
51
Lp and lp spaces
.
Picture is taken from R. Herbrich
![Page 52: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/52.jpg)
52
Lp and lp spaces
Picture is taken from R. Herbrich
![Page 53: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/53.jpg)
53
L2 and l2 special cases
Picture is taken from R. Herbrich
![Page 54: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/54.jpg)
54
Kernels
Definition: inner product, Hilbert spaces
![Page 55: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/55.jpg)
55
Integral operators, eigenfunctions
Definition: Eigenvalue, Eigenfunction
![Page 56: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/56.jpg)
56
Positive (semi) definite operators
Definition: Positive Definite Operator
![Page 57: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/57.jpg)
57
Mercer’s theorem
(*)
2 variables 1 variable
![Page 58: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/58.jpg)
58
Mercer’s theorem
...
![Page 59: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/59.jpg)
59
A nicer characterization
Theorem: nicer kernel characterization
![Page 60: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/60.jpg)
60
Kernel Families• Kernels have the intuitive meaning of
similarity measure between objects.
• So far we have seen two ways for making a linear classifier nonlinear in the input space:
1. (explicit) Choosing a mapping ) Mercer kernel k
2. (implicit) Choosing a Mercer kernel k ) Mercer map
![Page 61: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/61.jpg)
61
Designing new kernels from kernels
are also kernels.
Picture is taken from R. Herbrich
![Page 62: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/62.jpg)
62
Designing new kernels from kernels
Picture is taken from R. Herbrich
![Page 63: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/63.jpg)
63
Designing new kernels from kernels
![Page 64: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/64.jpg)
64
Kernels on inner product spacesNote:
![Page 65: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/65.jpg)
65Picture is taken from R. Herbrich
![Page 66: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/66.jpg)
66
Common Kernels• Polynomials of degree d
• Polynomials of degree up to d
• Sigmoid
• Gaussian kernels
Equivalent to x) of infinite dimensionality!
2
![Page 67: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/67.jpg)
67
The RBF kernelNote:
Proof:
![Page 68: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/68.jpg)
68
The RBF kernelNote:
Note:
Proof:
![Page 69: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/69.jpg)
69
The Polynomial kernel
![Page 70: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/70.jpg)
70
Reminder: Hard 1-dimensional Dataset
Make up a new feature!
Sort of… … computed from original feature(s)
x=0
),( 2kkk xxz
New features are sometimes called basis functions.
Separable! MAGIC!
Now drop this “augmented” data into our linear SVM.taken from Andrew W. Moore
![Page 71: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/71.jpg)
71
… New Features from Old …• Here: mapped 2 by : x [x, x2]
• Found “extra dimensions” linearly separable!
• In general,• Start with vector x N
• Want to add in x12 , x2
2, …
• Probably want other terms – eg x2 x7, …
• Which ones to include? Why not ALL OF THEM???
![Page 72: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/72.jpg)
72
Special Case• x=(x1, x2, x3 )
(1, x1, x2, x3, x12, x2
2, x32, x1x2, x1x3, x2x3 )
3 10, N=3, n=10;
22
)1)(2(
21
2NNNNNNN
In general, the dimension of the quadratic map:
taken from Andrew W. Moore
![Page 73: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/73.jpg)
73
Quadratic Basis
Functions
NN
N
N
N
N
xx
xx
xx
xx
xx
xx
x
x
x
x
x
x
x
1
1
32
1
31
21
2
22
21
2
1
2
:
2
:
2
2
:
2
2
:
2
:
2
2
1
)(
Constant Term
Linear Terms
Pure Quadratic
Terms
Quadratic Cross-Terms
What about those ??
… stay tuned
2
Let
taken from Andrew W. Moore
![Page 74: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/74.jpg)
74
Quadra
tic
Dot
Pro
duct
s
NN
N
N
N
N
N
N
N
N
N
bb
bb
bb
bb
bb
bb
b
b
b
b
b
b
aNa
aa
aa
aa
aa
aa
a
a
a
a
a
a
1
1
32
1
31
21
2
22
21
2
1
1
1
32
1
31
21
2
22
21
2
1
2
:
2
:
2
2
:
2
2
:
2
:
2
2
1
2
:
2
:
2
2
:
2
2
:
2
:
2
2
1
)(),( ba
1
N
iiiba
1
2
N
iii ba
1
22
N
i
N
ijjiji bbaa
1 1
2
+
+
+
taken from Andrew W. Moore
![Page 75: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/75.jpg)
75
Quadra
tic
Dot
Pro
duct
s
)(),( ba
N
i
N
ijjiji
N
iii
N
iii bbaababa
1 11
2
1
2)(21
Now consider another fn of a and b
(ab1)2
(ab)2 2ab1
121
2
1
N
iii
N
iii baba
1211 1
N
iii
N
i
N
jjjii bababa
122)(11 11
2
N
iii
N
i
N
ijjjii
N
iii babababa
They’re the same!
And this is only O(N) to compute… not O(N2)taken from Andrew W. Moore
![Page 76: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/76.jpg)
76
Higher Order Polynomials
Poly-nomial
(x) Cost to build Qkl matrix: traditional
Cost if 100 inputs
(a)∙(b)
Cost to build Qkl matrix: sneaky
Cost if 100 inputs
Quadratic
All m2/2 terms up to degree 2
m2 R2 /4 2 500 R2(a∙b+1)2
m R2 / 2 50 R2
Cubic All N3/6 terms up to degree 3
N3 m2 /12 83 000 m2(a∙b+1)3
N m2 / 2 50 m2
Quartic All N4/24 terms up to degree 4
N4 m2 /48 1960000m2
(a∙b+1)4
N m2 / 2 50 m2
Qkl yky l (x k x l )
Poly-nomial
(x) Cost to build Qkl matrix: traditional
Cost if N=100 dim inputs
(a)∙(b)
Cost to build Qkl matrix: sneaky
Cost if 100 dim inputs
Quadratic
All N2/2 terms up to degree 2
N2 m2 /4 2 500 m2(a∙b+1)2
N m2 / 2 50 m2
taken from Andrew W. Moore
![Page 77: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/77.jpg)
77
The Polynomial kernel, General case
We are going to map these to a larger space
We want to show that this k is a kernel function
![Page 78: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/78.jpg)
78
The Polynomial kernel, General case
P factors
We are going to map these to a larger space
![Page 79: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/79.jpg)
79
The Polynomial kernel, General case
We already know:
We want to get k in this form:
![Page 80: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/80.jpg)
80
The Polynomial kernel
For example
We already know:
![Page 81: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/81.jpg)
81
The Polynomial kernel
![Page 82: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/82.jpg)
82
The Polynomial kernel
) k is really a kernel!
![Page 83: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/83.jpg)
83
Reproducing Kernel Hilbert Spaces
![Page 84: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/84.jpg)
84
RKHS, Motivation
Now, we show another way using RKHS
What objective do we want to optimize?
1.,
2.,
![Page 85: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/85.jpg)
85
RKHS, Motivation
1st term, empirical loss2nd term, regularization
3.,How can we minimize the objective over functions???
• Be PARAMETRIC!!!...
(nope, we do not like that...)
• Use RKHS, and suddenly the problem will be finite dimensional optimization only (yummy...)
The Representer theorem will help us here
![Page 86: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/86.jpg)
86
Reproducing Kernel Hilbert Spaces
Now, we show another way using RKHS
Completing (closing) a pre-Hilbert space ) Hilbert space
Now, we show another way using RKHS
![Page 87: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/87.jpg)
87
Reproducing Kernel Hilbert Spaces
The inner product:
(*)
![Page 88: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/88.jpg)
88
Reproducing Kernel Hilbert Spaces
Note:
Proof:
(*)
![Page 89: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/89.jpg)
89
Reproducing Kernel Hilbert Spaces
Lemma:
• Pre-Hilbert space: Like the Euclidean space with rational scalars only
• Hilbert space: Like the Euclidean space with real scalars
Proof:
![Page 90: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/90.jpg)
90
Reproducing Kernel Hilbert Spaces
Lemma: (Reproducing property)
Lemma: The constructed features match to k
Huhh...
![Page 91: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/91.jpg)
91
Reproducing Kernel Hilbert Spaces
Proof of property 4.,:
rep. property
CBS For CBS we don’t need 4.,we need only that <0,0>=0!
![Page 92: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/92.jpg)
92
Methods to Construct Feature SpacesWe now have two methods to construct
feature maps from kernels
Well, these feature spaces are all isomorph with each other...
![Page 93: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/93.jpg)
93
The Representer TheoremIn the perceptron problem we could use the dual
algorithm, because we had this representation:
![Page 94: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/94.jpg)
94
The Representer TheoremTheorem:
1st term, empirical loss2nd term, regularization
![Page 95: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/95.jpg)
95
The Representer Theorem
Proof of Representer Theorem:
Message: Optimizing in general function classes is difficult, but in RKHS it is only finite! (m) dimensional problem
![Page 96: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/96.jpg)
96
Proof of the Representer Theorem
Proof of Representer Theorem
1st term, empirical loss2nd term, regularization
![Page 97: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/97.jpg)
97
1st term, empirical loss2nd term, regularization
Proof of the Representer Theorem
![Page 98: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/98.jpg)
98
Later will come• Supervised Learning
• SVM using kernels• Gaussian Processes
• Regression• Classification• Heteroscedastic case
• Unsupervised Learning• Kernel Principal Component Analysis• Kernel Independent Component Analysis
• Kernel Mutual Information• Kernel Generalized Variance• Kernel Canonical Correlation Analysis
![Page 99: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/99.jpg)
99
If we still have time…
• Automatic Relevance Machines• Bayes Point Machines
• Kernels on other objects• Kernels on graphs• Kernels on strings
• Fisher kernels• ANOVA kernels• Learning kernels
![Page 100: Kernel Methods](https://reader036.fdocuments.us/reader036/viewer/2022062309/56814400550346895db09556/html5/thumbnails/100.jpg)
100
Thanks for the Attention!