Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector...
-
Upload
gregory-neal -
Category
Documents
-
view
222 -
download
0
Transcript of Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector...
![Page 1: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)](https://reader035.fdocuments.us/reader035/viewer/2022062720/56649efb5503460f94c0d855/html5/thumbnails/1.jpg)
Support Vector Machines
Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
![Page 2: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)](https://reader035.fdocuments.us/reader035/viewer/2022062720/56649efb5503460f94c0d855/html5/thumbnails/2.jpg)
Notation
• Assume a binary classification problem: h (x) {−1, 1}
• Instances are represented by vector x n.
• Training examples: xk = (x1, x2, …, xn)
• Training set:
S = {(x1, y1), (x2, y2), ..., (xm, ym) | (xk, yk) n {+1, -1}}
Questions on notation...
![Page 3: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)](https://reader035.fdocuments.us/reader035/viewer/2022062720/56649efb5503460f94c0d855/html5/thumbnails/3.jpg)
Here, assume positive and negative instances are to be separated by the hyperplane , where b is the bias.
Equation of line:
x2
x1
![Page 4: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)](https://reader035.fdocuments.us/reader035/viewer/2022062720/56649efb5503460f94c0d855/html5/thumbnails/4.jpg)
• Intuition: The best hyperplane for future generalization will “maximally” separate the examples (assuming new examples will come from the same distribution)
![Page 5: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)](https://reader035.fdocuments.us/reader035/viewer/2022062720/56649efb5503460f94c0d855/html5/thumbnails/5.jpg)
Margin is defined as distance from separating hypeplane to nearest training example.
Note: Sometimes margin is definedas twice the distance.
Definition of Margin
![Page 6: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)](https://reader035.fdocuments.us/reader035/viewer/2022062720/56649efb5503460f94c0d855/html5/thumbnails/6.jpg)
Definition of Margin
Vapnik (1979) showed that the hyperplane maximizing the margin of S will have minimal VC dimension in the set of all consistent hyperplanes, and will thus be optimal.
![Page 7: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)](https://reader035.fdocuments.us/reader035/viewer/2022062720/56649efb5503460f94c0d855/html5/thumbnails/7.jpg)
Definition of Margin
Vapnik (1979) showed that the hyperplane maximizing the margin of S will have minimal VC dimension in the set of all consistent hyperplanes, and will thus be optimal.
This is an optimizationproblem!
![Page 8: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)](https://reader035.fdocuments.us/reader035/viewer/2022062720/56649efb5503460f94c0d855/html5/thumbnails/8.jpg)
Support vectors:
Training examplesthat lie on themargin.
http://nlp.stanford.edu/IR-book/html/htmledition/img1260.png
![Page 9: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)](https://reader035.fdocuments.us/reader035/viewer/2022062720/56649efb5503460f94c0d855/html5/thumbnails/9.jpg)
Support vectors:
Training examplesthat lie on themargin.
http://nlp.stanford.edu/IR-book/html/htmledition/img1260.png
![Page 10: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)](https://reader035.fdocuments.us/reader035/viewer/2022062720/56649efb5503460f94c0d855/html5/thumbnails/10.jpg)
Support vectors:
Training examplesthat lie on themargin.
http://nlp.stanford.edu/IR-book/html/htmledition/img1260.png
![Page 11: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)](https://reader035.fdocuments.us/reader035/viewer/2022062720/56649efb5503460f94c0d855/html5/thumbnails/11.jpg)
Support vectors:
Training examplesthat lie on themargin.
http://nlp.stanford.edu/IR-book/html/htmledition/img1260.png
![Page 12: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)](https://reader035.fdocuments.us/reader035/viewer/2022062720/56649efb5503460f94c0d855/html5/thumbnails/12.jpg)
Support vectors:
Training examplesthat lie on themargin.
http://nlp.stanford.edu/IR-book/html/htmledition/img1260.png
Without changing the problem, we can rescale our data to set a = 1
![Page 13: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)](https://reader035.fdocuments.us/reader035/viewer/2022062720/56649efb5503460f94c0d855/html5/thumbnails/13.jpg)
Support vectors:
Training examplesthat lie on themargin.
http://nlp.stanford.edu/IR-book/html/htmledition/img1260.png
Without changing the problem, we can rescale our data to set a = 1
![Page 14: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)](https://reader035.fdocuments.us/reader035/viewer/2022062720/56649efb5503460f94c0d855/html5/thumbnails/14.jpg)
http://nlp.stanford.edu/IR-book/html/htmledition/img1260.png
Without changing the problem, we can rescale our data to set a = 1
Length of the margin
mar
gin
![Page 15: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)](https://reader035.fdocuments.us/reader035/viewer/2022062720/56649efb5503460f94c0d855/html5/thumbnails/15.jpg)
http://nlp.stanford.edu/IR-book/html/htmledition/img1260.png
Without changing the problem, we can rescale our data to set a = 1
Length of the margin
mar
gin
![Page 16: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)](https://reader035.fdocuments.us/reader035/viewer/2022062720/56649efb5503460f94c0d855/html5/thumbnails/16.jpg)
w is perpendicular to decision boundary
![Page 17: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)](https://reader035.fdocuments.us/reader035/viewer/2022062720/56649efb5503460f94c0d855/html5/thumbnails/17.jpg)
• The length of the margin is
• So to maximize the margin, we need to minimize .w
![Page 18: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)](https://reader035.fdocuments.us/reader035/viewer/2022062720/56649efb5503460f94c0d855/html5/thumbnails/18.jpg)
Minimizing ||w||
Find w and b by doing the following minimization:
This is a quadratic optimization problem.
![Page 19: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)](https://reader035.fdocuments.us/reader035/viewer/2022062720/56649efb5503460f94c0d855/html5/thumbnails/19.jpg)
Minimizing ||w||
Find w and b by doing the following minimization:
This is a quadratic optimization problem.
Shorthand:
![Page 20: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)](https://reader035.fdocuments.us/reader035/viewer/2022062720/56649efb5503460f94c0d855/html5/thumbnails/20.jpg)
Dual Representation
• It turns out that w can be expressed as a linear combination of the training examples:
• The results of the SVM training algorithm (involving solving a quadratic programming problem) are the i and the bias b.
![Page 21: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)](https://reader035.fdocuments.us/reader035/viewer/2022062720/56649efb5503460f94c0d855/html5/thumbnails/21.jpg)
• Classify new example x as follows:
where | S | = m. (S is the training set)
SVM Classification
![Page 22: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)](https://reader035.fdocuments.us/reader035/viewer/2022062720/56649efb5503460f94c0d855/html5/thumbnails/22.jpg)
• Classify new example x as follows:
where | S | = m. (S is the training set)
• Note that dot product is
a kind of “similarity measure”
SVM Classification
![Page 23: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)](https://reader035.fdocuments.us/reader035/viewer/2022062720/56649efb5503460f94c0d855/html5/thumbnails/23.jpg)
Example
1 2-1-2
1
2
-2
-1
![Page 24: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)](https://reader035.fdocuments.us/reader035/viewer/2022062720/56649efb5503460f94c0d855/html5/thumbnails/24.jpg)
Example
1 2-1-2
1
2
-2
-1
Input to SVM optimizer:
x1 x2 class1 1 11 2 12 1 1-1 0 -10 -1 -1-1 -1 -1
![Page 25: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)](https://reader035.fdocuments.us/reader035/viewer/2022062720/56649efb5503460f94c0d855/html5/thumbnails/25.jpg)
Example
1 2-1-2
1
2
-2
-1
Input to SVM optimizer:
x1 x2 class1 1 11 2 12 1 1-1 0 -10 -1 -1-1 -1 -1
Output from SVM optimizer:
Support vector α(-1, 0) -.208(1, 1) .416(0, -1) -.208
b = -.376
![Page 26: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)](https://reader035.fdocuments.us/reader035/viewer/2022062720/56649efb5503460f94c0d855/html5/thumbnails/26.jpg)
Example
1 2-1-2
1
2
-2
-1
Input to SVM optimizer:
x1 x2 class1 1 11 2 12 1 1-1 0 -10 -1 -1-1 -1 -1
Output from SVM optimizer:
Support vector α(-1, 0) -.208(1, 1) .416(0, -1) -.208
b = -.376
Weight vector:
Separation line: Compute this
Sketch this
![Page 27: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)](https://reader035.fdocuments.us/reader035/viewer/2022062720/56649efb5503460f94c0d855/html5/thumbnails/27.jpg)
Example
1 2-1-2
1
2
-2
-1
Input to SVM optimizer:
x1 x2 class1 1 11 2 12 1 1-1 0 -10 -1 -1-1 -1 -1
Output from SVM optimizer:
Support vector α(-1, 0) -.208(1, 1) .416(0, -1) -.208
b = -.376
Weight vector:
Separation line:
![Page 28: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)](https://reader035.fdocuments.us/reader035/viewer/2022062720/56649efb5503460f94c0d855/html5/thumbnails/28.jpg)
Example
1 2-1-2
1
2
-2
-1
Classifying a new point:
Compute this
![Page 29: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)](https://reader035.fdocuments.us/reader035/viewer/2022062720/56649efb5503460f94c0d855/html5/thumbnails/29.jpg)
Example
1 2-1-2
1
2
-2
-1
Classifying a new point:
![Page 30: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)](https://reader035.fdocuments.us/reader035/viewer/2022062720/56649efb5503460f94c0d855/html5/thumbnails/30.jpg)
svm_light Demo
![Page 31: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)](https://reader035.fdocuments.us/reader035/viewer/2022062720/56649efb5503460f94c0d855/html5/thumbnails/31.jpg)
In-Class Exercises
![Page 32: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)](https://reader035.fdocuments.us/reader035/viewer/2022062720/56649efb5503460f94c0d855/html5/thumbnails/32.jpg)
Non-linearly separable training examples• What if the training examples are not linearly separable?
![Page 33: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)](https://reader035.fdocuments.us/reader035/viewer/2022062720/56649efb5503460f94c0d855/html5/thumbnails/33.jpg)
Non-linearly separable training examples• What if the training examples are not linearly separable?
• Use old trick: Find a function that maps points to a higher dimensional space (“feature space”) in which they are linearly separable, and do the classification in that higher-dimensional space.
![Page 34: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)](https://reader035.fdocuments.us/reader035/viewer/2022062720/56649efb5503460f94c0d855/html5/thumbnails/34.jpg)
Need to find a function that will perform such a mapping:
: n F
Then can find hyperplane in higher dimensional feature space F, and do classification using that hyperplane in higher dimensional space.
x1
x2
![Page 35: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)](https://reader035.fdocuments.us/reader035/viewer/2022062720/56649efb5503460f94c0d855/html5/thumbnails/35.jpg)
Need to find a function that will perform such a mapping:
: n F
Then can find hyperplane in higher dimensional feature space F, and do classification using that hyperplane in higher dimensional space.
x1
x2
x1
x2
x3
![Page 36: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)](https://reader035.fdocuments.us/reader035/viewer/2022062720/56649efb5503460f94c0d855/html5/thumbnails/36.jpg)
Exercise
Find : 2 F to create a 3-dimensional feature space in which XOR is linearly separable.
I.e., find (x1, x2) = (x1, x2, x3) so that these points become separable
x2
x1
(0, 1) (1, 1)
(0, 0)
(1, 0) x1
x2
x3
![Page 37: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)](https://reader035.fdocuments.us/reader035/viewer/2022062720/56649efb5503460f94c0d855/html5/thumbnails/37.jpg)
• Problem:
– Recall that classification of instance x is expressed in terms of dot products of x and support vectors.
– The quadratic programming problem of finding the support vectors and coefficients also depends only on dot products between training examples.
![Page 38: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)](https://reader035.fdocuments.us/reader035/viewer/2022062720/56649efb5503460f94c0d855/html5/thumbnails/38.jpg)
– So if each xi is replaced by (xi) in these procedures, we will have to calculate (xi) for each i as well as calculate a lot of dot products, (xi) (xj)
– In general, if the feature space F is high dimensional, (xi) (xj) will be expensive to compute.
![Page 39: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)](https://reader035.fdocuments.us/reader035/viewer/2022062720/56649efb5503460f94c0d855/html5/thumbnails/39.jpg)
• Second trick:
– Suppose that there were some magic function,
k(xi, xj) = (xi) (xj)
such that k is cheap to compute even though (xi) (xj) is expensive to compute.
– Then we wouldn’t need to compute the dot product directly; we’d just need to compute k during both the training and testing phases.
– The good news is: such k functions exist! They are called kernel functions, and come from the theory of integral operators.
![Page 40: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)](https://reader035.fdocuments.us/reader035/viewer/2022062720/56649efb5503460f94c0d855/html5/thumbnails/40.jpg)
Example: Polynomial kernel:
Suppose x = (x1, x2) and z = (z1, z2).
![Page 41: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)](https://reader035.fdocuments.us/reader035/viewer/2022062720/56649efb5503460f94c0d855/html5/thumbnails/41.jpg)
Example
Does the degree-2 polynomial kernel make XOR linearly separable?
x2
x1
(0, 1) (1, 1)
(0, 0)
(1, 0)
![Page 42: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)](https://reader035.fdocuments.us/reader035/viewer/2022062720/56649efb5503460f94c0d855/html5/thumbnails/42.jpg)
Recap of SVM algorithm
Given training set
S = {(x1, y1), (x2, y2), ..., (xm, ym) | (xi, yi) n {+1, -1}
1. Choose a map : n F, which maps xi to a higher dimensional feature space. (Solves problem that X might not be linearly separable in original space.)
2. Choose a cheap-to-compute kernel function
k(x,z)=(x) (z)
(Solves problem that in high dimensional spaces, dot products are very expensive to compute.)
![Page 43: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)](https://reader035.fdocuments.us/reader035/viewer/2022062720/56649efb5503460f94c0d855/html5/thumbnails/43.jpg)
4. Apply quadratic programming procedure (using the kernel function k) to find a hyperplane (w, b), such that
where the (xi)’s are support vectors, the i’s are coefficients, and b is a bias, such that (w, b ) is the hyperplane maximizing the margin of the training set in feature space F.
![Page 44: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)](https://reader035.fdocuments.us/reader035/viewer/2022062720/56649efb5503460f94c0d855/html5/thumbnails/44.jpg)
5. Now, given a new instance, x, find the classification of x by computing
![Page 45: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)](https://reader035.fdocuments.us/reader035/viewer/2022062720/56649efb5503460f94c0d855/html5/thumbnails/45.jpg)
![Page 46: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)](https://reader035.fdocuments.us/reader035/viewer/2022062720/56649efb5503460f94c0d855/html5/thumbnails/46.jpg)
![Page 47: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)](https://reader035.fdocuments.us/reader035/viewer/2022062720/56649efb5503460f94c0d855/html5/thumbnails/47.jpg)
![Page 48: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)](https://reader035.fdocuments.us/reader035/viewer/2022062720/56649efb5503460f94c0d855/html5/thumbnails/48.jpg)
More on Kernels• So far we’ve seen kernels that map instances in n to
instances in z where z > n.
• One way to create a kernel: Figure out appropriate feature space Φ(x), and find kernel function k which defines inner product on that space.
• More practically, we usually don’t know appropriate feature space Φ(x).
• What people do in practice is either:
1. Use one of the “classic” kernels (e.g., polynomial),
or
2. Define their own function that is appropriate for their task, and show that it qualifies as a kernel.
![Page 49: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)](https://reader035.fdocuments.us/reader035/viewer/2022062720/56649efb5503460f94c0d855/html5/thumbnails/49.jpg)
How to define your own kernel
• Given training data (x1, x2, ..., xn)
• Algorithm for SVM learning uses kernel matrix (also called Gram matrix):
• We can choose some function k, and compute the kernel matrix K using the training data.
• We just have to guarantee that our kernel defines an inner product on some feature space.
• Not as hard as it sounds.
![Page 50: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)](https://reader035.fdocuments.us/reader035/viewer/2022062720/56649efb5503460f94c0d855/html5/thumbnails/50.jpg)
What counts as a kernel?
• Mercer’s Theorem: If the kernel matrix K is “symmetric positive definite”, it defines a kernel, that is, it defines an inner product in some feature space.
• We don’t even have to know what that feature space is! It can have a huge number of dimensions.
![Page 51: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)](https://reader035.fdocuments.us/reader035/viewer/2022062720/56649efb5503460f94c0d855/html5/thumbnails/51.jpg)
Structural Kernels
• In domains such as natural language processing and bioinformatics, the similarities we want to capture are often structural (e.g., parse trees, formal grammars).
• An important area of kernel methods is defining structural kernels that capture this similarity (e.g., sequence alignment kernels, tree kernels, graph kernels, etc.)
![Page 52: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)](https://reader035.fdocuments.us/reader035/viewer/2022062720/56649efb5503460f94c0d855/html5/thumbnails/52.jpg)
From www.cs.pitt.edu/~tomas/cs3750/kernels.ppt:
• Design criteria - we want kernels to be
– valid – Satisfy Mercer condition of positive semidefiniteness
– good – embody the “true similarity” between objects
– appropriate – generalize well
– efficient – the computation of k(x,x’) is feasible
![Page 53: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)](https://reader035.fdocuments.us/reader035/viewer/2022062720/56649efb5503460f94c0d855/html5/thumbnails/53.jpg)
Example: Watson used tree kernels and SVMs to classify
question types for Jeopardy! questions
From Moschitti et al., 2011