Lecture 10: Support Vector Machines
description
Transcript of Lecture 10: Support Vector Machines
![Page 1: Lecture 10: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062411/56816806550346895ddd86fe/html5/thumbnails/1.jpg)
Machine LearningQueens College
Lecture 10: Support Vector Machines
![Page 2: Lecture 10: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062411/56816806550346895ddd86fe/html5/thumbnails/2.jpg)
Today
• Support Vector Machines
• Note: we’ll rely on some math from Optimality Theory that we won’t derive.
2
![Page 3: Lecture 10: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062411/56816806550346895ddd86fe/html5/thumbnails/3.jpg)
Maximum Margin
• Linear Classifiers can lead to many equally valid choices for the decision boundary
3
Are these really “equally valid”?
![Page 4: Lecture 10: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062411/56816806550346895ddd86fe/html5/thumbnails/4.jpg)
Max Margin
• How can we pick which is best?
• Maximize the size of the margin.
4
Are these really “equally valid”?
Small Margin
Large Margin
![Page 5: Lecture 10: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062411/56816806550346895ddd86fe/html5/thumbnails/5.jpg)
Support Vectors• Support Vectors
are those input points (vectors) closest to the decision boundary
• 1. They are vectors
• 2. They “support” the decision hyperplane
5
![Page 6: Lecture 10: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062411/56816806550346895ddd86fe/html5/thumbnails/6.jpg)
Support Vectors
• Define this as a decision problem
• The decision hyperplane:
• No fancy math, just the equation of a hyperplane.
6
![Page 7: Lecture 10: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062411/56816806550346895ddd86fe/html5/thumbnails/7.jpg)
Support Vectors
• Aside: Why do some cassifiers use or – Simplicity of the math
and interpretation.– For probability density
function estimation 0,1 has a clear correlate.
– For classification, a decision boundary of 0 is more easily interpretable than .5.
7
![Page 8: Lecture 10: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062411/56816806550346895ddd86fe/html5/thumbnails/8.jpg)
Support Vectors
• Define this as a decision problem
• The decision hyperplane:
• Decision Function:
8
![Page 9: Lecture 10: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062411/56816806550346895ddd86fe/html5/thumbnails/9.jpg)
Support Vectors
• Define this as a decision problem
• The decision hyperplane:
• Margin hyperplanes:
9
![Page 10: Lecture 10: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062411/56816806550346895ddd86fe/html5/thumbnails/10.jpg)
Support Vectors
• The decision hyperplane:
• Scale invariance
10
![Page 11: Lecture 10: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062411/56816806550346895ddd86fe/html5/thumbnails/11.jpg)
Support Vectors
• The decision hyperplane:
• Scale invariance
11
![Page 12: Lecture 10: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062411/56816806550346895ddd86fe/html5/thumbnails/12.jpg)
Support Vectors
• The decision hyperplane:
• Scale invariance
12
This scaling does not change the decision hyperplane, or the supportvector hyperplanes. But we willeliminate a variable from the optimization
![Page 13: Lecture 10: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062411/56816806550346895ddd86fe/html5/thumbnails/13.jpg)
What are we optimizing?• We will represent
the size of the margin in terms of w.
• This will allow us to simultaneously– Identify a decision
boundary– Maximize the margin
13
![Page 14: Lecture 10: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062411/56816806550346895ddd86fe/html5/thumbnails/14.jpg)
How do we represent the size of the margin in terms of w?
1. There must at least one point that lies on each support hyperplanes
14
Proof outline: If not, we could define a larger margin support hyperplane that does touch the nearest point(s).
![Page 15: Lecture 10: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062411/56816806550346895ddd86fe/html5/thumbnails/15.jpg)
How do we represent the size of the margin in terms of w?
1. There must at least one point that lies on each support hyperplanes
15
Proof outline: If not, we could define a larger margin support hyperplane that does touch the nearest point(s).
![Page 16: Lecture 10: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062411/56816806550346895ddd86fe/html5/thumbnails/16.jpg)
How do we represent the size of the margin in terms of w?
1. There must at least one point that lies on each support hyperplanes
2. Thus:
16
3. And:
![Page 17: Lecture 10: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062411/56816806550346895ddd86fe/html5/thumbnails/17.jpg)
How do we represent the size of the margin in terms of w?
1. There must at least one point that lies on each support hyperplanes
2. Thus:
17
3. And:
![Page 18: Lecture 10: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062411/56816806550346895ddd86fe/html5/thumbnails/18.jpg)
• The vector w is perpendicular to the decision hyperplane– If the dot product of two
vectors equals zero, the two vectors are perpendicular.
How do we represent the size of the margin in terms of w?
18
![Page 19: Lecture 10: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062411/56816806550346895ddd86fe/html5/thumbnails/19.jpg)
• The margin is the projection of x1 – x2 onto w, the normal of the hyperplane.
How do we represent the size of the margin in terms of w?
19
![Page 20: Lecture 10: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062411/56816806550346895ddd86fe/html5/thumbnails/20.jpg)
Aside: Vector Projection
20
![Page 21: Lecture 10: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062411/56816806550346895ddd86fe/html5/thumbnails/21.jpg)
• The margin is the projection of x1 – x2 onto w, the normal of the hyperplane.
How do we represent the size of the margin in terms of w?
21Size of the Margin:
Projection:
![Page 22: Lecture 10: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062411/56816806550346895ddd86fe/html5/thumbnails/22.jpg)
Maximizing the margin
• Goal: maximize the margin
22
Linear Separability of the data by the decision boundary
![Page 23: Lecture 10: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062411/56816806550346895ddd86fe/html5/thumbnails/23.jpg)
Max Margin Loss Function
• If constraint optimization then Lagrange Multipliers
• Optimize the “Primal”
23
![Page 24: Lecture 10: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062411/56816806550346895ddd86fe/html5/thumbnails/24.jpg)
Max Margin Loss Function
• Optimize the “Primal”
24
Partial wrt b
![Page 25: Lecture 10: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062411/56816806550346895ddd86fe/html5/thumbnails/25.jpg)
Max Margin Loss Function
• Optimize the “Primal”
25
Partial wrt w
![Page 26: Lecture 10: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062411/56816806550346895ddd86fe/html5/thumbnails/26.jpg)
Max Margin Loss Function
• Optimize the “Primal”
26
Partial wrt w
Now have to find αi.Substitute back to the Loss function
![Page 27: Lecture 10: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062411/56816806550346895ddd86fe/html5/thumbnails/27.jpg)
Max Margin Loss Function
• Construct the “dual”
27
![Page 28: Lecture 10: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062411/56816806550346895ddd86fe/html5/thumbnails/28.jpg)
Dual formulation of the error
• Solve this quadratic program to identify the lagrange multipliers and thus the weights
28
There exist (rather) fast approaches to quadratic program solving in both C, C++, Python, Java and R
![Page 29: Lecture 10: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062411/56816806550346895ddd86fe/html5/thumbnails/29.jpg)
Quadratic Programming
29
•If Q is positive semi definite, then f(x) is convex.
•If f(x) is convex, then there is a single maximum.
![Page 30: Lecture 10: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062411/56816806550346895ddd86fe/html5/thumbnails/30.jpg)
Support Vector Expansion
• When αi is non-zero then xi is a support vector
• When αi is zero xi is not a support vector30
New decision FunctionIndependent of the
Dimension of x!
![Page 31: Lecture 10: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062411/56816806550346895ddd86fe/html5/thumbnails/31.jpg)
Karush-Kuhn-Tucker Conditions
• In constraint optimization: At the optimal solution– Constraint * Lagrange Multiplier = 0
31
Only points on the decision boundary contribute to the solution!
![Page 32: Lecture 10: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062411/56816806550346895ddd86fe/html5/thumbnails/32.jpg)
Visualization of Support Vectors
32
![Page 33: Lecture 10: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062411/56816806550346895ddd86fe/html5/thumbnails/33.jpg)
Interpretability of SVM parameters
• What else can we tell from alphas?– If alpha is large, then the associated data
point is quite important.– It’s either an outlier, or incredibly important.
• But this only gives us the best solution for linearly separable data sets…
33
![Page 34: Lecture 10: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062411/56816806550346895ddd86fe/html5/thumbnails/34.jpg)
Basis of Kernel Methods
• The decision process doesn’t depend on the dimensionality of the data.
• We can map to a higher dimensionality of the data space.
• Note: data points only appear within a dot product.• The error is based on the dot product of data points – not the
data points themselves.
34
![Page 35: Lecture 10: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062411/56816806550346895ddd86fe/html5/thumbnails/35.jpg)
Basis of Kernel Methods
• Since data points only appear within a dot product.• Thus we can map to another space through a replacement
• The error is based on the dot product of data points – not the data points themselves.
35
![Page 36: Lecture 10: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062411/56816806550346895ddd86fe/html5/thumbnails/36.jpg)
Learning Theory bases of SVMs
• Theoretical bounds on testing error.– The upper bound doesn’t depend on the
dimensionality of the space– The lower bound is maximized by maximizing
the margin, γ, associated with the decision boundary.
36
![Page 37: Lecture 10: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062411/56816806550346895ddd86fe/html5/thumbnails/37.jpg)
Why we like SVMs
• They work– Good generalization
• Easily interpreted.– Decision boundary is based on the data in the
form of the support vectors.• Not so in multilayer perceptron networks
• Principled bounds on testing error from Learning Theory (VC dimension)
37
![Page 38: Lecture 10: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062411/56816806550346895ddd86fe/html5/thumbnails/38.jpg)
SVM vs. MLP
• SVMs have many fewer parameters– SVM: Maybe just a kernel parameter– MLP: Number and arrangement of nodes and
eta learning rate • SVM: Convex optimization task
– MLP: likelihood is non-convex -- local minima
38
![Page 39: Lecture 10: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062411/56816806550346895ddd86fe/html5/thumbnails/39.jpg)
Soft margin classification• There can be outliers on the other side of the
decision boundary, or leading to a small margin.• Solution: Introduce a penalty term to the constraint
function
39
![Page 40: Lecture 10: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062411/56816806550346895ddd86fe/html5/thumbnails/40.jpg)
Soft Max Dual
40
Still Quadratic Programming!
![Page 41: Lecture 10: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062411/56816806550346895ddd86fe/html5/thumbnails/41.jpg)
• Points are allowed within the margin, but cost is introduced.
Soft margin example
41
Hinge Loss
![Page 42: Lecture 10: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062411/56816806550346895ddd86fe/html5/thumbnails/42.jpg)
Probabilities from SVMs• Support Vector Machines are discriminant
functions
– Discriminant functions: f(x)=c– Discriminative models: f(x) = argmaxc p(c|x)– Generative Models: f(x) = argmaxc p(x|c)p(c)/p(x)
• No (principled) probabilities from SVMs• SVMs are not based on probability
distribution functions of class instances.
42
![Page 43: Lecture 10: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062411/56816806550346895ddd86fe/html5/thumbnails/43.jpg)
Efficiency of SVMs
• Not especially fast.• Training – n^3
– Quadratic Programming efficiency• Evaluation – n
– Need to evaluate against each support vector (potentially n)
43
![Page 44: Lecture 10: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062411/56816806550346895ddd86fe/html5/thumbnails/44.jpg)
Next Time
• Perceptrons
44