Support Vector Machines
Computational Linguistics: Jordan Boyd-GraberUniversity of MarylandSEPTEMBER 5, 2018
Slides adapted from Tom Mitchell, Eric Xing, and Lauren Hannah
Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 1 / 22
Roadmap
Classification: machines labeling data for us
Previously: naïve Bayes and logistic regression This time: SVMs (another) example of linear classifier Good classification accuracy Good theoretical properties
Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 2 / 22
Thinking Geometrically
Suppose you have two classes: vacations and sports
Suppose you have four documents
Sports
Doc1: {ball, ball, ball, travel}Doc2: {ball, ball}
Vacations
Doc3: {travel, ball, travel}Doc4: {travel}
What does this look like in vector space?
Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 3 / 22
Put the documents in vector space
Travel
3
2
1
01 2 3
Ball
Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 4 / 22
Vector space representation of documents
Each document is a vector, one component for each term.
Terms are axes.
High dimensionality: 10,000s of dimensions and more
How can we do classification in this space?
Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 5 / 22
Vector space classification
As before, the training set is a set of documents, each labeled with itsclass.
In vector space classification, this set corresponds to a labeled set ofpoints or vectors in the vector space.
Premise 1: Documents in the same class form a contiguous region.
Premise 2: Documents from different classes don’t overlap.
We define lines, surfaces, hypersurfaces to divide regions.
Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 6 / 22
Classes in the vector space
Recap Vector space classification Linear classifiers Support Vector Machines Discussion
Classes in the vector space
xxx
x
!! !!
!
!
China
Kenya
UK!
Should the document ! be assigned to China, UK or Kenya?
Schütze: Support vector machines 12 / 55
Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 7 / 22
Classes in the vector space
Recap Vector space classification Linear classifiers Support Vector Machines Discussion
Classes in the vector space
xxx
x
!! !!
!
!
China
Kenya
UK!
Should the document ! be assigned to China, UK or Kenya?
Schütze: Support vector machines 12 / 55
Should the document ? be assigned to China, UK or Kenya?
Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 7 / 22
Classes in the vector space
Recap Vector space classification Linear classifiers Support Vector Machines Discussion
Classes in the vector space
xxx
x
!! !!
!
!
China
Kenya
UK!
Find separators between the classes
Schütze: Support vector machines 12 / 55
Find separators between the classes
Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 7 / 22
Classes in the vector space
Recap Vector space classification Linear classifiers Support Vector Machines Discussion
Classes in the vector space
xxx
x
!! !!
!
!
China
Kenya
UK!
Find separators between the classes
Schütze: Support vector machines 12 / 55
Find separators between the classes
Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 7 / 22
Classes in the vector space
Recap Vector space classification Linear classifiers Support Vector Machines Discussion
Classes in the vector space
xxx
x
!! !!
!
!
China
Kenya
UK!
Find separators between the classes
Schütze: Support vector machines 12 / 55
Based on these separators: ? should be assigned to China
Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 7 / 22
Classes in the vector space
Recap Vector space classification Linear classifiers Support Vector Machines Discussion
Classes in the vector space
xxx
x
!! !!
!
!
China
Kenya
UK!
Find separators between the classes
Schütze: Support vector machines 12 / 55
How do we find separators that do a good job at classifying new documentslike ?? – Main topic of today
Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 7 / 22
Linear Classifiers
Linear classifiers
Definition: A linear classifier computes a linear combination or weighted sum
∑
i βixiof the feature values.
Classification decision:∑
i βixi >β0? (β0 is our bias) . . . where β0 (the threshold) is a parameter.
We call this the separator or decision boundary.
We find the separator based on training set.
Methods for finding separator: logistic regression, naïve Bayes, linearSVM
Assumption: The classes are linearly separable.
Before, we just talked about equations. What’s the geometric intuition?
Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 8 / 22
Linear Classifiers
Linear classifiers
Definition: A linear classifier computes a linear combination or weighted sum
∑
i βixiof the feature values.
Classification decision:∑
i βixi >β0? (β0 is our bias) . . . where β0 (the threshold) is a parameter.
We call this the separator or decision boundary.
We find the separator based on training set.
Methods for finding separator: logistic regression, naïve Bayes, linearSVM
Assumption: The classes are linearly separable.
Before, we just talked about equations. What’s the geometric intuition?
Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 8 / 22
Linear Classifiers
A linear classifier in 1D
Recap Vector space classification Linear classifiers Support Vector Machines Discussion
A linear classifier in 1D
A linear classifier in 1D isa point x described by theequation w1d1 = θ
x = θ/w1
Schütze: Support vector machines 15 / 55
A linear classifier in 1D is apoint x described by theequation β1x1 =β0
x =β0/β1 Points (x1) with β1x1 ≥β0 are
in the class c.
Points (x1) with β1x1
Linear Classifiers
A linear classifier in 1D
Recap Vector space classification Linear classifiers Support Vector Machines Discussion
A linear classifier in 1D
A linear classifier in 1D isa point x described by theequation w1d1 = θ
x = θ/w1
Schütze: Support vector machines 15 / 55
A linear classifier in 1D is apoint x described by theequation β1x1 =β0
x =β0/β1
Points (x1) with β1x1 ≥β0 arein the class c.
Points (x1) with β1x1
Linear Classifiers
A linear classifier in 1D
Recap Vector space classification Linear classifiers Support Vector Machines Discussion
A linear classifier in 1D
A linear classifier in 1D isa point x described by theequation w1d1 = θ
x = θ/w1
Points (d1) with w1d1 ≥ θare in the class c .
Schütze: Support vector machines 15 / 55
A linear classifier in 1D is apoint x described by theequation β1x1 =β0
x =β0/β1 Points (x1) with β1x1 ≥β0 are
in the class c.
Points (x1) with β1x1
Linear Classifiers
A linear classifier in 1D
Recap Vector space classification Linear classifiers Support Vector Machines Discussion
A linear classifier in 1D
A linear classifier in 1D isa point x described by theequation w1d1 = θ
x = θ/w1
Points (d1) with w1d1 ≥ θare in the class c .
Points (d1) with w1d1 < θare in the complementclass c .
Schütze: Support vector machines 15 / 55
A linear classifier in 1D is apoint x described by theequation β1x1 =β0
x =β0/β1 Points (x1) with β1x1 ≥β0 are
in the class c.
Points (x1) with β1x1
Linear Classifiers
A linear classifier in 2D
Recap Vector space classification Linear classifiers Support Vector Machines Discussion
A linear classifier in 2D
A linear classifier in 2D isa line described by theequation w1d1 + w2d2 = θ
Example for a 2D linearclassifier
Schütze: Support vector machines 16 / 55
A linear classifier in 2D is aline described by the equationβ1x1 +β2x2 =β0
Example for a 2D linearclassifier
Points (x1 x2) withβ1x1 +β2x2 ≥β0 are in theclass c.
Points (x1 x2) withβ1x1 +β2x2
Linear Classifiers
A linear classifier in 2D
Recap Vector space classification Linear classifiers Support Vector Machines Discussion
A linear classifier in 2D
A linear classifier in 2D isa line described by theequation w1d1 + w2d2 = θ
Example for a 2D linearclassifier
Schütze: Support vector machines 16 / 55
A linear classifier in 2D is aline described by the equationβ1x1 +β2x2 =β0
Example for a 2D linearclassifier
Points (x1 x2) withβ1x1 +β2x2 ≥β0 are in theclass c.
Points (x1 x2) withβ1x1 +β2x2
Linear Classifiers
A linear classifier in 2DRecap Vector space classification Linear classifiers Support Vector Machines Discussion
A linear classifier in 2D
A linear classifier in 2D isa line described by theequation w1d1 + w2d2 = θ
Example for a 2D linearclassifier
Points (d1 d2) withw1d1 + w2d2 ≥ θ are inthe class c .
Schütze: Support vector machines 16 / 55
A linear classifier in 2D is aline described by the equationβ1x1 +β2x2 =β0
Example for a 2D linearclassifier
Points (x1 x2) withβ1x1 +β2x2 ≥β0 are in theclass c.
Points (x1 x2) withβ1x1 +β2x2
Linear Classifiers
A linear classifier in 2D
Recap Vector space classification Linear classifiers Support Vector Machines Discussion
A linear classifier in 2D
A linear classifier in 2D isa line described by theequation w1d1 + w2d2 = θ
Example for a 2D linearclassifier
Points (d1 d2) withw1d1 + w2d2 ≥ θ are inthe class c .
Points (d1 d2) withw1d1 + w2d2 < θ are inthe complement class c .
Schütze: Support vector machines 16 / 55
A linear classifier in 2D is aline described by the equationβ1x1 +β2x2 =β0
Example for a 2D linearclassifier
Points (x1 x2) withβ1x1 +β2x2 ≥β0 are in theclass c.
Points (x1 x2) withβ1x1 +β2x2
Linear Classifiers
A linear classifier in 3D
Recap Vector space classification Linear classifiers Support Vector Machines Discussion
A linear classifier in 3D
A linear classifier in 3D isa plane described by theequationw1d1 + w2d2 + w3d3 = θ
Schütze: Support vector machines 17 / 55
A linear classifier in 3D is aplane described by theequationβ1x1 +β2x2 +β3x3 =β0
Example for a 3D linearclassifier
Points (x1 x2 x3) withβ1x1 +β2x2 +β3x3 ≥β0 arein the class c.
Points (x1 x2 x3) withβ1x1 +β2x2 +β3x3
Linear Classifiers
A linear classifier in 3D
Recap Vector space classification Linear classifiers Support Vector Machines Discussion
A linear classifier in 3D
A linear classifier in 3D isa plane described by theequationw1d1 + w2d2 + w3d3 = θ
Example for a 3D linearclassifier
Schütze: Support vector machines 17 / 55
A linear classifier in 3D is aplane described by theequationβ1x1 +β2x2 +β3x3 =β0
Example for a 3D linearclassifier
Points (x1 x2 x3) withβ1x1 +β2x2 +β3x3 ≥β0 arein the class c.
Points (x1 x2 x3) withβ1x1 +β2x2 +β3x3
Linear Classifiers
A linear classifier in 3DRecap Vector space classification Linear classifiers Support Vector Machines Discussion
A linear classifier in 3D
A linear classifier in 3D isa plane described by theequationw1d1 + w2d2 + w3d3 = θ
Example for a 3D linearclassifier
Points (d1 d2 d3) withw1d1 + w2d2 + w3d3 ≥ θare in the class c .
Schütze: Support vector machines 17 / 55
A linear classifier in 3D is aplane described by theequationβ1x1 +β2x2 +β3x3 =β0
Example for a 3D linearclassifier
Points (x1 x2 x3) withβ1x1 +β2x2 +β3x3 ≥β0 arein the class c.
Points (x1 x2 x3) withβ1x1 +β2x2 +β3x3
Linear Classifiers
A linear classifier in 3D
Recap Vector space classification Linear classifiers Support Vector Machines Discussion
A linear classifier in 3D
A linear classifier in 3D isa plane described by theequationw1d1 + w2d2 + w3d3 = θ
Example for a 3D linearclassifier
Points (d1 d2 d3) withw1d1 + w2d2 + w3d3 ≥ θare in the class c .
Points (d1 d2 d3) withw1d1 + w2d2 + w3d3 < θare in the complementclass c .
Schütze: Support vector machines 17 / 55
A linear classifier in 3D is aplane described by theequationβ1x1 +β2x2 +β3x3 =β0
Example for a 3D linearclassifier
Points (x1 x2 x3) withβ1x1 +β2x2 +β3x3 ≥β0 arein the class c.
Points (x1 x2 x3) withβ1x1 +β2x2 +β3x3
Linear Classifiers
Naive Bayes and Logistic Regression as linear classifiers
Multinomial Naive Bayes is a linear classifier (in log space) defined by:
M∑
i=1
βixi =β0
where βi = log[P̂(ti |c)/P̂(ti |c̄)], xi = number of occurrences of ti in d , andβ0 =− log[P̂(c)/P̂(c̄)]. Here, the index i , 1≤ i ≤M, refers to terms of thevocabulary.Logistic regression is the same (we only put it into the logistic function toturn it into a probability).
Takeway
Naïve Bayes, logistic regression and SVM are all linear methods. Theychoose their hyperplanes based on different objectives: joint likelihood(NB), conditional likelihood (LR), and the margin (SVM).
Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 12 / 22
Linear Classifiers
Naive Bayes and Logistic Regression as linear classifiers
Multinomial Naive Bayes is a linear classifier (in log space) defined by:
M∑
i=1
βixi =β0
where βi = log[P̂(ti |c)/P̂(ti |c̄)], xi = number of occurrences of ti in d , andβ0 =− log[P̂(c)/P̂(c̄)]. Here, the index i , 1≤ i ≤M, refers to terms of thevocabulary.Logistic regression is the same (we only put it into the logistic function toturn it into a probability).
Takeway
Naïve Bayes, logistic regression and SVM are all linear methods. Theychoose their hyperplanes based on different objectives: joint likelihood(NB), conditional likelihood (LR), and the margin (SVM).
Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 12 / 22
Linear Classifiers
Which hyperplane?
Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 13 / 22
Linear Classifiers
Which hyperplane?
For linearly separable training sets: there are infinitely many separatinghyperplanes.
They all separate the training set perfectly . . .
. . . but they behave differently on test data.
Error rates on new data are low for some, high for others.
How do we find a low-error separator?
Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 14 / 22
Support Vector Machines
Support vector machines
Machine-learning research in the last two decades has improvedclassifier effectiveness.
New generation of state-of-the-art classifiers: support vector machines(SVMs), boosted decision trees, regularized logistic regression, neuralnetworks, and random forests
Applications to IR problems, particularly text classification
SVMs: A kind of large-margin classifier
Vector space based machine-learning method aiming to find a decisionboundary between two classes that is maximally far from any point in thetraining data (possibly discounting some points as outliers or noise)
Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 15 / 22
Support Vector Machines
Support Vector Machines
2-class training data
decision boundary→linear separator
criterion: beingmaximally far awayfrom any data point→determines classifiermargin
linear separatorposition defined bysupport vectors
Recap Vector space classification Linear classifiers Support Vector Machines Discussion
Support Vector Machines
2-class training data
Schütze: Support vector machines 29 / 55
Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 16 / 22
Support Vector Machines
Support Vector Machines
2-class training data
decision boundary→linear separator
criterion: beingmaximally far awayfrom any data point→determines classifiermargin
linear separatorposition defined bysupport vectors
Recap Vector space classification Linear classifiers Support Vector Machines Discussion
Support Vector Machines
2-class training data
decision boundary→ linear separator
Schütze: Support vector machines 29 / 55
Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 16 / 22
Support Vector Machines
Support Vector Machines
2-class training data
decision boundary→linear separator
criterion: beingmaximally far awayfrom any data point→determines classifiermargin
linear separatorposition defined bysupport vectors
Recap Vector space classification Linear classifiers Support Vector Machines Discussion
Support Vector Machines
2-class training data
decision boundary→ linear separator
criterion: beingmaximally far awayfrom any data point→ determinesclassifier margin
Margin ismaximized
Schütze: Support vector machines 29 / 55
Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 16 / 22
Support Vector Machines
Support Vector Machines
2-class training data
decision boundary→linear separator
criterion: beingmaximally far awayfrom any data point→determines classifiermargin
linear separatorposition defined bysupport vectors
Recap Vector space classification Linear classifiers Support Vector Machines Discussion
Why maximize the margin?
Points near decisionsurface → uncertainclassification decisions(50% either way).A classifier with a largemargin makes no lowcertainty classificationdecisions.Gives classificationsafety margin w.r.t slighterrors in measurement ordoc. variation
Support vectors
Margin ismaximized
Maximummargindecisionhyperplane
Schütze: Support vector machines 30 / 55
Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 16 / 22
Support Vector Machines
Why maximize the margin?
Points near decisionsurface→ uncertainclassification decisions
A classifier with a largemargin is alwaysconfident
Gives classificationsafety margin(measurement orvariation)
Recap Vector space classification Linear classifiers Support Vector Machines Discussion
Why maximize the margin?
Points near decisionsurface → uncertainclassification decisions(50% either way).A classifier with a largemargin makes no lowcertainty classificationdecisions.Gives classificationsafety margin w.r.t slighterrors in measurement ordoc. variation
Support vectors
Margin ismaximized
Maximummargindecisionhyperplane
Schütze: Support vector machines 30 / 55
Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 17 / 22
Support Vector Machines
Why maximize the margin?
SVM classifier: large marginaround decision boundary
compare to decision hyperplane:place fat separator betweenclasses unique solution
decreased memory capacity
increased ability to correctlygeneralize to test data
Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 18 / 22
Formulation
Equation
Equation of a hyperplane
~w · xi + b = 0 (1)
Distance of a point to hyperplane
|~w · xi + b|||~w ||
(2)
The margin ρ is given by
ρ ≡ min(x ,y)∈S
|~w · xi + b|||~w ||
=1
||~w ||(3)
This is because for any point on the marginal hyperplane, ~w ·x + b =±1
Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 19 / 22
Formulation
Equation
Equation of a hyperplane
~w · xi + b = 0 (1)
Distance of a point to hyperplane
|~w · xi + b|||~w ||
(2)
The margin ρ is given by
ρ ≡ min(x ,y)∈S
|~w · xi + b|||~w ||
=1
||~w ||(3)
This is because for any point on the marginal hyperplane, ~w ·x + b =±1
Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 19 / 22
Formulation
Optimization Problem
We want to find a weight vector ~w and bias b that optimize
min~w ,b
1
2||w ||2 (4)
subject to yi(~w · xi + b)≥ 1, ∀i ∈ [1,m].
Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 20 / 22
Formulation
Choosing what kind of classifier to use
When building a text classifier, first question: how much training data isthere currently available?
None?
Very little?
A fair amount?
A huge amount
Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 21 / 22
Formulation
Choosing what kind of classifier to use
When building a text classifier, first question: how much training data isthere currently available?
None? Hand write rules or use active learning
Very little?
A fair amount?
A huge amount
Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 21 / 22
Formulation
Choosing what kind of classifier to use
When building a text classifier, first question: how much training data isthere currently available?
None? Hand write rules or use active learning
Very little? Naïve Bayes
A fair amount?
A huge amount
Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 21 / 22
Formulation
Choosing what kind of classifier to use
When building a text classifier, first question: how much training data isthere currently available?
None? Hand write rules or use active learning
Very little? Naïve Bayes
A fair amount? SVM
A huge amount
Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 21 / 22
Formulation
Choosing what kind of classifier to use
When building a text classifier, first question: how much training data isthere currently available?
None? Hand write rules or use active learning
Very little? Naïve Bayes
A fair amount? SVM
A huge amount Doesn’t matter, use whatever works
Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 21 / 22
Formulation
SVM extensions: What’s next
Finding solutions Slack variables: not perfect line Kernels: different geometries
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Computational Linguistics: Jordan Boyd-Graber | UMD Support Vector Machines | 22 / 22
Linear ClassifiersSupport Vector MachinesFormulation
Top Related