Support Vector Machines Alessandro Moschitti
Transcript of Support Vector Machines Alessandro Moschitti
![Page 1: Support Vector Machines Alessandro Moschitti](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1ba061276e740b0f6e58/html5/thumbnails/1.jpg)
MACHINE LEARNING
Alessandro Moschitti
Department of information and communication technology University of Trento
Email: [email protected]
Support Vector Machines
![Page 2: Support Vector Machines Alessandro Moschitti](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1ba061276e740b0f6e58/html5/thumbnails/2.jpg)
![Page 3: Support Vector Machines Alessandro Moschitti](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1ba061276e740b0f6e58/html5/thumbnails/3.jpg)
Summary
  Support Vector Machines   Hard-margin SVMs   Soft-margin SVMs
![Page 4: Support Vector Machines Alessandro Moschitti](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1ba061276e740b0f6e58/html5/thumbnails/4.jpg)
Communications
  No lecture tomorrow (neither Dec. 8)
  ML Exams   12 January 2011 at 9:00,   26 January 2011 at 9:00
  Exercise in Lab   A201 (Polo scientifico e tecnologico)   Wednesday 15 and 22 December, 2011   Time: 8.30-10.30
![Page 5: Support Vector Machines Alessandro Moschitti](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1ba061276e740b0f6e58/html5/thumbnails/5.jpg)
Which hyperplane choose?
![Page 6: Support Vector Machines Alessandro Moschitti](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1ba061276e740b0f6e58/html5/thumbnails/6.jpg)
Classifier with a Maximum Margin
Var1
Var2
Margin
Margin
IDEA 1: Select the hyperplane with maximum margin
![Page 7: Support Vector Machines Alessandro Moschitti](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1ba061276e740b0f6e58/html5/thumbnails/7.jpg)
Support Vector
Var1
Var2
Margin
Support Vectors
![Page 8: Support Vector Machines Alessandro Moschitti](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1ba061276e740b0f6e58/html5/thumbnails/8.jpg)
Support Vector Machine Classifiers
Var1
Var2 kbxw −=+⋅
kbxw =+â‹…
0=+â‹… bxw kk
w
The margin is equal to 2 kw
![Page 9: Support Vector Machines Alessandro Moschitti](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1ba061276e740b0f6e58/html5/thumbnails/9.jpg)
Support Vector Machines
Var1
Var2 kbxw −=+⋅
kbxw =+â‹…
0=+â‹… bxw kk
w
The margin is equal to 2 kw
We need to solve
€
max2 k
|| w || w ⋅ x + b ≥ +k, if x is positive
w ⋅ x + b ≤ −k, if x is negative
![Page 10: Support Vector Machines Alessandro Moschitti](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1ba061276e740b0f6e58/html5/thumbnails/10.jpg)
Support Vector Machines
Var1
Var2 1w x b⋅ + = −
1w x bâ‹… + =
0=+â‹… bxw 11
w
There is a scale for which k=1.
The problem transforms in:
€
max2
|| w || w ⋅ x + b ≥ +1, if x is positive
w ⋅ x + b ≤ −1, if x is negative
![Page 11: Support Vector Machines Alessandro Moschitti](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1ba061276e740b0f6e58/html5/thumbnails/11.jpg)
Final Formulation
€
⇒
€
max2
|| w || w ⋅ x i + b ≥ +1, yi =1
w ⋅ x i + b ≤ −1, yi = -1
€
max2
|| w ||yi( w ⋅ x i + b) ≥1
€
min|| w ||2
yi( w ⋅ x i + b) ≥1
€
min|| w ||2
2yi( w ⋅ x i + b) ≥1
€
⇒
€
⇒
€
⇒
![Page 12: Support Vector Machines Alessandro Moschitti](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1ba061276e740b0f6e58/html5/thumbnails/12.jpg)
Optimization Problem
  Optimal Hyperplane:
  Minimize
  Subject to
  The dual problem is simpler
€
Ï„ ( w ) =12 w 2
yi (( w ⋅ x i ) + b) ≥1,i =1,...,m
![Page 13: Support Vector Machines Alessandro Moschitti](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1ba061276e740b0f6e58/html5/thumbnails/13.jpg)
Lagrangian Definition
€
≤
![Page 14: Support Vector Machines Alessandro Moschitti](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1ba061276e740b0f6e58/html5/thumbnails/14.jpg)
Dual Optimization Problem
![Page 15: Support Vector Machines Alessandro Moschitti](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1ba061276e740b0f6e58/html5/thumbnails/15.jpg)
Dual Transformation
  To solve the dual problem we need to evaluate:
  Given the Lagrangian associated with our problem
  Let us impose the derivatives to 0, with respect to w
![Page 16: Support Vector Machines Alessandro Moschitti](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1ba061276e740b0f6e58/html5/thumbnails/16.jpg)
Dual Transformation (cont’d)
  and wrt b
  Then we substituted them in the Lagrange function
![Page 17: Support Vector Machines Alessandro Moschitti](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1ba061276e740b0f6e58/html5/thumbnails/17.jpg)
Final Dual Problem
![Page 18: Support Vector Machines Alessandro Moschitti](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1ba061276e740b0f6e58/html5/thumbnails/18.jpg)
Khun-Tucker Theorem
  Necessary and sufficient conditions to optimality
![Page 19: Support Vector Machines Alessandro Moschitti](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1ba061276e740b0f6e58/html5/thumbnails/19.jpg)
Properties coming from constraints
  Lagrange constraints:
  Karush-Kuhn-Tucker constraints
  Support Vectors have not null
  To evaluate b, we can apply the following equation
€
aii=1
m
∑ yi = 0 w = α ii=1
m
∑ yi
x i
€
α i ⋅ [yi ( x i ⋅ w + b) −1]= 0, i =1,...,m
iα
![Page 20: Support Vector Machines Alessandro Moschitti](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1ba061276e740b0f6e58/html5/thumbnails/20.jpg)
Warning!
  On the graphical examples, we always consider normalized hyperplane (hyperplanes with normalized gradient)
  b in this case is exactly the distance of the hyperplane from the origin
  So if we have an equation not normalized we may have
  and b is not the distance
€
x â‹… w '+b = 0 with x = x,y( ) and w '= 1,1( )
![Page 21: Support Vector Machines Alessandro Moschitti](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1ba061276e740b0f6e58/html5/thumbnails/21.jpg)
Warning!
  Let us consider a normalized gradient
€
w = 1/ 2,1/ 2( )x,y( ) ⋅ 1/ 2,1/ 2( ) + b = 0⇒ x / 2 + y / 2 = −b
⇒ y = −x − b 2
  Now we see that -b is exactly the distance.
  For x =0, we have the intersection with . This distance projected on is -b
€
−b 2
€
w
![Page 22: Support Vector Machines Alessandro Moschitti](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1ba061276e740b0f6e58/html5/thumbnails/22.jpg)
Soft Margin SVMs
Var1
Var2 1w x b⋅ + = −
1w x bâ‹… + =
0=+â‹… bxw 11
w
iξ slack variables are added
Some errors are allowed but they should penalize the objective function
iξ
![Page 23: Support Vector Machines Alessandro Moschitti](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1ba061276e740b0f6e58/html5/thumbnails/23.jpg)
Soft Margin SVMs
Var1
Var2 1w x b⋅ + = −
1w x bâ‹… + =
0=+â‹… bxw 11
w
iξ
The new constraints are
The objective function penalizes the incorrect classified examples
C is the trade-off between margin and the error
€
yi( w ⋅ x i + b) ≥1−ξ i
∀ x i where ξ i ≥ 0
€
min12|| w ||2 +C ξ ii∑
![Page 24: Support Vector Machines Alessandro Moschitti](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1ba061276e740b0f6e58/html5/thumbnails/24.jpg)
Dual formulation
  By deriving wrt
€
w , ξ and b
![Page 25: Support Vector Machines Alessandro Moschitti](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1ba061276e740b0f6e58/html5/thumbnails/25.jpg)
Partial Derivatives
![Page 26: Support Vector Machines Alessandro Moschitti](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1ba061276e740b0f6e58/html5/thumbnails/26.jpg)
Substitution in the objective function
  of Kronecker ijδ
![Page 27: Support Vector Machines Alessandro Moschitti](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1ba061276e740b0f6e58/html5/thumbnails/27.jpg)
Final dual optimization problem
![Page 28: Support Vector Machines Alessandro Moschitti](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1ba061276e740b0f6e58/html5/thumbnails/28.jpg)
Soft Margin Support Vector Machines
  The algorithm tries to keep ξi low and maximize the margin
  NB: The number of error is not directly minimized (NP-complete problem); the distances from the hyperplane are minimized
  If C→∞, the solution tends to the one of the hard-margin algorithm
  Attention !!!: if C = 0 we get = 0, since
  If C increases the number of error decreases. When C tends to infinite the number of errors must be 0, i.e. the hard-margin formulation
|||| w
€
min12|| w ||2 +C ξ ii∑
€
yi( w ⋅ x i + b) ≥1−ξ i ∀
x i
ξ i ≥ 0
€
yib ≥1−ξ i ∀ x i
![Page 29: Support Vector Machines Alessandro Moschitti](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1ba061276e740b0f6e58/html5/thumbnails/29.jpg)
Robusteness of Soft vs. Hard Margin SVMs
iξ
Var1
Var2 0=+â‹… bxw
ξi
Var1
Var2 0=+â‹… bxw
Soft Margin SVM Hard Margin SVM
![Page 30: Support Vector Machines Alessandro Moschitti](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1ba061276e740b0f6e58/html5/thumbnails/30.jpg)
Soft vs Hard Margin SVMs
  Soft-Margin has ever a solution
  Soft-Margin is more robust to odd examples
  Hard-Margin does not require parameters
![Page 31: Support Vector Machines Alessandro Moschitti](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1ba061276e740b0f6e58/html5/thumbnails/31.jpg)
Parameters
  C: trade-off parameter
  J: cost factor
€
min12|| w ||2 +C ξ ii∑ = min 1
2|| w ||2 +C+ ξ ii∑
++ C− ξ ii∑
−
€
= min 12|| w ||2 +C J ξ ii∑
++ ξ ii∑
−
![Page 32: Support Vector Machines Alessandro Moschitti](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1ba061276e740b0f6e58/html5/thumbnails/32.jpg)
Theoretical Justification
![Page 33: Support Vector Machines Alessandro Moschitti](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1ba061276e740b0f6e58/html5/thumbnails/33.jpg)
Definition of Training Set error
  Training Data
  Empirical Risk (error)
  Risk (error)
{ }1: ±→NRf
€
( x 1,y1),....,( x m ,ym )∈ RN × ±1{ }
€
Remp [ f ]= 1m
12 f ( x i ) − yi
i=1
m
∑
€
R[ f ]= 12
f ( x ) − y dP( x ,y)∫
![Page 34: Support Vector Machines Alessandro Moschitti](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1ba061276e740b0f6e58/html5/thumbnails/34.jpg)
Error Characterization (part 1)
  From PAC-learning Theory (Vapnik):
where d is theVC-dimension, m is the number of examples, δ is a bound on the probability to get such error and α is a classifier parameter.
€
R(α) ≤ Remp (α) +ϕ( dm , log(δ )m )
ϕ( dm , log(δ )m ) = d (log 2md +1)− log(δ4 )m
![Page 35: Support Vector Machines Alessandro Moschitti](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1ba061276e740b0f6e58/html5/thumbnails/35.jpg)
There are many versions for different bounds
![Page 36: Support Vector Machines Alessandro Moschitti](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1ba061276e740b0f6e58/html5/thumbnails/36.jpg)
Error Characterization (part 2)
![Page 37: Support Vector Machines Alessandro Moschitti](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1ba061276e740b0f6e58/html5/thumbnails/37.jpg)
Ranking, Regression and
Multiclassification
![Page 38: Support Vector Machines Alessandro Moschitti](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1ba061276e740b0f6e58/html5/thumbnails/38.jpg)
The Ranking SVM [Herbrich et al. 1999, 2000; Joachims et al. 2002]
  The aim is to classify instance pairs as correctly ranked or incorrectly ranked   This turns an ordinal regression problem back into a binary
classification problem
  We want a ranking function f such that
xi > xj iff f(xi) > f(xj)
  … or at least one that tries to do this with minimal error
  Suppose that f is a linear function
f(xi) = wxi
• Sec.15.4.2
![Page 39: Support Vector Machines Alessandro Moschitti](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1ba061276e740b0f6e58/html5/thumbnails/39.jpg)
The Ranking SVM
  Ranking Model: f(xi)�
€
f (xi )
• Sec.15.4.2
![Page 40: Support Vector Machines Alessandro Moschitti](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1ba061276e740b0f6e58/html5/thumbnails/40.jpg)
The Ranking SVM
  Then (combining the two equations on the last slide):
xi > xj iff wxi − w xj > 0
xi > xj iff w(xi − xj) > 0
  Let us then create a new instance space from such pairs: zk = xi − xk
yk = +1, −1 as xi ≥ , < xk
• Sec.15.4.2
![Page 41: Support Vector Machines Alessandro Moschitti](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1ba061276e740b0f6e58/html5/thumbnails/41.jpg)
Support Vector Ranking
  Given two examples we build one example (xi , xj) €
−1
![Page 42: Support Vector Machines Alessandro Moschitti](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1ba061276e740b0f6e58/html5/thumbnails/42.jpg)
Support Vector Regression (SVR)
Constraints: +ε
-ε 0
Solution:
x
f(x)
![Page 43: Support Vector Machines Alessandro Moschitti](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1ba061276e740b0f6e58/html5/thumbnails/43.jpg)
Support Vector Regression (SVR)
+ε
-ε 0
x
f(x)
€
12wTw + C ξ i + ξ i
*( )i=1
N
∑
Minimise:
Constraints: ξ
ξ*
![Page 44: Support Vector Machines Alessandro Moschitti](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1ba061276e740b0f6e58/html5/thumbnails/44.jpg)
Support Vector Regression
  yi is not -1 or 1 anymore, now it is a value
  ε is the tollerance of our function value
![Page 45: Support Vector Machines Alessandro Moschitti](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1ba061276e740b0f6e58/html5/thumbnails/45.jpg)
From Binary to Multiclass classifiers
  Three different approaches:
  ONE-vs-ALL (OVA)
  Given the example sets, {E1, E2, E3, …} for the categories: {C1, C2, C3,…} the binary classifiers: {b1, b2, b3,…} are built.
  For b1, E1 is the set of positives and E2∪E3 ∪… is the set of negatives, and so on
  For testing: given a classification instance x, the category is the one associated with the maximum margin among all binary classifiers
![Page 46: Support Vector Machines Alessandro Moschitti](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1ba061276e740b0f6e58/html5/thumbnails/46.jpg)
From Binary to Multiclass classifiers
  ALL-vs-ALL (AVA)   Given the examples: {E1, E2, E3, …} for the categories {C1, C2, C3,…}
  build the binary classifiers:
{b1_2, b1_3,…, b1_n, b2_3, b2_4,…, b2_n,…,bn-1_n}
  by learning on E1 (positives) and E2 (negatives), on E1 (positives) and E3 (negatives) and so on…
  For testing: given an example x,
  all the votes of all classifiers are collected
  where bE1E2 = 1 means a vote for C1 and bE1E2 = -1 is a vote for C2
  Select the category that gets more votes
![Page 47: Support Vector Machines Alessandro Moschitti](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1ba061276e740b0f6e58/html5/thumbnails/47.jpg)
From Binary to Multiclass classifiers
  Error Correcting Output Codes (ECOC)   The training set is partitioned according to binary sequences (codes)
associated with category sets.
  For example, 10101 indicates that the set of examples of
C1,C3 and C5 are used to train the C10101 classifier.
  The data of the other categories, i.e. C2 and C4 will be
negative examples
  In testing: the code-classifiers are used to decode one the original class, e.g.
C10101 = 1 and C11010 = 1 indicates that the instance belongs to C1 That is, the only one consistent with the codes
![Page 48: Support Vector Machines Alessandro Moschitti](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1ba061276e740b0f6e58/html5/thumbnails/48.jpg)
SVM-light: an implementation of SVMs
  Implements soft margin
  Contains the procedures for solving optimization problems
  Binary classifier
  Examples and descriptions in the web site:
http://www.joachims.org/
(http://svmlight.joachims.org/)
![Page 49: Support Vector Machines Alessandro Moschitti](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1ba061276e740b0f6e58/html5/thumbnails/49.jpg)
References
  A tutorial on Support Vector Machines for Pattern Recognition   Downloadable article (Chriss Burges)
  The Vapnik-Chervonenkis Dimension and the Learning Capability of Neural Nets   Downloadable Presentation
  Computational Learning Theory (Sally A Goldman Washington University St. Louis Missouri)   Downloadable Article
  AN INTRODUCTION TO SUPPORT VECTOR MACHINES (and other kernel-based learning methods) N. Cristianini and J. Shawe-Taylor Cambridge University Press   Check our library
  The Nature of Statistical Learning Theory Vladimir Naumovich Vapnik - Springer Verlag (December, 1999)   Check our library
![Page 50: Support Vector Machines Alessandro Moschitti](https://reader030.fdocuments.us/reader030/viewer/2022012803/61bd1ba061276e740b0f6e58/html5/thumbnails/50.jpg)
Exercise
  1. The equations of SVMs for Classification, Ranking and Regression (you can get them from my slides).
  2. The perceptron algorithm for Classification, Ranking and Regression (the last two you have to provide by looking at what you wrote in point (1)).
  3. The same as point (2) by using kernels (write the kernel definition as introduction of this section).