Support Vector Machines
description
Transcript of Support Vector Machines
![Page 1: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062315/56816337550346895dd3c71d/html5/thumbnails/1.jpg)
Support Vector Machines
Lecturer: Yishay Mansour
Itay Kirshenbaum
![Page 2: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062315/56816337550346895dd3c71d/html5/thumbnails/2.jpg)
Lecture OverviewIn this lecture we present in detail one of the most theoretically well motivated and practically most effective classification algorithms in modern machine learning: Support Vector Machines (SVMs).
![Page 3: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062315/56816337550346895dd3c71d/html5/thumbnails/3.jpg)
Lecture Overview – Cont. We begin with building the
intuition behind SVMs continue to define SVM as an
optimization problem and discuss how to efficiently solve it.
We conclude with an analysis of the error rate of SVMs using two techniques: Leave One Out and VC-dimension.
![Page 4: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062315/56816337550346895dd3c71d/html5/thumbnails/4.jpg)
Introduction Support Vector Machine is a
supervised learning algorithm Used to learn a hyperplane that
can solve the binary classification problem
Among the most extensively studied problems in machine learning.
![Page 5: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062315/56816337550346895dd3c71d/html5/thumbnails/5.jpg)
Binary Classification Problem Input space: Output space: Training data: S drawn i.i.d with distribution D Goal: Select hypothesis
that best predicts other points drawn i.i.d from D
nRX
)},(),...,,{( 11 mm yxyxS }1,1{ Y
Hh
![Page 6: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062315/56816337550346895dd3c71d/html5/thumbnails/6.jpg)
Binary Classification – Cont. Consider the problem of
predicting the success of a new drug based on a patient height and weight m ill people are selected and treated This generates m 2d vectors (height
and weight) Each point is assigned +1 to indicate
successful treatment or -1 otherwise This can be used as training data
![Page 7: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062315/56816337550346895dd3c71d/html5/thumbnails/7.jpg)
Binary classification – Cont. Infinitely many ways to classify Occam’s razor – simple
classification rules provide better results Linear classifier or hyperplane
Our class of linear classifiers:
0)*( if 1 to maps bxwXxHh
},|)*(sign {x RbRwbxwH n
![Page 8: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062315/56816337550346895dd3c71d/html5/thumbnails/8.jpg)
Choosing a Good Hyperplane Intuition
Consider two cases of positive classification:
w*x + b = 0.1 w*x + b = 100
More confident in the decision made by the latter rather than the former
Choose a hyperplane with maximal margin
![Page 9: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062315/56816337550346895dd3c71d/html5/thumbnails/9.jpg)
Good Hyperplane – Cont. Definition: Functional margin S
A linear classifier:
)*(ˆ with ˆˆ min},...,1{
s bxwy iiii
mi
),( toaccording oftion classifica theis bwxy ii
![Page 10: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062315/56816337550346895dd3c71d/html5/thumbnails/10.jpg)
Maximal Margin w,b can be scaled to increase
margin sign(w*x + b) = sign(5w*x + 5b) for
all x (5w, 5b) is 5 times greater than (w,b)
Cope by adding an additional constraint: ||w|| = 1
![Page 11: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062315/56816337550346895dd3c71d/html5/thumbnails/11.jpg)
Maximal Margin – Cont. Geometric Margin
Consider the geometric distance between the hyperplane and the closest points
![Page 12: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062315/56816337550346895dd3c71d/html5/thumbnails/12.jpg)
Geometric Margin Definition: Definition: Geometric margin S
Relation to functional margin
Both are equal when
)*( with min},...,1{
s wbx
wwy iiii
mi
ii yw ̂
1w
![Page 13: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062315/56816337550346895dd3c71d/html5/thumbnails/13.jpg)
The Algorithm We saw:
Two definitions of the margin Intuition behind seeking a
maximizing hyperplane Goal: Write an optimization
program that finds such a hyperplan
We always look for (w,b) maximizing the margin
![Page 14: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062315/56816337550346895dd3c71d/html5/thumbnails/14.jpg)
The Algorithm – Take 1 First try:
Idea Maximize - For each sample the
Functional margin is at least Functional and geometric margin are
the same as Largest possible geometric margin
with respect to the training set
1,,...,1,)*(max wmibxwy ii
1w
![Page 15: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062315/56816337550346895dd3c71d/html5/thumbnails/15.jpg)
The Algorithm – Take 2 The first try can’t be solved by any
off-the-shelf optimization software The constraint is non-linear In fact, it’s even non-convex
How can we discard the constraint? Use geometric margin!
1w
mibxwyw
ii ,...,1,ˆ)*(ˆ
max
![Page 16: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062315/56816337550346895dd3c71d/html5/thumbnails/16.jpg)
The Algorithm – Take 3 We now have a non-convex
objective function – The problem remains
Remember We can scale (w,b) as we wish Force the functional margin to be 1 Objective function: Same as: Factor of 0.5 and power of 2 do not
change the program – Make things easier
w1max
2
21min w
![Page 17: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062315/56816337550346895dd3c71d/html5/thumbnails/17.jpg)
The algorithm – Final version The final program:
The objective is convex (quadratic) All constraints are linear Can solve efficiently using standard
quadratic programing (QP) software
mibxwyw ii ,...,1,1)*(21max 2
![Page 18: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062315/56816337550346895dd3c71d/html5/thumbnails/18.jpg)
Convex Optimization We want to solve the optimization
problem more efficiently than generic QP
Solution – Use convex optimization techniques
![Page 19: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062315/56816337550346895dd3c71d/html5/thumbnails/19.jpg)
Convex Optimization – Cont. Definition: A convex function
Theorem
)()1()())1((
:1,0,,yfxfyxf
Xyx
f
))(()()( :,functionconvex abledifferenti a be :Let
xyxfxfyfXyxxf
![Page 20: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062315/56816337550346895dd3c71d/html5/thumbnails/20.jpg)
Convex Optimization Problem Convex optimization problem
We look for a value of Minimizes Under the constraint
mixgxfmixgf
iXx
i
,..,1,0)( s.t. )(min Findfunctionconvex be ,..,1,:,Let
Xx)(xf
mixgi ,..,1,0)(
![Page 21: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062315/56816337550346895dd3c71d/html5/thumbnails/21.jpg)
Lagrange Multipliers Used to find a maxima or a minima
of a function subject to constraints Use to solve out optimization
problem Definition
sMultiplier Lagrange thecalled are
0, )()(),(
,..,1,sconstraint subject to function of Lagragian
1
i
i
m
iii
i
XxxgxfxL
migfL
![Page 22: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062315/56816337550346895dd3c71d/html5/thumbnails/22.jpg)
Primal Program Plan
Use the Lagrangian to write a program called the Primal Program
Equal to f(x) is all the constraints are met
Otherwise – Definition – Primal Program
),(max)( 0 xLxP
![Page 23: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062315/56816337550346895dd3c71d/html5/thumbnails/23.jpg)
Primal Progam – Cont. The constraints are of the form If they are met
is maximized when all are 0, and the summation is 0
Otherwise is maximized for
0)( xgi
m
iii xg
1
)(
i
)()( xfxP
m
iii xg
1
)( i
)(xP
![Page 24: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062315/56816337550346895dd3c71d/html5/thumbnails/24.jpg)
Primal Progam – Cont. Our convex optimization problem
is now:
Define as the value of the primal program
),(maxmin)(min 0 xLx XxPXx )(min* xp PXx
![Page 25: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062315/56816337550346895dd3c71d/html5/thumbnails/25.jpg)
Dual Program We define the Dual Program as:
We’ll look at
Same as our primal program Order of min / max is different
Define the value of our Dual Program
),(min)( xLx XxD
),(minmax)(minmax 00 xLx XxaDXxa
),(minmax 0* xLd Xxa
![Page 26: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062315/56816337550346895dd3c71d/html5/thumbnails/26.jpg)
Dual Program – Cont. We want to show
If we find a solution to one problem, we find the solution to the second problem
Start with “max min” is always less then “min
max”
Now on to
** pd
** pd
*00
* ),(maxmin),(minmax pxLxLd aXxXxa
** dp
![Page 27: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062315/56816337550346895dd3c71d/html5/thumbnails/27.jpg)
Dual Program – Cont. Claim
Proof
Conclude
)( osolution t a is and then
),(),(),( :feasible is which ,0
andpoint saddle a are which 0 and exists if
p***
****
**
xxdp
axLaxLaxLxa
ax
*
0
*
***
00
*
),(infsup),(inf
),(),(sup),(supinf
daxLaxL
axLaxLaxLp
xax
aax
** pd
![Page 28: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062315/56816337550346895dd3c71d/html5/thumbnails/28.jpg)
Karush-Kuhn-Tucker (KKT) conditions KKT conditions derive a
characterization of an optimal solution to a convex problem.
Theorem
0)()( 3.
0)( ),( .20)()( ),( 1.
:s.t. 0 problemon optimizati theosolution t a is convex. and abledifferenti are ,..,1, and that Assume
xgxg
xgxLxgxfxL
xmigf
ii
a
xxx
i
![Page 29: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062315/56816337550346895dd3c71d/html5/thumbnails/29.jpg)
KKT Conditions – Cont. Proof
The other direction holds as well
m
i ii
m
i iii
m
i ixi
x
xga
xgxga
xxxga
xxxfxfxfx
1
1
1
0)(
)]()([
)()(
)()()()(: feasibleevery For
![Page 30: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062315/56816337550346895dd3c71d/html5/thumbnails/30.jpg)
KKT Conditions – Cont. Example
Consider the following optimization problem:
We have The Lagragian will be
2 .. 21min 2 xtsx
xxgxxf 2)(,21)( 1
2
)2(21),( 2 xxxL
**
22*
**
202),(
212)2(
21),(
0
xxL
xL
xxxL
![Page 31: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062315/56816337550346895dd3c71d/html5/thumbnails/31.jpg)
Optimal Margin Classifier Back to SVM
Rewrite our optimization program
Following the KKT conditions Only for points in the training
set with a margin of exactly 1 These are the support vectors of the
training set
01)*(),(
,...,1,1)*(21min 2
bxwybwg
mibxwyw
iii
ii
0i
![Page 32: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062315/56816337550346895dd3c71d/html5/thumbnails/32.jpg)
Optimal Margin – Cont. Optimal margin classifier and its support vectors
![Page 33: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062315/56816337550346895dd3c71d/html5/thumbnails/33.jpg)
Optimal Margin – Cont. Construct the Lagragian
Find the dual form First minimize to get Do so by setting the derivatives to
zero
]1)*([21),,(
1
2
bxwywbwL iim
ii
),,( bwL D
m
i
iii
m
i
iiix
xyw
xywbwL
1
*
1
0),,(
![Page 34: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062315/56816337550346895dd3c71d/html5/thumbnails/34.jpg)
Optimal Margin – Cont. Take the derivative with respect to
Use in the Lagrangian
We saw the last tem is zero
m
i
ii ybwL
b 1
0),,(
b
im
ii
jim
jiji
jim
ii ybxxyybwL
11,1
**
21),,(
*w
)(21),,(
1,1
** WxxyybwL jim
jiji
jim
ii
![Page 35: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062315/56816337550346895dd3c71d/html5/thumbnails/35.jpg)
Optimal Margin – Cont. The dual optimization problem
The KKT conditions hold Can solve by finding that maximize Assuming we have – define The solution to the primal problem
m
i
ii
i
y
miW
1
0
,..,1,0:)(max
* )(W
m
i
iixyw1
**
![Page 36: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062315/56816337550346895dd3c71d/html5/thumbnails/36.jpg)
Optimal Margin – Cont. Still need to find
Assume is a support vector We get
*bix
ii
ii
ii
xwyb
bxwy
bxwy
**
**
**
)(1
![Page 37: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062315/56816337550346895dd3c71d/html5/thumbnails/37.jpg)
Error Analysis Using Leave-One-Out The Leave-One-Out (LOO) method
Remove one point at a time from the training set
Calculate an SVM for the remaining points
Test our result using the removed point
Definition The indicator function I(exp) is 1 if
exp is true, otherwise 0
))((1ˆ1
}{
m
i
iixSLOO yxhI
mR i
![Page 38: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062315/56816337550346895dd3c71d/html5/thumbnails/38.jpg)
LOO Error Analysis – Cont. Expected error
It follows the expected error of LOO for a training set of size m is the same as for a training set of size m-1
)]([])([
)])(([1]ˆ[
'~'}{,
1}{~
1 SDSii
xSXS
m
i
iixSLOODS
herrorEyxhE
yxhIEm
RE
mi
im
![Page 39: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062315/56816337550346895dd3c71d/html5/thumbnails/39.jpg)
LOO Error Analysis – Cont. Theorem
ProofSSN
mSNEherrorE
SV
SVDSSDS mm
in ctorssupport ve ofnumber theis )(
]1
)([)]([ 1~~
1)(ˆ :Hence
ctor.support ve a bemust point they,incorrectlpoint a classifies if
mSNR
h
SV
S
LOO
![Page 40: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062315/56816337550346895dd3c71d/html5/thumbnails/40.jpg)
Generalization Bounds Using VC-dimension Theorem
Proof
2
22
Then .,min:)(set
hyperplane theofdimension -VC thebe Let .:Let
Rdwxwxwsign
dRxxS
Sx
d
i
iid
i
iiii
ii
d
xyxywxy
dixwyw
xx
1
d
1i 1
i1
wd
:dover Summing .,..,1 )(:
1,1yevery for So shattered. is ,..,set that theAssume
![Page 41: Support Vector Machines](https://reader035.fdocuments.us/reader035/viewer/2022062315/56816337550346895dd3c71d/html5/thumbnails/41.jpg)
Generalization Bounds Using VC-dimension – Cont. Proof – Cont.
2
22
22
,
,
2
1
21
1
Therefore
][d
: thatconcludecan we
when 1][ and when 0][ Since
][d
:ondistributi uniform with s' over the Averaging
Rd
dRxyyxxE
jiyyEjiyyE
yyxxExyExyE
y
i
iji
ji
jiy
jiy
jiy
ji
ji
jiy
d
i
iiy
d
i
iiy