Supportvectormachines( SVMs) Lecture5 · lecture5.pptx Author: David Sontag Created Date: 2/10/2016...
Transcript of Supportvectormachines( SVMs) Lecture5 · lecture5.pptx Author: David Sontag Created Date: 2/10/2016...
![Page 1: Supportvectormachines( SVMs) Lecture5 · lecture5.pptx Author: David Sontag Created Date: 2/10/2016 3:48:02 AM ...](https://reader033.fdocuments.us/reader033/viewer/2022042300/5ecaac4de047113c4d484f70/html5/thumbnails/1.jpg)
Support vector machines (SVMs) Lecture 5
David Sontag New York University
![Page 2: Supportvectormachines( SVMs) Lecture5 · lecture5.pptx Author: David Sontag Created Date: 2/10/2016 3:48:02 AM ...](https://reader033.fdocuments.us/reader033/viewer/2022042300/5ecaac4de047113c4d484f70/html5/thumbnails/2.jpg)
So5 margin SVM
w.x + b = +1
w.x + b = -‐1
w.x + b = 0
Slack penalty C > 0: • C=∞ ! minimizes upper bound on 0-‐1 loss • C≈0 ! points with ξi=0 have big margin
• Select using cross-‐valida=on
“slack variables”
ξ2
ξ1 ξ3
ξ4
Support vectors: Data points for which the constraints are binding
![Page 3: Supportvectormachines( SVMs) Lecture5 · lecture5.pptx Author: David Sontag Created Date: 2/10/2016 3:48:02 AM ...](https://reader033.fdocuments.us/reader033/viewer/2022042300/5ecaac4de047113c4d484f70/html5/thumbnails/3.jpg)
QP form:
More “natural” form:
Empirical loss RegularizaNon term
Equivalent if
So5 margin SVM
![Page 4: Supportvectormachines( SVMs) Lecture5 · lecture5.pptx Author: David Sontag Created Date: 2/10/2016 3:48:02 AM ...](https://reader033.fdocuments.us/reader033/viewer/2022042300/5ecaac4de047113c4d484f70/html5/thumbnails/4.jpg)
Subgradient (for non-‐differenNable funcNons)
![Page 5: Supportvectormachines( SVMs) Lecture5 · lecture5.pptx Author: David Sontag Created Date: 2/10/2016 3:48:02 AM ...](https://reader033.fdocuments.us/reader033/viewer/2022042300/5ecaac4de047113c4d484f70/html5/thumbnails/5.jpg)
(Sub)gradient descent of SVM objecNve
Step size:
-‐
![Page 6: Supportvectormachines( SVMs) Lecture5 · lecture5.pptx Author: David Sontag Created Date: 2/10/2016 3:48:02 AM ...](https://reader033.fdocuments.us/reader033/viewer/2022042300/5ecaac4de047113c4d484f70/html5/thumbnails/6.jpg)
The Pegasos Algorithm Pegasos Algorithm (from homework) Ini=alize: w1 = 0, t=0 For iter = 1,2,…,20
For j=1,2,…,|data| t = t+1 ηt = 1/(tλ) If yj(wt xj) < 1 wt+1 = (1-‐ηtλ) wt + ηt yj xj Else wt+1 = (1-‐ηtλ) wt
Output: wt+1
General framework Ini=alize: w1 = 0, t=0
While not converged t = t+1 Choose a stepsize, ηt Choose a direcNon, pt Go! Test for convergence
Output: wt+1
![Page 7: Supportvectormachines( SVMs) Lecture5 · lecture5.pptx Author: David Sontag Created Date: 2/10/2016 3:48:02 AM ...](https://reader033.fdocuments.us/reader033/viewer/2022042300/5ecaac4de047113c4d484f70/html5/thumbnails/7.jpg)
Pegasos Algorithm (from homework) Ini=alize: w1 = 0, t=0 For iter = 1,2,…,20
For j=1,2,…,|data| t = t+1 ηt = 1/(tλ) If yj(wt xj) < 1 wt+1 = wt – ηt(λwt-‐ yjxj) Else wt+1 = wt – ηtλwt
Output: wt+1
General framework Ini=alize: w1 = 0, t=0
While not converged t = t+1 Choose a stepsize, ηt Choose a direcNon, pt Go! Test for convergence
Output: wt+1
The Pegasos Algorithm
![Page 8: Supportvectormachines( SVMs) Lecture5 · lecture5.pptx Author: David Sontag Created Date: 2/10/2016 3:48:02 AM ...](https://reader033.fdocuments.us/reader033/viewer/2022042300/5ecaac4de047113c4d484f70/html5/thumbnails/8.jpg)
Pegasos Algorithm (from homework) Ini=alize: w1 = 0, t=0 For iter = 1,2,…,20
For j=1,2,…,|data| t = t+1 ηt = 1/(tλ) If yj(wt xj) < 1 wt+1 = wt – ηt(λwt-‐ yjxj) Else wt+1 = wt – ηtλwt
Output: wt+1
Convergence choice : Fixed number of itera=ons T=20*|data|
General framework Ini=alize: w1 = 0, t=0
While not converged t = t+1 Choose a stepsize, ηt Choose a direcNon, pt Go! Test for convergence
Output: wt+1
The Pegasos Algorithm
![Page 9: Supportvectormachines( SVMs) Lecture5 · lecture5.pptx Author: David Sontag Created Date: 2/10/2016 3:48:02 AM ...](https://reader033.fdocuments.us/reader033/viewer/2022042300/5ecaac4de047113c4d484f70/html5/thumbnails/9.jpg)
Pegasos Algorithm (from homework) Ini=alize: w1 = 0, t=0 For iter = 1,2,…,20
For j=1,2,…,|data| t = t+1 ηt = 1/(tλ) If yj(wt xj) < 1 wt+1 = wt – ηt(λwt-‐ yjxj) Else wt+1 = wt – ηtλwt
Output: wt+1
General framework Ini=alize: w1 = 0, t=0
While not converged t = t+1 Choose a stepsize, ηt Choose a direcNon, pt Go! Test for convergence
Output: wt+1
Stepsize choice: -‐ Ini=alize with 1/λ -‐ Decays with 1/t
The Pegasos Algorithm
![Page 10: Supportvectormachines( SVMs) Lecture5 · lecture5.pptx Author: David Sontag Created Date: 2/10/2016 3:48:02 AM ...](https://reader033.fdocuments.us/reader033/viewer/2022042300/5ecaac4de047113c4d484f70/html5/thumbnails/10.jpg)
Pegasos Algorithm (from homework) Ini=alize: w1 = 0, t=0 For iter = 1,2,…,20
For j=1,2,…,|data| t = t+1 ηt = 1/(tλ) If yj(wt xj) < 1 wt+1 = wt – ηt(λwt-‐ yjxj) Else wt+1 = wt – ηtλwt
Output: wt+1
General framework Ini=alize: w1 = 0, t=0
While not converged t = t+1 Choose a stepsize, ηt Choose a direc=on, pt Go! Test for convergence
Output: wt+1
Direc=on choice: Stochas=c approx to the subgradient
The Pegasos Algorithm
![Page 11: Supportvectormachines( SVMs) Lecture5 · lecture5.pptx Author: David Sontag Created Date: 2/10/2016 3:48:02 AM ...](https://reader033.fdocuments.us/reader033/viewer/2022042300/5ecaac4de047113c4d484f70/html5/thumbnails/11.jpg)
Subgradient calculaNon �
2
||w||2 + 1
m
X
i
max{0, 1� yiw · xi}
�
2
||w||2 +max{0, 1� yiw · xi}
Objec=ve:
Stochas=c Approx:
For a randomly chosen data point i
(in the assignment the choice of i is not random -‐ easier to debug and compare between students).
![Page 12: Supportvectormachines( SVMs) Lecture5 · lecture5.pptx Author: David Sontag Created Date: 2/10/2016 3:48:02 AM ...](https://reader033.fdocuments.us/reader033/viewer/2022042300/5ecaac4de047113c4d484f70/html5/thumbnails/12.jpg)
Subgradient calculaNon �
2
||w||2 + 1
m
X
i
max{0, 1� yiw · xi}
�
2
||w||2 +max{0, 1� yiw · xi}
Objec=ve:
Stochas=c Approx:
(sub)gradient:
�||w||+ d
dw
max{0, 1� yiw · xi}
![Page 13: Supportvectormachines( SVMs) Lecture5 · lecture5.pptx Author: David Sontag Created Date: 2/10/2016 3:48:02 AM ...](https://reader033.fdocuments.us/reader033/viewer/2022042300/5ecaac4de047113c4d484f70/html5/thumbnails/13.jpg)
Subgradient calculaNon �
2
||w||2 + 1
m
X
i
max{0, 1� yiw · xi}
�
2
||w||2 +max{0, 1� yiw · xi}
Objec=ve:
Stochas=c Approx:
(sub)gradient:
�||w||+ d
dw
max{0, 1� yiw · xi}yiw · xi
>1 <1 =1
�yixi 0
1 0
0 yiw · xi
![Page 14: Supportvectormachines( SVMs) Lecture5 · lecture5.pptx Author: David Sontag Created Date: 2/10/2016 3:48:02 AM ...](https://reader033.fdocuments.us/reader033/viewer/2022042300/5ecaac4de047113c4d484f70/html5/thumbnails/14.jpg)
Subgradient calculaNon �
2
||w||2 + 1
m
X
i
max{0, 1� yiw · xi}
�
2
||w||2 +max{0, 1� yiw · xi}
Objec=ve:
Stochas=c Approx:
(sub)gradient:
�||w||+ d
dw
max{0, 1� yiw · xi}yiw · xi
>1 <1 =1
�yixi 00
1 0
0 yiw · xi
![Page 15: Supportvectormachines( SVMs) Lecture5 · lecture5.pptx Author: David Sontag Created Date: 2/10/2016 3:48:02 AM ...](https://reader033.fdocuments.us/reader033/viewer/2022042300/5ecaac4de047113c4d484f70/html5/thumbnails/15.jpg)
Subgradient calculaNon �
2
||w||2 + 1
m
X
i
max{0, 1� yiw · xi}
�
2
||w||2 +max{0, 1� yiw · xi}
Objec=ve:
Stochas=c Approx:
(sub)gradient:
if yiw · xi < 1
else
�w � yixi
�w + 0
![Page 16: Supportvectormachines( SVMs) Lecture5 · lecture5.pptx Author: David Sontag Created Date: 2/10/2016 3:48:02 AM ...](https://reader033.fdocuments.us/reader033/viewer/2022042300/5ecaac4de047113c4d484f70/html5/thumbnails/16.jpg)
Pegasos Algorithm (from homework) Ini=alize: w1 = 0, t=0 For iter = 1,2,…,20
For j=1,2,…,|data| t = t+1 ηt = 1/(tλ) If yj(wt xj) < 1 wt+1 = wt – ηt(λwt-‐ yjxj) Else wt+1 = wt – ηt(λwt + 0)
Output: wt+1
General framework Ini=alize: w1 = 0, t=0
While not converged t = t+1 Choose a stepsize, ηt Choose a direc=on, pt Go! Test for convergence
Output: wt+1
Direc=on choice: Stochas=c approx to the subgradient
The Pegasos Algorithm
if yiw · xi < 1
else
�w � yixi
�w + 0
![Page 17: Supportvectormachines( SVMs) Lecture5 · lecture5.pptx Author: David Sontag Created Date: 2/10/2016 3:48:02 AM ...](https://reader033.fdocuments.us/reader033/viewer/2022042300/5ecaac4de047113c4d484f70/html5/thumbnails/17.jpg)
Pegasos Algorithm (from homework) Ini=alize: w1 = 0, t=0 For iter = 1,2,…,20
For j=1,2,…,|data| t = t+1 ηt = 1/(tλ) If yj(wt xj) < 1 wt+1 = wt – ηt(λwt-‐ yjxj) Else wt+1 = wt – ηtλwt
Output: wt+1
General framework Ini=alize: w1 = 0, t=0
While not converged t = t+1 Choose a stepsize, ηt Choose a direcNon, pt Go! Test for convergence
Output: wt+1
Go: update wt+1 = wt -‐ ηtpt
The Pegasos Algorithm
![Page 18: Supportvectormachines( SVMs) Lecture5 · lecture5.pptx Author: David Sontag Created Date: 2/10/2016 3:48:02 AM ...](https://reader033.fdocuments.us/reader033/viewer/2022042300/5ecaac4de047113c4d484f70/html5/thumbnails/18.jpg)
Why is this algorithm interesNng?
• Simple to implement, state of the art results. – NoNce similarity to Perceptron algorithm! Algorithmic differences: updates if insufficient margin, scales weight vector, and has a learning rate.
• Since based on stochas7c gradient descent, its running Nme guarantees are probabilisNc.
• Highlights interesNng tradeoffs between running Nme and data.
![Page 19: Supportvectormachines( SVMs) Lecture5 · lecture5.pptx Author: David Sontag Created Date: 2/10/2016 3:48:02 AM ...](https://reader033.fdocuments.us/reader033/viewer/2022042300/5ecaac4de047113c4d484f70/html5/thumbnails/19.jpg)
Much faster than previous methods
• 3 datasets (provided by Joachims) – Reuters CCAT (800K examples, 47k features) – Physics ArXiv (62k examples, 100k features)
– Covertype (581k examples, 54 features)
Training Time (in seconds):
Pegasos SVM-Perf SVM-Light
Reuters 2 77 20,075
Covertype 6 85 25,514
Astro-Physics 2 5 80
![Page 20: Supportvectormachines( SVMs) Lecture5 · lecture5.pptx Author: David Sontag Created Date: 2/10/2016 3:48:02 AM ...](https://reader033.fdocuments.us/reader033/viewer/2022042300/5ecaac4de047113c4d484f70/html5/thumbnails/20.jpg)
Approximate algorithms Error Decomposition
• Approximation error:– Best error achievable by large-margin predictor– Error of population minimizer
w0 = argmin E[f(w)] = argmin λ|w|2 + Ex,y[loss(⟨w,x⟩;y)]• Estimation error:
– Extra error due to replacing E[loss] with empirical lossw* = arg min fn(w)
• Optimization error:– Extra error due to only optimizing to within finite precision
err(w0)
err(w*)
err(w)Prediction error
From ICML’08 presentaNon (available here)
[Shalev Schwartz, Srebro ’08]
Note: w0 is redefined in this context (see below) – does not refer to ini=al weight vector
![Page 21: Supportvectormachines( SVMs) Lecture5 · lecture5.pptx Author: David Sontag Created Date: 2/10/2016 3:48:02 AM ...](https://reader033.fdocuments.us/reader033/viewer/2022042300/5ecaac4de047113c4d484f70/html5/thumbnails/21.jpg)
Approximate algorithms Error Decomposition
• Approximation error:– Best error achievable by large-margin predictor– Error of population minimizer
w0 = argmin E[f(w)] = argmin λ|w|2 + Ex,y[loss(⟨w,x⟩;y)]• Estimation error:
– Extra error due to replacing E[loss] with empirical lossw* = arg min fn(w)
• Optimization error:– Extra error due to only optimizing to within finite precision
err(w0)
err(w*)
err(w)Prediction error
Pegasos Guarantees
A5er updates:
err(wT) < err(w0) +
With probability 1-‐
✏
�
T = O
✓1
��✏
◆
[Shalev Schwartz, Srebro ’08]
![Page 22: Supportvectormachines( SVMs) Lecture5 · lecture5.pptx Author: David Sontag Created Date: 2/10/2016 3:48:02 AM ...](https://reader033.fdocuments.us/reader033/viewer/2022042300/5ecaac4de047113c4d484f70/html5/thumbnails/22.jpg)
Approximate algorithms Error Decomposition
• Approximation error:– Best error achievable by large-margin predictor– Error of population minimizer
w0 = argmin E[f(w)] = argmin λ|w|2 + Ex,y[loss(⟨w,x⟩;y)]• Estimation error:
– Extra error due to replacing E[loss] with empirical lossw* = arg min fn(w)
• Optimization error:– Extra error due to only optimizing to within finite precision
err(w0)
err(w*)
err(w)Prediction error
Pegasos Guarantees
A5er updates:
err(wT) < err(w0) +
With probability 1-‐
✏
�
Running Nme does NOT depend on:
-‐# training examples!
It DOES depend on: -‐ Dimensionality d (why?) -‐ ApproximaNon and -‐ Difficulty of problem
✏ �
T = O
✓1
��✏
◆
�
[Shalev Schwartz, Srebro ’08]
![Page 23: Supportvectormachines( SVMs) Lecture5 · lecture5.pptx Author: David Sontag Created Date: 2/10/2016 3:48:02 AM ...](https://reader033.fdocuments.us/reader033/viewer/2022042300/5ecaac4de047113c4d484f70/html5/thumbnails/23.jpg)
But how is that possible? The Double-Edged Sword
• When data set size increases:– Estimation error decreases– Can increase optimization error,
i.e. optimize to within lesser accuracy ⇒ fewer iterations– But handling more data is expensive
e.g. runtime of each iteration increases• Stochastic Gradient Descent,
e.g. PEGASOS (Primal Efficient Sub-Gradient Solver for SVMs) [Shalev-Shwartz Singer Srebro, ICML’07]
– Fixed runtime per iteration– Runtime to get fixed accuracy does not increase with n
err(w0)
err(w*)
err(w)
data set size (n)
Prediction error
As the dataset grows, our approximaNons can be worse to get the same error!
[Shalev Schwartz, Srebro ’08]