Machine Learning Basics Lecture 5: SVM IISVM: optimization •Optimization (Quadratic Programming):...

34
Machine Learning Basics Lecture 5: SVM II Princeton University COS 495 Instructor: Yingyu Liang

Transcript of Machine Learning Basics Lecture 5: SVM IISVM: optimization •Optimization (Quadratic Programming):...

Page 1: Machine Learning Basics Lecture 5: SVM IISVM: optimization •Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + ≥ s,∀ •Solved by Lagrange multiplier method: ℒ

Machine Learning Basics Lecture 5: SVM II

Princeton University COS 495

Instructor: Yingyu Liang

Page 2: Machine Learning Basics Lecture 5: SVM IISVM: optimization •Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + ≥ s,∀ •Solved by Lagrange multiplier method: ℒ

Review: SVM objective

Page 3: Machine Learning Basics Lecture 5: SVM IISVM: optimization •Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + ≥ s,∀ •Solved by Lagrange multiplier method: ℒ

SVM: objective

• Let 𝑊𝑖 ∈ +1,−1 , 𝑓𝑀,𝑏 𝑥 = 𝑀𝑇𝑥 + 𝑏. Margin:

𝛟 = min𝑖

𝑊𝑖𝑓𝑀,𝑏 𝑥𝑖| 𝑀 |

• Support Vector Machine:

max𝑀,𝑏

𝛟 = max𝑀,𝑏

min𝑖

𝑊𝑖𝑓𝑀,𝑏 𝑥𝑖| 𝑀 |

Page 4: Machine Learning Basics Lecture 5: SVM IISVM: optimization •Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + ≥ s,∀ •Solved by Lagrange multiplier method: ℒ

SVM: optimization

• Optimization (Quadratic Programming):

min𝑀,𝑏

1

2𝑀

2

𝑊𝑖 𝑀𝑇𝑥𝑖 + 𝑏 ≥ 1, ∀𝑖

• Solved by Lagrange multiplier method:

ℒ 𝑀, 𝑏, 𝜶 =1

2𝑀

2

−

𝑖

𝛌𝑖[𝑊𝑖 𝑀𝑇𝑥𝑖 + 𝑏 − 1]

where 𝜶 is the Lagrange multiplier

Page 5: Machine Learning Basics Lecture 5: SVM IISVM: optimization •Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + ≥ s,∀ •Solved by Lagrange multiplier method: ℒ

Lagrange multiplier

Page 6: Machine Learning Basics Lecture 5: SVM IISVM: optimization •Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + ≥ s,∀ •Solved by Lagrange multiplier method: ℒ

Lagrangian

• Consider optimization problem:min𝑀

𝑓(𝑀)

ℎ𝑖 𝑀 = 0, ∀1 ≀ 𝑖 ≀ 𝑙

• Lagrangian:

ℒ 𝑀, 𝜷 = 𝑓 𝑀 +

𝑖

𝛜𝑖ℎ𝑖(𝑀)

where 𝛜𝑖’s are called Lagrange multipliers

Page 7: Machine Learning Basics Lecture 5: SVM IISVM: optimization •Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + ≥ s,∀ •Solved by Lagrange multiplier method: ℒ

Lagrangian

• Consider optimization problem:min𝑀

𝑓(𝑀)

ℎ𝑖 𝑀 = 0, ∀1 ≀ 𝑖 ≀ 𝑙

• Solved by setting derivatives of Lagrangian to 0

𝜕ℒ

𝜕𝑀𝑖= 0;

𝜕ℒ

𝜕𝛜𝑖= 0

Page 8: Machine Learning Basics Lecture 5: SVM IISVM: optimization •Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + ≥ s,∀ •Solved by Lagrange multiplier method: ℒ

Generalized Lagrangian

• Consider optimization problem:min𝑀

𝑓(𝑀)

𝑔𝑖 𝑀 ≀ 0, ∀1 ≀ 𝑖 ≀ 𝑘

ℎ𝑗 𝑀 = 0, ∀1 ≀ 𝑗 ≀ 𝑙

• Generalized Lagrangian:

ℒ 𝑀, 𝜶, 𝜷 = 𝑓 𝑀 +

𝑖

𝛌𝑖𝑔𝑖(𝑀) +

𝑗

𝛜𝑗ℎ𝑗(𝑀)

where 𝛌𝑖 , 𝛜𝑗’s are called Lagrange multipliers

Page 9: Machine Learning Basics Lecture 5: SVM IISVM: optimization •Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + ≥ s,∀ •Solved by Lagrange multiplier method: ℒ

Generalized Lagrangian

• Consider the quantity:

𝜃𝑃 𝑀 ≔ max𝜶,𝜷:𝛌𝑖≥0

ℒ 𝑀, 𝜶, 𝜷

• Why?

𝜃𝑃 𝑀 = ቊ𝑓 𝑀 , if 𝑀 satisfies all the constraints+∞, if 𝑀 does not satisfy the constraints

• So minimizing 𝑓 𝑀 is the same as minimizing 𝜃𝑃 𝑀

min𝑀

𝑓 𝑀 = min𝑀

𝜃𝑃 𝑀 = min𝑀

max𝜶,𝜷:𝛌𝑖≥0

ℒ 𝑀,𝜶, 𝜷

Page 10: Machine Learning Basics Lecture 5: SVM IISVM: optimization •Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + ≥ s,∀ •Solved by Lagrange multiplier method: ℒ

Lagrange duality

• The primal problem

𝑝∗ ≔ min𝑀

𝑓 𝑀 = min𝑀

max𝜶,𝜷:𝛌𝑖≥0

ℒ 𝑀, 𝜶, 𝜷

• The dual problem

𝑑∗ ≔ max𝜶,𝜷:𝛌𝑖≥0

min𝑀

ℒ 𝑀, 𝜶, 𝜷

• Always true:𝑑∗ ≀ 𝑝∗

Page 11: Machine Learning Basics Lecture 5: SVM IISVM: optimization •Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + ≥ s,∀ •Solved by Lagrange multiplier method: ℒ

Lagrange duality

• The primal problem

𝑝∗ ≔ min𝑀

𝑓 𝑀 = min𝑀

max𝜶,𝜷:𝛌𝑖≥0

ℒ 𝑀, 𝜶, 𝜷

• The dual problem

𝑑∗ ≔ max𝜶,𝜷:𝛌𝑖≥0

min𝑀

ℒ 𝑀, 𝜶, 𝜷

• Interesting case: when do we have 𝑑∗ = 𝑝∗?

Page 12: Machine Learning Basics Lecture 5: SVM IISVM: optimization •Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + ≥ s,∀ •Solved by Lagrange multiplier method: ℒ

Lagrange duality

• Theorem: under proper conditions, there exists 𝑀∗, 𝜶∗, 𝜷∗ such that

𝑑∗ = ℒ 𝑀∗, 𝜶∗, 𝜷∗ = 𝑝∗

Moreover, 𝑀∗, 𝜶∗, 𝜷∗ satisfy Karush-Kuhn-Tucker (KKT) conditions:𝜕ℒ

𝜕𝑀𝑖= 0, 𝛌𝑖𝑔𝑖 𝑀 = 0

𝑔𝑖 𝑀 ≀ 0, ℎ𝑗 𝑀 = 0, 𝛌𝑖 ≥ 0

Page 13: Machine Learning Basics Lecture 5: SVM IISVM: optimization •Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + ≥ s,∀ •Solved by Lagrange multiplier method: ℒ

Lagrange duality

• Theorem: under proper conditions, there exists 𝑀∗, 𝜶∗, 𝜷∗ such that

𝑑∗ = ℒ 𝑀∗, 𝜶∗, 𝜷∗ = 𝑝∗

Moreover, 𝑀∗, 𝜶∗, 𝜷∗ satisfy Karush-Kuhn-Tucker (KKT) conditions:𝜕ℒ

𝜕𝑀𝑖= 0, 𝛌𝑖𝑔𝑖 𝑀 = 0

𝑔𝑖 𝑀 ≀ 0, ℎ𝑗 𝑀 = 0, 𝛌𝑖 ≥ 0

dual complementarity

Page 14: Machine Learning Basics Lecture 5: SVM IISVM: optimization •Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + ≥ s,∀ •Solved by Lagrange multiplier method: ℒ

Lagrange duality

• Theorem: under proper conditions, there exists 𝑀∗, 𝜶∗, 𝜷∗ such that

𝑑∗ = ℒ 𝑀∗, 𝜶∗, 𝜷∗ = 𝑝∗

• Moreover, 𝑀∗, 𝜶∗, 𝜷∗ satisfy Karush-Kuhn-Tucker (KKT) conditions:𝜕ℒ

𝜕𝑀𝑖= 0, 𝛌𝑖𝑔𝑖 𝑀 = 0

𝑔𝑖 𝑀 ≀ 0, ℎ𝑗 𝑀 = 0, 𝛌𝑖 ≥ 0

dual constraintsprimal constraints

Page 15: Machine Learning Basics Lecture 5: SVM IISVM: optimization •Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + ≥ s,∀ •Solved by Lagrange multiplier method: ℒ

Lagrange duality

• What are the proper conditions?

• A set of conditions (Slater conditions):• 𝑓, 𝑔𝑖 convex, ℎ𝑗 affine

• Exists 𝑀 satisfying all 𝑔𝑖 𝑀 < 0

• There exist other sets of conditions• Search Karush–Kuhn–Tucker conditions on Wikipedia

Page 16: Machine Learning Basics Lecture 5: SVM IISVM: optimization •Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + ≥ s,∀ •Solved by Lagrange multiplier method: ℒ

SVM: optimization

Page 17: Machine Learning Basics Lecture 5: SVM IISVM: optimization •Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + ≥ s,∀ •Solved by Lagrange multiplier method: ℒ

SVM: optimization

• Optimization (Quadratic Programming):

min𝑀,𝑏

1

2𝑀

2

𝑊𝑖 𝑀𝑇𝑥𝑖 + 𝑏 ≥ 1, ∀𝑖

• Generalized Lagrangian:

ℒ 𝑀, 𝑏, 𝜶 =1

2𝑀

2

−

𝑖

𝛌𝑖[𝑊𝑖 𝑀𝑇𝑥𝑖 + 𝑏 − 1]

where 𝜶 is the Lagrange multiplier

Page 18: Machine Learning Basics Lecture 5: SVM IISVM: optimization •Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + ≥ s,∀ •Solved by Lagrange multiplier method: ℒ

SVM: optimization

• KKT conditions:𝜕ℒ

𝜕𝑀= 0, 𝑀 = σ𝑖 𝛌𝑖𝑊𝑖𝑥𝑖 (1)

𝜕ℒ

𝜕𝑏= 0, 0 = σ𝑖 𝛌𝑖𝑊𝑖 (2)

• Plug into ℒ:

ℒ 𝑀, 𝑏, 𝜶 = σ𝑖 𝛌𝑖 −1

2σ𝑖𝑗 𝛌𝑖𝛌𝑗𝑊𝑖𝑊𝑗𝑥𝑖

𝑇𝑥𝑗 (3)

combined with 0 = σ𝑖 𝛌𝑖𝑊𝑖 , 𝛌𝑖 ≥ 0

Page 19: Machine Learning Basics Lecture 5: SVM IISVM: optimization •Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + ≥ s,∀ •Solved by Lagrange multiplier method: ℒ

SVM: optimization

• Reduces to dual problem:

ℒ 𝑀, 𝑏, 𝜶 =

𝑖

𝛌𝑖 −1

2

𝑖𝑗

𝛌𝑖𝛌𝑗𝑊𝑖𝑊𝑗𝑥𝑖𝑇𝑥𝑗

𝑖

𝛌𝑖𝑊𝑖 = 0, 𝛌𝑖 ≥ 0

• Since 𝑀 = σ𝑖 𝛌𝑖𝑊𝑖𝑥𝑖, we have 𝑀𝑇𝑥 + 𝑏 = σ𝑖 𝛌𝑖𝑊𝑖𝑥𝑖𝑇𝑥 + 𝑏

Only depend on inner products

Page 20: Machine Learning Basics Lecture 5: SVM IISVM: optimization •Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + ≥ s,∀ •Solved by Lagrange multiplier method: ℒ

Kernel methods

Page 21: Machine Learning Basics Lecture 5: SVM IISVM: optimization •Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + ≥ s,∀ •Solved by Lagrange multiplier method: ℒ

Features

Color Histogram

Red Green Blue

Extract features

𝑥 𝜙 𝑥

Page 22: Machine Learning Basics Lecture 5: SVM IISVM: optimization •Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + ≥ s,∀ •Solved by Lagrange multiplier method: ℒ

Features

Page 23: Machine Learning Basics Lecture 5: SVM IISVM: optimization •Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + ≥ s,∀ •Solved by Lagrange multiplier method: ℒ

Features

• Proper feature mapping can make non-linear to linear

• Using SVM on the feature space {𝜙 𝑥𝑖 }: only need 𝜙 𝑥𝑖𝑇𝜙(𝑥𝑗)

• Conclusion: no need to design 𝜙 ⋅ , only need to design

𝑘 𝑥𝑖 , 𝑥𝑗 = 𝜙 𝑥𝑖𝑇𝜙(𝑥𝑗)

Page 24: Machine Learning Basics Lecture 5: SVM IISVM: optimization •Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + ≥ s,∀ •Solved by Lagrange multiplier method: ℒ

Polynomial kernels

• Fix degree 𝑑 and constant 𝑐:𝑘 𝑥, 𝑥′ = 𝑥𝑇𝑥′ + 𝑐 𝑑

• What are 𝜙(𝑥)?

• Expand the expression to get 𝜙(𝑥)

Page 25: Machine Learning Basics Lecture 5: SVM IISVM: optimization •Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + ≥ s,∀ •Solved by Lagrange multiplier method: ℒ

Polynomial kernels

Figure from Foundations of Machine Learning, by M. Mohri, A. Rostamizadeh, and A. Talwalkar

Page 26: Machine Learning Basics Lecture 5: SVM IISVM: optimization •Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + ≥ s,∀ •Solved by Lagrange multiplier method: ℒ

Figure from Foundations of Machine Learning, by M. Mohri, A. Rostamizadeh, and A. Talwalkar

Page 27: Machine Learning Basics Lecture 5: SVM IISVM: optimization •Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + ≥ s,∀ •Solved by Lagrange multiplier method: ℒ

Gaussian kernels

• Fix bandwidth 𝜎:

𝑘 𝑥, 𝑥′ = exp(− 𝑥 − 𝑥′2/2𝜎2)

• Also called radial basis function (RBF) kernels

• What are 𝜙(𝑥)? Consider the un-normalized version𝑘′ 𝑥, 𝑥′ = exp(𝑥𝑇𝑥′/𝜎2)

• Power series expansion:

𝑘′ 𝑥, 𝑥′ =

𝑖

+∞𝑥𝑇𝑥′ 𝑖

𝜎𝑖𝑖!

Page 28: Machine Learning Basics Lecture 5: SVM IISVM: optimization •Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + ≥ s,∀ •Solved by Lagrange multiplier method: ℒ

Mercer’s condition for kenerls

• Theorem: 𝑘 𝑥, 𝑥′ has expansion

𝑘 𝑥, 𝑥′ =

𝑖

+∞

𝑎𝑖𝜙𝑖 𝑥 𝜙𝑖(𝑥′)

if and only if for any function 𝑐(𝑥),

∫ ∫ 𝑐 𝑥 𝑐 𝑥′ 𝑘 𝑥, 𝑥′ 𝑑𝑥𝑑𝑥′ ≥ 0

(Omit some math conditions for 𝑘 and 𝑐)

Page 29: Machine Learning Basics Lecture 5: SVM IISVM: optimization •Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + ≥ s,∀ •Solved by Lagrange multiplier method: ℒ

Constructing new kernels

• Kernels are closed under positive scaling, sum, product, pointwise limit, and composition with a power series σ𝑖

+∞𝑎𝑖𝑘𝑖(𝑥, 𝑥′)

• Example: 𝑘1 𝑥, 𝑥′ , 𝑘2 𝑥, 𝑥′ are kernels, then also is

𝑘 𝑥, 𝑥′ = 2𝑘1 𝑥, 𝑥′ + 3𝑘2 𝑥, 𝑥′

• Example: 𝑘1 𝑥, 𝑥′ is kernel, then also is

𝑘 𝑥, 𝑥′ = exp(𝑘1 𝑥, 𝑥′ )

Page 30: Machine Learning Basics Lecture 5: SVM IISVM: optimization •Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + ≥ s,∀ •Solved by Lagrange multiplier method: ℒ

Kernels v.s. Neural networks

Page 31: Machine Learning Basics Lecture 5: SVM IISVM: optimization •Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + ≥ s,∀ •Solved by Lagrange multiplier method: ℒ

Features

Color Histogram

Red Green Blue

Extract features

𝑥

𝑊 = 𝑀𝑇𝜙 𝑥build hypothesis

Page 32: Machine Learning Basics Lecture 5: SVM IISVM: optimization •Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + ≥ s,∀ •Solved by Lagrange multiplier method: ℒ

Features: part of the model

𝑊 = 𝑀𝑇𝜙 𝑥build hypothesis

Linear model

Nonlinear model

Page 33: Machine Learning Basics Lecture 5: SVM IISVM: optimization •Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + ≥ s,∀ •Solved by Lagrange multiplier method: ℒ

Polynomial kernels

Figure from Foundations of Machine Learning, by M. Mohri, A. Rostamizadeh, and A. Talwalkar

Page 34: Machine Learning Basics Lecture 5: SVM IISVM: optimization •Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + ≥ s,∀ •Solved by Lagrange multiplier method: ℒ

Polynomial kernel SVM as two layer neural network

𝑥1

𝑥2

𝑥12

𝑥22

2𝑥1𝑥2

2𝑐𝑥1

2𝑐𝑥2

𝑐

𝑊 = sign(𝑀𝑇𝜙(𝑥) + 𝑏)

First layer is fixed. If also learn first layer, it becomes two layer neural network