Machine Learning Basics Lecture 5: SVM IISVM: optimization •Optimization (Quadratic Programming):...

Machine Learning Basics Lecture 5: SVM II

Princeton University COS 495

Instructor: Yingyu Liang

Review: SVM objective

SVM: objective

• Let 𝑦𝑖 ∈ +1,−1 , 𝑓𝑤,𝑏 𝑥 = 𝑤𝑇𝑥 + 𝑏. Margin:

𝛾 = min𝑖

𝑦𝑖𝑓𝑤,𝑏 𝑥𝑖| 𝑤 |

• Support Vector Machine:

max𝑤,𝑏

𝛾 = max𝑤,𝑏

min𝑖

𝑦𝑖𝑓𝑤,𝑏 𝑥𝑖| 𝑤 |

SVM: optimization

• Optimization (Quadratic Programming):

min𝑤,𝑏

𝑦𝑖 𝑤𝑇𝑥𝑖 + 𝑏 ≥ 1, ∀𝑖

• Solved by Lagrange multiplier method:

ℒ 𝑤, 𝑏, 𝜶 =1

𝛼𝑖[𝑦𝑖 𝑤𝑇𝑥𝑖 + 𝑏 − 1]

where 𝜶 is the Lagrange multiplier

Lagrange multiplier

Lagrangian

• Consider optimization problem:min𝑤

𝑓(𝑤)

ℎ𝑖 𝑤 = 0, ∀1 ≤ 𝑖 ≤ 𝑙

• Lagrangian:

ℒ 𝑤, 𝜷 = 𝑓 𝑤 +

𝛽𝑖ℎ𝑖(𝑤)

where 𝛽𝑖’s are called Lagrange multipliers

Lagrangian

𝑓(𝑤)

ℎ𝑖 𝑤 = 0, ∀1 ≤ 𝑖 ≤ 𝑙

• Solved by setting derivatives of Lagrangian to 0

𝜕ℒ

𝜕𝑤𝑖= 0;

𝜕ℒ

𝜕𝛽𝑖= 0

Generalized Lagrangian

𝑓(𝑤)

𝑔𝑖 𝑤 ≤ 0, ∀1 ≤ 𝑖 ≤ 𝑘

ℎ𝑗 𝑤 = 0, ∀1 ≤ 𝑗 ≤ 𝑙

• Generalized Lagrangian:

ℒ 𝑤, 𝜶, 𝜷 = 𝑓 𝑤 +

𝛼𝑖𝑔𝑖(𝑤) +

𝛽𝑗ℎ𝑗(𝑤)

where 𝛼𝑖 , 𝛽𝑗’s are called Lagrange multipliers

Generalized Lagrangian

• Consider the quantity:

𝜃𝑃 𝑤 ≔ max𝜶,𝜷:𝛼𝑖≥0

ℒ 𝑤, 𝜶, 𝜷

• Why?

𝜃𝑃 𝑤 = ቊ𝑓 𝑤 , if 𝑤 satisfies all the constraints+∞, if 𝑤 does not satisfy the constraints

• So minimizing 𝑓 𝑤 is the same as minimizing 𝜃𝑃 𝑤

min𝑤

𝑓 𝑤 = min𝑤

𝜃𝑃 𝑤 = min𝑤

max𝜶,𝜷:𝛼𝑖≥0

ℒ 𝑤,𝜶, 𝜷

Lagrange duality

• The primal problem

𝑝∗ ≔ min𝑤

𝑓 𝑤 = min𝑤

• The dual problem

𝑑∗ ≔ max𝜶,𝜷:𝛼𝑖≥0

min𝑤

• Always true:𝑑∗ ≤ 𝑝∗

Lagrange duality

• The primal problem

𝑝∗ ≔ min𝑤

𝑓 𝑤 = min𝑤

• The dual problem

𝑑∗ ≔ max𝜶,𝜷:𝛼𝑖≥0

min𝑤

• Interesting case: when do we have 𝑑∗ = 𝑝∗?

Lagrange duality

• Theorem: under proper conditions, there exists 𝑤∗, 𝜶∗, 𝜷∗ such that

𝑑∗ = ℒ 𝑤∗, 𝜶∗, 𝜷∗ = 𝑝∗

Moreover, 𝑤∗, 𝜶∗, 𝜷∗ satisfy Karush-Kuhn-Tucker (KKT) conditions:𝜕ℒ

𝜕𝑤𝑖= 0, 𝛼𝑖𝑔𝑖 𝑤 = 0

𝑔𝑖 𝑤 ≤ 0, ℎ𝑗 𝑤 = 0, 𝛼𝑖 ≥ 0

Lagrange duality

𝑑∗ = ℒ 𝑤∗, 𝜶∗, 𝜷∗ = 𝑝∗

Moreover, 𝑤∗, 𝜶∗, 𝜷∗ satisfy Karush-Kuhn-Tucker (KKT) conditions:𝜕ℒ

𝜕𝑤𝑖= 0, 𝛼𝑖𝑔𝑖 𝑤 = 0

𝑔𝑖 𝑤 ≤ 0, ℎ𝑗 𝑤 = 0, 𝛼𝑖 ≥ 0

dual complementarity

Lagrange duality

𝑑∗ = ℒ 𝑤∗, 𝜶∗, 𝜷∗ = 𝑝∗

• Moreover, 𝑤∗, 𝜶∗, 𝜷∗ satisfy Karush-Kuhn-Tucker (KKT) conditions:𝜕ℒ

𝜕𝑤𝑖= 0, 𝛼𝑖𝑔𝑖 𝑤 = 0

𝑔𝑖 𝑤 ≤ 0, ℎ𝑗 𝑤 = 0, 𝛼𝑖 ≥ 0

dual constraintsprimal constraints

Lagrange duality

• What are the proper conditions?

• A set of conditions (Slater conditions):• 𝑓, 𝑔𝑖 convex, ℎ𝑗 affine

• Exists 𝑤 satisfying all 𝑔𝑖 𝑤 < 0

• There exist other sets of conditions• Search Karush–Kuhn–Tucker conditions on Wikipedia

SVM: optimization

• Optimization (Quadratic Programming):

min𝑤,𝑏

𝑦𝑖 𝑤𝑇𝑥𝑖 + 𝑏 ≥ 1, ∀𝑖

• Generalized Lagrangian:

ℒ 𝑤, 𝑏, 𝜶 =1

𝛼𝑖[𝑦𝑖 𝑤𝑇𝑥𝑖 + 𝑏 − 1]

where 𝜶 is the Lagrange multiplier

SVM: optimization

• KKT conditions:𝜕ℒ

𝜕𝑤= 0, 𝑤 = σ𝑖 𝛼𝑖𝑦𝑖𝑥𝑖 (1)

𝜕ℒ

𝜕𝑏= 0, 0 = σ𝑖 𝛼𝑖𝑦𝑖 (2)

• Plug into ℒ:

ℒ 𝑤, 𝑏, 𝜶 = σ𝑖 𝛼𝑖 −1

2σ𝑖𝑗 𝛼𝑖𝛼𝑗𝑦𝑖𝑦𝑗𝑥𝑖

𝑇𝑥𝑗 (3)

combined with 0 = σ𝑖 𝛼𝑖𝑦𝑖 , 𝛼𝑖 ≥ 0

SVM: optimization

• Reduces to dual problem:

ℒ 𝑤, 𝑏, 𝜶 =

𝛼𝑖 −1

𝑖𝑗

𝛼𝑖𝛼𝑗𝑦𝑖𝑦𝑗𝑥𝑖𝑇𝑥𝑗

𝛼𝑖𝑦𝑖 = 0, 𝛼𝑖 ≥ 0

• Since 𝑤 = σ𝑖 𝛼𝑖𝑦𝑖𝑥𝑖, we have 𝑤𝑇𝑥 + 𝑏 = σ𝑖 𝛼𝑖𝑦𝑖𝑥𝑖𝑇𝑥 + 𝑏

Only depend on inner products

Kernel methods

Features

Color Histogram

Red Green Blue

Extract features

𝑥 𝜙 𝑥

Features

• Proper feature mapping can make non-linear to linear

• Using SVM on the feature space {𝜙 𝑥𝑖 }: only need 𝜙 𝑥𝑖𝑇𝜙(𝑥𝑗)

• Conclusion: no need to design 𝜙 ⋅ , only need to design

𝑘 𝑥𝑖 , 𝑥𝑗 = 𝜙 𝑥𝑖𝑇𝜙(𝑥𝑗)

Polynomial kernels

• Fix degree 𝑑 and constant 𝑐:𝑘 𝑥, 𝑥′ = 𝑥𝑇𝑥′ + 𝑐 𝑑

• What are 𝜙(𝑥)?

• Expand the expression to get 𝜙(𝑥)

Polynomial kernels

Figure from Foundations of Machine Learning, by M. Mohri, A. Rostamizadeh, and A. Talwalkar

Gaussian kernels

• Fix bandwidth 𝜎:

𝑘 𝑥, 𝑥′ = exp(− 𝑥 − 𝑥′2/2𝜎2)

• Also called radial basis function (RBF) kernels

• What are 𝜙(𝑥)? Consider the un-normalized version𝑘′ 𝑥, 𝑥′ = exp(𝑥𝑇𝑥′/𝜎2)

• Power series expansion:

𝑘′ 𝑥, 𝑥′ =

+∞𝑥𝑇𝑥′ 𝑖

𝜎𝑖𝑖!

Mercer’s condition for kenerls

• Theorem: 𝑘 𝑥, 𝑥′ has expansion

𝑘 𝑥, 𝑥′ =

𝑎𝑖𝜙𝑖 𝑥 𝜙𝑖(𝑥′)

if and only if for any function 𝑐(𝑥),

∫ ∫ 𝑐 𝑥 𝑐 𝑥′ 𝑘 𝑥, 𝑥′ 𝑑𝑥𝑑𝑥′ ≥ 0

(Omit some math conditions for 𝑘 and 𝑐)

Constructing new kernels

• Kernels are closed under positive scaling, sum, product, pointwise limit, and composition with a power series σ𝑖

+∞𝑎𝑖𝑘𝑖(𝑥, 𝑥′)

• Example: 𝑘1 𝑥, 𝑥′ , 𝑘2 𝑥, 𝑥′ are kernels, then also is

𝑘 𝑥, 𝑥′ = 2𝑘1 𝑥, 𝑥′ + 3𝑘2 𝑥, 𝑥′

• Example: 𝑘1 𝑥, 𝑥′ is kernel, then also is

𝑘 𝑥, 𝑥′ = exp(𝑘1 𝑥, 𝑥′ )

Kernels v.s. Neural networks

Features

Color Histogram

Red Green Blue

Extract features

𝑦 = 𝑤𝑇𝜙 𝑥build hypothesis

Features: part of the model

𝑦 = 𝑤𝑇𝜙 𝑥build hypothesis

Linear model

Nonlinear model

Polynomial kernels

Figure from Foundations of Machine Learning, by M. Mohri, A. Rostamizadeh, and A. Talwalkar

Polynomial kernel SVM as two layer neural network

𝑥12

𝑥22

2𝑥1𝑥2

2𝑐𝑥1

2𝑐𝑥2

𝑦 = sign(𝑤𝑇𝜙(𝑥) + 𝑏)

First layer is fixed. If also learn first layer, it becomes two layer neural network

Machine Learning Basics Lecture 5: SVM IISVM: optimization •Optimization (Quadratic Programming):...

Documents

Transcript of Machine Learning Basics Lecture 5: SVM IISVM: optimization •Optimization (Quadratic Programming):...

PRECALCULUS: UNIT 5 NOTES€¦ · Definition of the Logarithmic Function: For > r and 𝑏> r, 𝑏≠ s, y=logb x is equivalent to by =x. The function f (x)= log b x is the logarithmic

SURFACE ORIENTATIONS AND ENERGY POLICY FOR ......Where, 𝐻 𝑑 is the monthly average daily diffuse radiation, 𝑅 𝑏 is the monthly average daily geometric factor for beam radiation,

ECS 452 - 5.1 - Binary Linear Block Codes - B 452 - 5.1 - Binary...Generator matrix: a revisit 76 From 𝐱 L𝐛 +𝐆 L Í 𝑏 Ý Ý Þ Ý @ 5, we see that the jelement of the codeword

The Definite Integral - uml.edufaculty.uml.edu/Jennifer_GonzalezZugasti/Calculus II...definite integral of 𝒇 from 𝒂 to 𝒃 is 𝑓𝑥 𝑏 𝑎 𝑑𝑥= lim 𝑃→0 𝑓𝑐

Categorifying cardinal arithmetic - MAA MathFest, … › ~eriehl › arithmetic.pdfPlan Goal:prove𝑎×(𝑏+𝑐)=(𝑎×𝑏)+(𝑎×𝑐)foranynaturalnumbers 𝑎,𝑏,and𝑐.

YEAR 10 MATHEMATICS EXAMINATION - Kinross College...Section 1: Multiple choice ( Total 25 marks, one mark per question) 1. The simplified form of 7𝑎𝑏+ 2𝑏−5𝑎𝑏+ 𝑏

What is an individual Counting Individualsweb.wilkes.edu/jeffrey.stratford/files/ecology/act2.pdfPopulation • Number of individuals in an area • Density • 𝑁𝑡ℎ =𝑁 +𝑏+

The Arching of Beams By Farid A. Chouery , PE, SE ©2012 ... · f 𝑦 ℎ a ∞ á @ 𝜎 ë𝑘 4𝑐𝑜𝑠𝛼 á𝑥 > :𝑎 á𝐷𝑘 4𝑤 á 5𝑤 á e :𝑏 á𝐷𝑘

𝐵0 →ργ decay at LHCb · 2016-12-19 · Introduction and experimental status • The radiative 𝐵0→ργ decay is a rare decay which corresponds at quark level to a 𝑏→𝑑𝛾

COVID-19 Người dương tính mới và Người tử vongž‹コロナウイルス感染...ℳ ℒ COVID-19 報 言語No.7•ベ⅄ⅆ⅜語‣ ※最後 日本語訳あℓ Thông

Today Don’t Forgetmccarronsite.net/m3325-chapter2slides.pdf3/19/2019 5 Don’t Open Your Book!!! 𝐿𝑒𝑡 𝑎,𝑏∈𝐴,𝜎∈𝑆 º 𝑎~𝑏 𝑖𝑓𝑓 𝑏 L𝜎 á𝑎

Studies of milling technology's approaches for ... · ②Specific surface area measurement by NMR R av:reciprocal of sample relaxation time R av =ѱ𝑝 𝐿𝜌𝑝 𝑠− 𝑏+R

Sean Simpson NEPIC Conference · Validation of Reactor Technology 6 𝐾𝐿=𝑘𝐿𝑃𝑇𝐻 6𝜀𝐺 𝑏 𝑛𝑇 𝑉𝑅 CO utilisation of existing LanzaTech reactors compared

No rationality through brute-force · 2019. 10. 25. · niscience (see Dantas, 2016, Ch. 1). In the model, a reasoner is composed of a lanˆuage (ℒ), a knowledge base (KB), and

Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdf · Training Recurrent Networks Cross-entropy loss 𝑃=ෑ , U 𝑡𝑘⇒ ℒ=−log𝑃= ℒ

Turing Machines - Stanford University...Review: RE Languages The language of a TM M (denoted (ℒ M)) is the set of all strings M accepts. A language L is called recognizable (equivalently,

Curiosity, |unobserved rewards | and neural networksin RL · (unobserved) Environment, D Action, E Observation: F=Φ(E,D) Loss ℒ(E,D) (unobserved) Regret:J K=max O

IBISML, PRMU & CVIM joint workshop September 16, 2017 ... · Biased SVM (Liu+, ICDM’03 )–best method prior to Elkan & Noto (KDD’08 1 2 2+𝐶 p ∈𝒳 p ℓH ⋅ +𝑏+𝐶u

Fekete-Szegy Inequality for a Subclass of -Valent Analytic Functions · 2014-04-23 · 4 JournalofAppliedMathematics If𝑓(𝑧)isoftheform(1)andbelongstotheclass 𝑉𝑝,𝑏,𝛼,𝛽(𝜙),then

RememberingMarkMahowald 1931–2013 · 2016. 5. 16. · the 𝑏 -based Adams spectral sequence didn’t have an 𝐸 2 -term expressible as an Ext group, because he “never understood

COVID-19 Người dương tính mới và Người tử vongž‹コロナウイルス感染...ℳ ℒ COVID-19 報言語No.7•ベ⅄ⅆ⅜語‣ ※最後日本語訳あℓ Thông