Machine Learning Basics Lecture 5: SVM IISVM: optimization â¢Optimization (Quadratic Programming):...
Transcript of Machine Learning Basics Lecture 5: SVM IISVM: optimization â¢Optimization (Quadratic Programming):...
Machine Learning Basics Lecture 5: SVM II
Princeton University COS 495
Instructor: Yingyu Liang
Review: SVM objective
SVM: objective
⢠Let ðŠð â +1,â1 , ðð€,ð ð¥ = ð€ðð¥ + ð. Margin:
ðŸ = minð
ðŠððð€,ð ð¥ð| ð€ |
⢠Support Vector Machine:
maxð€,ð
ðŸ = maxð€,ð
minð
ðŠððð€,ð ð¥ð| ð€ |
SVM: optimization
⢠Optimization (Quadratic Programming):
minð€,ð
1
2ð€
2
ðŠð ð€ðð¥ð + ð ⥠1, âð
⢠Solved by Lagrange multiplier method:
â ð€, ð, ð¶ =1
2ð€
2
â
ð
ðŒð[ðŠð ð€ðð¥ð + ð â 1]
where ð¶ is the Lagrange multiplier
Lagrange multiplier
Lagrangian
⢠Consider optimization problem:minð€
ð(ð€)
âð ð€ = 0, â1 †ð †ð
⢠Lagrangian:
â ð€, ð· = ð ð€ +
ð
ðœðâð(ð€)
where ðœðâs are called Lagrange multipliers
Lagrangian
⢠Consider optimization problem:minð€
ð(ð€)
âð ð€ = 0, â1 †ð †ð
⢠Solved by setting derivatives of Lagrangian to 0
ðâ
ðð€ð= 0;
ðâ
ððœð= 0
Generalized Lagrangian
⢠Consider optimization problem:minð€
ð(ð€)
ðð ð€ †0, â1 †ð †ð
âð ð€ = 0, â1 †ð †ð
⢠Generalized Lagrangian:
â ð€, ð¶, ð· = ð ð€ +
ð
ðŒððð(ð€) +
ð
ðœðâð(ð€)
where ðŒð , ðœðâs are called Lagrange multipliers
Generalized Lagrangian
⢠Consider the quantity:
ðð ð€ â maxð¶,ð·:ðŒðâ¥0
â ð€, ð¶, ð·
⢠Why?
ðð ð€ = áð ð€ , if ð€ satisfies all the constraints+â, if ð€ does not satisfy the constraints
⢠So minimizing ð ð€ is the same as minimizing ðð ð€
minð€
ð ð€ = minð€
ðð ð€ = minð€
maxð¶,ð·:ðŒðâ¥0
â ð€,ð¶, ð·
Lagrange duality
⢠The primal problem
ðâ â minð€
ð ð€ = minð€
maxð¶,ð·:ðŒðâ¥0
â ð€, ð¶, ð·
⢠The dual problem
ðâ â maxð¶,ð·:ðŒðâ¥0
minð€
â ð€, ð¶, ð·
⢠Always true:ðâ †ðâ
Lagrange duality
⢠The primal problem
ðâ â minð€
ð ð€ = minð€
maxð¶,ð·:ðŒðâ¥0
â ð€, ð¶, ð·
⢠The dual problem
ðâ â maxð¶,ð·:ðŒðâ¥0
minð€
â ð€, ð¶, ð·
⢠Interesting case: when do we have ðâ = ðâ?
Lagrange duality
⢠Theorem: under proper conditions, there exists ð€â, ð¶â, ð·â such that
ðâ = â ð€â, ð¶â, ð·â = ðâ
Moreover, ð€â, ð¶â, ð·â satisfy Karush-Kuhn-Tucker (KKT) conditions:ðâ
ðð€ð= 0, ðŒððð ð€ = 0
ðð ð€ †0, âð ð€ = 0, ðŒð ⥠0
Lagrange duality
⢠Theorem: under proper conditions, there exists ð€â, ð¶â, ð·â such that
ðâ = â ð€â, ð¶â, ð·â = ðâ
Moreover, ð€â, ð¶â, ð·â satisfy Karush-Kuhn-Tucker (KKT) conditions:ðâ
ðð€ð= 0, ðŒððð ð€ = 0
ðð ð€ †0, âð ð€ = 0, ðŒð ⥠0
dual complementarity
Lagrange duality
⢠Theorem: under proper conditions, there exists ð€â, ð¶â, ð·â such that
ðâ = â ð€â, ð¶â, ð·â = ðâ
⢠Moreover, ð€â, ð¶â, ð·â satisfy Karush-Kuhn-Tucker (KKT) conditions:ðâ
ðð€ð= 0, ðŒððð ð€ = 0
ðð ð€ †0, âð ð€ = 0, ðŒð ⥠0
dual constraintsprimal constraints
Lagrange duality
⢠What are the proper conditions?
⢠A set of conditions (Slater conditions):⢠ð, ðð convex, âð affine
⢠Exists ð€ satisfying all ðð ð€ < 0
⢠There exist other sets of conditions⢠Search KarushâKuhnâTucker conditions on Wikipedia
SVM: optimization
SVM: optimization
⢠Optimization (Quadratic Programming):
minð€,ð
1
2ð€
2
ðŠð ð€ðð¥ð + ð ⥠1, âð
⢠Generalized Lagrangian:
â ð€, ð, ð¶ =1
2ð€
2
â
ð
ðŒð[ðŠð ð€ðð¥ð + ð â 1]
where ð¶ is the Lagrange multiplier
SVM: optimization
⢠KKT conditions:ðâ
ðð€= 0, ð€ = Ïð ðŒððŠðð¥ð (1)
ðâ
ðð= 0, 0 = Ïð ðŒððŠð (2)
⢠Plug into â:
â ð€, ð, ð¶ = Ïð ðŒð â1
2Ïðð ðŒððŒððŠððŠðð¥ð
ðð¥ð (3)
combined with 0 = Ïð ðŒððŠð , ðŒð ⥠0
SVM: optimization
⢠Reduces to dual problem:
â ð€, ð, ð¶ =
ð
ðŒð â1
2
ðð
ðŒððŒððŠððŠðð¥ððð¥ð
ð
ðŒððŠð = 0, ðŒð ⥠0
⢠Since ð€ = Ïð ðŒððŠðð¥ð, we have ð€ðð¥ + ð = Ïð ðŒððŠðð¥ððð¥ + ð
Only depend on inner products
Kernel methods
Features
Color Histogram
Red Green Blue
Extract features
ð¥ ð ð¥
Features
Features
⢠Proper feature mapping can make non-linear to linear
⢠Using SVM on the feature space {ð ð¥ð }: only need ð ð¥ððð(ð¥ð)
⢠Conclusion: no need to design ð â , only need to design
ð ð¥ð , ð¥ð = ð ð¥ððð(ð¥ð)
Polynomial kernels
⢠Fix degree ð and constant ð:ð ð¥, ð¥â² = ð¥ðð¥â² + ð ð
⢠What are ð(ð¥)?
⢠Expand the expression to get ð(ð¥)
Polynomial kernels
Figure from Foundations of Machine Learning, by M. Mohri, A. Rostamizadeh, and A. Talwalkar
Figure from Foundations of Machine Learning, by M. Mohri, A. Rostamizadeh, and A. Talwalkar
Gaussian kernels
⢠Fix bandwidth ð:
ð ð¥, ð¥â² = exp(â ð¥ â ð¥â²2/2ð2)
⢠Also called radial basis function (RBF) kernels
⢠What are ð(ð¥)? Consider the un-normalized versionðâ² ð¥, ð¥â² = exp(ð¥ðð¥â²/ð2)
⢠Power series expansion:
ðâ² ð¥, ð¥â² =
ð
+âð¥ðð¥â² ð
ððð!
Mercerâs condition for kenerls
⢠Theorem: ð ð¥, ð¥â² has expansion
ð ð¥, ð¥â² =
ð
+â
ðððð ð¥ ðð(ð¥â²)
if and only if for any function ð(ð¥),
â« â« ð ð¥ ð ð¥â² ð ð¥, ð¥â² ðð¥ðð¥â² ⥠0
(Omit some math conditions for ð and ð)
Constructing new kernels
⢠Kernels are closed under positive scaling, sum, product, pointwise limit, and composition with a power series Ïð
+âðððð(ð¥, ð¥â²)
⢠Example: ð1 ð¥, ð¥â² , ð2 ð¥, ð¥â² are kernels, then also is
ð ð¥, ð¥â² = 2ð1 ð¥, ð¥â² + 3ð2 ð¥, ð¥â²
⢠Example: ð1 ð¥, ð¥â² is kernel, then also is
ð ð¥, ð¥â² = exp(ð1 ð¥, ð¥â² )
Kernels v.s. Neural networks
Features
Color Histogram
Red Green Blue
Extract features
ð¥
ðŠ = ð€ðð ð¥build hypothesis
Features: part of the model
ðŠ = ð€ðð ð¥build hypothesis
Linear model
Nonlinear model
Polynomial kernels
Figure from Foundations of Machine Learning, by M. Mohri, A. Rostamizadeh, and A. Talwalkar
Polynomial kernel SVM as two layer neural network
ð¥1
ð¥2
ð¥12
ð¥22
2ð¥1ð¥2
2ðð¥1
2ðð¥2
ð
ðŠ = sign(ð€ðð(ð¥) + ð)
First layer is fixed. If also learn first layer, it becomes two layer neural network