Introduction to Machine Learning...
Transcript of Introduction to Machine Learning...
Convex Optimization
Prof. Nati Srebro
Lecture 18:Course Summary
About the Courseโข Methods for solving convex optimization problems,
based on oracle access, with guarantees based on their properties
โข Understanding different optimization methodsโข Understanding their derivationโข When are they appropriateโข Guarantees (a few proofs, not a core component)
โข Working and reasoning about opt problemsโข Standard forms: LP, QP, SDP, etcโข Optimality conditionsโข Dualityโข Using Optimality Conditions and Duality to reason about ๐ฅโ
Oracles
โข 0th,1st,2nd, etc order oracle for function ๐:๐ฅ โฆ ๐ ๐ฅ , ๐ป๐ ๐ฅ , ๐ป2๐(๐ฅ)
โข Separation oracle for constraint ๐ณ
๐ฅ โฆ ๐ฅ โ ๐ณ
๐ ๐ . ๐ก.๐ณ โ ๐ฅโฒ ๐, ๐ฅ โ ๐ฅโฒ < 0
โข Projection oracle for constraint ๐ณ (w.r.t. โ โ โ2):
๐ฆ โ argmin๐ฅโ๐ณ
๐ฅ โ ๐ฆ 22
โข Linear optimization oracle
๐ฃ โ argmin๐ฅโ๐ณ
๐ฃ, ๐ฅ
โข Prox/projection oracle for function โ w.r.t. โ โ โ2:
๐ฆ, ๐ โฆ argmin๐ฅ
โ ๐ฅ + ๐ ๐ฅ โ ๐ฆ 22
โข Stochastic oracle:๐ฅ โฆ ๐ ๐ . ๐ก. ๐ผ ๐ = ๐ป๐(๐ฅ)
Analysis Assumptions
โข ๐ is ๐ฟ-Lipschitz (w.r.t โ โ โ)๐ ๐ฅ โ ๐ ๐ฆ โค ๐ฟ ๐ฅ โ ๐ฆ
โข ๐ is ๐-Smooth (w.r.t โ โ โ)
๐ ๐ฅ + ฮ๐ฅ โค ๐ ๐ฅ + ๐ป๐ ๐ฅ , ฮ๐ฅ +๐
2ฮ๐ฅ 2
โข ๐ is ๐-strongly convex (w.r.t โ โ โ)
๐ ๐ฅ + ๐ป๐ ๐ฅ , ฮ๐ฅ +๐
2ฮ๐ฅ 2 โค ๐ ๐ฅ + ฮ๐ฅ
โข ๐ is self-concordantโ๐ฅโdirection ๐ฃ ๐๐ฃ
โฒโฒโฒ ๐ฅ โค 2๐โฒโฒ ๐ฅ 3/2
โข ๐ is quadratic (i.e. ๐โฒโฒโฒ = 0)
โข โ๐ฅโโ is small (or domain is bounded)
โข Initial sub-optimality ๐0 = ๐ ๐ฅ 0 โ ๐ is small
Overall runtime
( runtime of each iteration ) ร (number of required iterations)
( oracle runtime ) + ( other operations )
Contrast to Non-Oracle Methods
โข Symbolically solving ๐ป๐(๐ฅ) = 0
โข Gaussian elimination
โข Simplex method for LPs
Advantages of Oracle Approach
โข Handle generic functions (not only of specific parametric form)
โข Makes it clear what functions can be handled, what we need to be able to compute about them
โข Complexity in terms of โoracle accessesโ
โข Can obtain lower bound on number of oracle accesses required
Unconstrained Optimizationwith 1st Oder Oracleโ
Dimension Independent Guarantees
๐ โผ ๐๐ โผ ๐ด ๐๐ โผ ๐ด ๐ โค ๐ณ๐ โค ๐ณ๐ โผ ๐๐
GD๐
๐log 1/๐
๐ ๐ฅโ 2
๐
๐ฟ2 ๐ฅโ 2
๐2๐ฟ2
๐๐
A-GD ๐๐ log 1/๐
๐ ๐ฅโ 2
๐
Lower bound ฮฉ ๐๐ log 1/๐ ฮฉ
๐ ๐ฅโ 2
๐ฮฉ
๐ฟ2 ๐ฅโ 2
๐2ฮฉ
๐ฟ2
๐๐
Theorem: for any method that uses only 1st order oracle access, there exists a
๐ฟ-Lipschitz function with ๐ฅโ โค 1 s.t. at least ฮฉ ๐ฟ2
๐2 oracle accesses are
required for the method to find an ๐-suboptimal solution
YuriNesterov
Unconstrained Optimizationwith 1st Oder Oracleโ
Low Dimensional Guarantees
#๐๐๐๐Runtime per
iterOverallruntime
Center of Mass ๐ log
1
๐โ โ
Ellipsoid ๐2 log1
๐๐2 ๐4 log
1
๐
Lower Bound ฮฉ ๐ log1
๐
Overall runtime
( runtime of each iteration ) ร (number of required iterations)
Unconstrained Optimization-Practical Methods
Memory Per iter#iter
๐ โผ ๐๐ โผ ๐ด#iter
quadratic
GD ๐ถ ๐ ๐๐ + ๐ถ(๐) ๐ฟ ๐ฅ๐จ๐ ๐/๐ ๐ฟ ๐ฅ๐จ๐ ๐/๐
A-GD ๐(๐) ๐ป๐ + ๐(๐) ๐ฟ ๐ฅ๐จ๐ ๐/๐ ๐ log 1/๐
Momentum ๐ ๐ ๐ป๐ + ๐(๐)
Conj-GD ๐ ๐ ๐ป๐ + ๐(๐) ๐ log 1/๐
L-BFGS ๐(๐๐) ๐ป๐ + ๐(๐๐)
BFGS ๐(๐2) ๐ป๐ + ๐(๐2) ๐ log 1/๐
Newton ๐ถ ๐๐ ๐๐๐ + ๐ถ(๐๐)๐ ๐ ๐ + ๐โ
๐ธ+ ๐ฅ๐จ๐ ๐ฅ๐จ๐ ๐/๐ 1
Constrained Optimizationmin๐ฅโ๐ณ
๐(๐ฅ)
โข Projected Gradient Descent
โข Oracle: ๐ฆ โฆ argmin๐ฅโ๐ณ
๐ฅ โ ๐ฆ 22
โข ๐(๐) + projection per iteration
โข Similar iteration complexity to Grad Descent:poly dependence on ๐ or on ๐ฟ
โข Conditional Gradient Descent:
โข Oracle: ๐ฃ โ argmin๐ฅโ๐ณ
๐ฃ, ๐ฅ
โข ๐(๐) + linear optimization per iteration
โข ๐( ๐๐) iterations, but no acceleration, no ๐(๐ log 1 ๐)
โข Ellipsoid Method
โข Separation oracle (can handle much more generic ๐ณ)
โข ๐(๐2) + separation oracle per itration
โข ๐(๐2 log 1 ๐) iterations total runtime ๐(๐4 log 1 ๐)
Constrained Optimization
โข Interior Point Methods
โข Only 1st and 2nd oracle access to ๐0, ๐๐โข Overall #Newton Iterations: ๐ ๐ (log 1 ๐ + log log 1 ๐ฟ)
โข Overall runtime: โ ๐ ๐ ๐ + ๐ 3 +๐ ๐ป2 evals log 1 ๐
โข Can also handle matrix inequalities (SDPs)
โข Other cones: need self concordant barrier
โข โStandard Methodโ for LPs, QPs, SDPs, etc
min๐ฅโโ๐
๐0(๐ฅ)
๐ . ๐ก. ๐๐ ๐ฅ โค 0๐ด๐ฅ = ๐
min๐ฅโโ๐
๐0(๐ฅ)
๐ . ๐ก. ๐๐ ๐ฅ โผ 0๐ด๐ฅ = ๐
min๐ฅโโ๐
๐0(๐ฅ)
๐ . ๐ก. โ๐๐ ๐ฅ โ ๐พ๐
๐ด๐ฅ = ๐
Barrier Methods
โข Log barriers ๐ผ๐ก ๐ข = โ1
๐กlog(โ๐ข):
โข Analysis and Guarantees
โข โCentral Pathโ
โข Relaxation of KKT conditions
min๐ฅโโ๐
๐0 ๐ฅ + ๐=1๐ ๐ผ ๐๐ ๐ฅ
๐ . ๐ก. ๐ด๐ฅ = ๐
0
โ
0barrier function penalty function
๐ฅโ
๐๐ ๐ฅ โค 0๐ด๐ฅ = ๐
Optimality Condition(assuming Strong Duality)
๐ฅโ optimal for (P)
๐โ, ๐โ optimal for (D)
๐๐ ๐ฅโ โค 0 โ๐=1โฆ๐
โ๐ ๐ฅโ = 0 โ๐=1โฆ๐
๐๐โ โฅ 0 โ๐=1..๐
๐ป๐ฅ๐ฟ ๐ฅโ, ๐โ, ๐โ = 0
๐๐โ๐๐ ๐ฅโ = 0 โ๐=1..๐
KKT Conditions
AlbertTucker
HaroldKuhn
WilliamKarush
Optimality Conditionfor Problem with Log Barrier
๐ฅโ optimal for (๐๐ก)
๐โ optimal for (๐ท๐ก)
๐๐ ๐ฅโ โค 0 โ๐=1โฆ๐
โ๐ ๐ฅโ = 0 โ๐=1โฆ๐
๐๐โ โฅ 0 โ๐=1..๐
๐ป๐ฅ๐ฟ ๐ฅโ, ๐โ, ๐โ = 0
๐๐โ๐๐ ๐ฅโ = ๐ โ๐=1..๐
KKT Conditions
Formulation of LP
1939 1947
Simplex
1984ArkadiNemirovski
1994YuriNesterov
NarendraKarmarkar
19941984
GeorgeDantzig
LeonidKhachiyanLeonid
Kantorovich
1984ArkadiNemirovski
General IP
IP for LP
1972 1979
EllipsoidMethod
Ellipsoid isPoly time
for LP
HerbertRobbins
1951
SGD (w/ Monro)
Non-SmoothSGD
+ analysis
1978
MargueriteFrank
PhilipWolfe
Conditional GD
1956
KKT
From Interior Point MethodsBack to First Order Methods
โข Interior Point Methods (and before them Ellipsoid):
โข ๐ ๐๐๐๐ฆ ๐ log1
๐
โข LP in time polynomial in size of input (number of bits in coefficients)
โข But in large scale applications, ๐(๐3.5) too high
โข First order (and stochastic) methods:
โข Better dependence on ๐
โข ๐๐๐๐ฆ1
๐(or ๐๐๐๐ฆ(๐ ))
Overall runtime
( runtime of each iteration ) ร (number of required iterations)
( oracle runtime ) + ( other operations )
Example: โ1 regression
๐ ๐ฅ =
๐=1
๐
๐๐ , ๐ฅ โ ๐๐
โข Option 1: Cast as constrained optimization
min๐ฅโโ๐,๐กโโ๐
๐
๐ก๐ ๐ . ๐ก. โ๐ก๐ โค ๐๐ , ๐ฅ โ ๐๐ โค ๐ก
โข Runtime: ๐( ๐ ๐ +๐ 3 log 1 ๐)
โข Options 2: Gradient Descent
โข ๐๐ 2 ๐ฅโ 2
๐2iterations
โข ๐(๐๐) per iteration overall ๐ ๐๐1
๐2
โข Option 3: Stochastic Gradient Descent
โข ๐๐ 2 ๐ฅโ 2
๐2iterations
โข ๐(๐) per iteration overall ๐ ๐1
๐2
Remember!
โข Gradient is in dual space---donโt take mapping for granted!
โ๐ โ๐ โ๐ฅ ๐ป๐๐ปฮจ
๐ปฮจโ1
Some things we didnโt coverโข Technology for unconstrained optimization
โข Different quasi-Newton and conj gradient methods
โข Different line searches
โข Technology for Interior Point Primal-Dual methods
โข Numerical Issues
โข Recent advances in first order and stochastic methods
โข Mirror Descent, Different Geometries, Adaptive Geometries
โข Decomposable functions, partial linearization and acceleration
โข Faster methods for problems with specific forms or properties
โข Message passing for solving LPs
โข Flows and network problems
โข Distributed Optimization
Current Trends
โข Small scale problems: โข Super-efficient and reliable solvers for use in real-time
โข Very large scale problems:โข Stochastic first order methods
โข Linear time, one-pass or few-pass
โข Relationship to Machine Learning
Other Courses
โข Winter 2016:
Non-Linear Optimization (Mihai Anitescu, Stats)
โข Winter 2016:
Networks I: Introduction to Modeling and Analysis (Ozan Candogan, Booth)
โข 2016/17:
Computational and Statistical Learning Theory (TTIC)
About the Courseโข Methods for solving convex optimization problems,
based on oracle access, with guarantees based on their properties
โข Understanding different optimization methodsโข Understanding their derivationโข When are they appropriateโข Guarantees (a few proofs, not a core component)
โข Working and reasoning about opt problemsโข Standard forms: LP, QP, SDP, etcโข Optimality conditionsโข Dualityโข Using Optimality Conditions and Duality to reason about ๐ฅโ
Final
โข Tuesday 10:30am
โข Straight-forward questions
โข Allowed: anything hand-written by you (not photo-copied)
โข In depth: Up to and including Interior Point Methods
โข Unconstrained Optimization
โข Duality
โข Optimality Conditions
โข Interior Point Methods
โข Phase I Methods and Feasibility Problems
โข Superficially (a few multiple choice or similar)
โข Other first order and stochastic methods
โข Lower Bounds
โข Center of Mass / Ellipsoid
โข Other approaches