Introduction to Machine Learning...

26
Convex Optimization Prof. Nati Srebro Lecture 18: Course Summary

Transcript of Introduction to Machine Learning...

Page 1: Introduction to Machine Learning 236756ttic.uchicago.edu/~nati/Teaching/TTIC31070/2015/Lecture18.pdfย ยท โˆ— 2 ๐œ– 2 โˆ— 2 ๐œ–2 2 ๐œ– A-GD log1/๐œ– โˆ— 2 ๐œ– Lower bound ฮฉ log1/๐œ–ฮฉ

Convex Optimization

Prof. Nati Srebro

Lecture 18:Course Summary

Page 2: Introduction to Machine Learning 236756ttic.uchicago.edu/~nati/Teaching/TTIC31070/2015/Lecture18.pdfย ยท โˆ— 2 ๐œ– 2 โˆ— 2 ๐œ–2 2 ๐œ– A-GD log1/๐œ– โˆ— 2 ๐œ– Lower bound ฮฉ log1/๐œ–ฮฉ

About the Courseโ€ข Methods for solving convex optimization problems,

based on oracle access, with guarantees based on their properties

โ€ข Understanding different optimization methodsโ€ข Understanding their derivationโ€ข When are they appropriateโ€ข Guarantees (a few proofs, not a core component)

โ€ข Working and reasoning about opt problemsโ€ข Standard forms: LP, QP, SDP, etcโ€ข Optimality conditionsโ€ข Dualityโ€ข Using Optimality Conditions and Duality to reason about ๐‘ฅโˆ—

Page 3: Introduction to Machine Learning 236756ttic.uchicago.edu/~nati/Teaching/TTIC31070/2015/Lecture18.pdfย ยท โˆ— 2 ๐œ– 2 โˆ— 2 ๐œ–2 2 ๐œ– A-GD log1/๐œ– โˆ— 2 ๐œ– Lower bound ฮฉ log1/๐œ–ฮฉ

Oracles

โ€ข 0th,1st,2nd, etc order oracle for function ๐‘“:๐‘ฅ โ†ฆ ๐‘“ ๐‘ฅ , ๐›ป๐‘“ ๐‘ฅ , ๐›ป2๐‘“(๐‘ฅ)

โ€ข Separation oracle for constraint ๐’ณ

๐‘ฅ โ†ฆ ๐‘ฅ โˆˆ ๐’ณ

๐‘” ๐‘ . ๐‘ก.๐’ณ โŠ‚ ๐‘ฅโ€ฒ ๐‘”, ๐‘ฅ โˆ’ ๐‘ฅโ€ฒ < 0

โ€ข Projection oracle for constraint ๐’ณ (w.r.t. โ€– โ‹… โ€–2):

๐‘ฆ โ†’ argmin๐‘ฅโˆˆ๐’ณ

๐‘ฅ โˆ’ ๐‘ฆ 22

โ€ข Linear optimization oracle

๐‘ฃ โ†’ argmin๐‘ฅโˆˆ๐’ณ

๐‘ฃ, ๐‘ฅ

โ€ข Prox/projection oracle for function โ„Ž w.r.t. โ€– โ‹… โ€–2:

๐‘ฆ, ๐œ† โ†ฆ argmin๐‘ฅ

โ„Ž ๐‘ฅ + ๐œ† ๐‘ฅ โˆ’ ๐‘ฆ 22

โ€ข Stochastic oracle:๐‘ฅ โ†ฆ ๐‘” ๐‘ . ๐‘ก. ๐”ผ ๐‘” = ๐›ป๐‘“(๐‘ฅ)

Page 4: Introduction to Machine Learning 236756ttic.uchicago.edu/~nati/Teaching/TTIC31070/2015/Lecture18.pdfย ยท โˆ— 2 ๐œ– 2 โˆ— 2 ๐œ–2 2 ๐œ– A-GD log1/๐œ– โˆ— 2 ๐œ– Lower bound ฮฉ log1/๐œ–ฮฉ

Analysis Assumptions

โ€ข ๐‘“ is ๐ฟ-Lipschitz (w.r.t โ€– โ‹… โ€–)๐‘“ ๐‘ฅ โˆ’ ๐‘“ ๐‘ฆ โ‰ค ๐ฟ ๐‘ฅ โˆ’ ๐‘ฆ

โ€ข ๐‘“ is ๐‘€-Smooth (w.r.t โ€– โ‹… โ€–)

๐‘“ ๐‘ฅ + ฮ”๐‘ฅ โ‰ค ๐‘“ ๐‘ฅ + ๐›ป๐‘“ ๐‘ฅ , ฮ”๐‘ฅ +๐‘€

2ฮ”๐‘ฅ 2

โ€ข ๐‘“ is ๐œ†-strongly convex (w.r.t โ€– โ‹… โ€–)

๐‘“ ๐‘ฅ + ๐›ป๐‘“ ๐‘ฅ , ฮ”๐‘ฅ +๐œ‡

2ฮ”๐‘ฅ 2 โ‰ค ๐‘“ ๐‘ฅ + ฮ”๐‘ฅ

โ€ข ๐‘“ is self-concordantโˆ€๐‘ฅโˆ€direction ๐‘ฃ ๐‘“๐‘ฃ

โ€ฒโ€ฒโ€ฒ ๐‘ฅ โ‰ค 2๐‘“โ€ฒโ€ฒ ๐‘ฅ 3/2

โ€ข ๐‘“ is quadratic (i.e. ๐‘“โ€ฒโ€ฒโ€ฒ = 0)

โ€ข โ€–๐‘ฅโˆ—โ€– is small (or domain is bounded)

โ€ข Initial sub-optimality ๐œ–0 = ๐‘“ ๐‘ฅ 0 โˆ’ ๐‘ is small

Page 5: Introduction to Machine Learning 236756ttic.uchicago.edu/~nati/Teaching/TTIC31070/2015/Lecture18.pdfย ยท โˆ— 2 ๐œ– 2 โˆ— 2 ๐œ–2 2 ๐œ– A-GD log1/๐œ– โˆ— 2 ๐œ– Lower bound ฮฉ log1/๐œ–ฮฉ

Overall runtime

( runtime of each iteration ) ร— (number of required iterations)

( oracle runtime ) + ( other operations )

Page 6: Introduction to Machine Learning 236756ttic.uchicago.edu/~nati/Teaching/TTIC31070/2015/Lecture18.pdfย ยท โˆ— 2 ๐œ– 2 โˆ— 2 ๐œ–2 2 ๐œ– A-GD log1/๐œ– โˆ— 2 ๐œ– Lower bound ฮฉ log1/๐œ–ฮฉ

Contrast to Non-Oracle Methods

โ€ข Symbolically solving ๐›ป๐‘“(๐‘ฅ) = 0

โ€ข Gaussian elimination

โ€ข Simplex method for LPs

Page 7: Introduction to Machine Learning 236756ttic.uchicago.edu/~nati/Teaching/TTIC31070/2015/Lecture18.pdfย ยท โˆ— 2 ๐œ– 2 โˆ— 2 ๐œ–2 2 ๐œ– A-GD log1/๐œ– โˆ— 2 ๐œ– Lower bound ฮฉ log1/๐œ–ฮฉ

Advantages of Oracle Approach

โ€ข Handle generic functions (not only of specific parametric form)

โ€ข Makes it clear what functions can be handled, what we need to be able to compute about them

โ€ข Complexity in terms of โ€œoracle accessesโ€

โ€ข Can obtain lower bound on number of oracle accesses required

Page 8: Introduction to Machine Learning 236756ttic.uchicago.edu/~nati/Teaching/TTIC31070/2015/Lecture18.pdfย ยท โˆ— 2 ๐œ– 2 โˆ— 2 ๐œ–2 2 ๐œ– A-GD log1/๐œ– โˆ— 2 ๐œ– Lower bound ฮฉ log1/๐œ–ฮฉ

Unconstrained Optimizationwith 1st Oder Oracleโ€”

Dimension Independent Guarantees

๐ โ‰ผ ๐›๐Ÿ โ‰ผ ๐‘ด ๐›๐Ÿ โ‰ผ ๐‘ด ๐› โ‰ค ๐‘ณ๐› โ‰ค ๐‘ณ๐ โ‰ผ ๐›๐Ÿ

GD๐‘€

๐œ‡log 1/๐œ–

๐‘€ ๐‘ฅโˆ— 2

๐œ–

๐ฟ2 ๐‘ฅโˆ— 2

๐œ–2๐ฟ2

๐œ‡๐œ–

A-GD ๐‘€๐œ‡ log 1/๐œ–

๐‘€ ๐‘ฅโˆ— 2

๐œ–

Lower bound ฮฉ ๐‘€๐œ‡ log 1/๐œ– ฮฉ

๐‘€ ๐‘ฅโˆ— 2

๐œ–ฮฉ

๐ฟ2 ๐‘ฅโˆ— 2

๐œ–2ฮฉ

๐ฟ2

๐œ‡๐œ–

Theorem: for any method that uses only 1st order oracle access, there exists a

๐ฟ-Lipschitz function with ๐‘ฅโˆ— โ‰ค 1 s.t. at least ฮฉ ๐ฟ2

๐œ–2 oracle accesses are

required for the method to find an ๐œ–-suboptimal solution

YuriNesterov

Page 9: Introduction to Machine Learning 236756ttic.uchicago.edu/~nati/Teaching/TTIC31070/2015/Lecture18.pdfย ยท โˆ— 2 ๐œ– 2 โˆ— 2 ๐œ–2 2 ๐œ– A-GD log1/๐œ– โˆ— 2 ๐œ– Lower bound ฮฉ log1/๐œ–ฮฉ

Unconstrained Optimizationwith 1st Oder Oracleโ€”

Low Dimensional Guarantees

#๐’Š๐’•๐’†๐’“Runtime per

iterOverallruntime

Center of Mass ๐‘› log

1

๐œ–โˆ— โˆ—

Ellipsoid ๐‘›2 log1

๐œ–๐‘›2 ๐‘›4 log

1

๐œ–

Lower Bound ฮฉ ๐‘› log1

๐œ–

Page 10: Introduction to Machine Learning 236756ttic.uchicago.edu/~nati/Teaching/TTIC31070/2015/Lecture18.pdfย ยท โˆ— 2 ๐œ– 2 โˆ— 2 ๐œ–2 2 ๐œ– A-GD log1/๐œ– โˆ— 2 ๐œ– Lower bound ฮฉ log1/๐œ–ฮฉ

Overall runtime

( runtime of each iteration ) ร— (number of required iterations)

Page 11: Introduction to Machine Learning 236756ttic.uchicago.edu/~nati/Teaching/TTIC31070/2015/Lecture18.pdfย ยท โˆ— 2 ๐œ– 2 โˆ— 2 ๐œ–2 2 ๐œ– A-GD log1/๐œ– โˆ— 2 ๐œ– Lower bound ฮฉ log1/๐œ–ฮฉ

Unconstrained Optimization-Practical Methods

Memory Per iter#iter

๐ โ‰ผ ๐›๐Ÿ โ‰ผ ๐‘ด#iter

quadratic

GD ๐‘ถ ๐’ ๐›๐’‡ + ๐‘ถ(๐’) ๐œฟ ๐ฅ๐จ๐  ๐Ÿ/๐ ๐œฟ ๐ฅ๐จ๐  ๐Ÿ/๐

A-GD ๐‘‚(๐‘›) ๐›ป๐‘“ + ๐‘‚(๐‘›) ๐œฟ ๐ฅ๐จ๐  ๐Ÿ/๐ ๐œ… log 1/๐œ–

Momentum ๐‘‚ ๐‘› ๐›ป๐‘“ + ๐‘‚(๐‘›)

Conj-GD ๐‘‚ ๐‘› ๐›ป๐‘“ + ๐‘‚(๐‘›) ๐œ… log 1/๐œ–

L-BFGS ๐‘‚(๐‘Ÿ๐‘›) ๐›ป๐‘“ + ๐‘‚(๐‘Ÿ๐‘›)

BFGS ๐‘‚(๐‘›2) ๐›ป๐‘“ + ๐‘‚(๐‘›2) ๐œ… log 1/๐œ–

Newton ๐‘ถ ๐’๐Ÿ ๐›๐Ÿ๐’‡ + ๐‘ถ(๐’๐Ÿ‘)๐’‡ ๐’™ ๐ŸŽ + ๐’‘โˆ—

๐œธ+ ๐ฅ๐จ๐  ๐ฅ๐จ๐ ๐Ÿ/๐ 1

Page 12: Introduction to Machine Learning 236756ttic.uchicago.edu/~nati/Teaching/TTIC31070/2015/Lecture18.pdfย ยท โˆ— 2 ๐œ– 2 โˆ— 2 ๐œ–2 2 ๐œ– A-GD log1/๐œ– โˆ— 2 ๐œ– Lower bound ฮฉ log1/๐œ–ฮฉ

Constrained Optimizationmin๐‘ฅโˆˆ๐’ณ

๐‘“(๐‘ฅ)

โ€ข Projected Gradient Descent

โ€ข Oracle: ๐‘ฆ โ†ฆ argmin๐‘ฅโˆˆ๐’ณ

๐‘ฅ โˆ’ ๐‘ฆ 22

โ€ข ๐‘‚(๐‘›) + projection per iteration

โ€ข Similar iteration complexity to Grad Descent:poly dependence on ๐ or on ๐œฟ

โ€ข Conditional Gradient Descent:

โ€ข Oracle: ๐‘ฃ โ†’ argmin๐‘ฅโˆˆ๐’ณ

๐‘ฃ, ๐‘ฅ

โ€ข ๐‘‚(๐‘›) + linear optimization per iteration

โ€ข ๐‘‚( ๐‘€๐œ–) iterations, but no acceleration, no ๐‘‚(๐œ… log 1 ๐œ–)

โ€ข Ellipsoid Method

โ€ข Separation oracle (can handle much more generic ๐’ณ)

โ€ข ๐‘‚(๐‘›2) + separation oracle per itration

โ€ข ๐‘‚(๐‘›2 log 1 ๐œ–) iterations total runtime ๐‘‚(๐‘›4 log 1 ๐œ–)

Page 13: Introduction to Machine Learning 236756ttic.uchicago.edu/~nati/Teaching/TTIC31070/2015/Lecture18.pdfย ยท โˆ— 2 ๐œ– 2 โˆ— 2 ๐œ–2 2 ๐œ– A-GD log1/๐œ– โˆ— 2 ๐œ– Lower bound ฮฉ log1/๐œ–ฮฉ

Constrained Optimization

โ€ข Interior Point Methods

โ€ข Only 1st and 2nd oracle access to ๐‘“0, ๐‘“๐‘–โ€ข Overall #Newton Iterations: ๐‘‚ ๐‘š (log 1 ๐œ– + log log 1 ๐›ฟ)

โ€ข Overall runtime: โ‰ˆ ๐‘‚ ๐‘š ๐‘› + ๐‘ 3 +๐‘š ๐›ป2 evals log 1 ๐œ–

โ€ข Can also handle matrix inequalities (SDPs)

โ€ข Other cones: need self concordant barrier

โ€ข โ€œStandard Methodโ€ for LPs, QPs, SDPs, etc

min๐‘ฅโˆˆโ„๐‘›

๐‘“0(๐‘ฅ)

๐‘ . ๐‘ก. ๐‘“๐‘– ๐‘ฅ โ‰ค 0๐ด๐‘ฅ = ๐‘

min๐‘ฅโˆˆโ„๐‘›

๐‘“0(๐‘ฅ)

๐‘ . ๐‘ก. ๐‘“๐‘– ๐‘ฅ โ‰ผ 0๐ด๐‘ฅ = ๐‘

min๐‘ฅโˆˆโ„๐‘›

๐‘“0(๐‘ฅ)

๐‘ . ๐‘ก. โˆ’๐‘“๐‘– ๐‘ฅ โˆˆ ๐พ๐‘–

๐ด๐‘ฅ = ๐‘

Page 14: Introduction to Machine Learning 236756ttic.uchicago.edu/~nati/Teaching/TTIC31070/2015/Lecture18.pdfย ยท โˆ— 2 ๐œ– 2 โˆ— 2 ๐œ–2 2 ๐œ– A-GD log1/๐œ– โˆ— 2 ๐œ– Lower bound ฮฉ log1/๐œ–ฮฉ

Barrier Methods

โ€ข Log barriers ๐ผ๐‘ก ๐‘ข = โˆ’1

๐‘กlog(โˆ’๐‘ข):

โ€ข Analysis and Guarantees

โ€ข โ€œCentral Pathโ€

โ€ข Relaxation of KKT conditions

min๐‘ฅโˆˆโ„๐‘›

๐‘“0 ๐‘ฅ + ๐‘–=1๐‘š ๐ผ ๐‘“๐‘– ๐‘ฅ

๐‘ . ๐‘ก. ๐ด๐‘ฅ = ๐‘

0

โˆž

0barrier function penalty function

๐‘ฅโˆ—

๐‘“๐‘– ๐‘ฅ โ‰ค 0๐ด๐‘ฅ = ๐‘

Page 15: Introduction to Machine Learning 236756ttic.uchicago.edu/~nati/Teaching/TTIC31070/2015/Lecture18.pdfย ยท โˆ— 2 ๐œ– 2 โˆ— 2 ๐œ–2 2 ๐œ– A-GD log1/๐œ– โˆ— 2 ๐œ– Lower bound ฮฉ log1/๐œ–ฮฉ

Optimality Condition(assuming Strong Duality)

๐‘ฅโˆ— optimal for (P)

๐œ†โˆ—, ๐œˆโˆ— optimal for (D)

๐‘“๐‘– ๐‘ฅโˆ— โ‰ค 0 โˆ€๐‘–=1โ€ฆ๐‘š

โ„Ž๐‘— ๐‘ฅโˆ— = 0 โˆ€๐‘—=1โ€ฆ๐‘

๐œ†๐‘–โˆ— โ‰ฅ 0 โˆ€๐‘–=1..๐‘š

๐›ป๐‘ฅ๐ฟ ๐‘ฅโˆ—, ๐œ†โˆ—, ๐œˆโˆ— = 0

๐œ†๐‘–โˆ—๐‘“๐‘– ๐‘ฅโˆ— = 0 โˆ€๐‘–=1..๐‘š

KKT Conditions

AlbertTucker

HaroldKuhn

WilliamKarush

Page 16: Introduction to Machine Learning 236756ttic.uchicago.edu/~nati/Teaching/TTIC31070/2015/Lecture18.pdfย ยท โˆ— 2 ๐œ– 2 โˆ— 2 ๐œ–2 2 ๐œ– A-GD log1/๐œ– โˆ— 2 ๐œ– Lower bound ฮฉ log1/๐œ–ฮฉ

Optimality Conditionfor Problem with Log Barrier

๐‘ฅโˆ— optimal for (๐‘ƒ๐‘ก)

๐œˆโˆ— optimal for (๐ท๐‘ก)

๐‘“๐‘– ๐‘ฅโˆ— โ‰ค 0 โˆ€๐‘–=1โ€ฆ๐‘š

โ„Ž๐‘— ๐‘ฅโˆ— = 0 โˆ€๐‘—=1โ€ฆ๐‘

๐œ†๐‘–โˆ— โ‰ฅ 0 โˆ€๐‘–=1..๐‘š

๐›ป๐‘ฅ๐ฟ ๐‘ฅโˆ—, ๐œ†โˆ—, ๐œˆโˆ— = 0

๐œ†๐‘–โˆ—๐‘“๐‘– ๐‘ฅโˆ— = ๐’• โˆ€๐‘–=1..๐‘š

KKT Conditions

Page 17: Introduction to Machine Learning 236756ttic.uchicago.edu/~nati/Teaching/TTIC31070/2015/Lecture18.pdfย ยท โˆ— 2 ๐œ– 2 โˆ— 2 ๐œ–2 2 ๐œ– A-GD log1/๐œ– โˆ— 2 ๐œ– Lower bound ฮฉ log1/๐œ–ฮฉ

Formulation of LP

1939 1947

Simplex

1984ArkadiNemirovski

1994YuriNesterov

NarendraKarmarkar

19941984

GeorgeDantzig

LeonidKhachiyanLeonid

Kantorovich

1984ArkadiNemirovski

General IP

IP for LP

1972 1979

EllipsoidMethod

Ellipsoid isPoly time

for LP

HerbertRobbins

1951

SGD (w/ Monro)

Non-SmoothSGD

+ analysis

1978

MargueriteFrank

PhilipWolfe

Conditional GD

1956

KKT

Page 18: Introduction to Machine Learning 236756ttic.uchicago.edu/~nati/Teaching/TTIC31070/2015/Lecture18.pdfย ยท โˆ— 2 ๐œ– 2 โˆ— 2 ๐œ–2 2 ๐œ– A-GD log1/๐œ– โˆ— 2 ๐œ– Lower bound ฮฉ log1/๐œ–ฮฉ

From Interior Point MethodsBack to First Order Methods

โ€ข Interior Point Methods (and before them Ellipsoid):

โ€ข ๐‘‚ ๐‘๐‘œ๐‘™๐‘ฆ ๐‘› log1

๐œ–

โ€ข LP in time polynomial in size of input (number of bits in coefficients)

โ€ข But in large scale applications, ๐‘‚(๐‘›3.5) too high

โ€ข First order (and stochastic) methods:

โ€ข Better dependence on ๐‘›

โ€ข ๐‘๐‘œ๐‘™๐‘ฆ1

๐œ–(or ๐‘๐‘œ๐‘™๐‘ฆ(๐œ…))

Page 19: Introduction to Machine Learning 236756ttic.uchicago.edu/~nati/Teaching/TTIC31070/2015/Lecture18.pdfย ยท โˆ— 2 ๐œ– 2 โˆ— 2 ๐œ–2 2 ๐œ– A-GD log1/๐œ– โˆ— 2 ๐œ– Lower bound ฮฉ log1/๐œ–ฮฉ

Overall runtime

( runtime of each iteration ) ร— (number of required iterations)

( oracle runtime ) + ( other operations )

Page 20: Introduction to Machine Learning 236756ttic.uchicago.edu/~nati/Teaching/TTIC31070/2015/Lecture18.pdfย ยท โˆ— 2 ๐œ– 2 โˆ— 2 ๐œ–2 2 ๐œ– A-GD log1/๐œ– โˆ— 2 ๐œ– Lower bound ฮฉ log1/๐œ–ฮฉ

Example: โ„“1 regression

๐‘“ ๐‘ฅ =

๐‘–=1

๐‘š

๐‘Ž๐‘– , ๐‘ฅ โˆ’ ๐‘๐‘–

โ€ข Option 1: Cast as constrained optimization

min๐‘ฅโˆˆโ„๐‘›,๐‘กโˆˆโ„๐‘š

๐‘–

๐‘ก๐‘– ๐‘ . ๐‘ก. โˆ’๐‘ก๐‘– โ‰ค ๐‘Ž๐‘– , ๐‘ฅ โˆ’ ๐‘๐‘– โ‰ค ๐‘ก

โ€ข Runtime: ๐‘‚( ๐‘š ๐‘› +๐‘š 3 log 1 ๐œ–)

โ€ข Options 2: Gradient Descent

โ€ข ๐‘‚๐‘Ž 2 ๐‘ฅโˆ— 2

๐œ–2iterations

โ€ข ๐‘‚(๐‘›๐‘š) per iteration overall ๐‘‚ ๐‘›๐‘š1

๐œ–2

โ€ข Option 3: Stochastic Gradient Descent

โ€ข ๐‘‚๐‘Ž 2 ๐‘ฅโˆ— 2

๐œ–2iterations

โ€ข ๐‘‚(๐‘›) per iteration overall ๐‘‚ ๐‘›1

๐œ–2

Page 21: Introduction to Machine Learning 236756ttic.uchicago.edu/~nati/Teaching/TTIC31070/2015/Lecture18.pdfย ยท โˆ— 2 ๐œ– 2 โˆ— 2 ๐œ–2 2 ๐œ– A-GD log1/๐œ– โˆ— 2 ๐œ– Lower bound ฮฉ log1/๐œ–ฮฉ

Remember!

โ€ข Gradient is in dual space---donโ€™t take mapping for granted!

โ„๐‘› โ„๐‘› โˆ—๐‘ฅ ๐›ป๐‘“๐›ปฮจ

๐›ปฮจโˆ’1

Page 22: Introduction to Machine Learning 236756ttic.uchicago.edu/~nati/Teaching/TTIC31070/2015/Lecture18.pdfย ยท โˆ— 2 ๐œ– 2 โˆ— 2 ๐œ–2 2 ๐œ– A-GD log1/๐œ– โˆ— 2 ๐œ– Lower bound ฮฉ log1/๐œ–ฮฉ

Some things we didnโ€™t coverโ€ข Technology for unconstrained optimization

โ€ข Different quasi-Newton and conj gradient methods

โ€ข Different line searches

โ€ข Technology for Interior Point Primal-Dual methods

โ€ข Numerical Issues

โ€ข Recent advances in first order and stochastic methods

โ€ข Mirror Descent, Different Geometries, Adaptive Geometries

โ€ข Decomposable functions, partial linearization and acceleration

โ€ข Faster methods for problems with specific forms or properties

โ€ข Message passing for solving LPs

โ€ข Flows and network problems

โ€ข Distributed Optimization

Page 23: Introduction to Machine Learning 236756ttic.uchicago.edu/~nati/Teaching/TTIC31070/2015/Lecture18.pdfย ยท โˆ— 2 ๐œ– 2 โˆ— 2 ๐œ–2 2 ๐œ– A-GD log1/๐œ– โˆ— 2 ๐œ– Lower bound ฮฉ log1/๐œ–ฮฉ

Current Trends

โ€ข Small scale problems: โ€ข Super-efficient and reliable solvers for use in real-time

โ€ข Very large scale problems:โ€ข Stochastic first order methods

โ€ข Linear time, one-pass or few-pass

โ€ข Relationship to Machine Learning

Page 24: Introduction to Machine Learning 236756ttic.uchicago.edu/~nati/Teaching/TTIC31070/2015/Lecture18.pdfย ยท โˆ— 2 ๐œ– 2 โˆ— 2 ๐œ–2 2 ๐œ– A-GD log1/๐œ– โˆ— 2 ๐œ– Lower bound ฮฉ log1/๐œ–ฮฉ

Other Courses

โ€ข Winter 2016:

Non-Linear Optimization (Mihai Anitescu, Stats)

โ€ข Winter 2016:

Networks I: Introduction to Modeling and Analysis (Ozan Candogan, Booth)

โ€ข 2016/17:

Computational and Statistical Learning Theory (TTIC)

Page 25: Introduction to Machine Learning 236756ttic.uchicago.edu/~nati/Teaching/TTIC31070/2015/Lecture18.pdfย ยท โˆ— 2 ๐œ– 2 โˆ— 2 ๐œ–2 2 ๐œ– A-GD log1/๐œ– โˆ— 2 ๐œ– Lower bound ฮฉ log1/๐œ–ฮฉ

About the Courseโ€ข Methods for solving convex optimization problems,

based on oracle access, with guarantees based on their properties

โ€ข Understanding different optimization methodsโ€ข Understanding their derivationโ€ข When are they appropriateโ€ข Guarantees (a few proofs, not a core component)

โ€ข Working and reasoning about opt problemsโ€ข Standard forms: LP, QP, SDP, etcโ€ข Optimality conditionsโ€ข Dualityโ€ข Using Optimality Conditions and Duality to reason about ๐‘ฅโˆ—

Page 26: Introduction to Machine Learning 236756ttic.uchicago.edu/~nati/Teaching/TTIC31070/2015/Lecture18.pdfย ยท โˆ— 2 ๐œ– 2 โˆ— 2 ๐œ–2 2 ๐œ– A-GD log1/๐œ– โˆ— 2 ๐œ– Lower bound ฮฉ log1/๐œ–ฮฉ

Final

โ€ข Tuesday 10:30am

โ€ข Straight-forward questions

โ€ข Allowed: anything hand-written by you (not photo-copied)

โ€ข In depth: Up to and including Interior Point Methods

โ€ข Unconstrained Optimization

โ€ข Duality

โ€ข Optimality Conditions

โ€ข Interior Point Methods

โ€ข Phase I Methods and Feasibility Problems

โ€ข Superficially (a few multiple choice or similar)

โ€ข Other first order and stochastic methods

โ€ข Lower Bounds

โ€ข Center of Mass / Ellipsoid

โ€ข Other approaches