A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC...
Transcript of A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC...
![Page 1: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version.](https://reader030.fdocuments.us/reader030/viewer/2022041101/5edade5a09ac2c67fa687058/html5/thumbnails/1.jpg)
ABriefLookatOptimization
CSC412/2506TutorialDavidMadras
January18,2018
Slidesadaptedfromlastyear’sversion
![Page 2: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version.](https://reader030.fdocuments.us/reader030/viewer/2022041101/5edade5a09ac2c67fa687058/html5/thumbnails/2.jpg)
Overview
• Introduction• Classesofoptimizationproblems• Linearprogramming• Steepest(gradient)descent• Newton’smethod• Quasi-Newtonmethods• Conjugategradients• Stochasticgradientdescent
![Page 3: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version.](https://reader030.fdocuments.us/reader030/viewer/2022041101/5edade5a09ac2c67fa687058/html5/thumbnails/3.jpg)
Whatisoptimization?• Typicalsetup(inmachinelearning,life):
– Formulateaproblem– Designasolution(usuallyamodel)– Usesomequantitativemeasuretodetermine howgoodthesolution is.
• E.g.,classification:– Createasystemtoclassifyimages– Modelissomesimpleclassifier, likelogisticregression– Quantitativemeasure isclassification error(lowerisbetter inthiscase)
• Thenaturalquestiontoaskis:canwefindasolutionwithabetterscore?
• Question:whatcouldwechangeintheclassificationsetuptolowertheclassificationerror(whatarethefreevariables)?
![Page 4: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version.](https://reader030.fdocuments.us/reader030/viewer/2022041101/5edade5a09ac2c67fa687058/html5/thumbnails/4.jpg)
Formaldefinition
• f(θ):somearbitraryfunction• c(θ):somearbitraryconstraints• Minimizingf(θ)isequivalenttomaximizing-f(θ),sowecanjusttalkaboutminimizationandbeOK.
![Page 5: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version.](https://reader030.fdocuments.us/reader030/viewer/2022041101/5edade5a09ac2c67fa687058/html5/thumbnails/5.jpg)
Typesofoptimizationproblems
• Dependingonf,c,andthedomainofθwegetmanyproblemswithmanydifferentcharacteristics.
• Generaloptimizationofarbitraryfunctionswitharbitraryconstraintsisextremelyhard.
• Mosttechniquesexploitstructureintheproblemtofindasolutionmoreefficiently.
![Page 6: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version.](https://reader030.fdocuments.us/reader030/viewer/2022041101/5edade5a09ac2c67fa687058/html5/thumbnails/6.jpg)
Typesofoptimization• Simpleenoughproblemshaveaclosedformsolution:
• f(x)=x2• Linearregression
• Iffandcarelinearfunctionsthenwecanuselinearprogramming(solvableinpolynomialtime).
• Iffandcareconvexthenwecanuseconvexoptimizationtechnique(mostofmachinelearningusesthese).
• Iffandcarenon-convexweusuallypretendit’sconvexandfindasub-optimal,buthopefullygoodenoughsolution(e.g.,deeplearning).
• Intheworstcasethereareglobaloptimizationtechniques(operationsresearchisverygoodatthese).
• Thereareyetmoretechniqueswhenthedomainofθisdiscrete.• Thislistisfarfromexhaustive.
![Page 7: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version.](https://reader030.fdocuments.us/reader030/viewer/2022041101/5edade5a09ac2c67fa687058/html5/thumbnails/7.jpg)
Typesofoptimization
• Takeaway:
Thinkhardaboutyourproblem,findthesimplestcategorythatitfitsinto,usethetoolsfromthatbranchofoptimization.
• Sometimesyoucansolveahardproblemwithaspecial-purposealgorithm,butmosttimeswefavorablack-boxapproachbecauseit’ssimpleandusuallyworks.
![Page 8: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version.](https://reader030.fdocuments.us/reader030/viewer/2022041101/5edade5a09ac2c67fa687058/html5/thumbnails/8.jpg)
Reallynaïveoptimizationalgorithm• Suppose
– D-dimensional vectorofparameterswhereeachdimension isboundedaboveandbelow.
• ForeachdimensionIpicksomesetofvaluestotry:
• Tryallcombinationsofvaluesforeachdimension,recordfforeachone.
• Pickthecombinationthatminimizesf.
![Page 9: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version.](https://reader030.fdocuments.us/reader030/viewer/2022041101/5edade5a09ac2c67fa687058/html5/thumbnails/9.jpg)
Reallynaïveoptimizationalgorithm
• Thisiscalledgridsearch.Itworksreallywellinlowdimensionswhenyoucanaffordtoevaluatefmanytimes.
• Lessappealingwhenfisexpensiveorinhighdimensions.
• YoumayhavealreadydonethiswhensearchingforagoodL2penaltyvalue.
![Page 10: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version.](https://reader030.fdocuments.us/reader030/viewer/2022041101/5edade5a09ac2c67fa687058/html5/thumbnails/10.jpg)
Convexfunctions
Usethelinetest.
![Page 11: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version.](https://reader030.fdocuments.us/reader030/viewer/2022041101/5edade5a09ac2c67fa687058/html5/thumbnails/11.jpg)
Convexfunctions
![Page 12: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version.](https://reader030.fdocuments.us/reader030/viewer/2022041101/5edade5a09ac2c67fa687058/html5/thumbnails/12.jpg)
Convexoptimization
• We’vetalkedabout1Dfunctions,butthedefinitionstillappliestohigherdimensions.
• Whydowecareaboutconvexfunctions?• Inaconvexfunction,anylocalminimumisautomaticallyaglobalminimum.
• Thismeanswecanapplyfairlynaïvetechniquestofindthenearestlocalminimumandstillguaranteethatwe’vefoundthebestsolution!
![Page 13: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version.](https://reader030.fdocuments.us/reader030/viewer/2022041101/5edade5a09ac2c67fa687058/html5/thumbnails/13.jpg)
Steepest(gradient)descent
• Cauchy(1847)
![Page 14: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version.](https://reader030.fdocuments.us/reader030/viewer/2022041101/5edade5a09ac2c67fa687058/html5/thumbnails/14.jpg)
![Page 15: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version.](https://reader030.fdocuments.us/reader030/viewer/2022041101/5edade5a09ac2c67fa687058/html5/thumbnails/15.jpg)
Aside:Taylorseries
• ATaylorseriesisapolynomialseriesthatconvergestoafunctionf.
• WesaythattheTaylorseriesexpansionofatxaroundapointa,f(x+a)is:
• Truncatingthisseriesgivesapolynomialapproximationtoafunction.
![Page 16: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version.](https://reader030.fdocuments.us/reader030/viewer/2022041101/5edade5a09ac2c67fa687058/html5/thumbnails/16.jpg)
Blue:exponential function;Red:Taylorseriesapproximation
![Page 17: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version.](https://reader030.fdocuments.us/reader030/viewer/2022041101/5edade5a09ac2c67fa687058/html5/thumbnails/17.jpg)
MultivariateTaylorSeries
• Thefirst-orderTaylorseriesexpansionofafunctionf(θ)aroundapointdis:
![Page 18: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version.](https://reader030.fdocuments.us/reader030/viewer/2022041101/5edade5a09ac2c67fa687058/html5/thumbnails/18.jpg)
Steepestdescentderivation• Supposeweareatθandwewanttopickadirectiond(withnorm1)suchthatf(θ+ηd)isassmallaspossibleforsomestepsizeη.Thisisequivalenttomaximizingf(θ)- f(θ+ηd).
• Usingalinearapproximation:
• Thisapproximationgetsbetterasηgetssmallersinceaswezoominonadifferentiable functionitwilllookmoreandmorelinear.
![Page 19: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version.](https://reader030.fdocuments.us/reader030/viewer/2022041101/5edade5a09ac2c67fa687058/html5/thumbnails/19.jpg)
Steepestdescentderivation• Weneedtofindthevalue fordthatmaximizes subject to
• Usingthedefinitionofcosineastheanglebetween twovectors:
![Page 20: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version.](https://reader030.fdocuments.us/reader030/viewer/2022041101/5edade5a09ac2c67fa687058/html5/thumbnails/20.jpg)
Howtochoosethestepsize?• Atiterationt• Generalidea:varyηt untilwefindtheminimumalong
• Thisisa1Doptimizationproblem.• Intheworstcasewecanjustmakeηt verysmall,butthenweneedtotakealotmoresteps.
• Generalstrategy:startwithabigηtandprogressivelymakeitsmallerbye.g.,halvingituntilthefunctiondecreases.
![Page 21: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version.](https://reader030.fdocuments.us/reader030/viewer/2022041101/5edade5a09ac2c67fa687058/html5/thumbnails/21.jpg)
Whenhaveweconverged?
• When• Ifthefunctionisconvexthenwehavereachedaglobalminimum.
![Page 22: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version.](https://reader030.fdocuments.us/reader030/viewer/2022041101/5edade5a09ac2c67fa687058/html5/thumbnails/22.jpg)
Theproblemwithgradientdescent
source:http://trond.hjorteland.com/thesis/img208.gif
![Page 23: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version.](https://reader030.fdocuments.us/reader030/viewer/2022041101/5edade5a09ac2c67fa687058/html5/thumbnails/23.jpg)
Newton’smethod
• Tospeedupconvergence,wecanuseamoreaccurateapproximation.
• SecondorderTaylorexpansion:
• HistheHessian matrixcontainingsecondderivatives.
![Page 24: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version.](https://reader030.fdocuments.us/reader030/viewer/2022041101/5edade5a09ac2c67fa687058/html5/thumbnails/24.jpg)
Newton’smethod
![Page 25: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version.](https://reader030.fdocuments.us/reader030/viewer/2022041101/5edade5a09ac2c67fa687058/html5/thumbnails/25.jpg)
Whatisitdoing?
• Ateachstep,Newton’smethodapproximatesthefunctionwithaquadraticbowl,thengoestotheminimumofthisbowl.
• Fortwiceormoredifferentiableconvexfunctions,thisisusuallymuchfasterthansteepestdescent(provably).
• Con:computingHessianrequiresO(D2)timeandstorage.InvertingtheHessianisevenmoreexpensive(uptoO(D3)).Thisisproblematicinhighdimensions.
![Page 26: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version.](https://reader030.fdocuments.us/reader030/viewer/2022041101/5edade5a09ac2c67fa687058/html5/thumbnails/26.jpg)
Quasi-Newtonmethods
• ComputationinvolvingtheHessianisexpensive.• Modernapproachesusecomputationallycheaperapproximations totheHessianorit’sinverse.
• Derivingtheseisbeyondthescopeofthistutorial,butwe’lloutlinesomeofthekeyideas.
• Theseareimplementedinmanygoodsoftwarepackagesinmanylanguagesandcanbetreatedasblackboxsolvers,butit’sgoodtoknowwheretheycomefromsothatyouknowwhenyouusethem.
![Page 27: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version.](https://reader030.fdocuments.us/reader030/viewer/2022041101/5edade5a09ac2c67fa687058/html5/thumbnails/27.jpg)
BFGS
• MaintainarunningestimateoftheHessianBt.• Ateachiteration,setBt+1 =Bt +Ut +Vt whereUandVarerank1matrices(thesearederivedspecificallyforthealgorithm).
• Theadvantageofusingalow-rankupdatetoimprovetheHessianestimateisthatBcanbecheaplyinvertedateachiteration.
![Page 28: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version.](https://reader030.fdocuments.us/reader030/viewer/2022041101/5edade5a09ac2c67fa687058/html5/thumbnails/28.jpg)
LimitedmemoryBFGS• BFGSprogressivelyupdatesBandsoonecanthinkofBt asa
sumofrank-1matricesfromsteps1tot.WecouldinsteadstoretheseupdatesandrecomputeBt ateachiteration(althoughthiswouldinvolvealotofredundantwork).
• L-BFGSonlystoresthemostrecentupdates,thereforetheapproximationitselfisalwayslowrankandonlyalimitedamountofmemoryneedstobeused(linearinD).
• L-BFGSworksextremelywellinpractice.• L-BFGS-BextendsL-BFGStohandleboundconstraintsonthe
variables.
![Page 29: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version.](https://reader030.fdocuments.us/reader030/viewer/2022041101/5edade5a09ac2c67fa687058/html5/thumbnails/29.jpg)
Conjugategradients• Steepestdescentoftenpicksadirectionit’stravelledinbefore
(thisresultsinthewigglybehavior).• Conjugategradientsmakesurewedon’ttravelinthesame
directionagain.• Thederivationforquadraticsismoreinvolvedthanwehave
timefor.• Thederivationforgeneralconvexfunctionsisfairlyhacky,but
reducestothequadraticversionwhenthefunctionisindeedquadratic.
• Takeaway:conjugategradientworksbetterthansteepestdescent,almostasgoodasL-BFGS.Italsohasamuchcheaperper-iterationcost(stilllinear,butbetterconstants).
![Page 30: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version.](https://reader030.fdocuments.us/reader030/viewer/2022041101/5edade5a09ac2c67fa687058/html5/thumbnails/30.jpg)
StochasticGradientDescent
• Recallthatwecanwritethelog-likelihoodofadistributionas:
![Page 31: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version.](https://reader030.fdocuments.us/reader030/viewer/2022041101/5edade5a09ac2c67fa687058/html5/thumbnails/31.jpg)
Stochasticgradientdescent• Anyiterationofagradientdescent(orquasi-Newton)methodrequiresthatwesumovertheentiredatasettocomputethegradient.
• SGDidea:ateachiteration,sub-sampleasmallamountofdata(evenjust1pointcanwork)andusethattoestimatethegradient.
• Eachupdateisnoisy,butveryfast!• ThisisthebasisofoptimizingMLalgorithmswithhugedatasets(e.g.,recentdeeplearning).
• Computinggradientsusingthefulldatasetiscalledbatchlearning,usingsubsetsofdataiscalledmini-batchlearning.
![Page 32: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version.](https://reader030.fdocuments.us/reader030/viewer/2022041101/5edade5a09ac2c67fa687058/html5/thumbnails/32.jpg)
Stochasticgradientdescent• Supposewemadeacopyofeachpoint,y=xsothatwenowhavetwiceasmuchdata.Thelog-likelihoodisnow:
• Inotherwords,theoptimalparametersdon’tchange,butwehavetodotwiceasmuchworktocomputethelog-likelihoodandit’sgradient!
• ThereasonSGDworksisbecausesimilardatayieldssimilargradients,soifthereisenoughredundancyinthedata,thenoisyfromsubsamplingwon’tbesobad.
![Page 33: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version.](https://reader030.fdocuments.us/reader030/viewer/2022041101/5edade5a09ac2c67fa687058/html5/thumbnails/33.jpg)
Stochasticgradientdescent• Inthestochasticsetting,linesearchesbreakdownandsodoestimatesoftheHessian,sostochasticquasi-Newtonmethodsareverydifficulttogetright.
• Sohowdowechooseanappropriatestepsize?• RobbinsandMonro(1951):pickasequenceofηt suchthat:
• Satisfiedby(asoneexample).• Balances“makingprogress”withaveragingoutnoise.
![Page 34: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version.](https://reader030.fdocuments.us/reader030/viewer/2022041101/5edade5a09ac2c67fa687058/html5/thumbnails/34.jpg)
FinalwordsonSGD
• SGDisveryeasytoimplementcomparedtoothermethods,butthestepsizesneedtobetunedtodifferentproblems,whereasbatchlearningtypically“justworks”.
• Tip1:dividethelog-likelihoodestimatebythesizeofyourmini-batches.Thismakesthelearningrateinvarianttomini-batchsize.
• Tip2:subsamplewithoutreplacementsothatyouvisiteachpointoneachpassthroughthedataset(thisisknownasanepoch).
![Page 35: A Brief Look at Optimization - Department of Computer ... · A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year’s version.](https://reader030.fdocuments.us/reader030/viewer/2022041101/5edade5a09ac2c67fa687058/html5/thumbnails/35.jpg)
UsefulReferences• Linearprogramming:
- LinearProgramming:FoundationsandExtensions(http://www.princeton.edu/~rvdb/LPbook/
• Convexoptimization:- http://web.stanford.edu/class/ee364a/index.html- http://stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf
• LPsolver:– Gurobi:http://www.gurobi.com/
• Stats(python):– Scipy stats:http://docs.scipy.org/doc/scipy-0.14.0/reference/stats.html
• Optimization(python):– Scipy optimize: http://docs.scipy.org/doc/scipy/reference/optimize.html
• Optimization(Matlab):– minFunc: http://www.cs.ubc.ca/~schmidtm/Software/minFunc.html
• GeneralML:– Scikit-Learn: http://scikit-learn.org/stable/