Stochastic Gradient Descentalinlab.kaist.ac.kr/resource/Lec2_SGD.pdf · 2020-04-25 · Algorithmic...
Transcript of Stochastic Gradient Descentalinlab.kaist.ac.kr/resource/Lec2_SGD.pdf · 2020-04-25 · Algorithmic...
![Page 1: Stochastic Gradient Descentalinlab.kaist.ac.kr/resource/Lec2_SGD.pdf · 2020-04-25 · Algorithmic Intelligence Lab •Gradient descent (GD)updates parameters iteratively by taking](https://reader035.fdocuments.us/reader035/viewer/2022070911/5faa4b38cbc8dd4f181f8a83/html5/thumbnails/1.jpg)
AlgorithmicIntelligenceLab
AlgorithmicIntelligenceLab
EE807:RecentAdvancesinDeepLearningLecture2
Slidemadeby
Insu HanandJongheon JeongKAISTEE
StochasticGradientDescent
![Page 2: Stochastic Gradient Descentalinlab.kaist.ac.kr/resource/Lec2_SGD.pdf · 2020-04-25 · Algorithmic Intelligence Lab •Gradient descent (GD)updates parameters iteratively by taking](https://reader035.fdocuments.us/reader035/viewer/2022070911/5faa4b38cbc8dd4f181f8a83/html5/thumbnails/2.jpg)
AlgorithmicIntelligenceLab
1. Introduction• Empiricalriskminimization(ERM)
2. GradientDescendMethods• Gradientdescent(GD)• Stochasticgradientdescent(SGD)
3. MomentumandAdaptiveLearningRateMethods• Momentummethods• Learningratescheduling• Adaptivelearningratemethods(AdaGrad,RmsProp,Adam)
4. ChangingBatchSize• Increasingthebatchsizewithoutlearningratedecaying
5. Summary
TableofContents
2
![Page 3: Stochastic Gradient Descentalinlab.kaist.ac.kr/resource/Lec2_SGD.pdf · 2020-04-25 · Algorithmic Intelligence Lab •Gradient descent (GD)updates parameters iteratively by taking](https://reader035.fdocuments.us/reader035/viewer/2022070911/5faa4b38cbc8dd4f181f8a83/html5/thumbnails/3.jpg)
AlgorithmicIntelligenceLab
• Giventrainingset
• Predictionfunctionparameterizedby
• Empiricalriskminimization: Findaparamaterthatminimizesthelossfunction
whereisalossfunctione.g.,MSE,crossentropy,
• Forexample,neuralnetworkhas
EmpiricalRiskMinimization(ERM)
3
Next,howtosolveERM?
![Page 4: Stochastic Gradient Descentalinlab.kaist.ac.kr/resource/Lec2_SGD.pdf · 2020-04-25 · Algorithmic Intelligence Lab •Gradient descent (GD)updates parameters iteratively by taking](https://reader035.fdocuments.us/reader035/viewer/2022070911/5faa4b38cbc8dd4f181f8a83/html5/thumbnails/4.jpg)
AlgorithmicIntelligenceLab
• Gradientdescent(GD) updatesparametersiterativelybytakinggradient.
• (+) Convergestoglobal(local)minimumforconvex(non-convex)problem.• (−)Notefficientwithrespecttocomputationtime andmemoryspace forhuge𝑛.• Forexample,ImageNetdatasethas𝑛 =1,281,167 images fortraining.
GradientDescent(GD)
4
parameters
learningrate
lossfunction
Next,efficientGD
1.2Mof256x256RGBimages≈ 236GBmemory
![Page 5: Stochastic Gradient Descentalinlab.kaist.ac.kr/resource/Lec2_SGD.pdf · 2020-04-25 · Algorithmic Intelligence Lab •Gradient descent (GD)updates parameters iteratively by taking](https://reader035.fdocuments.us/reader035/viewer/2022070911/5faa4b38cbc8dd4f181f8a83/html5/thumbnails/5.jpg)
AlgorithmicIntelligenceLab
• Stochasticgradientdescent(SGD) usesamples toapproximateGD
• Inpractice,minibatchsizescanbe32/64/128.
• Mainpracticalchallenges andcurrentsolutions:1. SGDcanbetoonoisyandmightbeunstable2. hardtofindagoodlearningrate
StochasticGradientDescent(SGD)
5*source:https://lovesnowbest.site/2018/02/16/Improving-Deep-Neural-Networks-Assignment-2/
Next,momentum
momentumadaptivelearningrate
![Page 6: Stochastic Gradient Descentalinlab.kaist.ac.kr/resource/Lec2_SGD.pdf · 2020-04-25 · Algorithmic Intelligence Lab •Gradient descent (GD)updates parameters iteratively by taking](https://reader035.fdocuments.us/reader035/viewer/2022070911/5faa4b38cbc8dd4f181f8a83/html5/thumbnails/6.jpg)
AlgorithmicIntelligenceLab
1. Momentumgradientdescent• Adddecayingpreviousgradients(momentum).
• Equivalenttotheweighted-sumofthefraction𝜇 ofpreviousupdate.
• (+) Momentumreducestheoscillationandacceleratestheconvergence.
MomentumMethods
6
momentum preservationratio
SGD
frictiontoverticalfluctuation
accelerationtoleftSGD+momentum
![Page 7: Stochastic Gradient Descentalinlab.kaist.ac.kr/resource/Lec2_SGD.pdf · 2020-04-25 · Algorithmic Intelligence Lab •Gradient descent (GD)updates parameters iteratively by taking](https://reader035.fdocuments.us/reader035/viewer/2022070911/5faa4b38cbc8dd4f181f8a83/html5/thumbnails/7.jpg)
AlgorithmicIntelligenceLab
1. Momentumgradientdescent• Adddecayingpreviousgradients(momentum).
• (−)Momentumcanfailtoconvergeevenforsimpleconvexoptimizations.• Nestrov’s acceleratedgradient (NAG)[Nesterov’1983]usegradientforapproximatefutureposition,i.e.,
MomentumMethods:Nesterov’s Momentum
7
momentum preservationratio
“lookahead”gradient
![Page 8: Stochastic Gradient Descentalinlab.kaist.ac.kr/resource/Lec2_SGD.pdf · 2020-04-25 · Algorithmic Intelligence Lab •Gradient descent (GD)updates parameters iteratively by taking](https://reader035.fdocuments.us/reader035/viewer/2022070911/5faa4b38cbc8dd4f181f8a83/html5/thumbnails/8.jpg)
AlgorithmicIntelligenceLab
1. Momentumgradientdescent• Adddecayingpreviousgradients(momentum).
• Nesterov’sacceleratedgradient (NAG)[Nesterov’1983]usegradientforapproximatefutureposition,i.e.,
MomentumMethods:Nesterov’s Momentum
8
momentum preservationratio
Quiz:fillinthepseudocodeofNesterov’acceleratedgradient
SGDSGD+momentum
NAG
![Page 9: Stochastic Gradient Descentalinlab.kaist.ac.kr/resource/Lec2_SGD.pdf · 2020-04-25 · Algorithmic Intelligence Lab •Gradient descent (GD)updates parameters iteratively by taking](https://reader035.fdocuments.us/reader035/viewer/2022070911/5faa4b38cbc8dd4f181f8a83/html5/thumbnails/9.jpg)
AlgorithmicIntelligenceLab
AdaptiveLearningRateMethods
9
2. Learningratescheduling• Learningrateiscriticalforminimizingloss!
*source:http://cs231n.github.io/neural-networks-3/
Next,learningratescheduling
Toohigh→Mayignorethenarrowvalley,candivergeToolow →Mayfallintothelocalminima,slowconverge
![Page 10: Stochastic Gradient Descentalinlab.kaist.ac.kr/resource/Lec2_SGD.pdf · 2020-04-25 · Algorithmic Intelligence Lab •Gradient descent (GD)updates parameters iteratively by taking](https://reader035.fdocuments.us/reader035/viewer/2022070911/5faa4b38cbc8dd4f181f8a83/html5/thumbnails/10.jpg)
AlgorithmicIntelligenceLab
2. Learningratescheduling:decaymethods• Anaivechoiceistheconstant learningrate• Commonlearningrateschedulesincludetime-based/step/exponentialdecay
• “Stepdecay”decreaseslearningratebyafactoreveryfewepochs• Typically,itisset= 0.01 anddropsbyhalfever= 10 epoch
Time-based Exponential Step(mostpopularinpractice)
AdaptiveLearningRateMethods:Learningrateannealing
10*source:https://towardsdatascience.com/
stepdecay exponentialdecay accuracy
![Page 11: Stochastic Gradient Descentalinlab.kaist.ac.kr/resource/Lec2_SGD.pdf · 2020-04-25 · Algorithmic Intelligence Lab •Gradient descent (GD)updates parameters iteratively by taking](https://reader035.fdocuments.us/reader035/viewer/2022070911/5faa4b38cbc8dd4f181f8a83/html5/thumbnails/11.jpg)
AlgorithmicIntelligenceLab
2. Learningratescheduling:cyclingmethod• [Smith’2015]proposedcycling learningrate(triangular)• Why“cycling”learningrate?
• Sometimes,increasinglearningrateishelpfultoescapethesaddlepoints
• Itcanbecombinedwithexponentialdecayorperiodicdecay
AdaptiveLearningRateMethods:Learningrateannealing
11*source:https://github.com/bckenstler/CLR
cycling(triangular)decay
![Page 12: Stochastic Gradient Descentalinlab.kaist.ac.kr/resource/Lec2_SGD.pdf · 2020-04-25 · Algorithmic Intelligence Lab •Gradient descent (GD)updates parameters iteratively by taking](https://reader035.fdocuments.us/reader035/viewer/2022070911/5faa4b38cbc8dd4f181f8a83/html5/thumbnails/12.jpg)
AlgorithmicIntelligenceLab
2. Learningratescheduling:cyclingmethod• [Loshchilov’2017]usecosinecycling andrestart themaximumateachcycle• Why“cosine”?
• Itdecaysslowlyatthehalfofcycleanddropquicklyattherest
• (+) canclimbdownandupthelosssurface,thuscantraverseseverallocalminima• (+) sameasrestartingatgoodpointswithaninitiallearningrate
AdaptiveLearningRateMethods:Learningrateannealing
12*source:Loshchilov etal.,SGDR:StochasticGradientDescentwithWarmRestarts.ICLR2017
![Page 13: Stochastic Gradient Descentalinlab.kaist.ac.kr/resource/Lec2_SGD.pdf · 2020-04-25 · Algorithmic Intelligence Lab •Gradient descent (GD)updates parameters iteratively by taking](https://reader035.fdocuments.us/reader035/viewer/2022070911/5faa4b38cbc8dd4f181f8a83/html5/thumbnails/13.jpg)
AlgorithmicIntelligenceLab
2. Learningratescheduling:cyclingmethod• [Loshchilov’2017]alsoproposedwarmrestart incyclinglearningrate
• (+) Ithelptoescapesaddlepointssinceitismorelikelytostuckinearlyiteration
AdaptiveLearningRateMethods:Learningrateannealing
13*source:Loshchilov etal.,SGDR:StochasticGradientDescentwithWarmRestarts.ICLR2017
Next,adaptivelearningrate
:stepdecay :cyclingwithnorestart :cyclingwithrestart
*Warmrestart:frequentlyrestartinearlyiterations
But,thereisnoperfectlearningratescheduling!Itdependsonspecifictask.
![Page 14: Stochastic Gradient Descentalinlab.kaist.ac.kr/resource/Lec2_SGD.pdf · 2020-04-25 · Algorithmic Intelligence Lab •Gradient descent (GD)updates parameters iteratively by taking](https://reader035.fdocuments.us/reader035/viewer/2022070911/5faa4b38cbc8dd4f181f8a83/html5/thumbnails/14.jpg)
AlgorithmicIntelligenceLab
AdaptiveLearningRateMethods:AdaGrad,RMSProp
14
3. Adaptivelychanginglearningrate(AdaGrad,RMSProp)• AdaGrad [Duchi’11]downscalesalearningratebymagnitudeofpreviousgradients.
• (−) thelearningratestrictlydecreasesandbecomestoosmallforlargeiterations.
• RMSProp [Tieleman’12]usesthemovingaveragesofsquaredgradient.
• Othervariantsalsoexist,e.g.,Adadelta[Zeiler’2012]
sumofallprevioussquaredgradients
preservationratio
![Page 15: Stochastic Gradient Descentalinlab.kaist.ac.kr/resource/Lec2_SGD.pdf · 2020-04-25 · Algorithmic Intelligence Lab •Gradient descent (GD)updates parameters iteratively by taking](https://reader035.fdocuments.us/reader035/viewer/2022070911/5faa4b38cbc8dd4f181f8a83/html5/thumbnails/15.jpg)
AlgorithmicIntelligenceLab
AdaptiveLearningRateMethods
15*source:animationsfromfromAlecRadford’blog
optimizationonsaddlepoint optimizationonlocaloptimum
• Visualizationofalgorithms
• Adaptivelearning-ratemethods,i.e.,Adadelta andRMSprop aremostsuitableandprovidethebestconvergenceforthesescenarios
Next,momentum+adaptivelearningrate
![Page 16: Stochastic Gradient Descentalinlab.kaist.ac.kr/resource/Lec2_SGD.pdf · 2020-04-25 · Algorithmic Intelligence Lab •Gradient descent (GD)updates parameters iteratively by taking](https://reader035.fdocuments.us/reader035/viewer/2022070911/5faa4b38cbc8dd4f181f8a83/html5/thumbnails/16.jpg)
AlgorithmicIntelligenceLab
3. Combinationofmomentumandadaptivelearningrate• Adam (ADAptive Momentestimation)[Kingma’2015]
• Canbeseenasmomentum+RMSprop update.• Othervariantsexist,e.g.,Adamax [Kingma’14],Nadam [Dozat’16]
AdaptiveLearningRateMethods:ADAM
16
averageofsquaredgradients
momentum
*source:Kingma andBa.Adam:Amethodforstochasticoptimization.ICLR2015
![Page 17: Stochastic Gradient Descentalinlab.kaist.ac.kr/resource/Lec2_SGD.pdf · 2020-04-25 · Algorithmic Intelligence Lab •Gradient descent (GD)updates parameters iteratively by taking](https://reader035.fdocuments.us/reader035/viewer/2022070911/5faa4b38cbc8dd4f181f8a83/html5/thumbnails/17.jpg)
AlgorithmicIntelligenceLab
• Inpractice, SGD+Momentum andAdam workswellinmanyapplications.
• But,schedulinglearningratesisstillcritical!(shouldbedecayappropriately)
• [Smith’2017]showsthatdecayinglearningrate=increasingbatchsize,• (+) Alargebatchsizeallowsfewerparameterupdates,leadingtoparallelism!
DecayingtheLearningRate=IncreasingtheBatchSize
17*source:Smithetal.,"Don'tDecaytheLearningRate,IncreasetheBatchSize.“,ICLR2017
![Page 18: Stochastic Gradient Descentalinlab.kaist.ac.kr/resource/Lec2_SGD.pdf · 2020-04-25 · Algorithmic Intelligence Lab •Gradient descent (GD)updates parameters iteratively by taking](https://reader035.fdocuments.us/reader035/viewer/2022070911/5faa4b38cbc8dd4f181f8a83/html5/thumbnails/18.jpg)
AlgorithmicIntelligenceLab
• SGDhavebeenusedasessentialalgorithmstodeeplearningasback-propagation.
• Momentummethodsimprovetheperformanceofgradientdescendalgorithms.• Nesterov’smomentum
• Annealinglearningratesarecriticalfortraininglossfunctions• Exponential,harmonic,cyclicdecayingmethods• Adaptivelearningratemethods(RMSProp,AdaGrad,AdaDelta,Adam,etc)
• Inpractice,SGD+momentum showssuccessfulresults,outperformingAdam!• Forexample,NLP(Huangetal.,2017)ormachinetranslation(Wuetal.,2016)
Summary
18
![Page 19: Stochastic Gradient Descentalinlab.kaist.ac.kr/resource/Lec2_SGD.pdf · 2020-04-25 · Algorithmic Intelligence Lab •Gradient descent (GD)updates parameters iteratively by taking](https://reader035.fdocuments.us/reader035/viewer/2022070911/5faa4b38cbc8dd4f181f8a83/html5/thumbnails/19.jpg)
AlgorithmicIntelligenceLab
• [Nesterov’1983]Nesterov.AmethodofsolvingaconvexprogrammingproblemwithconvergencerateO(1/k^2).1983link:http://mpawankumar.info/teaching/cdt-big-data/nesterov83.pdf
• [Duchi etal2011],“Adaptivesubgradient methodsforonlinelearningandstochasticoptimization”,JMLR2011link:http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf
• [Tieleman’2012]GeoffHinton’sLecture6eofCourseraClasslink:http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf
• [Zeiler’2012]Zeiler,M.D.(2012).ADADELTA:AnAdaptiveLearningRateMethodlink:https://arxiv.org/pdf/1212.5701.pdf
• [Smith’2015]Smith,LeslieN."Cyclicallearningratesfortrainingneuralnetworks.”link:https://arxiv.org/pdf/1506.01186.pdf
• [Kingma andBa.,2015]Kingma andBa.Adam:Amethodforstochasticoptimization.ICLR2015link:https://arxiv.org/pdf/1412.6980.pdf
• [Dozat’2016]Dozat,T.(2016).IncorporatingNesterov MomentumintoAdam.ICLRWorkshop,link:http://cs229.stanford.edu/proj2015/054_report.pdf
• [Smithetal.,2017]Smith,SamuelL.,Pieter-JanKindermans andQuocV.Le.Don'tDecaytheLearningRate,IncreasetheBatchSize.ICLR2017.link:https://openreview.net/pdf?id=B1Yy1BxCZ
• [Loshchilov etal.,2017]Loshchilov,I.,&Hutter,F.(2017).SGDR:StochasticGradientDescentwithWarmRestarts.ICLR2017.link:https://arxiv.org/pdf/1608.03983.pdf
References
19