Variance-based Stochastic Gradient Descent (vSGD)f15ece6504/slides/L23_LROptimizations.pdf · ADAM:...
Transcript of Variance-based Stochastic Gradient Descent (vSGD)f15ece6504/slides/L23_LROptimizations.pdf · ADAM:...
![Page 1: Variance-based Stochastic Gradient Descent (vSGD)f15ece6504/slides/L23_LROptimizations.pdf · ADAM: A Method For Stochastic Optimization Kingma & Ba, arXiv14. The idea - Establish](https://reader034.fdocuments.us/reader034/viewer/2022052102/603cf4377965ef197942dcbc/html5/thumbnails/1.jpg)
Variance-based Stochastic Gradient Descent (vSGD):
No More Pesky Learning RatesSchaul et al., ICML13
![Page 2: Variance-based Stochastic Gradient Descent (vSGD)f15ece6504/slides/L23_LROptimizations.pdf · ADAM: A Method For Stochastic Optimization Kingma & Ba, arXiv14. The idea - Establish](https://reader034.fdocuments.us/reader034/viewer/2022052102/603cf4377965ef197942dcbc/html5/thumbnails/2.jpg)
The idea- Remove need for setting learning rates by updating them optimally from the
Hessian values.
![Page 3: Variance-based Stochastic Gradient Descent (vSGD)f15ece6504/slides/L23_LROptimizations.pdf · ADAM: A Method For Stochastic Optimization Kingma & Ba, arXiv14. The idea - Establish](https://reader034.fdocuments.us/reader034/viewer/2022052102/603cf4377965ef197942dcbc/html5/thumbnails/3.jpg)
![Page 4: Variance-based Stochastic Gradient Descent (vSGD)f15ece6504/slides/L23_LROptimizations.pdf · ADAM: A Method For Stochastic Optimization Kingma & Ba, arXiv14. The idea - Establish](https://reader034.fdocuments.us/reader034/viewer/2022052102/603cf4377965ef197942dcbc/html5/thumbnails/4.jpg)
![Page 5: Variance-based Stochastic Gradient Descent (vSGD)f15ece6504/slides/L23_LROptimizations.pdf · ADAM: A Method For Stochastic Optimization Kingma & Ba, arXiv14. The idea - Establish](https://reader034.fdocuments.us/reader034/viewer/2022052102/603cf4377965ef197942dcbc/html5/thumbnails/5.jpg)
ADAM:A Method For Stochastic Optimization
Kingma & Ba, arXiv14
![Page 6: Variance-based Stochastic Gradient Descent (vSGD)f15ece6504/slides/L23_LROptimizations.pdf · ADAM: A Method For Stochastic Optimization Kingma & Ba, arXiv14. The idea - Establish](https://reader034.fdocuments.us/reader034/viewer/2022052102/603cf4377965ef197942dcbc/html5/thumbnails/6.jpg)
The idea- Establish and update trust
region where the gradient isassumed to hold.
- Attempts to combine therobustness to sparse gradientsof AdaGrad and the robustnessof RMSProp to non-stationaryobjectives.
![Page 7: Variance-based Stochastic Gradient Descent (vSGD)f15ece6504/slides/L23_LROptimizations.pdf · ADAM: A Method For Stochastic Optimization Kingma & Ba, arXiv14. The idea - Establish](https://reader034.fdocuments.us/reader034/viewer/2022052102/603cf4377965ef197942dcbc/html5/thumbnails/7.jpg)
Alternative form: AdaMax- The second moment is
calculated as a sum of squaresand its square root is used in theupdate in ADAM.
- Changing that from power of twoto power of p as p goes to infinityyields AdaMax.
![Page 8: Variance-based Stochastic Gradient Descent (vSGD)f15ece6504/slides/L23_LROptimizations.pdf · ADAM: A Method For Stochastic Optimization Kingma & Ba, arXiv14. The idea - Establish](https://reader034.fdocuments.us/reader034/viewer/2022052102/603cf4377965ef197942dcbc/html5/thumbnails/8.jpg)
Results
![Page 9: Variance-based Stochastic Gradient Descent (vSGD)f15ece6504/slides/L23_LROptimizations.pdf · ADAM: A Method For Stochastic Optimization Kingma & Ba, arXiv14. The idea - Establish](https://reader034.fdocuments.us/reader034/viewer/2022052102/603cf4377965ef197942dcbc/html5/thumbnails/9.jpg)
AdaGrad:Adaptive Subgradient Methods for Online Learning
and Stochastic OptimizationDuchi et al., COLT10
![Page 10: Variance-based Stochastic Gradient Descent (vSGD)f15ece6504/slides/L23_LROptimizations.pdf · ADAM: A Method For Stochastic Optimization Kingma & Ba, arXiv14. The idea - Establish](https://reader034.fdocuments.us/reader034/viewer/2022052102/603cf4377965ef197942dcbc/html5/thumbnails/10.jpg)
The idea- Decrease the update over time by penalizing quickly moving values.
![Page 11: Variance-based Stochastic Gradient Descent (vSGD)f15ece6504/slides/L23_LROptimizations.pdf · ADAM: A Method For Stochastic Optimization Kingma & Ba, arXiv14. The idea - Establish](https://reader034.fdocuments.us/reader034/viewer/2022052102/603cf4377965ef197942dcbc/html5/thumbnails/11.jpg)
The problem- The learning rate only ever decreases.- Complex problems may need more freedom.
![Page 12: Variance-based Stochastic Gradient Descent (vSGD)f15ece6504/slides/L23_LROptimizations.pdf · ADAM: A Method For Stochastic Optimization Kingma & Ba, arXiv14. The idea - Establish](https://reader034.fdocuments.us/reader034/viewer/2022052102/603cf4377965ef197942dcbc/html5/thumbnails/12.jpg)
Precursor to- AdaDelta (Zeiler, ArXiv12)
- Uses the square root of exponential moving average of squares instead of just accumulating.- Approximate a Hessian correction using the same moving impulse over the weight updates.- Removes need for learning rate
- AdaSecant (Gulcehre et al., ArXiv14)- Uses expected values to reduce variance.
![Page 13: Variance-based Stochastic Gradient Descent (vSGD)f15ece6504/slides/L23_LROptimizations.pdf · ADAM: A Method For Stochastic Optimization Kingma & Ba, arXiv14. The idea - Establish](https://reader034.fdocuments.us/reader034/viewer/2022052102/603cf4377965ef197942dcbc/html5/thumbnails/13.jpg)
Comparisons- https://cs.stanford.edu/people/karpathy/convnetjs/demo/trainers.html- Doesn’t have ADAM in the default run, but ADAM is implemented and can be
added.- Doesn’t have Batch Normalization, vSGD, AdaMax, or AdaSecant.
![Page 14: Variance-based Stochastic Gradient Descent (vSGD)f15ece6504/slides/L23_LROptimizations.pdf · ADAM: A Method For Stochastic Optimization Kingma & Ba, arXiv14. The idea - Establish](https://reader034.fdocuments.us/reader034/viewer/2022052102/603cf4377965ef197942dcbc/html5/thumbnails/14.jpg)
Questions?