DeepLearningProgramming Lecture5:Optimizationfor ... · 1Introduction - 1.1Mini-batchmethod...
Transcript of DeepLearningProgramming Lecture5:Optimizationfor ... · 1Introduction - 1.1Mini-batchmethod...
Training Deep ModelsLecture 5: Optimization forDeep Learning Programming
Dr. Hanhe LinUniversität Konstanz, 14.05.2018
1 Introduction
Optimization
- First-order methods- gradient descent
- Second-order methods- Newton’s method- Conjugate gradients- BFGS
Figure: Minimizing the loss is like finding the lowest point in a hillylandscape
2 / 28 14.05.2018 Deep Learning Programming - Lecture 5 Dr. Hanhe Lin
1 Introduction
Learning differs from pure optimization
- In pure optimization, minimizing cost J is a goal in and of itself- Machine learning acts indirectly:
- It reduces a different cost function J(θ) in the hope that doing so will improveperformance P
- We prefer to minimize the corresponding objective function where theexpectation is taken across the data-generating distribution rather than just finitetraining set
3 / 28 14.05.2018 Deep Learning Programming - Lecture 5 Dr. Hanhe Lin
1 Introduction - 1.1 Mini-batch method
Batch and Mini-batch algorithms
- Motivation: computational cost and redundancy in the training set- Batch/deterministic method:
- process all the training example simultaneously in a large batch- fast to converge- computation is very expensive when your training set is very large
- Stochastic/online method:- only a single example at a time- suitable when examples are drawn from a stream of continually created
examples- hard to converge
- Mini-batch/mini-batch stochastic:- use more than one but fewer than all the training examples
4 / 28 14.05.2018 Deep Learning Programming - Lecture 5 Dr. Hanhe Lin
1 Introduction - 1.1 Mini-batch method
Why mini-batch?
- Larger batches provide a more accurate estimate of the gradient- Multicore architectures are usually underutilized by extremely small batches- Some kinds of hardware achieve better runtime with specific sizes of arrays, e.g., GPU- Small batches can offer a regularizing effect, perhaps due to the noise they add to the
learning process- but may require small learning rate- may increase number of steps for convergence
5 / 28 14.05.2018 Deep Learning Programming - Lecture 5 Dr. Hanhe Lin
1 Introduction - 1.1 Mini-batch method
Mini-batch
- It is crucial that the mini-batches be selected randomly⇒ shuffle your data beforeselecting mini-batches
- In deep learning, one epoch means one pass of the full training set- Make several epochs through the training set
- the first epoch follows the unbiased gradient of the generalization error- additional epochs decrease gap between training error and test error
- In GPU, it is common for power of 2 batch sizes to offer better runtime
6 / 28 14.05.2018 Deep Learning Programming - Lecture 5 Dr. Hanhe Lin
2 Optimization
Challenges in deep learning optimization
- Ill-conditioning: cause SGD to get “stuck” in the sense that even every small stepsincrease the cost function
- Local minima: problematic if they have high cost in comparison to the global minimum- Plateaus, saddle points and other flat regions: gradient are zeros but are not local
minima- Cliffs and exploding gradients: diverge- Inexact gradient- Theoretical limits of optimization- . . .
7 / 28 14.05.2018 Deep Learning Programming - Lecture 5 Dr. Hanhe Lin
2 Optimization
Example: local minima & saddle point
8 / 28 14.05.2018 Deep Learning Programming - Lecture 5 Dr. Hanhe Lin
2 Optimization - 2.1 Mini-batch SGD
Basic algorithm
- Require: Learning rate αRequire: Initial parameter θwhile stopping criterion not met doSample a mini-batch of m examples from the training set {x (1), x (2), . . . , x (m)} withcorresponding targets y (i)Compute gradient estimate: g ← ∂J
∂θApply update: θ ← θ − αg
end while- We can speed up the optimization by changing either learning rate or gradient
9 / 28 14.05.2018 Deep Learning Programming - Lecture 5 Dr. Hanhe Lin
2 Optimization - 2.1 Mini-batch SGD
Learning rate
- Learning rate is a crucial parameter for SGD- Too large: overshoots local minimum, loss
increases- Too small: makes very slow progress, can
get stuck- Just right: makes steady progress towards
local minimum- Batch gradient descent can uses a fixed learning
rate as the true gradient becomes small and then0 when it reach a minimum, however, this is notsuitable for mini-batch SGD
10 / 28 14.05.2018 Deep Learning Programming - Lecture 5 Dr. Hanhe Lin
2 Optimization - 2.1 Mini-batch SGD
Learning rate decay
- In practice, it is necessary to gradually decrease the learning rate over time:- Step decay: reduce the learning by some factor every few epochs (slightly
preferable)- Exponential decay: α = α0e−kt , where α0, k are hyperparameters and t is the
iteration number- 1/t decay: α = α0/(1 + kt), where α0, k are hyperparameters and t is the
iteration number- Adapt learning rate by monitoring learning curves that plot the objective function as a
function of time (more of an art than a science)- If you have enough computational power, use a slower decay and train for a longer
time
11 / 28 14.05.2018 Deep Learning Programming - Lecture 5 Dr. Hanhe Lin
2 Optimization - 2.1 Mini-batch SGD
Momentum
- Momentum is designed to accelerate learning,especially in the face of high curvature, small butconsistent gradient
- Intuition: imagine a ball on the error surface:- The ball starts off by following the gradient, but
once it has velocity, it no longer does gradientdescent
- Its momentum makes it keep going in theprevious direction
- It damps oscillations in directions of high curvature bycombining gradients with opposite signs
- It builds up speed in directions with a gentle butconsistent gradient
12 / 28 14.05.2018 Deep Learning Programming - Lecture 5 Dr. Hanhe Lin
2 Optimization - 2.1 Mini-batch SGD
Momentum
- AlgorithmRequire: Learning rate α, momentum parameter β ∈ [0, 1)Require: Initial parameter θ, initial velocity vwhile stopping criterion not met doSample a mini-batch of m examples from the training set {x (1), x (2), . . . , x (m)} withcorresponding targets y (i)Compute gradient estimate: g ← ∂J
∂θCompute velocity update: v ← βv − αgApply update: θ ← θ + v
end while- v ← βv − αg is to increment the previous velocity- The larger β is relative to α, the more previous gradients affect the current direction- If the algorithm always observes gradient g, then it will accelerate in the direction of −g
13 / 28 14.05.2018 Deep Learning Programming - Lecture 5 Dr. Hanhe Lin
2 Optimization - 2.1 Mini-batch SGD
Use of momentum
- Common values of β used in practice include 0.5, 0.9, and 0.99- At beginning of learning there may be very large gradients, so it pays to use a
small momentum (e.g., 0.5)- Once the large gradients have disappeared the momentum can be smoothly
rased to its final value (e.g., 0.9 or even 0.99)- Like the learning rate, adapt β over time- Less important than shrinking α over time
14 / 28 14.05.2018 Deep Learning Programming - Lecture 5 Dr. Hanhe Lin
2 Optimization - 2.1 Mini-batch SGD
Nesterov momentum
- The standard momentum method first computes the gradient at the current locationand then takes a big jump in the direction of the updated accumulated gradient
- Inspired by the Nesterov method for optimizing convex functions, Ilya Sutskeverintroduced a variant of the momentum algorithm
- Procedures:- make a big jump in the direction of the previous accumulated gradient- measure the gradient where you end up and make correction
- Intuition: it is better to correct a mistake after you have made it!
15 / 28 14.05.2018 Deep Learning Programming - Lecture 5 Dr. Hanhe Lin
2 Optimization - 2.1 Mini-batch SGD
Nesterov momentum
- Require: Learning rate α, momentumparameter β ∈ [0, 1)
Require: Initial parameter θ, initial velocity vwhile stopping criterion not met doSample a mini-batch of m examples fromthe training set {x (1), x (2), . . . , x (m)} withcorresponding targets y (i)Apply interim update: θ ← θ + βvCompute gradient (at interim point):g ← ∂J
∂θCompute velocity update: v ← βv − αgApply update: θ ← θ + v
end while
Figure: A picture of the Nesterov method
16 / 28 14.05.2018 Deep Learning Programming - Lecture 5 Dr. Hanhe Lin
2 Optimization - 2.2 Adaptive learning rates
Introduction
- Learning rate is one of the hyperparameters that is the most difficult to set, which, hasa significant impact on the model performance
- Momentum and Nesterov manipulate learning rate globally and equally for allparameters
- In deep models, the appropriate learning rates can vary widely between weights. E.g.,small gradient in early layers
- Another solution: use a separate learning rate for each parameter and automaticallyadapt it through the course of learning
- Algorithms:- AdaGrad- RMSProp- Adam- . . .
17 / 28 14.05.2018 Deep Learning Programming - Lecture 5 Dr. Hanhe Lin
2 Optimization - 2.2 Adaptive learning rates
AdaGrad
- Adapts the learning rates of all model parameters by scaling them inverselyproportional to the square root of the sum of all the historical squared values of thegradient
- Large (Small) partial derivate of the loss⇒ rapid (slow) decrease in learning rate- The accumulation of squared gradients from the beginning of training can result in a
premature and excessive decrease in the effective learning rate- Perform well for some but not all deep learning models
18 / 28 14.05.2018 Deep Learning Programming - Lecture 5 Dr. Hanhe Lin
2 Optimization - 2.2 Adaptive learning rates
AdaGrad
- AlgorithmRequire: Global learning rate α, Initial parameter θ, small constant ϵ, e.g., 10−7Zero initialize gradient accumulation variable rwhile stopping criterion not met doSample a mini-batch of m examples from the training set {x (1), x (2), . . . , x (m)} withcorresponding targets y (i)Compute gradient: g ← ∂J
∂θAccumulate squired gradient: r ← r + g ⊙ gCompute update: ∆θ ← − α
ϵ+√r⊙ g
Apply update: θ ← θ + ∆θend while
19 / 28 14.05.2018 Deep Learning Programming - Lecture 5 Dr. Hanhe Lin
2 Optimization - 2.2 Adaptive learning rates
RMSProp
- Root Mean Square Propagation (RMSProp) modifies AdaGrad to perform better in thenon-convex setting by changing the gradient accumulation into an exponentiallyweighted moving average
- It discards history from the extreme pass so that it can converge rapidly after finding aconvex bowl
- Require: Global learning rate α, Initial parameter θ, small constant ϵ, e.g., 10−7Zero initialize gradient accumulation variable r , decay rate ρwhile stopping criterion not met doSample a mini-batch of m examples from the training set {x (1), x (2), . . . , x (m)} withcorresponding targets y (i)Compute gradient: g ← ∂J
∂θAccumulate squared gradient: r ← ρr + (1− ρ)g ⊙ gCompute update: ∆θ ← − α
ϵ+√r⊙ g
Apply update: θ ← θ + ∆θend while
20 / 28 14.05.2018 Deep Learning Programming - Lecture 5 Dr. Hanhe Lin
2 Optimization - 2.2 Adaptive learning rates
RMSProp with Nesterov momentum
- AlgorithmRequire: Learning rate α, momentum parameter β ∈ [0, 1)Require: Initial parameter θ, initial velocity v , decay rate ρ, small constant ϵ, e.g., 10−7while stopping criterion not met doSample a mini-batch of m examples from the training set {x (1), x (2), . . . , x (m)} withcorresponding targets y (i)Apply interim update: θ ← θ + βvCompute gradient (at interim point): g ← ∂J
∂θAccumulate squared gradient: r ← ρr + (1− ρ)g ⊙ gCompute velocity update: v ← βv − α
ϵ+√rg
Apply update: θ ← θ + vend while
21 / 28 14.05.2018 Deep Learning Programming - Lecture 5 Dr. Hanhe Lin
2 Optimization - 2.2 Adaptive learning rates
Adam
- Adaptive moments (Adam) is a variant on the combination of RMSProp andmomentum
- Exponentially weighted moving average on both the first-order moment of gradient(momentum) and second-order moment (RMSProp)
- Less bias early in training than RMSProp because of bias corrections- Robust to the choice of hyperparameters
22 / 28 14.05.2018 Deep Learning Programming - Lecture 5 Dr. Hanhe Lin
2 Optimization - 2.2 Adaptive learning rates
Adam algorithm- Require: Step size α, Exponential decay rates for moment estimates, ρ1 and ρ2 in
[0, 1), Small constant ϵ for numerical stabilization, Initial parameters θZero initialize 1st and 2nd moment variable s and r , initialize time step t = 0while stopping criterion not met doSample a mini-batch of m examples from the training set {x (1), x (2), . . . , x (m)} withcorresponding targets y (i)Compute gradient: g ← ∂J
∂θt ← t + 1Update biased first moment estimate: s ← ρ1s + (1− ρ1)gUpdate biased second moment estimate: r ← ρ2r + (1− ρ2)g ⊙ gCorrect bias in the first moment: s ← s
1−ρt1Correct bias in the second moment: r ← r
1−ρt2Compute update: ∆θ → −α s√
r+ϵApply update: θ ← θ + ∆θ
end while23 / 28 14.05.2018 Deep Learning Programming - Lecture 5 Dr. Hanhe Lin
2 Optimization - 2.2 Adaptive learning rates
Choosing the right optimization algorithm
- No consensus on this point- Active algorithms include SGD, SGD with momentum, RMSProp, RMSProp with
momentum, Adam- The choice of which algorithm to use seems to depend largely on the user’s familiarity
with the algorithm
24 / 28 14.05.2018 Deep Learning Programming - Lecture 5 Dr. Hanhe Lin
2 Optimization - 2.3 Batch normalization
Motivation
- As learning progresses, the distribution of the layer inputs changes due to theparameter update (internal covariate shift)
- This can result in slowing down learning- Batch normalization is a technique to reduce this effect:
- Explicitly force the layer activations to have zero mean and unit covariance- Adds a learn-able scale and bias term to allow the network to still use the
nonlinearity- Can be applied to any input and hidden layer
25 / 28 14.05.2018 Deep Learning Programming - Lecture 5 Dr. Hanhe Lin
2 Optimization - 2.3 Batch normalization
Algorithm
- Pros:- Easier to learn with gradient
descent- Enable higher learning rates- Less careful about initialization- Regularized the model
26 / 28 14.05.2018 Deep Learning Programming - Lecture 5 Dr. Hanhe Lin
2 Optimization - 2.3 Batch normalization
Batch normalization at test time
- At test time, batch normalization operatedifferently
- Instead of computing mean and std onthe batch, a single fixed empirical meanand std computed during training isapplied
Figure: Adding batch normalization to network
27 / 28 14.05.2018 Deep Learning Programming - Lecture 5 Dr. Hanhe Lin
3 Summary
Summary
- Optimization for deep models is different from pure optimization:- act indirectly- mini-batch SGD- non-convex surface, local minima and saddle points
- Shuffle your data before applying SGD- Dynamically change learning rate:
- global, e.g., SGD with weight decay, SGD with momentum, . . .- individual, e.g., RMSProp, AdaGrad, . . .- hybrid, e.g., RMSProp with Nesterov momentum, Adam, . . .
- There is no consensus on which algorithm is better, try and see- Batch normalization helps to address the internal covariate shift and make learning
faster
28 / 28 14.05.2018 Deep Learning Programming - Lecture 5 Dr. Hanhe Lin