DeepLearningProgramming Lecture5:Optimizationfor ... · 1Introduction - 1.1Mini-batchmethod...

Training Deep ModelsLecture 5: Optimization forDeep Learning Programming

Dr. Hanhe LinUniversität Konstanz, 14.05.2018

1 Introduction

Optimization

- First-order methods- gradient descent

- Second-order methods- Newton’s method- Conjugate gradients- BFGS

Figure: Minimizing the loss is like finding the lowest point in a hillylandscape

2 / 28 14.05.2018 Deep Learning Programming - Lecture 5 Dr. Hanhe Lin

1 Introduction

Learning differs from pure optimization

- In pure optimization, minimizing cost J is a goal in and of itself- Machine learning acts indirectly:

- It reduces a different cost function J(θ) in the hope that doing so will improveperformance P

- We prefer to minimize the corresponding objective function where theexpectation is taken across the data-generating distribution rather than just finitetraining set


1 Introduction - 1.1 Mini-batch method

Batch and Mini-batch algorithms

- Motivation: computational cost and redundancy in the training set- Batch/deterministic method:

- process all the training example simultaneously in a large batch- fast to converge- computation is very expensive when your training set is very large

- Stochastic/online method:- only a single example at a time- suitable when examples are drawn from a stream of continually created

examples- hard to converge

- Mini-batch/mini-batch stochastic:- use more than one but fewer than all the training examples



Why mini-batch?

- Larger batches provide a more accurate estimate of the gradient- Multicore architectures are usually underutilized by extremely small batches- Some kinds of hardware achieve better runtime with specific sizes of arrays, e.g., GPU- Small batches can offer a regularizing effect, perhaps due to the noise they add to the

learning process- but may require small learning rate- may increase number of steps for convergence



Mini-batch

- It is crucial that the mini-batches be selected randomly⇒ shuffle your data beforeselecting mini-batches

- In deep learning, one epoch means one pass of the full training set- Make several epochs through the training set

- the first epoch follows the unbiased gradient of the generalization error- additional epochs decrease gap between training error and test error

- In GPU, it is common for power of 2 batch sizes to offer better runtime


2 Optimization

Challenges in deep learning optimization

- Ill-conditioning: cause SGD to get “stuck” in the sense that even every small stepsincrease the cost function

- Local minima: problematic if they have high cost in comparison to the global minimum- Plateaus, saddle points and other flat regions: gradient are zeros but are not local

minima- Cliffs and exploding gradients: diverge- Inexact gradient- Theoretical limits of optimization- . . .


2 Optimization

Example: local minima & saddle point


2 Optimization - 2.1 Mini-batch SGD

Basic algorithm

- Require: Learning rate αRequire: Initial parameter θwhile stopping criterion not met doSample a mini-batch of m examples from the training set {x (1), x (2), . . . , x (m)} withcorresponding targets y (i)Compute gradient estimate: g ← ∂J

∂θApply update: θ ← θ − αg

end while- We can speed up the optimization by changing either learning rate or gradient



Learning rate

- Learning rate is a crucial parameter for SGD- Too large: overshoots local minimum, loss

increases- Too small: makes very slow progress, can

get stuck- Just right: makes steady progress towards

local minimum- Batch gradient descent can uses a fixed learning

rate as the true gradient becomes small and then0 when it reach a minimum, however, this is notsuitable for mini-batch SGD



Learning rate decay

- In practice, it is necessary to gradually decrease the learning rate over time:- Step decay: reduce the learning by some factor every few epochs (slightly

preferable)- Exponential decay: α = α0e−kt , where α0, k are hyperparameters and t is the

iteration number- 1/t decay: α = α0/(1 + kt), where α0, k are hyperparameters and t is the

iteration number- Adapt learning rate by monitoring learning curves that plot the objective function as a

function of time (more of an art than a science)- If you have enough computational power, use a slower decay and train for a longer

time



Momentum

- Momentum is designed to accelerate learning,especially in the face of high curvature, small butconsistent gradient

- Intuition: imagine a ball on the error surface:- The ball starts off by following the gradient, but

once it has velocity, it no longer does gradientdescent

- Its momentum makes it keep going in theprevious direction

- It damps oscillations in directions of high curvature bycombining gradients with opposite signs

- It builds up speed in directions with a gentle butconsistent gradient



Momentum

- AlgorithmRequire: Learning rate α, momentum parameter β ∈ [0, 1)Require: Initial parameter θ, initial velocity vwhile stopping criterion not met doSample a mini-batch of m examples from the training set {x (1), x (2), . . . , x (m)} withcorresponding targets y (i)Compute gradient estimate: g ← ∂J

∂θCompute velocity update: v ← βv − αgApply update: θ ← θ + v

end while- v ← βv − αg is to increment the previous velocity- The larger β is relative to α, the more previous gradients affect the current direction- If the algorithm always observes gradient g, then it will accelerate in the direction of −g



Use of momentum

- Common values of β used in practice include 0.5, 0.9, and 0.99- At beginning of learning there may be very large gradients, so it pays to use a

small momentum (e.g., 0.5)- Once the large gradients have disappeared the momentum can be smoothly

rased to its final value (e.g., 0.9 or even 0.99)- Like the learning rate, adapt β over time- Less important than shrinking α over time



Nesterov momentum

- The standard momentum method first computes the gradient at the current locationand then takes a big jump in the direction of the updated accumulated gradient

- Inspired by the Nesterov method for optimizing convex functions, Ilya Sutskeverintroduced a variant of the momentum algorithm

- Procedures:- make a big jump in the direction of the previous accumulated gradient- measure the gradient where you end up and make correction

- Intuition: it is better to correct a mistake after you have made it!



Nesterov momentum

- Require: Learning rate α, momentumparameter β ∈ [0, 1)

Require: Initial parameter θ, initial velocity vwhile stopping criterion not met doSample a mini-batch of m examples fromthe training set {x (1), x (2), . . . , x (m)} withcorresponding targets y (i)Apply interim update: θ ← θ + βvCompute gradient (at interim point):g ← ∂J

∂θCompute velocity update: v ← βv − αgApply update: θ ← θ + v

end while

Figure: A picture of the Nesterov method


2 Optimization - 2.2 Adaptive learning rates

Introduction

- Learning rate is one of the hyperparameters that is the most difficult to set, which, hasa significant impact on the model performance

- Momentum and Nesterov manipulate learning rate globally and equally for allparameters

- In deep models, the appropriate learning rates can vary widely between weights. E.g.,small gradient in early layers

- Another solution: use a separate learning rate for each parameter and automaticallyadapt it through the course of learning

- Algorithms:- AdaGrad- RMSProp- Adam- . . .



AdaGrad

- Adapts the learning rates of all model parameters by scaling them inverselyproportional to the square root of the sum of all the historical squared values of thegradient

- Large (Small) partial derivate of the loss⇒ rapid (slow) decrease in learning rate- The accumulation of squared gradients from the beginning of training can result in a

premature and excessive decrease in the effective learning rate- Perform well for some but not all deep learning models



AdaGrad

- AlgorithmRequire: Global learning rate α, Initial parameter θ, small constant ϵ, e.g., 10−7Zero initialize gradient accumulation variable rwhile stopping criterion not met doSample a mini-batch of m examples from the training set {x (1), x (2), . . . , x (m)} withcorresponding targets y (i)Compute gradient: g ← ∂J

∂θAccumulate squired gradient: r ← r + g ⊙ gCompute update: ∆θ ← − α

ϵ+√r⊙ g

Apply update: θ ← θ + ∆θend while



RMSProp

- Root Mean Square Propagation (RMSProp) modifies AdaGrad to perform better in thenon-convex setting by changing the gradient accumulation into an exponentiallyweighted moving average

- It discards history from the extreme pass so that it can converge rapidly after finding aconvex bowl

- Require: Global learning rate α, Initial parameter θ, small constant ϵ, e.g., 10−7Zero initialize gradient accumulation variable r , decay rate ρwhile stopping criterion not met doSample a mini-batch of m examples from the training set {x (1), x (2), . . . , x (m)} withcorresponding targets y (i)Compute gradient: g ← ∂J

∂θAccumulate squared gradient: r ← ρr + (1− ρ)g ⊙ gCompute update: ∆θ ← − α

ϵ+√r⊙ g

Apply update: θ ← θ + ∆θend while



RMSProp with Nesterov momentum

- AlgorithmRequire: Learning rate α, momentum parameter β ∈ [0, 1)Require: Initial parameter θ, initial velocity v , decay rate ρ, small constant ϵ, e.g., 10−7while stopping criterion not met doSample a mini-batch of m examples from the training set {x (1), x (2), . . . , x (m)} withcorresponding targets y (i)Apply interim update: θ ← θ + βvCompute gradient (at interim point): g ← ∂J

∂θAccumulate squared gradient: r ← ρr + (1− ρ)g ⊙ gCompute velocity update: v ← βv − α

ϵ+√rg

Apply update: θ ← θ + vend while



Adam

- Adaptive moments (Adam) is a variant on the combination of RMSProp andmomentum

- Exponentially weighted moving average on both the first-order moment of gradient(momentum) and second-order moment (RMSProp)

- Less bias early in training than RMSProp because of bias corrections- Robust to the choice of hyperparameters



Adam algorithm- Require: Step size α, Exponential decay rates for moment estimates, ρ1 and ρ2 in

[0, 1), Small constant ϵ for numerical stabilization, Initial parameters θZero initialize 1st and 2nd moment variable s and r , initialize time step t = 0while stopping criterion not met doSample a mini-batch of m examples from the training set {x (1), x (2), . . . , x (m)} withcorresponding targets y (i)Compute gradient: g ← ∂J

∂θt ← t + 1Update biased first moment estimate: s ← ρ1s + (1− ρ1)gUpdate biased second moment estimate: r ← ρ2r + (1− ρ2)g ⊙ gCorrect bias in the first moment: s ← s

1−ρt1Correct bias in the second moment: r ← r

1−ρt2Compute update: ∆θ → −α s√

r+ϵApply update: θ ← θ + ∆θ

end while23 / 28 14.05.2018 Deep Learning Programming - Lecture 5 Dr. Hanhe Lin


Choosing the right optimization algorithm

- No consensus on this point- Active algorithms include SGD, SGD with momentum, RMSProp, RMSProp with

momentum, Adam- The choice of which algorithm to use seems to depend largely on the user’s familiarity

with the algorithm


2 Optimization - 2.3 Batch normalization

Motivation

- As learning progresses, the distribution of the layer inputs changes due to theparameter update (internal covariate shift)

- This can result in slowing down learning- Batch normalization is a technique to reduce this effect:

- Explicitly force the layer activations to have zero mean and unit covariance- Adds a learn-able scale and bias term to allow the network to still use the

nonlinearity- Can be applied to any input and hidden layer



Algorithm

- Pros:- Easier to learn with gradient

descent- Enable higher learning rates- Less careful about initialization- Regularized the model



Batch normalization at test time

- At test time, batch normalization operatedifferently

- Instead of computing mean and std onthe batch, a single fixed empirical meanand std computed during training isapplied

Figure: Adding batch normalization to network


3 Summary

Summary

- Optimization for deep models is different from pure optimization:- act indirectly- mini-batch SGD- non-convex surface, local minima and saddle points

- Shuffle your data before applying SGD- Dynamically change learning rate:

- global, e.g., SGD with weight decay, SGD with momentum, . . .- individual, e.g., RMSProp, AdaGrad, . . .- hybrid, e.g., RMSProp with Nesterov momentum, Adam, . . .

- There is no consensus on which algorithm is better, try and see- Batch normalization helps to address the internal covariate shift and make learning

faster


DeepLearningProgramming Lecture5:Optimizationfor ... · 1Introduction - 1.1Mini-batchmethod...

Documents

Transcript of DeepLearningProgramming Lecture5:Optimizationfor ... · 1Introduction - 1.1Mini-batchmethod...