Computational Statistics Optimisation...COMPUTATIONAL STATISTICS OPTIMISATION Luca Bortolussi...

27
C OMPUTATIONAL S TATISTICS OPTIMISATION Luca Bortolussi Department of Mathematics and Geosciences University of Trieste Office 238, third floor, H2bis [email protected] Trieste, Winter Semester 2016/2017

Transcript of Computational Statistics Optimisation...COMPUTATIONAL STATISTICS OPTIMISATION Luca Bortolussi...

Page 1: Computational Statistics Optimisation...COMPUTATIONAL STATISTICS OPTIMISATION Luca Bortolussi Department of Mathematics and Geosciences University of Trieste Office 238, third floor,

COMPUTATIONAL STATISTICS

OPTIMISATION

Luca Bortolussi

Department of Mathematics and GeosciencesUniversity of Trieste

Office 238, third floor, [email protected]

Trieste, Winter Semester 2016/2017

Page 2: Computational Statistics Optimisation...COMPUTATIONAL STATISTICS OPTIMISATION Luca Bortolussi Department of Mathematics and Geosciences University of Trieste Office 238, third floor,

OUTLINE

1 STOCHASTIC GRADIENT DESCENT

2 CONJUGATE GRADIENTS

3 NEWTON’S METHODS

Page 3: Computational Statistics Optimisation...COMPUTATIONAL STATISTICS OPTIMISATION Luca Bortolussi Department of Mathematics and Geosciences University of Trieste Office 238, third floor,

STOCHASTIC GRADIENT DESCENT CONJUGATE GRADIENTS NEWTON’S METHODS 3 / 27

BASICS

Consider a function f (x), from Rn to R, twice differentiable.Their minima are points such that ∇f (x) = 0.At a minimum x∗ of f , the Hessian matrix Hf (x∗) is positivesemidefinite, i.e. vT Hf v ≥ 0.If a point x∗ is such that (a) ∇f (x) = 0 and (b) Hf (x∗) ispositive definite, then x∗ is a minimum of f .For a quadratic function f (x) = 1

2xT Ax − bT x + c thecondition ∇f (x) = 0 reads Ax − b = 0.If A is invertible and positive definite, then the pointx∗ = A−1b is the unique minimum of f , as f is a convexfunction.

Page 4: Computational Statistics Optimisation...COMPUTATIONAL STATISTICS OPTIMISATION Luca Bortolussi Department of Mathematics and Geosciences University of Trieste Office 238, third floor,

STOCHASTIC GRADIENT DESCENT CONJUGATE GRADIENTS NEWTON’S METHODS 4 / 27

GRADIENT DESCENT

Notation. xk denotes the sequence of points of thedescent. gk = ∇f (xk ). The update is in the direction pk :

xk+1 = xk + ηkpk

In gradient descent, at a point x, take a step towards−∇f (x), hence in the update rule becomes we setpk = −gk .In the simplest case, ηk = η. If η is not small enough, wecan step over the minimum. If η is very small this usuallynot happens, but convergence is very slow.For a quadratic function, we have that pk = −Axk + b.

Page 5: Computational Statistics Optimisation...COMPUTATIONAL STATISTICS OPTIMISATION Luca Bortolussi Department of Mathematics and Geosciences University of Trieste Office 238, third floor,

STOCHASTIC GRADIENT DESCENT CONJUGATE GRADIENTS NEWTON’S METHODS 5 / 27

STOCHASTIC GRADIENT DESCENT

If the function to minimise is of the form f (x) =∑N

i=1 fi(x),as is the case for ML problems, then we can use stochasticgradient descent, which instead of taking a step along gk ,it steps along the direction −∇fi(xk).The algorithm iterates over the dataset one or more times,typically shuffling it each time.Alternatively to one single observations, small batches(mini-batches) of observations can be used to improve themethod.

Page 6: Computational Statistics Optimisation...COMPUTATIONAL STATISTICS OPTIMISATION Luca Bortolussi Department of Mathematics and Geosciences University of Trieste Office 238, third floor,

STOCHASTIC GRADIENT DESCENT CONJUGATE GRADIENTS NEWTON’S METHODS 6 / 27

SGD, CROSS ENTROPY, AND MINI-BATCHES

The cross-entropy between distributions p and q is:

H(p,q) = H(p) + DKL(p ‖ q) = −∑

xp(x) log q(x)

The empirical distribution pemp of the dataset (xi , yi) givesto each observed point probability 1/N, for N total points.Maximizing the log likelihood is the same as minimizing thecross entropy between the empirical probability and theprobability predicted by the model. Calling the loss functionL(f (xi , θ), yi) = log p(yi | xi , θ), this is

H(pemp,p) =1N

∑i

L(f (xi , θ), yi)

Page 7: Computational Statistics Optimisation...COMPUTATIONAL STATISTICS OPTIMISATION Luca Bortolussi Department of Mathematics and Geosciences University of Trieste Office 238, third floor,

STOCHASTIC GRADIENT DESCENT CONJUGATE GRADIENTS NEWTON’S METHODS 7 / 27

SGD, CROSS ENTROPY, AND MINI-BATCHES

The cross entropy between the model and the true datadistribution is

H(pdata,p) = E((x,y)∼pdata)[L(f (x, θ), y)]

If we sample N points from pdata, H(pdata,p) isapproximated by H(pemp,p) in a statistical sense.Similarly, the gradient ∇θH(pdata,p) can be approximatedby ∇θH(pemp,p).We can see the use of a mini-batch of size m (with a singlepass on the data) in the SGD as a statistical approximationof the gradient ∇θH(pdata,p): hence we minimize thegeneralization error.For very big data, we may not even use all data points intraining.

Page 8: Computational Statistics Optimisation...COMPUTATIONAL STATISTICS OPTIMISATION Luca Bortolussi Department of Mathematics and Geosciences University of Trieste Office 238, third floor,

STOCHASTIC GRADIENT DESCENT CONJUGATE GRADIENTS NEWTON’S METHODS 8 / 27

SGD AND LEARNING RATE

The learning rate η of the SGD algorithm cannot be kept fixed ateach iteration. In fact, the algorithm would not converge in thiscase, due to the noisy evaluations fo the gradient. Hence, ηkmust depend on the iteration

A sufficient condition for convergence of SGD is :∞∑

k=1

ηk = ∞ and∞∑

k=1

η2k < ∞

Typically, one sets ηk = (1 − kτ

)η0 + kτητ,

where τ is equal to the number of iterations for few epochs of thealgorithm (epoch = one iteration over the dataset). For deepmodels (i.e. very complex), τ ≈ 100. Furthermore, ητ ≈ 0.01η0.

The choice of η0 is delicate. Too large and the algorithm maydiverge, too small and it may take forever. Strategy: monitor thefirst 50-100 iterations (plot the estimated cost function, using thesame minibatch used for gradient), and find an “optimal” η0.Then choose a larger one, avoiding instabilities.

Page 9: Computational Statistics Optimisation...COMPUTATIONAL STATISTICS OPTIMISATION Luca Bortolussi Department of Mathematics and Geosciences University of Trieste Office 238, third floor,

STOCHASTIC GRADIENT DESCENT CONJUGATE GRADIENTS NEWTON’S METHODS 9 / 27

MOMENTUM AND NESTEROV-MOMENTUM

Introduces memory in the gradient,by averaging the current value withprevious ones:

25/10/16 14:42

Page 23 of 56http://www.deeplearningbook.org/contents/optimization.html

CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

and Bousquet 2008( ) argue that it therefore may not be worthwhile to pursue

an optimization algorithm that converges faster than O( 1k

) for machine learning

tasks—faster convergence presumably corresponds to overfitting. Moreover, the

asymptotic analysis obscures many advantages that stochastic gradient descent

has after a small number of steps. With large datasets, the ability of SGD to make

rapid initial progress while evaluating the gradient for only very few examples

outweighs its slow asymptotic convergence. Most of the algorithms described in

the remainder of this chapter achieve benefits that matter in practice but are lost

in the constant factors obscured by the O( 1k

) asymptotic analysis. One can also

trade off the benefits of both batch and stochastic gradient descent by gradually

increasing the minibatch size during the course of learning.

For more information on SGD, see ( ).Bottou 1998

8.3.2

8.3.2

8.3.2

8.3.2

8.3.2

8.3.2

8.3.28.3.2 Momen

Momen

Momen

Momen

Momen

Momen

MomenMomentum

tum

tum

tum

tum

tum

tumtum

While stochastic gradient descent remains a very popular optimization strategy,

learning with it can sometimes be slow. The method of momentum (Polyak 1964, )

is designed to accelerate learning, especially in the face of high curvature, small but

consistent gradients, or noisy gradients. The momentum algorithm accumulates

an exponentially decaying moving average of past gradients and continues to move

in their direction. The effect of momentum is illustrated in figure .8.5

Formally, the momentum algorithm introduces a variable v

that plays the role

of velocity—it is the direction and speed at which the parameters move through

parameter space. The velocity is set to an exponentially decaying average of the

negative gradient. The name momentum

derives from a physical analogy, in

which the negative gradient is a force moving a particle through parameter space,

according to Newton’s laws of motion. Momentum in physics is mass times velocity.

In the momentum learning algorithm, we assume unit mass, so the velocity vector v

may also be regarded as the momentum of the particle. A hyperparameter

α ∈ [0,1)

determines how quickly the contributions of previous gradients exponentially decay.

The update rule is given by:

v v← α − ∇ θ

1

m

m

i=1

L( (f x( )i

; )θ , y( )i )

, (8.15)

θ θ v← + . (8.16)

The velocity v

accumulates the gradient elements ∇θ

1m

mi=1 L( (f x( )i

; )θ , y( )i )

.

The larger α

is relative to

, the more previous gradients affect the current direction.

The SGD algorithm with momentum is given in algorithm .8.2

296

25/10/16 14:42

Page 24 of 56http://www.deeplearningbook.org/contents/optimization.html

CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

− − −30 20 10 0 10 20−30

−20

−10

0

10

20

Figure 8.5: Momentum aims primarily to solve two problems: poor conditioning of the

Hessian matrix and variance in the stochastic gradient. Here, we illustrate how momentum

overcomes the first of these two problems. The contour lines depict a quadratic loss

function with a poorly conditioned Hessian matrix. The red path cutting across the

contours indicates the path followed by the momentum learning rule as it minimizes this

function. At each step along the way, we draw an arrow indicating the step that gradient

descent would take at that point. We can see that a poorly conditioned quadratic objective

looks like a long, narrow valley or canyon with steep sides. Momentum correctly traverses

the canyon lengthwise, while gradient steps waste time moving back and forth across the

narrow axis of the canyon. Compare also figure , which shows the behavior of gradient4.6

descent without momentum.

297

If all gradients are aligned, momentum accelerates bymultiplying by 1/(1 − α). Generally, α = 0.5 or 0.9 or 0.99.We can see the algorithm as a physical system subject to aNewtown forces and evolving in continuous time. The costfunction is taken as a potential and modulated by η, andthe momentum term corresponds to viscous friction(proportional to velocity). Initial velocity is equal to theinitial gradient.

Page 10: Computational Statistics Optimisation...COMPUTATIONAL STATISTICS OPTIMISATION Luca Bortolussi Department of Mathematics and Geosciences University of Trieste Office 238, third floor,

STOCHASTIC GRADIENT DESCENT CONJUGATE GRADIENTS NEWTON’S METHODS 10 / 27

MOMENTUM AND NESTEROV-MOMENTUM

Introduces memory in the gradient,by averaging the current value withprevious ones:

25/10/16 14:42

Page 23 of 56http://www.deeplearningbook.org/contents/optimization.html

CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

and Bousquet 2008( ) argue that it therefore may not be worthwhile to pursue

an optimization algorithm that converges faster than O( 1k

) for machine learning

tasks—faster convergence presumably corresponds to overfitting. Moreover, the

asymptotic analysis obscures many advantages that stochastic gradient descent

has after a small number of steps. With large datasets, the ability of SGD to make

rapid initial progress while evaluating the gradient for only very few examples

outweighs its slow asymptotic convergence. Most of the algorithms described in

the remainder of this chapter achieve benefits that matter in practice but are lost

in the constant factors obscured by the O( 1k

) asymptotic analysis. One can also

trade off the benefits of both batch and stochastic gradient descent by gradually

increasing the minibatch size during the course of learning.

For more information on SGD, see ( ).Bottou 1998

8.3.2

8.3.2

8.3.2

8.3.2

8.3.2

8.3.2

8.3.28.3.2 Momen

Momen

Momen

Momen

Momen

Momen

MomenMomentum

tum

tum

tum

tum

tum

tumtum

While stochastic gradient descent remains a very popular optimization strategy,

learning with it can sometimes be slow. The method of momentum (Polyak 1964, )

is designed to accelerate learning, especially in the face of high curvature, small but

consistent gradients, or noisy gradients. The momentum algorithm accumulates

an exponentially decaying moving average of past gradients and continues to move

in their direction. The effect of momentum is illustrated in figure .8.5

Formally, the momentum algorithm introduces a variable v

that plays the role

of velocity—it is the direction and speed at which the parameters move through

parameter space. The velocity is set to an exponentially decaying average of the

negative gradient. The name momentum

derives from a physical analogy, in

which the negative gradient is a force moving a particle through parameter space,

according to Newton’s laws of motion. Momentum in physics is mass times velocity.

In the momentum learning algorithm, we assume unit mass, so the velocity vector v

may also be regarded as the momentum of the particle. A hyperparameter

α ∈ [0,1)

determines how quickly the contributions of previous gradients exponentially decay.

The update rule is given by:

v v← α − ∇ θ

1

m

m

i=1

L( (f x( )i

; )θ , y( )i )

, (8.15)

θ θ v← + . (8.16)

The velocity v

accumulates the gradient elements ∇θ

1m

mi=1 L( (f x( )i

; )θ , y( )i )

.

The larger α

is relative to

, the more previous gradients affect the current direction.

The SGD algorithm with momentum is given in algorithm .8.2

296

25/10/16 14:42

Page 24 of 56http://www.deeplearningbook.org/contents/optimization.html

CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

− − −30 20 10 0 10 20−30

−20

−10

0

10

20

Figure 8.5: Momentum aims primarily to solve two problems: poor conditioning of the

Hessian matrix and variance in the stochastic gradient. Here, we illustrate how momentum

overcomes the first of these two problems. The contour lines depict a quadratic loss

function with a poorly conditioned Hessian matrix. The red path cutting across the

contours indicates the path followed by the momentum learning rule as it minimizes this

function. At each step along the way, we draw an arrow indicating the step that gradient

descent would take at that point. We can see that a poorly conditioned quadratic objective

looks like a long, narrow valley or canyon with steep sides. Momentum correctly traverses

the canyon lengthwise, while gradient steps waste time moving back and forth across the

narrow axis of the canyon. Compare also figure , which shows the behavior of gradient4.6

descent without momentum.

297

25/10/16 14:42

Page 25 of 56http://www.deeplearningbook.org/contents/optimization.html

CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

Previously, the size of the step was simply the norm of the gradient multiplied

by the learning rate. Now, the size of the step depends on how large and how

aligned a sequence of gradients are. The step size is largest when many successive

gradients point in exactly the same direction. If the momentum algorithm always

observes gradient g

, then it will accelerate in the direction of −g

, until reaching a

terminal velocity where the size of each step is

|| ||g

1− α

. (8.17)

It is thus helpful to think of the momentum hyperparameter in terms of 11−α

. Forexample, α = .

9 corresponds to multiplying the maximum speed by relative to10

the gradient descent algorithm.

Common values of α

used in practice include .5, .

9, and .

99. Like the learningrate, α

may also be adapted over time. Typically it begins with a small value and

is later raised. It is less important to adapt α

over time than to shrink

over time.

Algorithm 8.2 Stochastic gradient descent (SGD) with momentum

Require: Learning rate , momentum parameter . α

Require: Initial parameter , initial velocity .θ v

while dostopping criterion not met

Sample a minibatch of m

examples from the training set {x(1)

, . . . ,x( )m } with

corresponding targets y( )i .

Compute gradient estimate: g ← 1m∇θ

i L f( (x( )i

; )θ , y( )i )

Compute velocity update: v v g← α −

Apply update: θ θ v← +

end while

We can view the momentum algorithm as simulating a particle subject to

continuous-time Newtonian dynamics. The physical analogy can help to build

intuition for how the momentum and gradient descent algorithms behave.

The position of the particle at any point in time is given by θ(t

). The particle

experiences net force . This force causes the particle to accelerate:f ( )t

f( ) =t∂2

∂t2

θ( )t . (8.18)

Rather than viewing this as a second-order differential equation of the position,

we can introduce the variable v(t

) representing the velocity of the particle at time

t and rewrite the Newtonian dynamics as a first-order differential equation:

v( ) =t∂

∂t

θ( )t , (8.19)

298

Page 11: Computational Statistics Optimisation...COMPUTATIONAL STATISTICS OPTIMISATION Luca Bortolussi Department of Mathematics and Geosciences University of Trieste Office 238, third floor,

STOCHASTIC GRADIENT DESCENT CONJUGATE GRADIENTS NEWTON’S METHODS 11 / 27

MOMENTUM AND NESTEROV-MOMENTUM

Nesterov momentum evaluates the gradient in anintermediate point. It can be shown that it modifiesstandard GD convergence rate to O(1/k2)

25/10/16 14:42

Page 27 of 56http://www.deeplearningbook.org/contents/optimization.html

CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

that the gradient can continue to cause motion until a minimum is reached, but

strong enough to prevent motion if the gradient does not justify moving.

8.3.3

8.3.3

8.3.3

8.3.3

8.3.3

8.3.3

8.3.38.3.3 Nestero

Nestero

Nestero

Nestero

Nestero

Nestero

NesteroNesterov

v

v

v

v

v

vv Momen

Momen

Momen

Momen

Momen

Momen

MomenMomentum

tum

tum

tum

tum

tum

tumtum

Sutskever 2013et al. ( ) introduced a variant of the momentum algorithm that was

inspired by Nesterov’s accelerated gradient method ( , , ). TheNesterov 1983 2004

update rules in this case are given by:

v v← α − ∇ θ

1

m

m

i=1

L

f x( ( )i

; + )θ αv ,y( )i

, (8.21)

θ θ v← + , (8.22)

where the parameters α and

play a similar role as in the standard momentum

method. The difference between Nesterov momentum and standard momentum is

where the gradient is evaluated. With Nesterov momentum the gradient is evaluated

after the current velocity is applied. Thus one can interpret Nesterov momentum

as attempting to add a correction factor to the standard method of momentum.

The complete Nesterov momentum algorithm is presented in algorithm .8.3

In the convex batch gradient case, Nesterov momentum brings the rate of

convergence of the excess error from O(1/k

) (after k

steps) to O(1/k2

) as shown

by Nesterov 1983( ). Unfortunately, in the stochastic gradient case, Nesterov

momentum does not improve the rate of convergence.

Algorithm 8.3 Stochastic gradient descent (SGD) with Nesterov momentum

Require: Learning rate , momentum parameter . α

Require: Initial parameter , initial velocity .θ v

while dostopping criterion not met

Sample a minibatch of m

examples from the training set {x(1)

, . . . ,x( )m } with

corresponding labels y( )i .

Apply interim update: ˜

θ θ v← + α

Compute gradient (at interim point): g ← 1m∇θ

i L f( (x( )i ; ˜

θ y), ( )i )

Compute velocity update: v v g← α −

Apply update: θ θ v← +

end while

300

25/10/16 14:42

Page 27 of 56http://www.deeplearningbook.org/contents/optimization.html

CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

that the gradient can continue to cause motion until a minimum is reached, but

strong enough to prevent motion if the gradient does not justify moving.

8.3.3

8.3.3

8.3.3

8.3.3

8.3.3

8.3.3

8.3.38.3.3 Nestero

Nestero

Nestero

Nestero

Nestero

Nestero

NesteroNesterov

v

v

v

v

v

vv Momen

Momen

Momen

Momen

Momen

Momen

MomenMomentum

tum

tum

tum

tum

tum

tumtum

Sutskever 2013et al. ( ) introduced a variant of the momentum algorithm that was

inspired by Nesterov’s accelerated gradient method ( , , ). TheNesterov 1983 2004

update rules in this case are given by:

v v← α − ∇ θ

1

m

m

i=1

L

f x( ( )i

; + )θ αv ,y( )i

, (8.21)

θ θ v← + , (8.22)

where the parameters α and

play a similar role as in the standard momentum

method. The difference between Nesterov momentum and standard momentum is

where the gradient is evaluated. With Nesterov momentum the gradient is evaluated

after the current velocity is applied. Thus one can interpret Nesterov momentum

as attempting to add a correction factor to the standard method of momentum.

The complete Nesterov momentum algorithm is presented in algorithm .8.3

In the convex batch gradient case, Nesterov momentum brings the rate of

convergence of the excess error from O(1/k

) (after k

steps) to O(1/k2

) as shown

by Nesterov 1983( ). Unfortunately, in the stochastic gradient case, Nesterov

momentum does not improve the rate of convergence.

Algorithm 8.3 Stochastic gradient descent (SGD) with Nesterov momentum

Require: Learning rate , momentum parameter . α

Require: Initial parameter , initial velocity .θ v

while dostopping criterion not met

Sample a minibatch of m

examples from the training set {x(1)

, . . . ,x( )m } with

corresponding labels y( )i .

Apply interim update: ˜

θ θ v← + α

Compute gradient (at interim point): g ← 1m∇θ

i L f( (x( )i ; ˜

θ y), ( )i )

Compute velocity update: v v g← α −

Apply update: θ θ v← +

end while

300

Page 12: Computational Statistics Optimisation...COMPUTATIONAL STATISTICS OPTIMISATION Luca Bortolussi Department of Mathematics and Geosciences University of Trieste Office 238, third floor,

STOCHASTIC GRADIENT DESCENT CONJUGATE GRADIENTS NEWTON’S METHODS 12 / 27

INITIALIZATION OF THE OPTIMISATION ALGORITHM

The initial point of the optimization algorithm is crucial forconvergence, especially in high dimensions. If we are inthe basin of attraction of a good minimum/ area with goodcost function, then the SGD will work fine. Otherwise not.We can randomise initial conditions and try theoptimisation several times.If we have some extra information about the solution,better incorporate it: As a general rule, always useasymmetric initial conditions (especially if the model hassymmetries: see neural networks).Sample from a (zero mean) Gaussian or an uniform.Range is important: if too large may result in instabilities. Iftoo small, may introduce too little variation.Heuristics depend on the model to learn.

Page 13: Computational Statistics Optimisation...COMPUTATIONAL STATISTICS OPTIMISATION Luca Bortolussi Department of Mathematics and Geosciences University of Trieste Office 238, third floor,

STOCHASTIC GRADIENT DESCENT CONJUGATE GRADIENTS NEWTON’S METHODS 13 / 27

ADAPTIVE LEARNING RATE: ADAGRAD

Introduce a different rate for each parameter. Modify themto take the curvature of the search space into account.AdaGrad: scales the learning rate inversely proportional tothe square root of the sum of all their historical squaredvalues. Good for convex problems.

25/10/16 14:42

Page 35 of 56http://www.deeplearningbook.org/contents/optimization.html

CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

Algorithm 8.4 The AdaGrad algorithm

Require: Global learning rate

Require: Initial parameter θ

Require: Small constant , perhapsδ 10−7

, for numerical stability

Initialize gradient accumulation variable r = 0

while dostopping criterion not met

Sample a minibatch of m

examples from the training set {x(1)

, . . . ,x( )m } with

corresponding targets y( )i .

Compute gradient: g ← 1m∇θ

iL f( (x( )i

; )θ , y( )i )

Accumulate squared gradient: r r g g← +

Compute update: ∆

θ ← − δ+

√r

g

. (Division and square root applied

element-wise)

Apply update: θ θ θ← + ∆

end while

have made the learning rate too small before arriving at such a convex structure.

RMSProp uses an exponentially decaying average to discard history from the

extreme past so that it can converge rapidly after finding a convex bowl, as if it

were an instance of the AdaGrad algorithm initialized within that bowl.

RMSProp is shown in its standard form in algorithm and combined with8.5

Nesterov momentum in algorithm . Compared to AdaGrad, the use of the8.6

moving average introduces a new hyperparameter, ρ

, that controls the length scale

of the moving average.

Empirically, RMSProp has been shown to be an effective and practical op-

timization algorithm for deep neural networks. It is currently one of the go-to

optimization methods being employed routinely by deep learning practitioners.

8.5.3

8.5.3

8.5.3

8.5.3

8.5.3

8.5.3

8.5.38.5.3 A

A

A

A

A

A

AAdam

dam

dam

dam

dam

dam

damdam

Adam

( , ) is yet another adaptive learning rate optimizationKingma and Ba 2014

algorithm and is presented in algorithm . The name “Adam” derives from8.7

the phrase “adaptive moments.” In the context of the earlier algorithms, it is

perhaps best seen as a variant on the combination of RMSProp and momentum

with a few important distinctions. First, in Adam, momentum is incorporated

directly as an estimate of the first order moment (with exponential weighting) of

the gradient. The most straightforward way to add momentum to RMSProp is to

apply momentum to the rescaled gradients. The use of momentum in combination

with rescaling does not have a clear theoretical motivation. Second, Adam includes

308

Page 14: Computational Statistics Optimisation...COMPUTATIONAL STATISTICS OPTIMISATION Luca Bortolussi Department of Mathematics and Geosciences University of Trieste Office 238, third floor,

STOCHASTIC GRADIENT DESCENT CONJUGATE GRADIENTS NEWTON’S METHODS 14 / 27

ADAPTIVE LEARNING RATE: RMSPROP

RMSProp: performs better in non-convex setting thanAdaGrad, by changing gradient accumulation into anexponentially weighted moving average.

25/10/16 14:42

Page 36 of 56http://www.deeplearningbook.org/contents/optimization.html

CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

Algorithm 8.5 The RMSProp algorithm

Require: Global learning rate , decay rate . ρ

Require: Initial parameter θRequire:

Small constant δ

, usually 10−6

, used to stabilize division by small

numbers.

Initialize accumulation variables r = 0

while dostopping criterion not met

Sample a minibatch of m

examples from the training set {x(1)

, . . . ,x( )m } with

corresponding targets y( )i .

Compute gradient: g ← 1m∇θ

iL f( (x( )i

; )θ , y( )i )

Accumulate squared gradient: r r g g← ρ + (1 )− ρ

Compute parameter update: ∆θ = − √δ+rg

. ( 1√δ+r

applied element-wise)

Apply update: θ θ θ← + ∆

end while

bias corrections to the estimates of both the first-order moments (the momentum

term) and the (uncentered) second-order moments to account for their initialization

at the origin (see algorithm ). RMSProp also incorporates an estimate of the8.7

(uncentered) second-order moment, however it lacks the correction factor. Thus,

unlike in Adam, the RMSProp second-order moment estimate may have high bias

early in training. Adam is generally regarded as being fairly robust to the choice

of hyperparameters, though the learning rate sometimes needs to be changed from

the suggested default.

8.5.4

8.5.4

8.5.4

8.5.4

8.5.4

8.5.4

8.5.48.5.4 Cho

Cho

Cho

Cho

Cho

Cho

ChoChoosing

osing

osing

osing

osing

osing

osingosing the

the

the

the

the

the

thethe Righ

Righ

Righ

Righ

Righ

Righ

RighRight

t

t

t

t

t

tt Optimization

Optimization

Optimization

Optimization

Optimization

Optimization

OptimizationOptimization Algorithm

Algorithm

Algorithm

Algorithm

Algorithm

Algorithm

AlgorithmAlgorithm

In this section, we discussed a series of related algorithms that each seek to address

the challenge of optimizing deep models by adapting the learning rate for each

model parameter. At this point, a natural question is: which algorithm should one

choose?

Unfortunately, there is currently no consensus on this point. ( )Schaul et al. 2014

presented a valuable comparison of a large number of optimization algorithms

across a wide range of learning tasks. While the results suggest that the family of

algorithms with adaptive learning rates (represented by RMSProp and AdaDelta)

performed fairly robustly, no single best algorithm has emerged.

Currently, the most popular optimization algorithms actively in use include

SGD, SGD with momentum, RMSProp, RMSProp with momentum, AdaDelta

and Adam. The choice of which algorithm to use, at this point, seems to depend

309

Page 15: Computational Statistics Optimisation...COMPUTATIONAL STATISTICS OPTIMISATION Luca Bortolussi Department of Mathematics and Geosciences University of Trieste Office 238, third floor,

STOCHASTIC GRADIENT DESCENT CONJUGATE GRADIENTS NEWTON’S METHODS 15 / 27

ADAPTIVE LEARNING RATE: RMSPROP

RMSProp can be also combined with NesterovMomentum. There is an extra hyperparameter controllingthe length scale of moving average.

25/10/16 14:42

Page 37 of 56http://www.deeplearningbook.org/contents/optimization.html

CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

Algorithm 8.6 RMSProp algorithm with Nesterov momentum

Require: Global learning rate , decay rate , momentum coefficient . ρ α

Require: Initial parameter , initial velocity .θ v

Initialize accumulation variable r = 0

while dostopping criterion not met

Sample a minibatch of m

examples from the training set {x(1)

, . . . ,x( )m } with

corresponding targets y( )i.

Compute interim update: ˜

θ θ v← + α

Compute gradient: g ← 1m∇θ

iL f( (x( )i ; ˜

θ y), ( )i )

Accumulate gradient: r r g g← ρ + (1 )− ρ

Compute velocity update: v v← α − √r

g. (1√r

applied element-wise)

Apply update: θ θ v← +

end while

largely on the user’s familiarity with the algorithm (for ease of hyperparameter

tuning).

8.6

8.6

8.6

8.6

8.6

8.6

8.68.6 Appro

Appro

Appro

Appro

Appro

Appro

ApproApproximate

ximate

ximate

ximate

ximate

ximate

ximateximate Second-Order

Second-Order

Second-Order

Second-Order

Second-Order

Second-Order

Second-OrderSecond-Order Metho

Metho

Metho

Metho

Metho

Metho

MethoMethods

ds

ds

ds

ds

ds

dsds

In this section we discuss the application of second-order methods to the training

of deep networks. See ( ) for an earlier treatment of this subject.LeCun et al. 1998a

For simplicity of exposition, the only objective function we examine is the empirical

risk:

J( ) = θ Ex,y∼pdata ( )x,y

[ ( ( ; ) )] =L f x θ , y1

m

m

i=1

L f( (x( )i

; )θ , y( )i

). (8.25)

However the methods we discuss here extend readily to more general objective

functions that, for instance, include parameter regularization terms such as those

discussed in chapter .7

8.6.1

8.6.1

8.6.1

8.6.1

8.6.1

8.6.1

8.6.18.6.1 Newton’s

Newton’s

Newton’s

Newton’s

Newton’s

Newton’s

Newton’sNewton’s Metho

Metho

Metho

Metho

Metho

Metho

MethoMethod

d

d

d

d

d

dd

In section , we introduced second-order gradient methods. In contrast to first-4.3

order methods, second-order methods make use of second derivatives to improve

optimization. The most widely used second-order method is Newton’s method. We

now describe Newton’s method in more detail, with emphasis on its application to

neural network training.

310

Page 16: Computational Statistics Optimisation...COMPUTATIONAL STATISTICS OPTIMISATION Luca Bortolussi Department of Mathematics and Geosciences University of Trieste Office 238, third floor,

STOCHASTIC GRADIENT DESCENT CONJUGATE GRADIENTS NEWTON’S METHODS 16 / 27

ADAPTIVE LEARNING RATE: ADAM

Adam integrates RMSProp with momentum. Introduces asecond order correction. Quite stable w.r.t. hyperparameters.

25/10/16 14:42

Page 38 of 56http://www.deeplearningbook.org/contents/optimization.html

CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

Algorithm 8.7 The Adam algorithm

Require: Step size (Suggested default: ) 0 001.Require:

Exponential decay rates for moment estimates, ρ1 and ρ2

in [0, 1).

(Suggested defaults: and respectively)0 9. 0 999.Require:

Small constant δ

used for numerical stabilization. (Suggested default:

10−8)

Require: Initial parameters θ

Initialize 1st and 2nd moment variables ,s = 0 r = 0

Initialize time step t = 0

while dostopping criterion not met

Sample a minibatch of m

examples from the training set {x(1)

, . . . ,x( )m } with

corresponding targets y( )i.

Compute gradient: g ← 1m∇θ

iL f( (x( )i

; )θ , y( )i )

t t← + 1

Update biased first moment estimate: s← ρ1

s+ (1− ρ1 )g

Update biased second moment estimate: r ← ρ2

r + (1− ρ2

)g g

Correct bias in first moment: ˆ

s← s1−ρt

1

Correct bias in second moment: ˆ

r← r1−ρt2

Compute update: ∆ = θ − s√r+δ

(operations applied element-wise)

Apply update: θ θ θ← + ∆

end while

Newton’s method is an optimization scheme based on using a second-order Tay-

lor series expansion to approximate J (θ

) near some point θ0

, ignoring derivatives

of higher order:

J J( ) θ ≈ (θ0

) + (θ θ− 0)∇θJ(θ0

) +1

2

(θ θ− 0)

H θ θ( − 0

), (8.26)

where H

is the Hessian of J

with respect to θ

evaluated at θ0

. If we then solve for

the critical point of this function, we obtain the Newton parameter update rule:

θ∗ = θ0

−H−1∇θJ(θ0

) (8.27)

Thus for a locally quadratic function (with positive definite H

), by rescaling

the gradient by H−1

, Newton’s method jumps directly to the minimum. If the

objective function is convex but not quadratic (there are higher-order terms), this

update can be iterated, yielding the training algorithm associated with Newton’s

method, given in algorithm .8.8

311

Page 17: Computational Statistics Optimisation...COMPUTATIONAL STATISTICS OPTIMISATION Luca Bortolussi Department of Mathematics and Geosciences University of Trieste Office 238, third floor,

STOCHASTIC GRADIENT DESCENT CONJUGATE GRADIENTS NEWTON’S METHODS 17 / 27

DIGRESSION: REGULARIZATION BY EARLY STOPPING

One of the most used regularization strategies, particularlyfor deep models, is early stopping. Idea is that, for acomplex model (prone to overfitting), the bestgeneralization is not found at an optimum. A better solutioncan be found along the trajectory going to it.One uses a validation dataset to check during optimizationhow validation error decreases, and stops at a minimum ofthe validation curve. Time is thus treated as ahyperparameter.

24/10/16 14:49

Page 19 of 46http://www.deeplearningbook.org/contents/regularization.html

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

0 50 100 150 200 250

Time (epochs)

0 00.

0 05.

0 10.

0 15.

0 20.

Loss

(neg

ative log-likelihood)

Training set loss

Validation set loss

Figure 7.3: Learning curves showing how the negative log-likelihood loss changes over

time (indicated as number of training iterations over the dataset, or epochs

). In this

example, we train a maxout network on MNIST. Observe that the training objective

decreases consistently over time, but the validation set average loss eventually begins to

increase again, forming an asymmetric U-shaped curve.

greatly improved (in proportion with the increased number of examples for the

shared parameters, compared to the scenario of single-task models). Of course this

will happen only if some assumptions about the statistical relationship between

the different tasks are valid, meaning that there is something shared across some

of the tasks.

From the point of view of deep learning, the underlying prior belief is the

following: among the factors that explain the variations observed in the data

associated with the different tasks, some are shared across two or more tasks.

7.8

7.8

7.8

7.8

7.8

7.8

7.87.8 Early

Early

Early

Early

Early

Early

EarlyEarly Stopping

Stopping

Stopping

Stopping

Stopping

Stopping

StoppingStopping

When training large models with sufficient representational capacity to overfit

the task, we often observe that training error decreases steadily over time, but

validation set error begins to rise again. See figure for an example of this7.3

behavior. This behavior occurs very reliably.

This means we can obtain a model with better validation set error (and thus,

hopefully better test set error) by returning to the parameter setting at the point in

time with the lowest validation set error. Every time the error on the validation set

improves, we store a copy of the model parameters. When the training algorithm

terminates, we return these parameters, rather than the latest parameters. The

246

Page 18: Computational Statistics Optimisation...COMPUTATIONAL STATISTICS OPTIMISATION Luca Bortolussi Department of Mathematics and Geosciences University of Trieste Office 238, third floor,

STOCHASTIC GRADIENT DESCENT CONJUGATE GRADIENTS NEWTON’S METHODS 18 / 27

DIGRESSION: REGULARIZATION BY EARLY STOPPING

One can show that early stopping, for linear models, has asimilar effect as L2 regularization.Early stopping is a very cheap form of regularization.

24/10/16 14:49

Page 24 of 46http://www.deeplearningbook.org/contents/regularization.html

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

w1

w2

w∗

w

w1

w2

w∗

w

Figure 7.4: An illustration of the effect of early stopping. (Left)The solid contour lines

indicate the contours of the negative log-likelihood. The dashed line indicates the trajectory

taken by SGD beginning from the origin. Rather than stopping at the pointw∗ that

minimizes the cost, early stopping results in the trajectory stopping at an earlier pointw.

(Right)An illustration of the effect ofL2

regularization for comparison. The dashed circles

indicate the contours of the L2

penalty, which causes the minimum of the total cost to lie

nearer the origin than the minimum of the unregularized cost.

We are going to study the trajectory followed by the parameter vector during

training. For simplicity, let us set the initial parameter vector to the origin,3 thatis w (0) = 0

. Let us study the approximate behavior of gradient descent on J by

analyzing gradient descent on J :

w( )τ = w ( 1)τ−

− ∇ wJ(w( 1)τ−

) (7.35)

= w ( 1)τ−

− H w( ( 1)τ−

−w∗

) (7.36)

w( )τ

−w∗

= ( )(I H− w( 1)τ−

−w∗

). (7.37)

Let us now rewrite this expression in the space of the eigenvectors of H

, exploiting

the eigendecomposition of H: H = Q QΛ

, whereΛ

is a diagonal matrix and Q

is an orthonormal basis of eigenvectors.

w( )τ

−w∗

= (I Q Q− Λ )(w( 1)τ−

−w∗

) (7.38)

Q(w( )τ

−w∗

) = ( )I − Λ Q (w( 1)τ−

−w∗

) (7.39)

3

For neural networks, to obtain symmetry breaking between hidden units, we cannot initialize

all the parameters to 0

, as discussed in section . However, the argument holds for any other6.2

initial value w(0)

.

251

Page 19: Computational Statistics Optimisation...COMPUTATIONAL STATISTICS OPTIMISATION Luca Bortolussi Department of Mathematics and Geosciences University of Trieste Office 238, third floor,

STOCHASTIC GRADIENT DESCENT CONJUGATE GRADIENTS NEWTON’S METHODS 19 / 27

DIGRESSION: REGULARIZATION BY EARLY STOPPING

24/10/16 14:49

Page 20 of 46http://www.deeplearningbook.org/contents/regularization.html

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

algorithm terminates when no parameters have improved over the best recorded

validation error for some pre-specified number of iterations. This procedure is

specified more formally in algorithm .7.1

Algorithm 7.1

The early stopping meta-algorithm for determining the best

amount of time to train. This meta-algorithm is a general strategy that works

well with a variety of training algorithms and ways of quantifying error on the

validation set.

Let be the number of steps between evaluations.nLet p

be the “patience,” the number of times to observe worsening validation set

error before giving up.

Let θo

be the initial parameters.

θ θ← o

i← 0

j ← 0

v←∞θ∗

← θi∗

← i

while doj < p

Update by running the training algorithm for steps.θ n

i i n← +

v

← ValidationSetError( )θ

if v

< v then

j ← 0θ∗

← θi∗

← i

v v←

else

j j← + 1

end if

end while

Best parameters are θ∗

, best number of training steps is i∗

This strategy is known as

early stopping

. It is probably the most commonly

used form of regularization in deep learning. Its popularity is due both to its

effectiveness and its simplicity.

One way to think of early stopping is as a very efficient hyperparameter selection

algorithm. In this view, the number of training steps is just another hyperparameter.

We can see in figure that this hyperparameter has a U-shaped validation set7.3

247

Page 20: Computational Statistics Optimisation...COMPUTATIONAL STATISTICS OPTIMISATION Luca Bortolussi Department of Mathematics and Geosciences University of Trieste Office 238, third floor,

OUTLINE

1 STOCHASTIC GRADIENT DESCENT

2 CONJUGATE GRADIENTS

3 NEWTON’S METHODS

Page 21: Computational Statistics Optimisation...COMPUTATIONAL STATISTICS OPTIMISATION Luca Bortolussi Department of Mathematics and Geosciences University of Trieste Office 238, third floor,

STOCHASTIC GRADIENT DESCENT CONJUGATE GRADIENTS NEWTON’S METHODS 21 / 27

GRADIENT DESCENT

Notation. xk denotes the sequence of points of thedescent. gk = ∇f (xk ). The update is in the direction pk :

xk+1 = xk + ηkpk

In gradient descent, at a point x, take a step towards−∇f (x), hence in the update rule becomes we setpk = −gk .In the simplest case, ηk = η. If η is not small enough, wecan step over the minimum. If η is very small this usuallynot happens, but convergence is very slow.For a quadratic function, we have that pk = −Axk + b.

Page 22: Computational Statistics Optimisation...COMPUTATIONAL STATISTICS OPTIMISATION Luca Bortolussi Department of Mathematics and Geosciences University of Trieste Office 238, third floor,

STOCHASTIC GRADIENT DESCENT CONJUGATE GRADIENTS NEWTON’S METHODS 22 / 27

GRADIENT DESCENT WITH LINE SEARCH

One possibility to improve gradient descent is to take thebest step possible, i.e. set ηk to a value minimising thefunction f (xk + λpk) along the line with direction pk .The minimum is obtained by solving for λ the equation

∇f (xk + λpk)T pk = gTk+1pk = 0

and setting ηk to this solution.for a quadratic function, we have that the best learning rateis given by

ηk =(b − Axk )T pk

pT Ap

Page 23: Computational Statistics Optimisation...COMPUTATIONAL STATISTICS OPTIMISATION Luca Bortolussi Department of Mathematics and Geosciences University of Trieste Office 238, third floor,

STOCHASTIC GRADIENT DESCENT CONJUGATE GRADIENTS NEWTON’S METHODS 23 / 27

CONJUGATE GRADIENTS

Consider a quadratic minimisation problem. If the matrix Awould be diagonal, we could solve separately n different1-dimensional optimisation problems.We can change coordinates by an orthogonal matrix P thatdiagonalises the matrix A. By letting x = Py, we canrewrite the function f (x) as

f (y) =12

yT PT APy − BT Py + c

The columns of P are called conjugate vectors and satisfypT

i Apj = 0 and pTi Api > 0. They are linearly independent

and are very good directions to follows in a descentmethod.

Page 24: Computational Statistics Optimisation...COMPUTATIONAL STATISTICS OPTIMISATION Luca Bortolussi Department of Mathematics and Geosciences University of Trieste Office 238, third floor,

STOCHASTIC GRADIENT DESCENT CONJUGATE GRADIENTS NEWTON’S METHODS 24 / 27

CONJUGATE GRADIENTS

To construct conjugate vectors, we can use theGram-Schmidt orthogonalisation procedure: if v is linearlyindependent of p1,. . . , pk , then

pk+1 = v −k∑

j=1

pTj Av

pTj Apj

pj

We can start from a basis and construct the conjugatevectors p1,. . . , pn.In the conjugate vectors algorithm, we take step k + 1along pk+1. The best ηk , according to line search, is

ηk =−pT

k gk

pTk Apk

It holds that ∇f (xk+1)T pi = 0 for all i = 1, . . . , k (Lunenbergexpanding subspace theorem).

Page 25: Computational Statistics Optimisation...COMPUTATIONAL STATISTICS OPTIMISATION Luca Bortolussi Department of Mathematics and Geosciences University of Trieste Office 238, third floor,

STOCHASTIC GRADIENT DESCENT CONJUGATE GRADIENTS NEWTON’S METHODS 25 / 27

CONJUGATE GRADIENTS

The conjugate gradients method constructs pk ’s on the fly.Works well also for non-quadratic problems. For quadraticproblems converges in at most n steps.A good choice for a linearly independent vector v at stepk + 1 to construct pk+1 is thus ∇f (xk+1).In this case, after some algebra, we can compute:

ηk+1 =gT

k+1gk+1

pTk+1Apk+1

pk+1 = −gk+1 + βkpk

with

βk =gT

k+1gk+1

gTk gk

or βk =gT

k+1(gk+1 − gk )

gTk gk

known as the Fletcher-Reeves or Polak-Ribière(preferrable for non-quadratic problems) formulae .

Page 26: Computational Statistics Optimisation...COMPUTATIONAL STATISTICS OPTIMISATION Luca Bortolussi Department of Mathematics and Geosciences University of Trieste Office 238, third floor,

OUTLINE

1 STOCHASTIC GRADIENT DESCENT

2 CONJUGATE GRADIENTS

3 NEWTON’S METHODS

Page 27: Computational Statistics Optimisation...COMPUTATIONAL STATISTICS OPTIMISATION Luca Bortolussi Department of Mathematics and Geosciences University of Trieste Office 238, third floor,

STOCHASTIC GRADIENT DESCENT CONJUGATE GRADIENTS NEWTON’S METHODS 27 / 27

NEWTON-RAPSON METHOD

As an alternative optimisation for small n, we can use theNewton-Rapson method, which has better convergenceproperties than gradient descent.By Taylor expansion

f (x + ∆) ≈ f (x) + ∆T∇f (x) +12

∆T Hf (x)∆

where Hf is the Hessian of f (x).Differentiating w.r.t. ∆, the minimum of the r.h.s. is when∇f (x) = −Hf (x)∆, hence for ∆ = −H−1

f (x)∇f (x)

Thus we obtain the update rule:

xk+1 = xk − ηH−1f (xk )∇f (xk )

with 0 < η < 1 to improve convergence.