Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch...

Optimization for Data Science

Mini-batching, sampling, momentum and other tricks

Lecturer: Robert M. Gower & Alexandre Gramfort

Tutorials: Quentin Bertrand, Nidham Gazagnadou

Sampled i.i.d

The Stochastic Gradient Method

Step size/ Learning rate

Sampled i.i.d



What about mini-batchingWhat about mini-batching

Sample mini-batch with




What should b and be?



What should b and be? How does b influence the stepsize ?



What should b and be? How does b influence the stepsize ? How does the data influence the best

mini-batch and stepsize?

How to choose the minibatch size?

minibatch size

step

siz

e


minibatch size

step

siz

egood

bad

Cross validation score/objective function


minibatch size

step

siz

eBest

parametersgood

bad


Linear Scaling Rule: When the minibatch size is multiplied by k, multiply the learning rate by k.Linear Scaling Rule: When the minibatch size is multiplied by k, multiply the learning rate by k.


Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour, Goyal et al., CoRR 2017

minibatch size

step

siz

eBest

parametersgood

bad



Linear Scaling Rule: When the mini-batch size is multiplied by k, multiply the learning rate by k.Linear Scaling Rule: When the mini-batch size is multiplied by k, multiply the learning rate by k.


minibatch size

step

siz

e

Cross validation score

How to choose the minibatch size?good

bad



minibatch size

step

siz

e



bad



minibatch size

step

siz

e

Missed the best one



bad

Need to figure out functional relationshipbetween minibatchsize and step size

Need to figure out functional relationshipbetween minibatchsize and step size



minibatch size

step

siz

e

Missed the best one


Stochastic Reformulation of Finite sum problems

Random sampling vectorRandom sampling vector

Simple Stochastic Reformulation


Stochastic Reformulation

Minimizing the expectation of random linear combinations of original function



Original finite sum problem



SGD with arbitrary sampling


By design we have that


The distribution encodes any form of i.i.d mini-batching/ non-uniform sampling.

By design we have thatExample: Gradient descent


The distribution encodes any form of i.i.d mini-batching/ non-uniform sampling.

saves time for theorists: One representation for all forms of sampling

saves time for theorists: One representation for all forms of sampling

By design we have thatExample: Gradient descent

Examples of arbitrary sampling: uniform single element

Random setRandom set

Examples of arbitrary sampling: uniform single element


Single element SGD Single element SGD

Examples of arbitrary sampling: uniform mini-batching


Mini-batch SGD without replacementMini-batch SGD without replacement

Examples of arbitrary sampling: non-uniform mini-batching


Richtárik and Takáč (arXiv:1310.3438; Opt Letters 2016)

Examples of arbitrary sampling: non-uniform mini-batching


Arbitrary sampling SGD Arbitrary sampling SGD

Richtárik and Takáč (arXiv:1310.3438; Opt Letters 2016)


Includes all forms of SGD (including GD)


How to analyse this general SGD?



How to analyse this general SGD?


Look at the extremes: GD and single element SGD

Assumption and convergence of Gradient Descent and SGD

Reminder: Convergence GD strongly convex + smooth

Now smoothness gives

Assumptions and Convergence of Gradient Descent quasi strong

convexity constant

Smoothness constant

Assumptions and Convergence of Gradient Descent

Iteration complexity of gradient descentIteration complexity of gradient descent

quasi strong convexity constant

Smoothness constant

Assumptions and Convergence of Stochastic Gradient Descent

Bigger smoothness constant/ stronger assumption


Definition Definition



Iteration complexity of SGDIteration complexity of SGD

Definition Definition


Needell, Srebro, Ward: Math. Prog, 2016.

Informal comparison between GD and SGD iteration complexity


GDGD SGDSGD


GDGD SGDSGD

In general:In general:How do they compare?


GDGD SGDSGD


When n is big


GDGD SGDSGD


Need new “interpolating” notion of smoothness

When n is big

Ass: Expected Smoothness. We write when Ass: Expected Smoothness. We write when

Key constant: Expected smoothness



Expected smoothness constantDepends on v and f

RMG, Richtárik and Bach (arXiv:1805.02632, 2018)



Lemma: Lemma:





Lemma: Lemma:

Rough estimate (we can do better)





Lemma: Lemma:




Definition: Gradient noiseDefinition: Gradient noise



Lemma: Lemma:




Definition: Gradient noiseDefinition: Gradient noise

Generalization of

Example of Expected Smoothness

1 2 3 4 5 6 7 8 9 10 11 12 13

50

100

150

200

250

300

350

400

batchsize

ex

pe

cte

d s

mo

oth

ne

ss

S is chosen uniformly at random from all subsets of size b

EXE: In your list!

Example of Expected Smoothness

1 2 3 4 5 6 7 8 9 10 11 12 13

50

100

150

200

250

300

350

400

batchsize

ex

pe

cte

d s

mo

oth

ne

ss

S is chosen uniformly at random from all subsets of size b

Measures how much model fits dataEXE: In your list!

Expected smoothness gives awesome bound on 2nd moment

Assumption There exists B>0 Assumption There exists B>0

Normally bound on gradient is an assumption

Recht, Wright & Niu, F. Hogwild: Neurips, 2011.

Hazan & Kale, JMLR 2014.

Rakhlin, Shamir, & Sridharan, ICML 2012

Shamir & Zhang, ICML 2013.


Lemma Lemma








Lemma Lemma

informative: with realistic assumptionsinformative: with realistic assumptions







Main Theorem (Linear convergence to a neighborhood)

Theorem Theorem


Theorem Theorem

Fixed stepsize

CorollaryCorollary


Theorem Theorem

Fixed stepsize

CorollaryCorollary


Theorem Theorem

Fixed stepsize

saves time for theorists: Includes GD and SGD as special cases. Also tighter!

Proof is SUPER EASY:

Taking expectation with respect to

Taking total expectationLemma Lemma

quasi strong conv

Exercises on Sampling, Expected Smoothness + gradient noise

Optimal mini-batch sizes

CorollaryCorollary

Total complexity for mini-batch SGD

CorollaryCorollary


Total Complexity = #stochastic gradient calculated

in each iteration

CorollaryCorollary


Total complexity is a simple function of mini-batch size b

Total Complexity = #stochastic gradient calculated

in each iteration

Optimal mini-batch size


Linearly increasing


Linearly increasing Linearly decreasing



Stepsize increases with b

Optimal mini-batch size for models that interpolate data


Linearly increasing

increases with b


Linearly increasing

increases with b

All gains in mini-batching are due to multi-threading and cache memory?

Stochastic Gradient Descent 𝛄 = 0.2

Learning schedule: Constant & decreasing step sizes

Theorem Theorem

Learning rate with switch point

Learning schedule: Constant & decreasing step sizes

Theorem Theorem

A stochastic condition number

Learning rate with switch point

109Stochastic Gradient Descent with switch to decreasing stepsizes

Switch point

Stochastic variance reduced methods









What to do about the variance?

Controlled Stochastic Reformulation

covariate Cancel out


covariate Cancel out


Use covariates to control the variance


Use covariates to control the variance




Variance reduction with arbitrary sampling



How to choose ?How to choose ?

Choosing the covariate


We would like:

Linear approximationLinear approximation


We would like:

A reference point/ snap shot

SVRG: Stochastic Variance Reduced Gradients

Reference pointReference point

SampleSample

Grad. estimateGrad. estimate

Johnson & Zhang, 2013 NIPS

Iteration complexity for SVRG and SAGA for arbitrary sampling

Theorem for SVRG Theorem for SVRG

stepsize Iteration complexity

Sebbouh, Gazagnadou, Jelassi, Bach, G., 2019





Theorem for SAGA (and the JacSketch family of methods) Theorem for SAGA (and the JacSketch family of methods)


G., Bach, Richtarik, 2018





Theorem for SAGA (and the JacSketch family of methods) Theorem for SAGA (and the JacSketch family of methods)


G., Bach, Richtarik, 2018

Missing details due to extra definitions

Total Complexity of mini-batch SVRG Sebbouh, Gazagnadou, Jelassi, Bach, G, 2019

Total Complexity of mini-batch SVRG

Non-linearly increasing

Sebbouh, Gazagnadou, Jelassi, Bach, G, 2019



Linearly decreasing




Linearly decreasing

Stepsize increasing with b


Total Complexity of mini-batch SAGA Gazagnadou, G & Salmon, ICML 2019

Total Complexity of mini-batch SAGA

Linearly increasing

Gazagnadou, G & Salmon, ICML 2019



Always smallerthan 25% of data



Real-sim data:(n,d) = (72’309, 20’958)

Predicts good total complexity


Slice data:(n,d) = (53’500, 386)


Slice data:(n,d) = (53’500, 386)

Predicts good total complexity

Take home message so farStochastic reformulations allow

to view all variants as simple SGDStochastic reformulations allow

to view all variants as simple SGD

To analyse all forms of sampling used through expected smooth

To analyse all forms of sampling used through expected smooth

How to calculate optimal mini-batch size of SGD, SAGA and SVRG

How to calculate optimal mini-batch size of SGD, SAGA and SVRG

Stepsize increase by orders when mini-batch size increases

Stepsize increase by orders when mini-batch size increases

Momentum

Issue with Gradient Descent


Max local rateMax local rate

Local rate of changeLocal rate of change


GD is the “steepest descent”


Solution

Get’s stuck in “flat” valleys Give momentum to keep going

Heavey Ball Method:Heavey Ball Method:

Adding some Momentum to GD

Adds “Inertia” to update


Adding some Momentum to GD


GD with momentum (GDm):GD with momentum (GDm):Adds “Momentum”

to update

GDm and Heavy Ball EquivalenceGD with momentum:GD with momentum:

Convergence of Gradient Descent with Momentum

Theorem Theorem

stepsize

momentum parameter

Polyak 1964

Convergence of Gradient Descent with Momentum

Theorem Theorem

stepsize

momentum parameter

CorollaryCorollary

Polyak 1964

Fundamental Theorem of CalculusFundamental Theorem of Calculus

Proof sketch: GDm convergence

Fundamental Theorem of CalculusFundamental Theorem of Calculus

Proof sketch: GDm convergence

Depends on past. Difficult recurrence

Proof: Convergence of Heavy Ball


Simple recurrence!


EXE on Eigenvalues: EXE on Eigenvalues:

Stochastic Heavey Ball Method:Stochastic Heavey Ball Method:

Adding Momentum to SGD


SGD with momentum (SGDm):SGD with momentum (SGDm):

Sampled i.i.d

Rumelhart, Hinton, Geoffrey, Ronald, 1986, Nature

SGDm and Averaging

http://fa.bianp.net/teaching/2018/COMP-652/stochastic_gradient.html


SGDm and Averaging



SGDm and Averaging

Acts like an approximate variance reduction since


RMG, P. Richtarik, F. Bach (2018), preprint online Stochastic quasi-gradient methods: Variance reduction via Jacobian sketching

N. Gazagnadou, RMG, J. Salmon (2019) , ICML 2019. Optimal mini-batch and step sizes for SAGA

RMG, Nicolas Loizou, Xun Qian, Alibek Sailanbayev, Egor Shulgin and Peter Richtárik (2019), ICML SGD: general analysis and improved rates

O. Sebbouh, N. Gazagnadou, S. Jelassi, F. Bach, RMG Neurips 2019, preprint online. Towards closing the gap between the theory and practice of SVRG

Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch...

Documents

Transcript of Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch...