Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch...

176
Optimization for Data Science Mini-batching, sampling, momentum and other tricks Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quenn Bertrand, Nidham Gazagnadou

Transcript of Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch...

Page 1: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Optimization for Data Science

Mini-batching, sampling, momentum and other tricks

Lecturer: Robert M. Gower & Alexandre Gramfort

Tutorials: Quentin Bertrand, Nidham Gazagnadou

Page 2: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Sampled i.i.d

The Stochastic Gradient Method

Step size/ Learning rate

Page 3: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Sampled i.i.d

The Stochastic Gradient Method

Step size/ Learning rate

What about mini-batchingWhat about mini-batching

Page 4: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Sample mini-batch with

The Stochastic Gradient Method

Page 5: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Sample mini-batch with

The Stochastic Gradient Method

What should b and be?

Page 6: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Sample mini-batch with

The Stochastic Gradient Method

What should b and be? How does b influence the stepsize ?

Page 7: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Sample mini-batch with

The Stochastic Gradient Method

What should b and be? How does b influence the stepsize ? How does the data influence the best

mini-batch and stepsize?

Page 8: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

How to choose the minibatch size?

minibatch size

step

siz

e

Page 9: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

How to choose the minibatch size?

minibatch size

step

siz

e

Page 10: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

How to choose the minibatch size?

minibatch size

step

siz

e

Page 11: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

How to choose the minibatch size?

minibatch size

step

siz

e

Page 12: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

How to choose the minibatch size?

minibatch size

step

siz

e

Page 13: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

How to choose the minibatch size?

minibatch size

step

siz

e

Page 14: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

How to choose the minibatch size?

minibatch size

step

siz

e

Page 15: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

How to choose the minibatch size?

minibatch size

step

siz

e

Page 16: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

How to choose the minibatch size?

minibatch size

step

siz

e

Page 17: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

How to choose the minibatch size?

minibatch size

step

siz

egood

bad

Cross validation score/objective function

Page 18: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

How to choose the minibatch size?

minibatch size

step

siz

eBest

parametersgood

bad

Cross validation score/objective function

Page 19: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Linear Scaling Rule: When the minibatch size is multiplied by k, multiply the learning rate by k.Linear Scaling Rule: When the minibatch size is multiplied by k, multiply the learning rate by k.

How to choose the minibatch size?

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour, Goyal et al., CoRR 2017

minibatch size

step

siz

eBest

parametersgood

bad

Cross validation score/objective function

Page 20: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Linear Scaling Rule: When the minibatch size is multiplied by k, multiply the learning rate by k.Linear Scaling Rule: When the minibatch size is multiplied by k, multiply the learning rate by k.

How to choose the minibatch size?

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour, Goyal et al., CoRR 2017

minibatch size

step

siz

eBest

parametersgood

bad

Cross validation score/objective function

Page 21: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

How to choose the minibatch size?

Linear Scaling Rule: When the mini-batch size is multiplied by k, multiply the learning rate by k.Linear Scaling Rule: When the mini-batch size is multiplied by k, multiply the learning rate by k.

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour, Goyal et al., CoRR 2017

minibatch size

step

siz

e

Cross validation score

Page 22: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

How to choose the minibatch size?

Linear Scaling Rule: When the mini-batch size is multiplied by k, multiply the learning rate by k.Linear Scaling Rule: When the mini-batch size is multiplied by k, multiply the learning rate by k.

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour, Goyal et al., CoRR 2017

minibatch size

step

siz

e

Cross validation score

Page 23: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

How to choose the minibatch size?good

bad

Linear Scaling Rule: When the mini-batch size is multiplied by k, multiply the learning rate by k.Linear Scaling Rule: When the mini-batch size is multiplied by k, multiply the learning rate by k.

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour, Goyal et al., CoRR 2017

minibatch size

step

siz

e

Cross validation score

Page 24: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

How to choose the minibatch size?good

bad

Linear Scaling Rule: When the mini-batch size is multiplied by k, multiply the learning rate by k.Linear Scaling Rule: When the mini-batch size is multiplied by k, multiply the learning rate by k.

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour, Goyal et al., CoRR 2017

minibatch size

step

siz

e

Missed the best one

Cross validation score

Page 25: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

How to choose the minibatch size?good

bad

Need to figure out functional relationshipbetween minibatchsize and step size

Need to figure out functional relationshipbetween minibatchsize and step size

Linear Scaling Rule: When the mini-batch size is multiplied by k, multiply the learning rate by k.Linear Scaling Rule: When the mini-batch size is multiplied by k, multiply the learning rate by k.

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour, Goyal et al., CoRR 2017

minibatch size

step

siz

e

Missed the best one

Cross validation score

Page 26: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Stochastic Reformulation of Finite sum problems

Page 27: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Random sampling vectorRandom sampling vector

Simple Stochastic Reformulation

Page 28: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Random sampling vectorRandom sampling vector

Simple Stochastic Reformulation

Page 29: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Random sampling vectorRandom sampling vector

Simple Stochastic Reformulation

Page 30: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Random sampling vectorRandom sampling vector

Stochastic Reformulation

Minimizing the expectation of random linear combinations of original function

Stochastic Reformulation

Minimizing the expectation of random linear combinations of original function

Original finite sum problem

Original finite sum problem

Simple Stochastic Reformulation

Page 31: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

SGD with arbitrary sampling

Page 32: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

SGD with arbitrary sampling

By design we have that

Page 33: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

SGD with arbitrary sampling

The distribution encodes any form of i.i.d mini-batching/ non-uniform sampling.

By design we have thatExample: Gradient descent

Page 34: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

SGD with arbitrary sampling

The distribution encodes any form of i.i.d mini-batching/ non-uniform sampling.

saves time for theorists: One representation for all forms of sampling

saves time for theorists: One representation for all forms of sampling

By design we have thatExample: Gradient descent

Page 35: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Examples of arbitrary sampling: uniform single element

Random setRandom set

Page 36: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Examples of arbitrary sampling: uniform single element

Random setRandom set

Page 37: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Examples of arbitrary sampling: uniform single element

Random setRandom set

Page 38: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Examples of arbitrary sampling: uniform single element

Random setRandom set

Single element SGD Single element SGD

Page 39: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Examples of arbitrary sampling: uniform mini-batching

Random setRandom set

Mini-batch SGD without replacementMini-batch SGD without replacement

Page 40: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Examples of arbitrary sampling: non-uniform mini-batching

Random setRandom set

Richtárik and Takáč (arXiv:1310.3438; Opt Letters 2016)

Page 41: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Examples of arbitrary sampling: non-uniform mini-batching

Random setRandom set

Richtárik and Takáč (arXiv:1310.3438; Opt Letters 2016)

Page 42: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Examples of arbitrary sampling: non-uniform mini-batching

Random setRandom set

Richtárik and Takáč (arXiv:1310.3438; Opt Letters 2016)

Page 43: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Examples of arbitrary sampling: non-uniform mini-batching

Random setRandom set

Arbitrary sampling SGD Arbitrary sampling SGD

Richtárik and Takáč (arXiv:1310.3438; Opt Letters 2016)

Page 44: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

SGD with arbitrary sampling

Includes all forms of SGD (including GD)

Page 45: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

SGD with arbitrary sampling

How to analyse this general SGD?

Includes all forms of SGD (including GD)

Page 46: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

SGD with arbitrary sampling

How to analyse this general SGD?

Includes all forms of SGD (including GD)

Look at the extremes: GD and single element SGD

Page 47: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Assumption and convergence of Gradient Descent and SGD

Page 48: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Reminder: Convergence GD strongly convex + smooth

Now smoothness gives

Page 49: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Assumptions and Convergence of Gradient Descent quasi strong

convexity constant

Smoothness constant

Page 50: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Assumptions and Convergence of Gradient Descent

Iteration complexity of gradient descentIteration complexity of gradient descent

quasi strong convexity constant

Smoothness constant

Page 51: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Assumptions and Convergence of Stochastic Gradient Descent

Bigger smoothness constant/ stronger assumption

Page 52: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Assumptions and Convergence of Stochastic Gradient Descent

Definition Definition

Bigger smoothness constant/ stronger assumption

Page 53: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Assumptions and Convergence of Stochastic Gradient Descent

Iteration complexity of SGDIteration complexity of SGD

Definition Definition

Bigger smoothness constant/ stronger assumption

Needell, Srebro, Ward: Math. Prog, 2016.

Page 54: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Informal comparison between GD and SGD iteration complexity

Page 55: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Informal comparison between GD and SGD iteration complexity

GDGD SGDSGD

Page 56: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Informal comparison between GD and SGD iteration complexity

GDGD SGDSGD

In general:In general:How do they compare?

Page 57: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Informal comparison between GD and SGD iteration complexity

GDGD SGDSGD

In general:In general:How do they compare?

When n is big

Page 58: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Informal comparison between GD and SGD iteration complexity

GDGD SGDSGD

In general:In general:How do they compare?

Need new “interpolating” notion of smoothness

When n is big

Page 59: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Ass: Expected Smoothness. We write when Ass: Expected Smoothness. We write when

Key constant: Expected smoothness

Page 60: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Ass: Expected Smoothness. We write when Ass: Expected Smoothness. We write when

Key constant: Expected smoothness

Page 61: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Ass: Expected Smoothness. We write when Ass: Expected Smoothness. We write when

Key constant: Expected smoothness

Expected smoothness constantDepends on v and f

RMG, Richtárik and Bach (arXiv:1805.02632, 2018)

Page 62: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Ass: Expected Smoothness. We write when Ass: Expected Smoothness. We write when

Key constant: Expected smoothness

Lemma: Lemma:

Expected smoothness constantDepends on v and f

RMG, Richtárik and Bach (arXiv:1805.02632, 2018)

Page 63: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Ass: Expected Smoothness. We write when Ass: Expected Smoothness. We write when

Key constant: Expected smoothness

Lemma: Lemma:

Rough estimate (we can do better)

Expected smoothness constantDepends on v and f

RMG, Richtárik and Bach (arXiv:1805.02632, 2018)

Page 64: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Ass: Expected Smoothness. We write when Ass: Expected Smoothness. We write when

Key constant: Expected smoothness

Lemma: Lemma:

Rough estimate (we can do better)

Expected smoothness constantDepends on v and f

RMG, Richtárik and Bach (arXiv:1805.02632, 2018)

Definition: Gradient noiseDefinition: Gradient noise

Page 65: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Ass: Expected Smoothness. We write when Ass: Expected Smoothness. We write when

Key constant: Expected smoothness

Lemma: Lemma:

Rough estimate (we can do better)

Expected smoothness constantDepends on v and f

RMG, Richtárik and Bach (arXiv:1805.02632, 2018)

Definition: Gradient noiseDefinition: Gradient noise

Generalization of

Page 66: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Example of Expected Smoothness

1 2 3 4 5 6 7 8 9 10 11 12 13

50

100

150

200

250

300

350

400

batchsize

ex

pe

cte

d s

mo

oth

ne

ss

S is chosen uniformly at random from all subsets of size b

EXE: In your list!

Page 67: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Example of Expected Smoothness

1 2 3 4 5 6 7 8 9 10 11 12 13

50

100

150

200

250

300

350

400

batchsize

ex

pe

cte

d s

mo

oth

ne

ss

S is chosen uniformly at random from all subsets of size b

EXE: In your list!

Page 68: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Example of Expected Smoothness

1 2 3 4 5 6 7 8 9 10 11 12 13

50

100

150

200

250

300

350

400

batchsize

ex

pe

cte

d s

mo

oth

ne

ss

S is chosen uniformly at random from all subsets of size b

EXE: In your list!

Page 69: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Example of Expected Smoothness

1 2 3 4 5 6 7 8 9 10 11 12 13

50

100

150

200

250

300

350

400

batchsize

ex

pe

cte

d s

mo

oth

ne

ss

S is chosen uniformly at random from all subsets of size b

Measures how much model fits dataEXE: In your list!

Page 70: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Expected smoothness gives awesome bound on 2nd moment

Assumption There exists B>0 Assumption There exists B>0

Normally bound on gradient is an assumption

Recht, Wright & Niu, F. Hogwild: Neurips, 2011.

Hazan & Kale, JMLR 2014.

Rakhlin, Shamir, & Sridharan, ICML 2012

Shamir & Zhang, ICML 2013.

Page 71: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Expected smoothness gives awesome bound on 2nd moment

Assumption There exists B>0 Assumption There exists B>0

Normally bound on gradient is an assumption

Recht, Wright & Niu, F. Hogwild: Neurips, 2011.

Hazan & Kale, JMLR 2014.

Rakhlin, Shamir, & Sridharan, ICML 2012

Shamir & Zhang, ICML 2013.

Page 72: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Expected smoothness gives awesome bound on 2nd moment

Assumption There exists B>0 Assumption There exists B>0

Normally bound on gradient is an assumption

Recht, Wright & Niu, F. Hogwild: Neurips, 2011.

Hazan & Kale, JMLR 2014.

Rakhlin, Shamir, & Sridharan, ICML 2012

Shamir & Zhang, ICML 2013.

Page 73: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Expected smoothness gives awesome bound on 2nd moment

Lemma Lemma

Assumption There exists B>0 Assumption There exists B>0

Normally bound on gradient is an assumption

Recht, Wright & Niu, F. Hogwild: Neurips, 2011.

Hazan & Kale, JMLR 2014.

Rakhlin, Shamir, & Sridharan, ICML 2012

Shamir & Zhang, ICML 2013.

Page 74: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Expected smoothness gives awesome bound on 2nd moment

Lemma Lemma

informative: with realistic assumptionsinformative: with realistic assumptions

Assumption There exists B>0 Assumption There exists B>0

Normally bound on gradient is an assumption

Recht, Wright & Niu, F. Hogwild: Neurips, 2011.

Hazan & Kale, JMLR 2014.

Rakhlin, Shamir, & Sridharan, ICML 2012

Shamir & Zhang, ICML 2013.

Page 75: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Main Theorem (Linear convergence to a neighborhood)

Theorem Theorem

Page 76: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Main Theorem (Linear convergence to a neighborhood)

Theorem Theorem

Fixed stepsize

Page 77: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

CorollaryCorollary

Main Theorem (Linear convergence to a neighborhood)

Theorem Theorem

Fixed stepsize

Page 78: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

CorollaryCorollary

Main Theorem (Linear convergence to a neighborhood)

Theorem Theorem

Fixed stepsize

saves time for theorists: Includes GD and SGD as special cases. Also tighter!

Page 79: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Proof is SUPER EASY:

Taking expectation with respect to

Taking total expectationLemma Lemma

quasi strong conv

Page 80: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Exercises on Sampling, Expected Smoothness + gradient noise

Page 81: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Optimal mini-batch sizes

Page 82: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

CorollaryCorollary

Total complexity for mini-batch SGD

Page 83: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

CorollaryCorollary

Total complexity for mini-batch SGD

Page 84: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

CorollaryCorollary

Total complexity for mini-batch SGD

Page 85: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

CorollaryCorollary

Total complexity for mini-batch SGD

Total Complexity = #stochastic gradient calculated

in each iteration

Page 86: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

CorollaryCorollary

Total complexity for mini-batch SGD

Total complexity is a simple function of mini-batch size b

Total Complexity = #stochastic gradient calculated

in each iteration

Page 87: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Optimal mini-batch size

Page 88: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Optimal mini-batch size

Linearly increasing

Page 89: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Optimal mini-batch size

Linearly increasing Linearly decreasing

Page 90: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Optimal mini-batch size

Linearly increasing Linearly decreasing

Page 91: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Optimal mini-batch size

Linearly increasing Linearly decreasing

Stepsize increases with b

Page 92: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Optimal mini-batch size for models that interpolate data

Page 93: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Optimal mini-batch size for models that interpolate data

Page 94: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Optimal mini-batch size for models that interpolate data

Page 95: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Optimal mini-batch size for models that interpolate data

Page 96: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Optimal mini-batch size for models that interpolate data

Linearly increasing

increases with b

Page 97: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Optimal mini-batch size for models that interpolate data

Linearly increasing

increases with b

All gains in mini-batching are due to multi-threading and cache memory?

Page 98: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Stochastic Gradient Descent 𝛄 = 0.2

Page 99: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Learning schedule: Constant & decreasing step sizes

Theorem Theorem

Learning rate with switch point

Page 100: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Learning schedule: Constant & decreasing step sizes

Theorem Theorem

A stochastic condition number

Learning rate with switch point

Page 101: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Learning schedule: Constant & decreasing step sizes

Theorem Theorem

A stochastic condition number

Learning rate with switch point

Page 102: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Learning schedule: Constant & decreasing step sizes

Theorem Theorem

A stochastic condition number

Learning rate with switch point

Page 103: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

109Stochastic Gradient Descent with switch to decreasing stepsizes

Switch point

Page 104: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Stochastic variance reduced methods

Page 105: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Random sampling vectorRandom sampling vector

Stochastic Reformulation

Minimizing the expectation of random linear combinations of original function

Stochastic Reformulation

Minimizing the expectation of random linear combinations of original function

Original finite sum problem

Original finite sum problem

Simple Stochastic Reformulation

What to do about the variance?

Page 106: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Controlled Stochastic Reformulation

Page 107: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

covariate Cancel out

Controlled Stochastic Reformulation

Page 108: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

covariate Cancel out

Controlled Stochastic Reformulation

Page 109: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

covariate Cancel out

Controlled Stochastic Reformulation

Use covariates to control the variance

Controlled Stochastic Reformulation

Use covariates to control the variance

Original finite sum problem

Original finite sum problem

Controlled Stochastic Reformulation

Page 110: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Variance reduction with arbitrary sampling

Page 111: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Variance reduction with arbitrary sampling

Page 112: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Variance reduction with arbitrary sampling

Page 113: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Variance reduction with arbitrary sampling

By design we have that

Page 114: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Variance reduction with arbitrary sampling

By design we have that

How to choose ?How to choose ?

Page 115: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Choosing the covariate

Page 116: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Choosing the covariate

We would like:

Page 117: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Choosing the covariate

We would like:

Page 118: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Linear approximationLinear approximation

Choosing the covariate

We would like:

A reference point/ snap shot

Page 119: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

SVRG: Stochastic Variance Reduced Gradients

Reference pointReference point

SampleSample

Grad. estimateGrad. estimate

Johnson & Zhang, 2013 NIPS

Page 120: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

SVRG: Stochastic Variance Reduced Gradients

Reference pointReference point

SampleSample

Grad. estimateGrad. estimate

Johnson & Zhang, 2013 NIPS

Page 121: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

SVRG: Stochastic Variance Reduced Gradients

Reference pointReference point

SampleSample

Grad. estimateGrad. estimate

Johnson & Zhang, 2013 NIPS

Page 122: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

SVRG: Stochastic Variance Reduced Gradients

Reference pointReference point

SampleSample

Grad. estimateGrad. estimate

Johnson & Zhang, 2013 NIPS

Page 123: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Iteration complexity for SVRG and SAGA for arbitrary sampling

Theorem for SVRG Theorem for SVRG

stepsize Iteration complexity

Sebbouh, Gazagnadou, Jelassi, Bach, G., 2019

Page 124: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Iteration complexity for SVRG and SAGA for arbitrary sampling

Theorem for SVRG Theorem for SVRG

stepsize Iteration complexity

Sebbouh, Gazagnadou, Jelassi, Bach, G., 2019

Theorem for SAGA (and the JacSketch family of methods) Theorem for SAGA (and the JacSketch family of methods)

stepsize Iteration complexity

G., Bach, Richtarik, 2018

Page 125: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Iteration complexity for SVRG and SAGA for arbitrary sampling

Theorem for SVRG Theorem for SVRG

stepsize Iteration complexity

Sebbouh, Gazagnadou, Jelassi, Bach, G., 2019

Theorem for SAGA (and the JacSketch family of methods) Theorem for SAGA (and the JacSketch family of methods)

stepsize Iteration complexity

G., Bach, Richtarik, 2018

Missing details due to extra definitions

Page 126: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Total Complexity of mini-batch SVRG Sebbouh, Gazagnadou, Jelassi, Bach, G, 2019

Page 127: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Total Complexity of mini-batch SVRG

Non-linearly increasing

Sebbouh, Gazagnadou, Jelassi, Bach, G, 2019

Page 128: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Total Complexity of mini-batch SVRG

Non-linearly increasing

Linearly decreasing

Sebbouh, Gazagnadou, Jelassi, Bach, G, 2019

Page 129: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Total Complexity of mini-batch SVRG

Non-linearly increasing

Linearly decreasing

Sebbouh, Gazagnadou, Jelassi, Bach, G, 2019

Page 130: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Total Complexity of mini-batch SVRG

Non-linearly increasing

Linearly decreasing

Sebbouh, Gazagnadou, Jelassi, Bach, G, 2019

Page 131: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Total Complexity of mini-batch SVRG

Non-linearly increasing

Linearly decreasing

Stepsize increasing with b

Sebbouh, Gazagnadou, Jelassi, Bach, G, 2019

Page 132: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Total Complexity of mini-batch SAGA Gazagnadou, G & Salmon, ICML 2019

Page 133: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Total Complexity of mini-batch SAGA

Linearly increasing

Gazagnadou, G & Salmon, ICML 2019

Page 134: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Total Complexity of mini-batch SAGA

Linearly increasing Linearly decreasing

Gazagnadou, G & Salmon, ICML 2019

Page 135: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Total Complexity of mini-batch SAGA

Linearly increasing Linearly decreasing

Gazagnadou, G & Salmon, ICML 2019

Page 136: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Total Complexity of mini-batch SAGA

Linearly increasing Linearly decreasing

Always smallerthan 25% of data

Gazagnadou, G & Salmon, ICML 2019

Page 137: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Total Complexity of mini-batch SAGA

Real-sim data:(n,d) = (72’309, 20’958)

Predicts good total complexity

Page 138: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Total Complexity of mini-batch SAGA

Real-sim data:(n,d) = (72’309, 20’958)

Predicts good total complexity

Page 139: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Total Complexity of mini-batch SAGA

Slice data:(n,d) = (53’500, 386)

Page 140: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Total Complexity of mini-batch SAGA

Slice data:(n,d) = (53’500, 386)

Predicts good total complexity

Page 141: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Take home message so farStochastic reformulations allow

to view all variants as simple SGDStochastic reformulations allow

to view all variants as simple SGD

To analyse all forms of sampling used through expected smooth

To analyse all forms of sampling used through expected smooth

How to calculate optimal mini-batch size of SGD, SAGA and SVRG

How to calculate optimal mini-batch size of SGD, SAGA and SVRG

Stepsize increase by orders when mini-batch size increases

Stepsize increase by orders when mini-batch size increases

Page 142: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Take home message so farStochastic reformulations allow

to view all variants as simple SGDStochastic reformulations allow

to view all variants as simple SGD

To analyse all forms of sampling used through expected smooth

To analyse all forms of sampling used through expected smooth

How to calculate optimal mini-batch size of SGD, SAGA and SVRG

How to calculate optimal mini-batch size of SGD, SAGA and SVRG

Stepsize increase by orders when mini-batch size increases

Stepsize increase by orders when mini-batch size increases

Page 143: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Momentum

Page 144: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Issue with Gradient Descent

Step size/ Learning rate

Page 145: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Max local rateMax local rate

Local rate of changeLocal rate of change

Issue with Gradient Descent

GD is the “steepest descent”

Page 146: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Issue with Gradient Descent

Solution

Get’s stuck in “flat” valleys Give momentum to keep going

Page 147: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Heavey Ball Method:Heavey Ball Method:

Adding some Momentum to GD

Adds “Inertia” to update

Page 148: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Heavey Ball Method:Heavey Ball Method:

Adding some Momentum to GD

Adds “Inertia” to update

GD with momentum (GDm):GD with momentum (GDm):Adds “Momentum”

to update

Page 149: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

GDm and Heavy Ball EquivalenceGD with momentum:GD with momentum:

Page 150: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

GDm and Heavy Ball EquivalenceGD with momentum:GD with momentum:

Page 151: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

GDm and Heavy Ball EquivalenceGD with momentum:GD with momentum:

Page 152: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

GDm and Heavy Ball EquivalenceGD with momentum:GD with momentum:

Page 153: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Heavey Ball Method:Heavey Ball Method:

GDm and Heavy Ball EquivalenceGD with momentum:GD with momentum:

Page 154: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Convergence of Gradient Descent with Momentum

Theorem Theorem

stepsize

momentum parameter

Polyak 1964

Page 155: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Convergence of Gradient Descent with Momentum

Theorem Theorem

stepsize

momentum parameter

CorollaryCorollary

Polyak 1964

Page 156: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Fundamental Theorem of CalculusFundamental Theorem of Calculus

Proof sketch: GDm convergence

Page 157: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Fundamental Theorem of CalculusFundamental Theorem of Calculus

Proof sketch: GDm convergence

Page 158: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Fundamental Theorem of CalculusFundamental Theorem of Calculus

Proof sketch: GDm convergence

Page 159: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Fundamental Theorem of CalculusFundamental Theorem of Calculus

Proof sketch: GDm convergence

Page 160: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Fundamental Theorem of CalculusFundamental Theorem of Calculus

Proof sketch: GDm convergence

Depends on past. Difficult recurrence

Page 161: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Proof: Convergence of Heavy Ball

Page 162: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Proof: Convergence of Heavy Ball

Page 163: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Proof: Convergence of Heavy Ball

Page 164: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Proof: Convergence of Heavy Ball

Page 165: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Proof: Convergence of Heavy Ball

Simple recurrence!

Page 166: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Proof: Convergence of Heavy Ball

Simple recurrence!

Page 167: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Proof: Convergence of Heavy Ball

Page 168: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Proof: Convergence of Heavy Ball

EXE on Eigenvalues: EXE on Eigenvalues:

Page 169: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Proof: Convergence of Heavy Ball

EXE on Eigenvalues: EXE on Eigenvalues:

Page 170: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

Stochastic Heavey Ball Method:Stochastic Heavey Ball Method:

Adding Momentum to SGD

Adds “Inertia” to update

SGD with momentum (SGDm):SGD with momentum (SGDm):

Sampled i.i.d

Rumelhart, Hinton, Geoffrey, Ronald, 1986, Nature

Page 171: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

SGDm and Averaging

http://fa.bianp.net/teaching/2018/COMP-652/stochastic_gradient.html

Page 172: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

SGDm and Averaging

http://fa.bianp.net/teaching/2018/COMP-652/stochastic_gradient.html

Page 173: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

SGD with momentum (SGDm):SGD with momentum (SGDm):

SGDm and Averaging

http://fa.bianp.net/teaching/2018/COMP-652/stochastic_gradient.html

Page 174: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

SGD with momentum (SGDm):SGD with momentum (SGDm):

SGDm and Averaging

Acts like an approximate variance reduction since

http://fa.bianp.net/teaching/2018/COMP-652/stochastic_gradient.html

Page 175: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

SGD with momentum (SGDm):SGD with momentum (SGDm):

SGDm and Averaging

Acts like an approximate variance reduction since

http://fa.bianp.net/teaching/2018/COMP-652/stochastic_gradient.html

Page 176: Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch size? good bad Need to figure out functional relationship between minibatch size

RMG, P. Richtarik, F. Bach (2018), preprint online Stochastic quasi-gradient methods: Variance reduction via Jacobian sketching

N. Gazagnadou, RMG, J. Salmon (2019) , ICML 2019. Optimal mini-batch and step sizes for SAGA

RMG, Nicolas Loizou, Xun Qian, Alibek Sailanbayev, Egor Shulgin and Peter Richtárik (2019), ICML SGD: general analysis and improved rates

O. Sebbouh, N. Gazagnadou, S. Jelassi, F. Bach, RMG Neurips 2019, preprint online. Towards closing the gap between the theory and practice of SVRG