Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch...
Transcript of Mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · How to choose the minibatch...
Optimization for Data Science
Mini-batching, sampling, momentum and other tricks
Lecturer: Robert M. Gower & Alexandre Gramfort
Tutorials: Quentin Bertrand, Nidham Gazagnadou
Sampled i.i.d
The Stochastic Gradient Method
Step size/ Learning rate
Sampled i.i.d
The Stochastic Gradient Method
Step size/ Learning rate
What about mini-batchingWhat about mini-batching
Sample mini-batch with
The Stochastic Gradient Method
Sample mini-batch with
The Stochastic Gradient Method
What should b and be?
Sample mini-batch with
The Stochastic Gradient Method
What should b and be? How does b influence the stepsize ?
Sample mini-batch with
The Stochastic Gradient Method
What should b and be? How does b influence the stepsize ? How does the data influence the best
mini-batch and stepsize?
How to choose the minibatch size?
minibatch size
step
siz
e
How to choose the minibatch size?
minibatch size
step
siz
e
How to choose the minibatch size?
minibatch size
step
siz
e
How to choose the minibatch size?
minibatch size
step
siz
e
How to choose the minibatch size?
minibatch size
step
siz
e
How to choose the minibatch size?
minibatch size
step
siz
e
How to choose the minibatch size?
minibatch size
step
siz
e
How to choose the minibatch size?
minibatch size
step
siz
e
How to choose the minibatch size?
minibatch size
step
siz
e
How to choose the minibatch size?
minibatch size
step
siz
egood
bad
Cross validation score/objective function
How to choose the minibatch size?
minibatch size
step
siz
eBest
parametersgood
bad
Cross validation score/objective function
Linear Scaling Rule: When the minibatch size is multiplied by k, multiply the learning rate by k.Linear Scaling Rule: When the minibatch size is multiplied by k, multiply the learning rate by k.
How to choose the minibatch size?
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour, Goyal et al., CoRR 2017
minibatch size
step
siz
eBest
parametersgood
bad
Cross validation score/objective function
Linear Scaling Rule: When the minibatch size is multiplied by k, multiply the learning rate by k.Linear Scaling Rule: When the minibatch size is multiplied by k, multiply the learning rate by k.
How to choose the minibatch size?
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour, Goyal et al., CoRR 2017
minibatch size
step
siz
eBest
parametersgood
bad
Cross validation score/objective function
How to choose the minibatch size?
Linear Scaling Rule: When the mini-batch size is multiplied by k, multiply the learning rate by k.Linear Scaling Rule: When the mini-batch size is multiplied by k, multiply the learning rate by k.
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour, Goyal et al., CoRR 2017
minibatch size
step
siz
e
Cross validation score
How to choose the minibatch size?
Linear Scaling Rule: When the mini-batch size is multiplied by k, multiply the learning rate by k.Linear Scaling Rule: When the mini-batch size is multiplied by k, multiply the learning rate by k.
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour, Goyal et al., CoRR 2017
minibatch size
step
siz
e
Cross validation score
How to choose the minibatch size?good
bad
Linear Scaling Rule: When the mini-batch size is multiplied by k, multiply the learning rate by k.Linear Scaling Rule: When the mini-batch size is multiplied by k, multiply the learning rate by k.
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour, Goyal et al., CoRR 2017
minibatch size
step
siz
e
Cross validation score
How to choose the minibatch size?good
bad
Linear Scaling Rule: When the mini-batch size is multiplied by k, multiply the learning rate by k.Linear Scaling Rule: When the mini-batch size is multiplied by k, multiply the learning rate by k.
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour, Goyal et al., CoRR 2017
minibatch size
step
siz
e
Missed the best one
Cross validation score
How to choose the minibatch size?good
bad
Need to figure out functional relationshipbetween minibatchsize and step size
Need to figure out functional relationshipbetween minibatchsize and step size
Linear Scaling Rule: When the mini-batch size is multiplied by k, multiply the learning rate by k.Linear Scaling Rule: When the mini-batch size is multiplied by k, multiply the learning rate by k.
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour, Goyal et al., CoRR 2017
minibatch size
step
siz
e
Missed the best one
Cross validation score
Stochastic Reformulation of Finite sum problems
Random sampling vectorRandom sampling vector
Simple Stochastic Reformulation
Random sampling vectorRandom sampling vector
Simple Stochastic Reformulation
Random sampling vectorRandom sampling vector
Simple Stochastic Reformulation
Random sampling vectorRandom sampling vector
Stochastic Reformulation
Minimizing the expectation of random linear combinations of original function
Stochastic Reformulation
Minimizing the expectation of random linear combinations of original function
Original finite sum problem
Original finite sum problem
Simple Stochastic Reformulation
SGD with arbitrary sampling
SGD with arbitrary sampling
By design we have that
SGD with arbitrary sampling
The distribution encodes any form of i.i.d mini-batching/ non-uniform sampling.
By design we have thatExample: Gradient descent
SGD with arbitrary sampling
The distribution encodes any form of i.i.d mini-batching/ non-uniform sampling.
saves time for theorists: One representation for all forms of sampling
saves time for theorists: One representation for all forms of sampling
By design we have thatExample: Gradient descent
Examples of arbitrary sampling: uniform single element
Random setRandom set
Examples of arbitrary sampling: uniform single element
Random setRandom set
Examples of arbitrary sampling: uniform single element
Random setRandom set
Examples of arbitrary sampling: uniform single element
Random setRandom set
Single element SGD Single element SGD
Examples of arbitrary sampling: uniform mini-batching
Random setRandom set
Mini-batch SGD without replacementMini-batch SGD without replacement
Examples of arbitrary sampling: non-uniform mini-batching
Random setRandom set
Richtárik and Takáč (arXiv:1310.3438; Opt Letters 2016)
Examples of arbitrary sampling: non-uniform mini-batching
Random setRandom set
Richtárik and Takáč (arXiv:1310.3438; Opt Letters 2016)
Examples of arbitrary sampling: non-uniform mini-batching
Random setRandom set
Richtárik and Takáč (arXiv:1310.3438; Opt Letters 2016)
Examples of arbitrary sampling: non-uniform mini-batching
Random setRandom set
Arbitrary sampling SGD Arbitrary sampling SGD
Richtárik and Takáč (arXiv:1310.3438; Opt Letters 2016)
SGD with arbitrary sampling
Includes all forms of SGD (including GD)
SGD with arbitrary sampling
How to analyse this general SGD?
Includes all forms of SGD (including GD)
SGD with arbitrary sampling
How to analyse this general SGD?
Includes all forms of SGD (including GD)
Look at the extremes: GD and single element SGD
Assumption and convergence of Gradient Descent and SGD
Reminder: Convergence GD strongly convex + smooth
Now smoothness gives
Assumptions and Convergence of Gradient Descent quasi strong
convexity constant
Smoothness constant
Assumptions and Convergence of Gradient Descent
Iteration complexity of gradient descentIteration complexity of gradient descent
quasi strong convexity constant
Smoothness constant
Assumptions and Convergence of Stochastic Gradient Descent
Bigger smoothness constant/ stronger assumption
Assumptions and Convergence of Stochastic Gradient Descent
Definition Definition
Bigger smoothness constant/ stronger assumption
Assumptions and Convergence of Stochastic Gradient Descent
Iteration complexity of SGDIteration complexity of SGD
Definition Definition
Bigger smoothness constant/ stronger assumption
Needell, Srebro, Ward: Math. Prog, 2016.
Informal comparison between GD and SGD iteration complexity
Informal comparison between GD and SGD iteration complexity
GDGD SGDSGD
Informal comparison between GD and SGD iteration complexity
GDGD SGDSGD
In general:In general:How do they compare?
Informal comparison between GD and SGD iteration complexity
GDGD SGDSGD
In general:In general:How do they compare?
When n is big
Informal comparison between GD and SGD iteration complexity
GDGD SGDSGD
In general:In general:How do they compare?
Need new “interpolating” notion of smoothness
When n is big
Ass: Expected Smoothness. We write when Ass: Expected Smoothness. We write when
Key constant: Expected smoothness
Ass: Expected Smoothness. We write when Ass: Expected Smoothness. We write when
Key constant: Expected smoothness
Ass: Expected Smoothness. We write when Ass: Expected Smoothness. We write when
Key constant: Expected smoothness
Expected smoothness constantDepends on v and f
RMG, Richtárik and Bach (arXiv:1805.02632, 2018)
Ass: Expected Smoothness. We write when Ass: Expected Smoothness. We write when
Key constant: Expected smoothness
Lemma: Lemma:
Expected smoothness constantDepends on v and f
RMG, Richtárik and Bach (arXiv:1805.02632, 2018)
Ass: Expected Smoothness. We write when Ass: Expected Smoothness. We write when
Key constant: Expected smoothness
Lemma: Lemma:
Rough estimate (we can do better)
Expected smoothness constantDepends on v and f
RMG, Richtárik and Bach (arXiv:1805.02632, 2018)
Ass: Expected Smoothness. We write when Ass: Expected Smoothness. We write when
Key constant: Expected smoothness
Lemma: Lemma:
Rough estimate (we can do better)
Expected smoothness constantDepends on v and f
RMG, Richtárik and Bach (arXiv:1805.02632, 2018)
Definition: Gradient noiseDefinition: Gradient noise
Ass: Expected Smoothness. We write when Ass: Expected Smoothness. We write when
Key constant: Expected smoothness
Lemma: Lemma:
Rough estimate (we can do better)
Expected smoothness constantDepends on v and f
RMG, Richtárik and Bach (arXiv:1805.02632, 2018)
Definition: Gradient noiseDefinition: Gradient noise
Generalization of
Example of Expected Smoothness
1 2 3 4 5 6 7 8 9 10 11 12 13
50
100
150
200
250
300
350
400
batchsize
ex
pe
cte
d s
mo
oth
ne
ss
S is chosen uniformly at random from all subsets of size b
EXE: In your list!
Example of Expected Smoothness
1 2 3 4 5 6 7 8 9 10 11 12 13
50
100
150
200
250
300
350
400
batchsize
ex
pe
cte
d s
mo
oth
ne
ss
S is chosen uniformly at random from all subsets of size b
EXE: In your list!
Example of Expected Smoothness
1 2 3 4 5 6 7 8 9 10 11 12 13
50
100
150
200
250
300
350
400
batchsize
ex
pe
cte
d s
mo
oth
ne
ss
S is chosen uniformly at random from all subsets of size b
EXE: In your list!
Example of Expected Smoothness
1 2 3 4 5 6 7 8 9 10 11 12 13
50
100
150
200
250
300
350
400
batchsize
ex
pe
cte
d s
mo
oth
ne
ss
S is chosen uniformly at random from all subsets of size b
Measures how much model fits dataEXE: In your list!
Expected smoothness gives awesome bound on 2nd moment
Assumption There exists B>0 Assumption There exists B>0
Normally bound on gradient is an assumption
Recht, Wright & Niu, F. Hogwild: Neurips, 2011.
Hazan & Kale, JMLR 2014.
Rakhlin, Shamir, & Sridharan, ICML 2012
Shamir & Zhang, ICML 2013.
Expected smoothness gives awesome bound on 2nd moment
Assumption There exists B>0 Assumption There exists B>0
Normally bound on gradient is an assumption
Recht, Wright & Niu, F. Hogwild: Neurips, 2011.
Hazan & Kale, JMLR 2014.
Rakhlin, Shamir, & Sridharan, ICML 2012
Shamir & Zhang, ICML 2013.
Expected smoothness gives awesome bound on 2nd moment
Assumption There exists B>0 Assumption There exists B>0
Normally bound on gradient is an assumption
Recht, Wright & Niu, F. Hogwild: Neurips, 2011.
Hazan & Kale, JMLR 2014.
Rakhlin, Shamir, & Sridharan, ICML 2012
Shamir & Zhang, ICML 2013.
Expected smoothness gives awesome bound on 2nd moment
Lemma Lemma
Assumption There exists B>0 Assumption There exists B>0
Normally bound on gradient is an assumption
Recht, Wright & Niu, F. Hogwild: Neurips, 2011.
Hazan & Kale, JMLR 2014.
Rakhlin, Shamir, & Sridharan, ICML 2012
Shamir & Zhang, ICML 2013.
Expected smoothness gives awesome bound on 2nd moment
Lemma Lemma
informative: with realistic assumptionsinformative: with realistic assumptions
Assumption There exists B>0 Assumption There exists B>0
Normally bound on gradient is an assumption
Recht, Wright & Niu, F. Hogwild: Neurips, 2011.
Hazan & Kale, JMLR 2014.
Rakhlin, Shamir, & Sridharan, ICML 2012
Shamir & Zhang, ICML 2013.
Main Theorem (Linear convergence to a neighborhood)
Theorem Theorem
Main Theorem (Linear convergence to a neighborhood)
Theorem Theorem
Fixed stepsize
CorollaryCorollary
Main Theorem (Linear convergence to a neighborhood)
Theorem Theorem
Fixed stepsize
CorollaryCorollary
Main Theorem (Linear convergence to a neighborhood)
Theorem Theorem
Fixed stepsize
saves time for theorists: Includes GD and SGD as special cases. Also tighter!
Proof is SUPER EASY:
Taking expectation with respect to
Taking total expectationLemma Lemma
quasi strong conv
Exercises on Sampling, Expected Smoothness + gradient noise
Optimal mini-batch sizes
CorollaryCorollary
Total complexity for mini-batch SGD
CorollaryCorollary
Total complexity for mini-batch SGD
CorollaryCorollary
Total complexity for mini-batch SGD
CorollaryCorollary
Total complexity for mini-batch SGD
Total Complexity = #stochastic gradient calculated
in each iteration
CorollaryCorollary
Total complexity for mini-batch SGD
Total complexity is a simple function of mini-batch size b
Total Complexity = #stochastic gradient calculated
in each iteration
Optimal mini-batch size
Optimal mini-batch size
Linearly increasing
Optimal mini-batch size
Linearly increasing Linearly decreasing
Optimal mini-batch size
Linearly increasing Linearly decreasing
Optimal mini-batch size
Linearly increasing Linearly decreasing
Stepsize increases with b
Optimal mini-batch size for models that interpolate data
Optimal mini-batch size for models that interpolate data
Optimal mini-batch size for models that interpolate data
Optimal mini-batch size for models that interpolate data
Optimal mini-batch size for models that interpolate data
Linearly increasing
increases with b
Optimal mini-batch size for models that interpolate data
Linearly increasing
increases with b
All gains in mini-batching are due to multi-threading and cache memory?
Stochastic Gradient Descent 𝛄 = 0.2
Learning schedule: Constant & decreasing step sizes
Theorem Theorem
Learning rate with switch point
Learning schedule: Constant & decreasing step sizes
Theorem Theorem
A stochastic condition number
Learning rate with switch point
Learning schedule: Constant & decreasing step sizes
Theorem Theorem
A stochastic condition number
Learning rate with switch point
Learning schedule: Constant & decreasing step sizes
Theorem Theorem
A stochastic condition number
Learning rate with switch point
109Stochastic Gradient Descent with switch to decreasing stepsizes
Switch point
Stochastic variance reduced methods
Random sampling vectorRandom sampling vector
Stochastic Reformulation
Minimizing the expectation of random linear combinations of original function
Stochastic Reformulation
Minimizing the expectation of random linear combinations of original function
Original finite sum problem
Original finite sum problem
Simple Stochastic Reformulation
What to do about the variance?
Controlled Stochastic Reformulation
covariate Cancel out
Controlled Stochastic Reformulation
covariate Cancel out
Controlled Stochastic Reformulation
covariate Cancel out
Controlled Stochastic Reformulation
Use covariates to control the variance
Controlled Stochastic Reformulation
Use covariates to control the variance
Original finite sum problem
Original finite sum problem
Controlled Stochastic Reformulation
Variance reduction with arbitrary sampling
Variance reduction with arbitrary sampling
Variance reduction with arbitrary sampling
Variance reduction with arbitrary sampling
By design we have that
Variance reduction with arbitrary sampling
By design we have that
How to choose ?How to choose ?
Choosing the covariate
Choosing the covariate
We would like:
Choosing the covariate
We would like:
Linear approximationLinear approximation
Choosing the covariate
We would like:
A reference point/ snap shot
SVRG: Stochastic Variance Reduced Gradients
Reference pointReference point
SampleSample
Grad. estimateGrad. estimate
Johnson & Zhang, 2013 NIPS
SVRG: Stochastic Variance Reduced Gradients
Reference pointReference point
SampleSample
Grad. estimateGrad. estimate
Johnson & Zhang, 2013 NIPS
SVRG: Stochastic Variance Reduced Gradients
Reference pointReference point
SampleSample
Grad. estimateGrad. estimate
Johnson & Zhang, 2013 NIPS
SVRG: Stochastic Variance Reduced Gradients
Reference pointReference point
SampleSample
Grad. estimateGrad. estimate
Johnson & Zhang, 2013 NIPS
Iteration complexity for SVRG and SAGA for arbitrary sampling
Theorem for SVRG Theorem for SVRG
stepsize Iteration complexity
Sebbouh, Gazagnadou, Jelassi, Bach, G., 2019
Iteration complexity for SVRG and SAGA for arbitrary sampling
Theorem for SVRG Theorem for SVRG
stepsize Iteration complexity
Sebbouh, Gazagnadou, Jelassi, Bach, G., 2019
Theorem for SAGA (and the JacSketch family of methods) Theorem for SAGA (and the JacSketch family of methods)
stepsize Iteration complexity
G., Bach, Richtarik, 2018
Iteration complexity for SVRG and SAGA for arbitrary sampling
Theorem for SVRG Theorem for SVRG
stepsize Iteration complexity
Sebbouh, Gazagnadou, Jelassi, Bach, G., 2019
Theorem for SAGA (and the JacSketch family of methods) Theorem for SAGA (and the JacSketch family of methods)
stepsize Iteration complexity
G., Bach, Richtarik, 2018
Missing details due to extra definitions
Total Complexity of mini-batch SVRG Sebbouh, Gazagnadou, Jelassi, Bach, G, 2019
Total Complexity of mini-batch SVRG
Non-linearly increasing
Sebbouh, Gazagnadou, Jelassi, Bach, G, 2019
Total Complexity of mini-batch SVRG
Non-linearly increasing
Linearly decreasing
Sebbouh, Gazagnadou, Jelassi, Bach, G, 2019
Total Complexity of mini-batch SVRG
Non-linearly increasing
Linearly decreasing
Sebbouh, Gazagnadou, Jelassi, Bach, G, 2019
Total Complexity of mini-batch SVRG
Non-linearly increasing
Linearly decreasing
Sebbouh, Gazagnadou, Jelassi, Bach, G, 2019
Total Complexity of mini-batch SVRG
Non-linearly increasing
Linearly decreasing
Stepsize increasing with b
Sebbouh, Gazagnadou, Jelassi, Bach, G, 2019
Total Complexity of mini-batch SAGA Gazagnadou, G & Salmon, ICML 2019
Total Complexity of mini-batch SAGA
Linearly increasing
Gazagnadou, G & Salmon, ICML 2019
Total Complexity of mini-batch SAGA
Linearly increasing Linearly decreasing
Gazagnadou, G & Salmon, ICML 2019
Total Complexity of mini-batch SAGA
Linearly increasing Linearly decreasing
Gazagnadou, G & Salmon, ICML 2019
Total Complexity of mini-batch SAGA
Linearly increasing Linearly decreasing
Always smallerthan 25% of data
Gazagnadou, G & Salmon, ICML 2019
Total Complexity of mini-batch SAGA
Real-sim data:(n,d) = (72’309, 20’958)
Predicts good total complexity
Total Complexity of mini-batch SAGA
Real-sim data:(n,d) = (72’309, 20’958)
Predicts good total complexity
Total Complexity of mini-batch SAGA
Slice data:(n,d) = (53’500, 386)
Total Complexity of mini-batch SAGA
Slice data:(n,d) = (53’500, 386)
Predicts good total complexity
Take home message so farStochastic reformulations allow
to view all variants as simple SGDStochastic reformulations allow
to view all variants as simple SGD
To analyse all forms of sampling used through expected smooth
To analyse all forms of sampling used through expected smooth
How to calculate optimal mini-batch size of SGD, SAGA and SVRG
How to calculate optimal mini-batch size of SGD, SAGA and SVRG
Stepsize increase by orders when mini-batch size increases
Stepsize increase by orders when mini-batch size increases
Take home message so farStochastic reformulations allow
to view all variants as simple SGDStochastic reformulations allow
to view all variants as simple SGD
To analyse all forms of sampling used through expected smooth
To analyse all forms of sampling used through expected smooth
How to calculate optimal mini-batch size of SGD, SAGA and SVRG
How to calculate optimal mini-batch size of SGD, SAGA and SVRG
Stepsize increase by orders when mini-batch size increases
Stepsize increase by orders when mini-batch size increases
Momentum
Issue with Gradient Descent
Step size/ Learning rate
Max local rateMax local rate
Local rate of changeLocal rate of change
Issue with Gradient Descent
GD is the “steepest descent”
Issue with Gradient Descent
Solution
Get’s stuck in “flat” valleys Give momentum to keep going
Heavey Ball Method:Heavey Ball Method:
Adding some Momentum to GD
Adds “Inertia” to update
Heavey Ball Method:Heavey Ball Method:
Adding some Momentum to GD
Adds “Inertia” to update
GD with momentum (GDm):GD with momentum (GDm):Adds “Momentum”
to update
GDm and Heavy Ball EquivalenceGD with momentum:GD with momentum:
GDm and Heavy Ball EquivalenceGD with momentum:GD with momentum:
GDm and Heavy Ball EquivalenceGD with momentum:GD with momentum:
GDm and Heavy Ball EquivalenceGD with momentum:GD with momentum:
Heavey Ball Method:Heavey Ball Method:
GDm and Heavy Ball EquivalenceGD with momentum:GD with momentum:
Convergence of Gradient Descent with Momentum
Theorem Theorem
stepsize
momentum parameter
Polyak 1964
Convergence of Gradient Descent with Momentum
Theorem Theorem
stepsize
momentum parameter
CorollaryCorollary
Polyak 1964
Fundamental Theorem of CalculusFundamental Theorem of Calculus
Proof sketch: GDm convergence
Fundamental Theorem of CalculusFundamental Theorem of Calculus
Proof sketch: GDm convergence
Fundamental Theorem of CalculusFundamental Theorem of Calculus
Proof sketch: GDm convergence
Fundamental Theorem of CalculusFundamental Theorem of Calculus
Proof sketch: GDm convergence
Fundamental Theorem of CalculusFundamental Theorem of Calculus
Proof sketch: GDm convergence
Depends on past. Difficult recurrence
Proof: Convergence of Heavy Ball
Proof: Convergence of Heavy Ball
Proof: Convergence of Heavy Ball
Proof: Convergence of Heavy Ball
Proof: Convergence of Heavy Ball
Simple recurrence!
Proof: Convergence of Heavy Ball
Simple recurrence!
Proof: Convergence of Heavy Ball
Proof: Convergence of Heavy Ball
EXE on Eigenvalues: EXE on Eigenvalues:
Proof: Convergence of Heavy Ball
EXE on Eigenvalues: EXE on Eigenvalues:
Stochastic Heavey Ball Method:Stochastic Heavey Ball Method:
Adding Momentum to SGD
Adds “Inertia” to update
SGD with momentum (SGDm):SGD with momentum (SGDm):
Sampled i.i.d
Rumelhart, Hinton, Geoffrey, Ronald, 1986, Nature
SGDm and Averaging
http://fa.bianp.net/teaching/2018/COMP-652/stochastic_gradient.html
SGDm and Averaging
http://fa.bianp.net/teaching/2018/COMP-652/stochastic_gradient.html
SGD with momentum (SGDm):SGD with momentum (SGDm):
SGDm and Averaging
http://fa.bianp.net/teaching/2018/COMP-652/stochastic_gradient.html
SGD with momentum (SGDm):SGD with momentum (SGDm):
SGDm and Averaging
Acts like an approximate variance reduction since
http://fa.bianp.net/teaching/2018/COMP-652/stochastic_gradient.html
SGD with momentum (SGDm):SGD with momentum (SGDm):
SGDm and Averaging
Acts like an approximate variance reduction since
http://fa.bianp.net/teaching/2018/COMP-652/stochastic_gradient.html
RMG, P. Richtarik, F. Bach (2018), preprint online Stochastic quasi-gradient methods: Variance reduction via Jacobian sketching
N. Gazagnadou, RMG, J. Salmon (2019) , ICML 2019. Optimal mini-batch and step sizes for SAGA
RMG, Nicolas Loizou, Xun Qian, Alibek Sailanbayev, Egor Shulgin and Peter Richtárik (2019), ICML SGD: general analysis and improved rates
O. Sebbouh, N. Gazagnadou, S. Jelassi, F. Bach, RMG Neurips 2019, preprint online. Towards closing the gap between the theory and practice of SVRG