mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · how to choose the minibatch...

Post on 26-Aug-2021

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Optimization for Data Science

Mini-batching, sampling, momentum and other tricks

Lecturer: Robert M. Gower & Alexandre Gramfort

Tutorials: Quentin Bertrand, Nidham Gazagnadou

Sampled i.i.d

The Stochastic Gradient Method

Step size/ Learning rate

Sampled i.i.d

The Stochastic Gradient Method

Step size/ Learning rate

What about mini-batchingWhat about mini-batching

Sample mini-batch with

The Stochastic Gradient Method

Sample mini-batch with

The Stochastic Gradient Method

What should b and be?

Sample mini-batch with

The Stochastic Gradient Method

What should b and be? How does b influence the stepsize ?

Sample mini-batch with

The Stochastic Gradient Method

What should b and be? How does b influence the stepsize ? How does the data influence the best

mini-batch and stepsize?

How to choose the minibatch size?

minibatch size

step

siz

e

How to choose the minibatch size?

minibatch size

step

siz

e

How to choose the minibatch size?

minibatch size

step

siz

e

How to choose the minibatch size?

minibatch size

step

siz

e

How to choose the minibatch size?

minibatch size

step

siz

e

How to choose the minibatch size?

minibatch size

step

siz

e

How to choose the minibatch size?

minibatch size

step

siz

e

How to choose the minibatch size?

minibatch size

step

siz

e

How to choose the minibatch size?

minibatch size

step

siz

e

How to choose the minibatch size?

minibatch size

step

siz

egood

bad

Cross validation score/objective function

How to choose the minibatch size?

minibatch size

step

siz

eBest

parametersgood

bad

Cross validation score/objective function

Linear Scaling Rule: When the minibatch size is multiplied by k, multiply the learning rate by k.Linear Scaling Rule: When the minibatch size is multiplied by k, multiply the learning rate by k.

How to choose the minibatch size?

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour, Goyal et al., CoRR 2017

minibatch size

step

siz

eBest

parametersgood

bad

Cross validation score/objective function

Linear Scaling Rule: When the minibatch size is multiplied by k, multiply the learning rate by k.Linear Scaling Rule: When the minibatch size is multiplied by k, multiply the learning rate by k.

How to choose the minibatch size?

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour, Goyal et al., CoRR 2017

minibatch size

step

siz

eBest

parametersgood

bad

Cross validation score/objective function

How to choose the minibatch size?

Linear Scaling Rule: When the mini-batch size is multiplied by k, multiply the learning rate by k.Linear Scaling Rule: When the mini-batch size is multiplied by k, multiply the learning rate by k.

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour, Goyal et al., CoRR 2017

minibatch size

step

siz

e

Cross validation score

How to choose the minibatch size?

Linear Scaling Rule: When the mini-batch size is multiplied by k, multiply the learning rate by k.Linear Scaling Rule: When the mini-batch size is multiplied by k, multiply the learning rate by k.

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour, Goyal et al., CoRR 2017

minibatch size

step

siz

e

Cross validation score

How to choose the minibatch size?good

bad

Linear Scaling Rule: When the mini-batch size is multiplied by k, multiply the learning rate by k.Linear Scaling Rule: When the mini-batch size is multiplied by k, multiply the learning rate by k.

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour, Goyal et al., CoRR 2017

minibatch size

step

siz

e

Cross validation score

How to choose the minibatch size?good

bad

Linear Scaling Rule: When the mini-batch size is multiplied by k, multiply the learning rate by k.Linear Scaling Rule: When the mini-batch size is multiplied by k, multiply the learning rate by k.

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour, Goyal et al., CoRR 2017

minibatch size

step

siz

e

Missed the best one

Cross validation score

How to choose the minibatch size?good

bad

Need to figure out functional relationshipbetween minibatchsize and step size

Need to figure out functional relationshipbetween minibatchsize and step size

Linear Scaling Rule: When the mini-batch size is multiplied by k, multiply the learning rate by k.Linear Scaling Rule: When the mini-batch size is multiplied by k, multiply the learning rate by k.

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour, Goyal et al., CoRR 2017

minibatch size

step

siz

e

Missed the best one

Cross validation score

Stochastic Reformulation of Finite sum problems

Random sampling vectorRandom sampling vector

Simple Stochastic Reformulation

Random sampling vectorRandom sampling vector

Simple Stochastic Reformulation

Random sampling vectorRandom sampling vector

Simple Stochastic Reformulation

Random sampling vectorRandom sampling vector

Stochastic Reformulation

Minimizing the expectation of random linear combinations of original function

Stochastic Reformulation

Minimizing the expectation of random linear combinations of original function

Original finite sum problem

Original finite sum problem

Simple Stochastic Reformulation

SGD with arbitrary sampling

SGD with arbitrary sampling

By design we have that

SGD with arbitrary sampling

The distribution encodes any form of i.i.d mini-batching/ non-uniform sampling.

By design we have thatExample: Gradient descent

SGD with arbitrary sampling

The distribution encodes any form of i.i.d mini-batching/ non-uniform sampling.

saves time for theorists: One representation for all forms of sampling

saves time for theorists: One representation for all forms of sampling

By design we have thatExample: Gradient descent

Examples of arbitrary sampling: uniform single element

Random setRandom set

Examples of arbitrary sampling: uniform single element

Random setRandom set

Examples of arbitrary sampling: uniform single element

Random setRandom set

Examples of arbitrary sampling: uniform single element

Random setRandom set

Single element SGD Single element SGD

Examples of arbitrary sampling: uniform mini-batching

Random setRandom set

Mini-batch SGD without replacementMini-batch SGD without replacement

Examples of arbitrary sampling: non-uniform mini-batching

Random setRandom set

Richtárik and Takáč (arXiv:1310.3438; Opt Letters 2016)

Examples of arbitrary sampling: non-uniform mini-batching

Random setRandom set

Richtárik and Takáč (arXiv:1310.3438; Opt Letters 2016)

Examples of arbitrary sampling: non-uniform mini-batching

Random setRandom set

Richtárik and Takáč (arXiv:1310.3438; Opt Letters 2016)

Examples of arbitrary sampling: non-uniform mini-batching

Random setRandom set

Arbitrary sampling SGD Arbitrary sampling SGD

Richtárik and Takáč (arXiv:1310.3438; Opt Letters 2016)

SGD with arbitrary sampling

Includes all forms of SGD (including GD)

SGD with arbitrary sampling

How to analyse this general SGD?

Includes all forms of SGD (including GD)

SGD with arbitrary sampling

How to analyse this general SGD?

Includes all forms of SGD (including GD)

Look at the extremes: GD and single element SGD

Assumption and convergence of Gradient Descent and SGD

Reminder: Convergence GD strongly convex + smooth

Now smoothness gives

Assumptions and Convergence of Gradient Descent quasi strong

convexity constant

Smoothness constant

Assumptions and Convergence of Gradient Descent

Iteration complexity of gradient descentIteration complexity of gradient descent

quasi strong convexity constant

Smoothness constant

Assumptions and Convergence of Stochastic Gradient Descent

Bigger smoothness constant/ stronger assumption

Assumptions and Convergence of Stochastic Gradient Descent

Definition Definition

Bigger smoothness constant/ stronger assumption

Assumptions and Convergence of Stochastic Gradient Descent

Iteration complexity of SGDIteration complexity of SGD

Definition Definition

Bigger smoothness constant/ stronger assumption

Needell, Srebro, Ward: Math. Prog, 2016.

Informal comparison between GD and SGD iteration complexity

Informal comparison between GD and SGD iteration complexity

GDGD SGDSGD

Informal comparison between GD and SGD iteration complexity

GDGD SGDSGD

In general:In general:How do they compare?

Informal comparison between GD and SGD iteration complexity

GDGD SGDSGD

In general:In general:How do they compare?

When n is big

Informal comparison between GD and SGD iteration complexity

GDGD SGDSGD

In general:In general:How do they compare?

Need new “interpolating” notion of smoothness

When n is big

Ass: Expected Smoothness. We write when Ass: Expected Smoothness. We write when

Key constant: Expected smoothness

Ass: Expected Smoothness. We write when Ass: Expected Smoothness. We write when

Key constant: Expected smoothness

Ass: Expected Smoothness. We write when Ass: Expected Smoothness. We write when

Key constant: Expected smoothness

Expected smoothness constantDepends on v and f

RMG, Richtárik and Bach (arXiv:1805.02632, 2018)

Ass: Expected Smoothness. We write when Ass: Expected Smoothness. We write when

Key constant: Expected smoothness

Lemma: Lemma:

Expected smoothness constantDepends on v and f

RMG, Richtárik and Bach (arXiv:1805.02632, 2018)

Ass: Expected Smoothness. We write when Ass: Expected Smoothness. We write when

Key constant: Expected smoothness

Lemma: Lemma:

Rough estimate (we can do better)

Expected smoothness constantDepends on v and f

RMG, Richtárik and Bach (arXiv:1805.02632, 2018)

Ass: Expected Smoothness. We write when Ass: Expected Smoothness. We write when

Key constant: Expected smoothness

Lemma: Lemma:

Rough estimate (we can do better)

Expected smoothness constantDepends on v and f

RMG, Richtárik and Bach (arXiv:1805.02632, 2018)

Definition: Gradient noiseDefinition: Gradient noise

Ass: Expected Smoothness. We write when Ass: Expected Smoothness. We write when

Key constant: Expected smoothness

Lemma: Lemma:

Rough estimate (we can do better)

Expected smoothness constantDepends on v and f

RMG, Richtárik and Bach (arXiv:1805.02632, 2018)

Definition: Gradient noiseDefinition: Gradient noise

Generalization of

Example of Expected Smoothness

1 2 3 4 5 6 7 8 9 10 11 12 13

50

100

150

200

250

300

350

400

batchsize

ex

pe

cte

d s

mo

oth

ne

ss

S is chosen uniformly at random from all subsets of size b

EXE: In your list!

Example of Expected Smoothness

1 2 3 4 5 6 7 8 9 10 11 12 13

50

100

150

200

250

300

350

400

batchsize

ex

pe

cte

d s

mo

oth

ne

ss

S is chosen uniformly at random from all subsets of size b

EXE: In your list!

Example of Expected Smoothness

1 2 3 4 5 6 7 8 9 10 11 12 13

50

100

150

200

250

300

350

400

batchsize

ex

pe

cte

d s

mo

oth

ne

ss

S is chosen uniformly at random from all subsets of size b

EXE: In your list!

Example of Expected Smoothness

1 2 3 4 5 6 7 8 9 10 11 12 13

50

100

150

200

250

300

350

400

batchsize

ex

pe

cte

d s

mo

oth

ne

ss

S is chosen uniformly at random from all subsets of size b

Measures how much model fits dataEXE: In your list!

Expected smoothness gives awesome bound on 2nd moment

Assumption There exists B>0 Assumption There exists B>0

Normally bound on gradient is an assumption

Recht, Wright & Niu, F. Hogwild: Neurips, 2011.

Hazan & Kale, JMLR 2014.

Rakhlin, Shamir, & Sridharan, ICML 2012

Shamir & Zhang, ICML 2013.

Expected smoothness gives awesome bound on 2nd moment

Assumption There exists B>0 Assumption There exists B>0

Normally bound on gradient is an assumption

Recht, Wright & Niu, F. Hogwild: Neurips, 2011.

Hazan & Kale, JMLR 2014.

Rakhlin, Shamir, & Sridharan, ICML 2012

Shamir & Zhang, ICML 2013.

Expected smoothness gives awesome bound on 2nd moment

Assumption There exists B>0 Assumption There exists B>0

Normally bound on gradient is an assumption

Recht, Wright & Niu, F. Hogwild: Neurips, 2011.

Hazan & Kale, JMLR 2014.

Rakhlin, Shamir, & Sridharan, ICML 2012

Shamir & Zhang, ICML 2013.

Expected smoothness gives awesome bound on 2nd moment

Lemma Lemma

Assumption There exists B>0 Assumption There exists B>0

Normally bound on gradient is an assumption

Recht, Wright & Niu, F. Hogwild: Neurips, 2011.

Hazan & Kale, JMLR 2014.

Rakhlin, Shamir, & Sridharan, ICML 2012

Shamir & Zhang, ICML 2013.

Expected smoothness gives awesome bound on 2nd moment

Lemma Lemma

informative: with realistic assumptionsinformative: with realistic assumptions

Assumption There exists B>0 Assumption There exists B>0

Normally bound on gradient is an assumption

Recht, Wright & Niu, F. Hogwild: Neurips, 2011.

Hazan & Kale, JMLR 2014.

Rakhlin, Shamir, & Sridharan, ICML 2012

Shamir & Zhang, ICML 2013.

Main Theorem (Linear convergence to a neighborhood)

Theorem Theorem

Main Theorem (Linear convergence to a neighborhood)

Theorem Theorem

Fixed stepsize

CorollaryCorollary

Main Theorem (Linear convergence to a neighborhood)

Theorem Theorem

Fixed stepsize

CorollaryCorollary

Main Theorem (Linear convergence to a neighborhood)

Theorem Theorem

Fixed stepsize

saves time for theorists: Includes GD and SGD as special cases. Also tighter!

Proof is SUPER EASY:

Taking expectation with respect to

Taking total expectationLemma Lemma

quasi strong conv

Exercises on Sampling, Expected Smoothness + gradient noise

Optimal mini-batch sizes

CorollaryCorollary

Total complexity for mini-batch SGD

CorollaryCorollary

Total complexity for mini-batch SGD

CorollaryCorollary

Total complexity for mini-batch SGD

CorollaryCorollary

Total complexity for mini-batch SGD

Total Complexity = #stochastic gradient calculated

in each iteration

CorollaryCorollary

Total complexity for mini-batch SGD

Total complexity is a simple function of mini-batch size b

Total Complexity = #stochastic gradient calculated

in each iteration

Optimal mini-batch size

Optimal mini-batch size

Linearly increasing

Optimal mini-batch size

Linearly increasing Linearly decreasing

Optimal mini-batch size

Linearly increasing Linearly decreasing

Optimal mini-batch size

Linearly increasing Linearly decreasing

Stepsize increases with b

Optimal mini-batch size for models that interpolate data

Optimal mini-batch size for models that interpolate data

Optimal mini-batch size for models that interpolate data

Optimal mini-batch size for models that interpolate data

Optimal mini-batch size for models that interpolate data

Linearly increasing

increases with b

Optimal mini-batch size for models that interpolate data

Linearly increasing

increases with b

All gains in mini-batching are due to multi-threading and cache memory?

Stochastic Gradient Descent 𝛄 = 0.2

Learning schedule: Constant & decreasing step sizes

Theorem Theorem

Learning rate with switch point

Learning schedule: Constant & decreasing step sizes

Theorem Theorem

A stochastic condition number

Learning rate with switch point

Learning schedule: Constant & decreasing step sizes

Theorem Theorem

A stochastic condition number

Learning rate with switch point

Learning schedule: Constant & decreasing step sizes

Theorem Theorem

A stochastic condition number

Learning rate with switch point

109Stochastic Gradient Descent with switch to decreasing stepsizes

Switch point

Stochastic variance reduced methods

Random sampling vectorRandom sampling vector

Stochastic Reformulation

Minimizing the expectation of random linear combinations of original function

Stochastic Reformulation

Minimizing the expectation of random linear combinations of original function

Original finite sum problem

Original finite sum problem

Simple Stochastic Reformulation

What to do about the variance?

Controlled Stochastic Reformulation

covariate Cancel out

Controlled Stochastic Reformulation

covariate Cancel out

Controlled Stochastic Reformulation

covariate Cancel out

Controlled Stochastic Reformulation

Use covariates to control the variance

Controlled Stochastic Reformulation

Use covariates to control the variance

Original finite sum problem

Original finite sum problem

Controlled Stochastic Reformulation

Variance reduction with arbitrary sampling

Variance reduction with arbitrary sampling

Variance reduction with arbitrary sampling

Variance reduction with arbitrary sampling

By design we have that

Variance reduction with arbitrary sampling

By design we have that

How to choose ?How to choose ?

Choosing the covariate

Choosing the covariate

We would like:

Choosing the covariate

We would like:

Linear approximationLinear approximation

Choosing the covariate

We would like:

A reference point/ snap shot

SVRG: Stochastic Variance Reduced Gradients

Reference pointReference point

SampleSample

Grad. estimateGrad. estimate

Johnson & Zhang, 2013 NIPS

SVRG: Stochastic Variance Reduced Gradients

Reference pointReference point

SampleSample

Grad. estimateGrad. estimate

Johnson & Zhang, 2013 NIPS

SVRG: Stochastic Variance Reduced Gradients

Reference pointReference point

SampleSample

Grad. estimateGrad. estimate

Johnson & Zhang, 2013 NIPS

SVRG: Stochastic Variance Reduced Gradients

Reference pointReference point

SampleSample

Grad. estimateGrad. estimate

Johnson & Zhang, 2013 NIPS

Iteration complexity for SVRG and SAGA for arbitrary sampling

Theorem for SVRG Theorem for SVRG

stepsize Iteration complexity

Sebbouh, Gazagnadou, Jelassi, Bach, G., 2019

Iteration complexity for SVRG and SAGA for arbitrary sampling

Theorem for SVRG Theorem for SVRG

stepsize Iteration complexity

Sebbouh, Gazagnadou, Jelassi, Bach, G., 2019

Theorem for SAGA (and the JacSketch family of methods) Theorem for SAGA (and the JacSketch family of methods)

stepsize Iteration complexity

G., Bach, Richtarik, 2018

Iteration complexity for SVRG and SAGA for arbitrary sampling

Theorem for SVRG Theorem for SVRG

stepsize Iteration complexity

Sebbouh, Gazagnadou, Jelassi, Bach, G., 2019

Theorem for SAGA (and the JacSketch family of methods) Theorem for SAGA (and the JacSketch family of methods)

stepsize Iteration complexity

G., Bach, Richtarik, 2018

Missing details due to extra definitions

Total Complexity of mini-batch SVRG Sebbouh, Gazagnadou, Jelassi, Bach, G, 2019

Total Complexity of mini-batch SVRG

Non-linearly increasing

Sebbouh, Gazagnadou, Jelassi, Bach, G, 2019

Total Complexity of mini-batch SVRG

Non-linearly increasing

Linearly decreasing

Sebbouh, Gazagnadou, Jelassi, Bach, G, 2019

Total Complexity of mini-batch SVRG

Non-linearly increasing

Linearly decreasing

Sebbouh, Gazagnadou, Jelassi, Bach, G, 2019

Total Complexity of mini-batch SVRG

Non-linearly increasing

Linearly decreasing

Sebbouh, Gazagnadou, Jelassi, Bach, G, 2019

Total Complexity of mini-batch SVRG

Non-linearly increasing

Linearly decreasing

Stepsize increasing with b

Sebbouh, Gazagnadou, Jelassi, Bach, G, 2019

Total Complexity of mini-batch SAGA Gazagnadou, G & Salmon, ICML 2019

Total Complexity of mini-batch SAGA

Linearly increasing

Gazagnadou, G & Salmon, ICML 2019

Total Complexity of mini-batch SAGA

Linearly increasing Linearly decreasing

Gazagnadou, G & Salmon, ICML 2019

Total Complexity of mini-batch SAGA

Linearly increasing Linearly decreasing

Gazagnadou, G & Salmon, ICML 2019

Total Complexity of mini-batch SAGA

Linearly increasing Linearly decreasing

Always smallerthan 25% of data

Gazagnadou, G & Salmon, ICML 2019

Total Complexity of mini-batch SAGA

Real-sim data:(n,d) = (72’309, 20’958)

Predicts good total complexity

Total Complexity of mini-batch SAGA

Real-sim data:(n,d) = (72’309, 20’958)

Predicts good total complexity

Total Complexity of mini-batch SAGA

Slice data:(n,d) = (53’500, 386)

Total Complexity of mini-batch SAGA

Slice data:(n,d) = (53’500, 386)

Predicts good total complexity

Take home message so farStochastic reformulations allow

to view all variants as simple SGDStochastic reformulations allow

to view all variants as simple SGD

To analyse all forms of sampling used through expected smooth

To analyse all forms of sampling used through expected smooth

How to calculate optimal mini-batch size of SGD, SAGA and SVRG

How to calculate optimal mini-batch size of SGD, SAGA and SVRG

Stepsize increase by orders when mini-batch size increases

Stepsize increase by orders when mini-batch size increases

Take home message so farStochastic reformulations allow

to view all variants as simple SGDStochastic reformulations allow

to view all variants as simple SGD

To analyse all forms of sampling used through expected smooth

To analyse all forms of sampling used through expected smooth

How to calculate optimal mini-batch size of SGD, SAGA and SVRG

How to calculate optimal mini-batch size of SGD, SAGA and SVRG

Stepsize increase by orders when mini-batch size increases

Stepsize increase by orders when mini-batch size increases

Momentum

Issue with Gradient Descent

Step size/ Learning rate

Max local rateMax local rate

Local rate of changeLocal rate of change

Issue with Gradient Descent

GD is the “steepest descent”

Issue with Gradient Descent

Solution

Get’s stuck in “flat” valleys Give momentum to keep going

Heavey Ball Method:Heavey Ball Method:

Adding some Momentum to GD

Adds “Inertia” to update

Heavey Ball Method:Heavey Ball Method:

Adding some Momentum to GD

Adds “Inertia” to update

GD with momentum (GDm):GD with momentum (GDm):Adds “Momentum”

to update

GDm and Heavy Ball EquivalenceGD with momentum:GD with momentum:

GDm and Heavy Ball EquivalenceGD with momentum:GD with momentum:

GDm and Heavy Ball EquivalenceGD with momentum:GD with momentum:

GDm and Heavy Ball EquivalenceGD with momentum:GD with momentum:

Heavey Ball Method:Heavey Ball Method:

GDm and Heavy Ball EquivalenceGD with momentum:GD with momentum:

Convergence of Gradient Descent with Momentum

Theorem Theorem

stepsize

momentum parameter

Polyak 1964

Convergence of Gradient Descent with Momentum

Theorem Theorem

stepsize

momentum parameter

CorollaryCorollary

Polyak 1964

Fundamental Theorem of CalculusFundamental Theorem of Calculus

Proof sketch: GDm convergence

Fundamental Theorem of CalculusFundamental Theorem of Calculus

Proof sketch: GDm convergence

Fundamental Theorem of CalculusFundamental Theorem of Calculus

Proof sketch: GDm convergence

Fundamental Theorem of CalculusFundamental Theorem of Calculus

Proof sketch: GDm convergence

Fundamental Theorem of CalculusFundamental Theorem of Calculus

Proof sketch: GDm convergence

Depends on past. Difficult recurrence

Proof: Convergence of Heavy Ball

Proof: Convergence of Heavy Ball

Proof: Convergence of Heavy Ball

Proof: Convergence of Heavy Ball

Proof: Convergence of Heavy Ball

Simple recurrence!

Proof: Convergence of Heavy Ball

Simple recurrence!

Proof: Convergence of Heavy Ball

Proof: Convergence of Heavy Ball

EXE on Eigenvalues: EXE on Eigenvalues:

Proof: Convergence of Heavy Ball

EXE on Eigenvalues: EXE on Eigenvalues:

Stochastic Heavey Ball Method:Stochastic Heavey Ball Method:

Adding Momentum to SGD

Adds “Inertia” to update

SGD with momentum (SGDm):SGD with momentum (SGDm):

Sampled i.i.d

Rumelhart, Hinton, Geoffrey, Ronald, 1986, Nature

SGDm and Averaging

http://fa.bianp.net/teaching/2018/COMP-652/stochastic_gradient.html

SGDm and Averaging

http://fa.bianp.net/teaching/2018/COMP-652/stochastic_gradient.html

SGD with momentum (SGDm):SGD with momentum (SGDm):

SGDm and Averaging

http://fa.bianp.net/teaching/2018/COMP-652/stochastic_gradient.html

SGD with momentum (SGDm):SGD with momentum (SGDm):

SGDm and Averaging

Acts like an approximate variance reduction since

http://fa.bianp.net/teaching/2018/COMP-652/stochastic_gradient.html

SGD with momentum (SGDm):SGD with momentum (SGDm):

SGDm and Averaging

Acts like an approximate variance reduction since

http://fa.bianp.net/teaching/2018/COMP-652/stochastic_gradient.html

RMG, P. Richtarik, F. Bach (2018), preprint online Stochastic quasi-gradient methods: Variance reduction via Jacobian sketching

N. Gazagnadou, RMG, J. Salmon (2019) , ICML 2019. Optimal mini-batch and step sizes for SAGA

RMG, Nicolas Loizou, Xun Qian, Alibek Sailanbayev, Egor Shulgin and Peter Richtárik (2019), ICML SGD: general analysis and improved rates

O. Sebbouh, N. Gazagnadou, S. Jelassi, F. Bach, RMG Neurips 2019, preprint online. Towards closing the gap between the theory and practice of SVRG

top related