mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · how to choose the minibatch...

Optimization for Data Science

Mini-batching, sampling, momentum and other tricks

Lecturer: Robert M. Gower & Alexandre Gramfort

Tutorials: Quentin Bertrand, Nidham Gazagnadou

Sampled i.i.d

The Stochastic Gradient Method

Step size/ Learning rate

Sampled i.i.d

What about mini-batchingWhat about mini-batching

Sample mini-batch with

What should b and be?

What should b and be? How does b influence the stepsize ?

What should b and be? How does b influence the stepsize ? How does the data influence the best

mini-batch and stepsize?

How to choose the minibatch size?

minibatch size

Cross validation score/objective function

minibatch size

parametersgood

Linear Scaling Rule: When the minibatch size is multiplied by k, multiply the learning rate by k.Linear Scaling Rule: When the minibatch size is multiplied by k, multiply the learning rate by k.

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour, Goyal et al., CoRR 2017

minibatch size

parametersgood

Linear Scaling Rule: When the minibatch size is multiplied by k, multiply the learning rate by k.Linear Scaling Rule: When the minibatch size is multiplied by k, multiply the learning rate by k.

minibatch size

parametersgood

Linear Scaling Rule: When the mini-batch size is multiplied by k, multiply the learning rate by k.Linear Scaling Rule: When the mini-batch size is multiplied by k, multiply the learning rate by k.

minibatch size

Cross validation score

minibatch size

How to choose the minibatch size?good

minibatch size

Missed the best one

Need to figure out functional relationshipbetween minibatchsize and step size

minibatch size

Missed the best one

Stochastic Reformulation of Finite sum problems

Random sampling vectorRandom sampling vector

Simple Stochastic Reformulation

Stochastic Reformulation

Minimizing the expectation of random linear combinations of original function

Original finite sum problem

SGD with arbitrary sampling

By design we have that

The distribution encodes any form of i.i.d mini-batching/ non-uniform sampling.

By design we have thatExample: Gradient descent

The distribution encodes any form of i.i.d mini-batching/ non-uniform sampling.

saves time for theorists: One representation for all forms of sampling

By design we have thatExample: Gradient descent

Examples of arbitrary sampling: uniform single element

Random setRandom set

Single element SGD Single element SGD

Examples of arbitrary sampling: uniform mini-batching

Mini-batch SGD without replacementMini-batch SGD without replacement

Examples of arbitrary sampling: non-uniform mini-batching

Richtárik and Takáč (arXiv:1310.3438; Opt Letters 2016)

Arbitrary sampling SGD Arbitrary sampling SGD

Includes all forms of SGD (including GD)

How to analyse this general SGD?

Look at the extremes: GD and single element SGD

Assumption and convergence of Gradient Descent and SGD

Reminder: Convergence GD strongly convex + smooth

Now smoothness gives

Assumptions and Convergence of Gradient Descent quasi strong

convexity constant

Smoothness constant

Assumptions and Convergence of Gradient Descent

Iteration complexity of gradient descentIteration complexity of gradient descent

quasi strong convexity constant

Smoothness constant

Assumptions and Convergence of Stochastic Gradient Descent

Bigger smoothness constant/ stronger assumption

Definition Definition

Iteration complexity of SGDIteration complexity of SGD

Definition Definition

Needell, Srebro, Ward: Math. Prog, 2016.

Informal comparison between GD and SGD iteration complexity

GDGD SGDSGD

In general:In general:How do they compare?

GDGD SGDSGD

When n is big

GDGD SGDSGD

Need new “interpolating” notion of smoothness

When n is big

Ass: Expected Smoothness. We write when Ass: Expected Smoothness. We write when

Key constant: Expected smoothness

Expected smoothness constantDepends on v and f

RMG, Richtárik and Bach (arXiv:1805.02632, 2018)

Lemma: Lemma:

Rough estimate (we can do better)

Lemma: Lemma:

Definition: Gradient noiseDefinition: Gradient noise

Lemma: Lemma:

Definition: Gradient noiseDefinition: Gradient noise

Generalization of

Example of Expected Smoothness

1 2 3 4 5 6 7 8 9 10 11 12 13

batchsize

S is chosen uniformly at random from all subsets of size b

EXE: In your list!

1 2 3 4 5 6 7 8 9 10 11 12 13

batchsize

EXE: In your list!

1 2 3 4 5 6 7 8 9 10 11 12 13

batchsize

EXE: In your list!

1 2 3 4 5 6 7 8 9 10 11 12 13

batchsize

Measures how much model fits dataEXE: In your list!

Expected smoothness gives awesome bound on 2nd moment

Assumption There exists B>0 Assumption There exists B>0

Normally bound on gradient is an assumption

Recht, Wright & Niu, F. Hogwild: Neurips, 2011.

Hazan & Kale, JMLR 2014.

Rakhlin, Shamir, & Sridharan, ICML 2012

Shamir & Zhang, ICML 2013.

Lemma Lemma

informative: with realistic assumptionsinformative: with realistic assumptions

Main Theorem (Linear convergence to a neighborhood)

Theorem Theorem

Fixed stepsize

CorollaryCorollary

Theorem Theorem

Fixed stepsize

CorollaryCorollary

Theorem Theorem

Fixed stepsize

saves time for theorists: Includes GD and SGD as special cases. Also tighter!

Proof is SUPER EASY:

Taking expectation with respect to

Taking total expectationLemma Lemma

quasi strong conv

Exercises on Sampling, Expected Smoothness + gradient noise

Optimal mini-batch sizes

CorollaryCorollary

Total complexity for mini-batch SGD

CorollaryCorollary

Total Complexity = #stochastic gradient calculated

in each iteration

CorollaryCorollary

Total complexity is a simple function of mini-batch size b

Total Complexity = #stochastic gradient calculated

in each iteration

Optimal mini-batch size

Linearly increasing

Linearly increasing Linearly decreasing

Stepsize increases with b

Optimal mini-batch size for models that interpolate data

Linearly increasing

increases with b

Optimal mini-batch size for models that interpolate data

Linearly increasing

increases with b

All gains in mini-batching are due to multi-threading and cache memory?

Stochastic Gradient Descent 𝛄 = 0.2

Learning schedule: Constant & decreasing step sizes

Theorem Theorem

Learning rate with switch point

Theorem Theorem

A stochastic condition number

Theorem Theorem

109Stochastic Gradient Descent with switch to decreasing stepsizes

Switch point

Stochastic variance reduced methods

What to do about the variance?

Controlled Stochastic Reformulation

covariate Cancel out

Use covariates to control the variance

Variance reduction with arbitrary sampling

How to choose ?How to choose ?

Choosing the covariate

We would like:

Linear approximationLinear approximation

We would like:

A reference point/ snap shot

SVRG: Stochastic Variance Reduced Gradients

Reference pointReference point

SampleSample

Grad. estimateGrad. estimate

Johnson & Zhang, 2013 NIPS

SampleSample

Iteration complexity for SVRG and SAGA for arbitrary sampling

Theorem for SVRG Theorem for SVRG

stepsize Iteration complexity

Sebbouh, Gazagnadou, Jelassi, Bach, G., 2019

Theorem for SAGA (and the JacSketch family of methods) Theorem for SAGA (and the JacSketch family of methods)

G., Bach, Richtarik, 2018

Theorem for SAGA (and the JacSketch family of methods) Theorem for SAGA (and the JacSketch family of methods)

G., Bach, Richtarik, 2018

Missing details due to extra definitions

Total Complexity of mini-batch SVRG Sebbouh, Gazagnadou, Jelassi, Bach, G, 2019

Total Complexity of mini-batch SVRG

Non-linearly increasing

Sebbouh, Gazagnadou, Jelassi, Bach, G, 2019

Linearly decreasing

Stepsize increasing with b

Total Complexity of mini-batch SAGA Gazagnadou, G & Salmon, ICML 2019

Total Complexity of mini-batch SAGA

Linearly increasing

Gazagnadou, G & Salmon, ICML 2019

Always smallerthan 25% of data

Real-sim data:(n,d) = (72’309, 20’958)

Predicts good total complexity

Real-sim data:(n,d) = (72’309, 20’958)

Slice data:(n,d) = (53’500, 386)

Take home message so farStochastic reformulations allow

to view all variants as simple SGDStochastic reformulations allow

to view all variants as simple SGD

To analyse all forms of sampling used through expected smooth

How to calculate optimal mini-batch size of SGD, SAGA and SVRG

Stepsize increase by orders when mini-batch size increases

Take home message so farStochastic reformulations allow

to view all variants as simple SGDStochastic reformulations allow

to view all variants as simple SGD

To analyse all forms of sampling used through expected smooth

How to calculate optimal mini-batch size of SGD, SAGA and SVRG

Stepsize increase by orders when mini-batch size increases

Momentum

Issue with Gradient Descent

Max local rateMax local rate

Local rate of changeLocal rate of change

GD is the “steepest descent”

Solution

Get’s stuck in “flat” valleys Give momentum to keep going

Heavey Ball Method:Heavey Ball Method:

Adding some Momentum to GD

Adds “Inertia” to update

Adding some Momentum to GD

GD with momentum (GDm):GD with momentum (GDm):Adds “Momentum”

to update

GDm and Heavy Ball EquivalenceGD with momentum:GD with momentum:

Convergence of Gradient Descent with Momentum

Theorem Theorem

stepsize

momentum parameter

Polyak 1964

Convergence of Gradient Descent with Momentum

Theorem Theorem

stepsize

momentum parameter

CorollaryCorollary

Polyak 1964

Fundamental Theorem of CalculusFundamental Theorem of Calculus

Proof sketch: GDm convergence

Depends on past. Difficult recurrence

Proof: Convergence of Heavy Ball

Simple recurrence!

EXE on Eigenvalues: EXE on Eigenvalues:

Stochastic Heavey Ball Method:Stochastic Heavey Ball Method:

Adding Momentum to SGD

SGD with momentum (SGDm):SGD with momentum (SGDm):

Sampled i.i.d

Rumelhart, Hinton, Geoffrey, Ronald, 1986, Nature

SGDm and Averaging

http://fa.bianp.net/teaching/2018/COMP-652/stochastic_gradient.html

SGDm and Averaging

Acts like an approximate variance reduction since

SGDm and Averaging

Acts like an approximate variance reduction since

RMG, P. Richtarik, F. Bach (2018), preprint online Stochastic quasi-gradient methods: Variance reduction via Jacobian sketching

N. Gazagnadou, RMG, J. Salmon (2019) , ICML 2019. Optimal mini-batch and step sizes for SAGA

RMG, Nicolas Loizou, Xun Qian, Alibek Sailanbayev, Egor Shulgin and Peter Richtárik (2019), ICML SGD: general analysis and improved rates

O. Sebbouh, N. Gazagnadou, S. Jelassi, F. Bach, RMG Neurips 2019, preprint online. Towards closing the gap between the theory and practice of SVRG

mini-batching, sampling, momentum and other tricks · 2021. 6. 25. · how to choose the minibatch...

Documents

batching solutions - batching - processing · batching...

[speco] catalog - concrete batching...

s9422: an auto-batching api for high-performance rnn … ·...

batching plant.pdf

concrete batching machine

batching plant ppt

batching and mixing

5200a batching instrument color brochure - fairbanks.com ·...

sany concrete batching plant · sany concrete batching...

concrete batching industries batching & processing flyash

batching - utdallas.edu

compact batching plants

doc.dr.sc. marko subašićssegvic/du/du9gan.pdf · mjera...

compact batching plant concrete batching plant atp 45, 61 &...

concrete batching plants

batching plan foundation

concrete batching plant - groundwork group · concrete...

sany concrete batching plant - global website v8 batching...

accurate, large minibatch sgd: training imagenet in 1...

concrete batching plant - static.powersites.ph · concrete...