stochastic gradient descent and robustness to ill … › cohen › sanservolotalks › bach.pdf–...

55
Stochastic gradient descent and robustness to ill-conditioning Francis Bach INRIA - Ecole Normale Sup´ erieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion, Eric Moulines - Venezia, May 2016

Upload: others

Post on 07-Jun-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Stochastic gradient descent and robustness to ill … › cohen › Sanservolotalks › bach.pdf– Mixing statistics and optimization – Using smoothness to go beyond stochastic

Stochastic gradient descentand robustness to ill-conditioning

Francis Bach

INRIA - Ecole Normale Superieure, Paris, France

ÉCOLE NORMALE

S U P É R I E U R E

Joint work with Aymeric Dieuleveut, Nicolas Flammarion,

Eric Moulines - Venezia, May 2016

Page 2: Stochastic gradient descent and robustness to ill … › cohen › Sanservolotalks › bach.pdf– Mixing statistics and optimization – Using smoothness to go beyond stochastic

“Big data” revolution?

A new scientific context

• Data everywhere: size does not (always) matter

• Science and industry

• Size and variety

• Learning from examples

– n observations in dimension p

Page 3: Stochastic gradient descent and robustness to ill … › cohen › Sanservolotalks › bach.pdf– Mixing statistics and optimization – Using smoothness to go beyond stochastic

Search engines - Advertising

Page 4: Stochastic gradient descent and robustness to ill … › cohen › Sanservolotalks › bach.pdf– Mixing statistics and optimization – Using smoothness to go beyond stochastic

Visual object recognition

Page 5: Stochastic gradient descent and robustness to ill … › cohen › Sanservolotalks › bach.pdf– Mixing statistics and optimization – Using smoothness to go beyond stochastic

Bioinformatics

• Protein: Crucial elements of cell life

• Massive data: 2 millions for humans

• Complex data

Page 6: Stochastic gradient descent and robustness to ill … › cohen › Sanservolotalks › bach.pdf– Mixing statistics and optimization – Using smoothness to go beyond stochastic

Context

Machine learning for “big data”

• Large-scale machine learning: large p, large n

– p : dimension of each observation (input)

– n : number of observations

• Examples: computer vision, bioinformatics, advertising

– Ideal running-time complexity: O(pn)

– Going back to simple methods

– Stochastic gradient methods (Robbins and Monro, 1951)

– Mixing statistics and optimization

– Using smoothness to go beyond stochastic gradient descent

Page 7: Stochastic gradient descent and robustness to ill … › cohen › Sanservolotalks › bach.pdf– Mixing statistics and optimization – Using smoothness to go beyond stochastic

Context

Machine learning for “big data”

• Large-scale machine learning: large p, large n

– p : dimension of each observation (input)

– n : number of observations

• Examples: computer vision, bioinformatics, advertising

• Ideal running-time complexity: O(pn)

– Going back to simple methods

– Stochastic gradient methods (Robbins and Monro, 1951)

– Mixing statistics and optimization

– Using smoothness to go beyond stochastic gradient descent

Page 8: Stochastic gradient descent and robustness to ill … › cohen › Sanservolotalks › bach.pdf– Mixing statistics and optimization – Using smoothness to go beyond stochastic

Context

Machine learning for “big data”

• Large-scale machine learning: large p, large n

– p : dimension of each observation (input)

– n : number of observations

• Examples: computer vision, bioinformatics, advertising

• Ideal running-time complexity: O(pn)

• Going back to simple methods

– Stochastic gradient methods (Robbins and Monro, 1951)

– Mixing statistics and optimization

– Using smoothness to go beyond stochastic gradient descent

Page 9: Stochastic gradient descent and robustness to ill … › cohen › Sanservolotalks › bach.pdf– Mixing statistics and optimization – Using smoothness to go beyond stochastic

Supervised machine learning

• Data: n observations (xi, yi) ∈ X × Y, i = 1, . . . , n, i.i.d.

• Prediction as a linear function 〈θ,Φ(x)〉 of features Φ(x) ∈ Rp

– Explicit features adapted to inputs (can be learned as well)

– Using Hilbert spaces for non-linear / non-parametric estimation

Page 10: Stochastic gradient descent and robustness to ill … › cohen › Sanservolotalks › bach.pdf– Mixing statistics and optimization – Using smoothness to go beyond stochastic

Supervised machine learning

• Data: n observations (xi, yi) ∈ X × Y, i = 1, . . . , n, i.i.d.

• Prediction as a linear function 〈θ,Φ(x)〉 of features Φ(x) ∈ Rp

• (regularized) empirical risk minimization: find θ solution of

minθ∈Rp

1

n

n∑

i=1

ℓ(

yi, 〈θ,Φ(xi)〉)

+ µΩ(θ)

convex data fitting term + regularizer

Page 11: Stochastic gradient descent and robustness to ill … › cohen › Sanservolotalks › bach.pdf– Mixing statistics and optimization – Using smoothness to go beyond stochastic

Usual losses

• Regression: y ∈ R, prediction y = 〈θ,Φ(x)〉– quadratic loss 1

2(y − y)2 = 12(y − 〈θ,Φ(x)〉)2

Page 12: Stochastic gradient descent and robustness to ill … › cohen › Sanservolotalks › bach.pdf– Mixing statistics and optimization – Using smoothness to go beyond stochastic

Usual losses

• Regression: y ∈ R, prediction y = 〈θ,Φ(x)〉– quadratic loss 1

2(y − y)2 = 12(y − 〈θ,Φ(x)〉)2

• Classification : y ∈ −1, 1, prediction y = sign(〈θ,Φ(x)〉)– loss of the form ℓ(y 〈θ,Φ(x)〉)– “True” 0-1 loss: ℓ(y 〈θ,Φ(x)〉) = 1y 〈θ,Φ(x)〉<0

– Usual convex losses:

−3 −2 −1 0 1 2 3 40

1

2

3

4

5

0−1hingesquarelogistic

Page 13: Stochastic gradient descent and robustness to ill … › cohen › Sanservolotalks › bach.pdf– Mixing statistics and optimization – Using smoothness to go beyond stochastic

Supervised machine learning

• Data: n observations (xi, yi) ∈ X × Y, i = 1, . . . , n, i.i.d.

• Prediction as a linear function 〈θ,Φ(x)〉 of features Φ(x) ∈ Rp

• (regularized) empirical risk minimization: find θ solution of

minθ∈Rp

1

n

n∑

i=1

ℓ(

yi, 〈θ,Φ(xi)〉)

+ µΩ(θ)

convex data fitting term + regularizer

Page 14: Stochastic gradient descent and robustness to ill … › cohen › Sanservolotalks › bach.pdf– Mixing statistics and optimization – Using smoothness to go beyond stochastic

Supervised machine learning

• Data: n observations (xi, yi) ∈ X × Y, i = 1, . . . , n, i.i.d.

• Prediction as a linear function 〈θ,Φ(x)〉 of features Φ(x) ∈ Rp

• (regularized) empirical risk minimization: find θ solution of

minθ∈Rp

1

n

n∑

i=1

ℓ(

yi, 〈θ,Φ(xi)〉)

+ µΩ(θ)

convex data fitting term + regularizer

• Empirical risk: f(θ) = 1n

∑ni=1 ℓ(yi, 〈θ,Φ(xi)〉) training cost

• Expected risk: f(θ) = E(x,y)ℓ(y, 〈θ,Φ(x)〉) testing cost

• Two fundamental questions: (1) computing θ and (2) analyzing θ

Page 15: Stochastic gradient descent and robustness to ill … › cohen › Sanservolotalks › bach.pdf– Mixing statistics and optimization – Using smoothness to go beyond stochastic

Smoothness and strong convexity

• A function g : Rp → R is L-smooth if and only if it is twice

differentiable and

∀θ ∈ Rp, eigenvalues

[

g′′(θ)]

6 L

smooth non−smooth

Page 16: Stochastic gradient descent and robustness to ill … › cohen › Sanservolotalks › bach.pdf– Mixing statistics and optimization – Using smoothness to go beyond stochastic

Smoothness and strong convexity

• A function g : Rp → R is L-smooth if and only if it is twice

differentiable and

∀θ ∈ Rp, eigenvalues

[

g′′(θ)]

6 L

• Machine learning

– with g(θ) = 1n

∑ni=1 ℓ(yi, 〈θ,Φ(xi)〉)

– Hessian ≈ covariance matrix 1n

∑ni=1Φ(xi)⊗ Φ(xi)

– Bounded data: ‖Φ(x)‖ 6 R ⇒ L = O(R2)

Page 17: Stochastic gradient descent and robustness to ill … › cohen › Sanservolotalks › bach.pdf– Mixing statistics and optimization – Using smoothness to go beyond stochastic

Smoothness and strong convexity

• A twice differentiable function g : Rp → R is µ-strongly convex if

and only if

∀θ ∈ Rp, eigenvalues

[

g′′(θ)]

> µ

convexstronglyconvex

Page 18: Stochastic gradient descent and robustness to ill … › cohen › Sanservolotalks › bach.pdf– Mixing statistics and optimization – Using smoothness to go beyond stochastic

Smoothness and strong convexity

• A twice differentiable function g : Rp → R is µ-strongly convex if

and only if

∀θ ∈ Rp, eigenvalues

[

g′′(θ)]

> µ

(large µ/L) (small µ/L)

Page 19: Stochastic gradient descent and robustness to ill … › cohen › Sanservolotalks › bach.pdf– Mixing statistics and optimization – Using smoothness to go beyond stochastic

Smoothness and strong convexity

• A twice differentiable function g : Rp → R is µ-strongly convex if

and only if

∀θ ∈ Rp, eigenvalues

[

g′′(θ)]

> µ

• Machine learning

– with g(θ) = 1n

∑ni=1 ℓ(yi, 〈θ,Φ(xi)〉)

– Hessian ≈ covariance matrix 1n

∑ni=1Φ(xi)⊗ Φ(xi)

– Data with invertible covariance matrix (low correlation/dimension)

Page 20: Stochastic gradient descent and robustness to ill … › cohen › Sanservolotalks › bach.pdf– Mixing statistics and optimization – Using smoothness to go beyond stochastic

Smoothness and strong convexity

• A twice differentiable function g : Rp → R is µ-strongly convex if

and only if

∀θ ∈ Rp, eigenvalues

[

g′′(θ)]

> µ

• Machine learning

– with g(θ) = 1n

∑ni=1 ℓ(yi, 〈θ,Φ(xi)〉)

– Hessian ≈ covariance matrix 1n

∑ni=1Φ(xi)⊗ Φ(xi)

– Data with invertible covariance matrix (low correlation/dimension)

• Adding regularization by µ2‖θ‖2

– creates additional bias unless µ is small

Page 21: Stochastic gradient descent and robustness to ill … › cohen › Sanservolotalks › bach.pdf– Mixing statistics and optimization – Using smoothness to go beyond stochastic

Iterative methods for minimizing smooth functions

• Assumption: g convex and smooth on Rp

• Gradient descent: θt = θt−1 − γt g′(θt−1)

– O(1/t) convergence rate for convex functions

– O(e−(µ/L)t) convergence rate for strongly convex functions

Page 22: Stochastic gradient descent and robustness to ill … › cohen › Sanservolotalks › bach.pdf– Mixing statistics and optimization – Using smoothness to go beyond stochastic

Iterative methods for minimizing smooth functions

• Assumption: g convex and smooth on Rp

• Gradient descent: θt = θt−1 − γt g′(θt−1)

– O(1/t) convergence rate for convex functions

– O(e−(µ/L)t) convergence rate for strongly convex functions

• Newton method: θt = θt−1 − g′′(θt−1)−1g′(θt−1)

– O(

e−ρ2t)

convergence rate

Page 23: Stochastic gradient descent and robustness to ill … › cohen › Sanservolotalks › bach.pdf– Mixing statistics and optimization – Using smoothness to go beyond stochastic

Iterative methods for minimizing smooth functions

• Assumption: g convex and smooth on Rp

• Gradient descent: θt = θt−1 − γt g′(θt−1)

– O(1/t) convergence rate for convex functions

– O(e−(µ/L)t) convergence rate for strongly convex functions

• Newton method: θt = θt−1 − g′′(θt−1)−1g′(θt−1)

– O(

e−ρ2t)

convergence rate

• Key insights from Bottou and Bousquet (2008)

1. In machine learning, no need to optimize below statistical error

2. In machine learning, cost functions are averages

⇒ Stochastic approximation

Page 24: Stochastic gradient descent and robustness to ill … › cohen › Sanservolotalks › bach.pdf– Mixing statistics and optimization – Using smoothness to go beyond stochastic

Stochastic approximation

• Goal: Minimizing a function f defined on Rp

– given only unbiased estimates f ′n(θn) of its gradients f ′(θn) at

certain points θn ∈ Rp

Page 25: Stochastic gradient descent and robustness to ill … › cohen › Sanservolotalks › bach.pdf– Mixing statistics and optimization – Using smoothness to go beyond stochastic

Stochastic approximation

• Goal: Minimizing a function f defined on Rp

– given only unbiased estimates f ′n(θn) of its gradients f ′(θn) at

certain points θn ∈ Rp

• Machine learning - statistics

– f(θ) = Efn(θ) = E ℓ(yn, 〈θ,Φ(xn)〉) = generalization error

– Loss for a single pair of observations: fn(θ) = ℓ(yn, 〈θ,Φ(xn)〉)– Expected gradient:

f ′(θ) = Ef ′n(θ) = E

ℓ′(yn, 〈θ,Φ(xn)〉)Φ(xn)

• Beyond convex optimization: see, e.g., Benveniste et al. (2012)

Page 26: Stochastic gradient descent and robustness to ill … › cohen › Sanservolotalks › bach.pdf– Mixing statistics and optimization – Using smoothness to go beyond stochastic

Convex stochastic approximation

• Key assumption: smoothness and/or strong convexity

• Key algorithm: stochastic gradient descent (a.k.a. Robbins-Monro)

θn = θn−1 − γn f′n(θn−1)

– Polyak-Ruppert averaging: θn = 1n+1

∑nk=0 θk

– Which learning rate sequence γn? Classical setting: γn = Cn−α

Page 27: Stochastic gradient descent and robustness to ill … › cohen › Sanservolotalks › bach.pdf– Mixing statistics and optimization – Using smoothness to go beyond stochastic

Convex stochastic approximation

• Key assumption: smoothness and/or strong convexity

• Key algorithm: stochastic gradient descent (a.k.a. Robbins-Monro)

θn = θn−1 − γn f′n(θn−1)

– Polyak-Ruppert averaging: θn = 1n+1

∑nk=0 θk

– Which learning rate sequence γn? Classical setting: γn = Cn−α

• Running-time = O(np)

– Single pass through the data

– One line of code among many

Page 28: Stochastic gradient descent and robustness to ill … › cohen › Sanservolotalks › bach.pdf– Mixing statistics and optimization – Using smoothness to go beyond stochastic

Convex stochastic approximation

Existing analysis

• Known global minimax rates of convergence for non-smooth

problems (Nemirovsky and Yudin, 1983; Agarwal et al., 2012)

– Strongly convex: O((µn)−1)

Attained by averaged stochastic gradient descent with γn ∝ (µn)−1

– Non-strongly convex: O(n−1/2)

Attained by averaged stochastic gradient descent with γn ∝ n−1/2

Page 29: Stochastic gradient descent and robustness to ill … › cohen › Sanservolotalks › bach.pdf– Mixing statistics and optimization – Using smoothness to go beyond stochastic

Convex stochastic approximation

Existing analysis

• Known global minimax rates of convergence for non-smooth

problems (Nemirovsky and Yudin, 1983; Agarwal et al., 2012)

– Strongly convex: O((µn)−1)

Attained by averaged stochastic gradient descent with γn ∝ (µn)−1

– Non-strongly convex: O(n−1/2)

Attained by averaged stochastic gradient descent with γn ∝ n−1/2

• Asymptotic analysis of averaging (Polyak and Juditsky, 1992;

Ruppert, 1988)

– All step sizes γn = Cn−α with α ∈ (1/2, 1) lead to O(n−1) for

smooth strongly convex problems

A single algorithm with global adaptive convergence rate for

smooth problems?

Page 30: Stochastic gradient descent and robustness to ill … › cohen › Sanservolotalks › bach.pdf– Mixing statistics and optimization – Using smoothness to go beyond stochastic

Convex stochastic approximation

Existing analysis

• Known global minimax rates of convergence for non-smooth

problems (Nemirovsky and Yudin, 1983; Agarwal et al., 2012)

– Strongly convex: O((µn)−1)

Attained by averaged stochastic gradient descent with γn ∝ (µn)−1

– Non-strongly convex: O(n−1/2)

Attained by averaged stochastic gradient descent with γn ∝ n−1/2

• Asymptotic analysis of averaging (Polyak and Juditsky, 1992;

Ruppert, 1988)

– All step sizes γn = Cn−α with α ∈ (1/2, 1) lead to O(n−1) for

smooth strongly convex problems

• A single algorithm for smooth problems with global convergence

rate O(1/n) in all situations?

Page 31: Stochastic gradient descent and robustness to ill … › cohen › Sanservolotalks › bach.pdf– Mixing statistics and optimization – Using smoothness to go beyond stochastic

Least-mean-square algorithm

• Least-squares: f(θ) = 12E

[

(yn − 〈Φ(xn), θ〉)2]

with θ ∈ Rp

– SGD = least-mean-square algorithm (see, e.g., Macchi, 1995)

– usually studied without averaging and decreasing step-sizes

– with strong convexity assumption E[

Φ(xn)⊗Φ(xn)]

= H < µ · Id

Page 32: Stochastic gradient descent and robustness to ill … › cohen › Sanservolotalks › bach.pdf– Mixing statistics and optimization – Using smoothness to go beyond stochastic

Least-mean-square algorithm

• Least-squares: f(θ) = 12E

[

(yn − 〈Φ(xn), θ〉)2]

with θ ∈ Rp

– SGD = least-mean-square algorithm (see, e.g., Macchi, 1995)

– usually studied without averaging and decreasing step-sizes

– with strong convexity assumption E[

Φ(xn)⊗Φ(xn)]

= H < µ · Id

• New analysis for averaging and constant step-size γ = 1/(4R2)

– Assume ‖Φ(xn)‖ 6 R and |yn − 〈Φ(xn), θ∗〉| 6 σ almost surely

– No assumption regarding lowest eigenvalues of H

– Main result: Ef(θn)− f(θ∗) 64σ2p

n+

4R2‖θ0 − θ∗‖2n

• Matches statistical lower bound (Tsybakov, 2003)

– Non-asymptotic robust version of Gyorfi and Walk (1996)

Page 33: Stochastic gradient descent and robustness to ill … › cohen › Sanservolotalks › bach.pdf– Mixing statistics and optimization – Using smoothness to go beyond stochastic

Markov chain interpretation of constant step sizes

• LMS recursion for fn(θ) =12

(

yn − 〈Φ(xn), θ〉)2

θn = θn−1 − γ(

〈Φ(xn), θn−1〉 − yn)

Φ(xn)

• The sequence (θn)n is a homogeneous Markov chain

– convergence to a stationary distribution πγ

– with expectation θγdef=

θπγ(dθ)

- For least-squares, θγ = θ∗

θγ

θ0

θn

Page 34: Stochastic gradient descent and robustness to ill … › cohen › Sanservolotalks › bach.pdf– Mixing statistics and optimization – Using smoothness to go beyond stochastic

Markov chain interpretation of constant step sizes

• LMS recursion for fn(θ) =12

(

yn − 〈Φ(xn), θ〉)2

θn = θn−1 − γ(

〈Φ(xn), θn−1〉 − yn)

Φ(xn)

• The sequence (θn)n is a homogeneous Markov chain

– convergence to a stationary distribution πγ

– with expectation θγdef=

θπγ(dθ)

• For least-squares, θγ = θ∗

θ∗

θ0

θn

Page 35: Stochastic gradient descent and robustness to ill … › cohen › Sanservolotalks › bach.pdf– Mixing statistics and optimization – Using smoothness to go beyond stochastic

Markov chain interpretation of constant step sizes

• LMS recursion for fn(θ) =12

(

yn − 〈Φ(xn), θ〉)2

θn = θn−1 − γ(

〈Φ(xn), θn−1〉 − yn)

Φ(xn)

• The sequence (θn)n is a homogeneous Markov chain

– convergence to a stationary distribution πγ

– with expectation θγdef=

θπγ(dθ)

• For least-squares, θγ = θ∗

θ∗

θ0

θn

θn

Page 36: Stochastic gradient descent and robustness to ill … › cohen › Sanservolotalks › bach.pdf– Mixing statistics and optimization – Using smoothness to go beyond stochastic

Markov chain interpretation of constant step sizes

• LMS recursion for fn(θ) =12

(

yn − 〈Φ(xn), θ〉)2

θn = θn−1 − γ(

〈Φ(xn), θn−1〉 − yn)

Φ(xn)

• The sequence (θn)n is a homogeneous Markov chain

– convergence to a stationary distribution πγ

– with expectation θγdef=

θπγ(dθ)

• For least-squares, θγ = θ∗

– θn does not converge to θ∗ but oscillates around it

– oscillations of order√γ

• Ergodic theorem:

– Averaged iterates converge to θγ = θ∗ at rate O(1/n)

Page 37: Stochastic gradient descent and robustness to ill … › cohen › Sanservolotalks › bach.pdf– Mixing statistics and optimization – Using smoothness to go beyond stochastic

Simulations - synthetic examples

• Gaussian distributions - p = 20

0 2 4 6−5

−4

−3

−2

−1

0

log10

(n)

log 10

[f(θ)

−f(

θ *)]

synthetic square

1/2R2

1/8R2

1/32R2

1/2R2n1/2

Page 38: Stochastic gradient descent and robustness to ill … › cohen › Sanservolotalks › bach.pdf– Mixing statistics and optimization – Using smoothness to go beyond stochastic

Simulations - benchmarks

• alpha (p = 500, n = 500 000), news (p = 1 300 000, n = 20 000)

0 2 4 6

−2

−1.5

−1

−0.5

0

0.5

1

log10

(n)

log 10

[f(θ)

−f(

θ *)]

alpha square C=1 test

1/R2

1/R2n1/2

SAG

0 2 4 6

−2

−1.5

−1

−0.5

0

0.5

1

log10

(n)

alpha square C=opt test

C/R2

C/R2n1/2

SAG

0 2 4

−0.8

−0.6

−0.4

−0.2

0

0.2

log10

(n)

log 10

[f(θ)

−f(

θ *)]

news square C=1 test

1/R2

1/R2n1/2

SAG

0 2 4

−0.8

−0.6

−0.4

−0.2

0

0.2

log10

(n)

news square C=opt test

C/R2

C/R2n1/2

SAG

Page 39: Stochastic gradient descent and robustness to ill … › cohen › Sanservolotalks › bach.pdf– Mixing statistics and optimization – Using smoothness to go beyond stochastic

Isn’t least-squares regression a “regression”?

Page 40: Stochastic gradient descent and robustness to ill … › cohen › Sanservolotalks › bach.pdf– Mixing statistics and optimization – Using smoothness to go beyond stochastic

Isn’t least-squares regression a “regression”?

• Least-squares regression

– Simpler to analyze and understand

– Explicit relationship to bias/variance trade-offs (next slides)

• Many important loss functions are not quadratic

– Beyond least-squares with online Newton steps

– Complexity of O(p) per iteration with rate O(p/n)

– See Bach and Moulines (2013) for details

Page 41: Stochastic gradient descent and robustness to ill … › cohen › Sanservolotalks › bach.pdf– Mixing statistics and optimization – Using smoothness to go beyond stochastic

Optimal bounds for least-squares?

• Least-squares: cannot beat σ2p/n (Tsybakov, 2003). Really?

Page 42: Stochastic gradient descent and robustness to ill … › cohen › Sanservolotalks › bach.pdf– Mixing statistics and optimization – Using smoothness to go beyond stochastic

Optimal bounds for least-squares?

• Least-squares: cannot beat σ2p/n (Tsybakov, 2003). Really?

• Refined analysis (Defossez and Bach, 2015)

Ef(θn)− f(θ∗) 6σ2p

n+

‖H−1/2(θ0 − θ∗)‖22γ2n2

– In practice: bias may be larger than variance, σ2p/n pessimistic

Page 43: Stochastic gradient descent and robustness to ill … › cohen › Sanservolotalks › bach.pdf– Mixing statistics and optimization – Using smoothness to go beyond stochastic

Optimal bounds for least-squares?

• Least-squares: cannot beat σ2p/n (Tsybakov, 2003). Really?

• Refined analysis (Defossez and Bach, 2015)

Ef(θn)− f(θ∗) 6σ2p

n+

‖H−1/2(θ0 − θ∗)‖22γ2n2

– In practice: bias may be larger than variance, σ2p/n pessimistic

• Refined assumptions with adaptivity (Dieuleveut and Bach, 2014)

Ef(θn)− f(θ∗) 6σ2γ1/α trH1/α

n1−1/α+

‖H1/2−r(θ0 − θ∗)‖22γ2rn2minr,1

– SGD is adaptive to the covariance matrix eigenvalue decay

– Leads to optimal rates for non-parametric regression

Page 44: Stochastic gradient descent and robustness to ill … › cohen › Sanservolotalks › bach.pdf– Mixing statistics and optimization – Using smoothness to go beyond stochastic

Achieving optimal bias and variance terms

• Current results with averaged SGD (ill-conditioned problems)

– Variance (starting from optimal θ∗) =σ2p

n

– Bias (no noise)= minR2‖θ0 − θ∗‖2

n,R4〈θ0 − θ∗,H

−1(θ0 − θ∗)〉n2

Page 45: Stochastic gradient descent and robustness to ill … › cohen › Sanservolotalks › bach.pdf– Mixing statistics and optimization – Using smoothness to go beyond stochastic

Achieving optimal bias and variance terms

• Current results with averaged SGD (ill-conditioned problems)

– Variance (starting from optimal θ∗) =σ2p

n

– Bias (no noise)= minR2‖θ0 − θ∗‖2

n,R4〈θ0 − θ∗,H

−1(θ0 − θ∗)〉n2

Page 46: Stochastic gradient descent and robustness to ill … › cohen › Sanservolotalks › bach.pdf– Mixing statistics and optimization – Using smoothness to go beyond stochastic

Achieving optimal bias and variance terms

• Current results with averaged SGD (ill-conditioned problems)

– Variance (starting from optimal θ∗) =σ2p

n

– Bias (no noise)= minR2‖θ0 − θ∗‖2

n,R4〈θ0 − θ∗,H

−1(θ0 − θ∗)〉n2

Bias Variance

Averaged gradient descent

(Bach and Moulines, 2013)R2‖θ0 − θ∗‖2

n

σ2p

n

Page 47: Stochastic gradient descent and robustness to ill … › cohen › Sanservolotalks › bach.pdf– Mixing statistics and optimization – Using smoothness to go beyond stochastic

Achieving optimal bias and variance terms

Bias Variance

Averaged gradient descent

(Bach and Moulines, 2013)R2‖θ0 − θ∗‖2

n

σ2p

n

Page 48: Stochastic gradient descent and robustness to ill … › cohen › Sanservolotalks › bach.pdf– Mixing statistics and optimization – Using smoothness to go beyond stochastic

Achieving optimal bias and variance terms

Bias Variance

Averaged gradient descent

(Bach and Moulines, 2013)R2‖θ0 − θ∗‖2

n

σ2p

n

Accelerated gradient descent

(Nesterov, 1983)R2‖θ0 − θ∗‖2

n2σ2p

• Acceleration is notoriously non-robust to noise (d’Aspremont,

2008; Schmidt et al., 2011)

– For non-structured noise, see Lan (2012)

Page 49: Stochastic gradient descent and robustness to ill … › cohen › Sanservolotalks › bach.pdf– Mixing statistics and optimization – Using smoothness to go beyond stochastic

Achieving optimal bias and variance terms

Bias Variance

Averaged gradient descent

(Bach and Moulines, 2013)R2‖θ0 − θ∗‖2

n

σ2p

n

Accelerated gradient descent

(Nesterov, 1983)R2‖θ0 − θ∗‖2

n2σ2p

“Between” averaging and acceleration

(Flammarion and Bach, 2015)R2‖θ0 − θ∗‖2

n1+α

σ2p

n1−α

Page 50: Stochastic gradient descent and robustness to ill … › cohen › Sanservolotalks › bach.pdf– Mixing statistics and optimization – Using smoothness to go beyond stochastic

Achieving optimal bias and variance terms

Bias Variance

Averaged gradient descent

(Bach and Moulines, 2013)R2‖θ0 − θ∗‖2

n

σ2p

n

Accelerated gradient descent

(Nesterov, 1983)R2‖θ0 − θ∗‖2

n2σ2p

“Between” averaging and acceleration

(Flammarion and Bach, 2015)R2‖θ0 − θ∗‖2

n1+α

σ2p

n1−α

Averaging and acceleration

(Dieuleveut, Flammarion, and Bach, 2016)R2‖θ0 − θ∗‖2

n2

σ2p

n

Page 51: Stochastic gradient descent and robustness to ill … › cohen › Sanservolotalks › bach.pdf– Mixing statistics and optimization – Using smoothness to go beyond stochastic

Conclusions

• Constant-step-size averaged stochastic gradient descent

– Reaches convergence rate O(1/n) in all regimes

– Improves on the O(1/√n) lower-bound of non-smooth problems

– Efficient online Newton step for non-quadratic problems

– Robustness to step-size selection and adaptivity

Page 52: Stochastic gradient descent and robustness to ill … › cohen › Sanservolotalks › bach.pdf– Mixing statistics and optimization – Using smoothness to go beyond stochastic

Conclusions

• Constant-step-size averaged stochastic gradient descent

– Reaches convergence rate O(1/n) in all regimes

– Improves on the O(1/√n) lower-bound of non-smooth problems

– Efficient online Newton step for non-quadratic problems

– Robustness to step-size selection and adaptivity

• Extensions and future work

– Going beyond a single pass (Le Roux, Schmidt, and Bach, 2012;

Defazio, Bach, and Lacoste-Julien, 2014)

– Proximal extensions fo non-differentiable terms

– Kernels and nonparametric estimation (Dieuleveut and Bach, 2014)

– Parallelization

– Non-convex problems

Page 53: Stochastic gradient descent and robustness to ill … › cohen › Sanservolotalks › bach.pdf– Mixing statistics and optimization – Using smoothness to go beyond stochastic

References

A. Agarwal, P. L. Bartlett, P. Ravikumar, and M. J. Wainwright. Information-theoretic lower bounds

on the oracle complexity of stochastic convex optimization. Information Theory, IEEE Transactions

on, 58(5):3235–3249, 2012.

F. Bach and E. Moulines. Non-strongly-convex smooth stochastic approximation with convergence

rate o(1/n). Technical Report 00831977, HAL, 2013.

Albert Benveniste, Michel Metivier, and Pierre Priouret. Adaptive algorithms and stochastic

approximations. Springer Publishing Company, Incorporated, 2012.

L. Bottou and O. Bousquet. The tradeoffs of large scale learning. In Adv. NIPS, 2008.

A. d’Aspremont. Smooth optimization with approximate gradient. SIAM J. Optim., 19(3):1171–1183,

2008.

Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method

with support for non-strongly convex composite objectives. In Advances in Neural Information

Processing Systems, pages 1646–1654, 2014.

A. Defossez and F. Bach. Constant step size least-mean-square: Bias-variance trade-offs and optimal

sampling distributions. 2015.

A. Dieuleveut and F. Bach. Non-parametric Stochastic Approximation with Large Step sizes. Technical

report, ArXiv, 2014.

A. Dieuleveut, N. Flammarion, and F. Bach. Harder, better, faster, stronger convergence rates for

least-squares regression. Technical Report 1602.05419, arXiv, 2016.

Page 54: Stochastic gradient descent and robustness to ill … › cohen › Sanservolotalks › bach.pdf– Mixing statistics and optimization – Using smoothness to go beyond stochastic

N. Flammarion and F. Bach. From averaging to acceleration, there is only a step-size. arXiv preprint

arXiv:1504.01577, 2015.

L. Gyorfi and H. Walk. On the averaged stochastic approximation for linear regression. SIAM Journal

on Control and Optimization, 34(1):31–61, 1996.

G. Lan. An optimal method for stochastic composite optimization. Math. Program., 133(1-2, Ser. A):

365–397, 2012.

N. Le Roux, M. Schmidt, and F. Bach. A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets. In Adv. NIPS, 2012.

O. Macchi. Adaptive processing: The least mean squares approach with applications in transmission.

Wiley West Sussex, 1995.

A. S. Nemirovsky and D. B. Yudin. Problem complexity and method efficiency in optimization. Wiley

& Sons, 1983.

Y. Nesterov. A method for solving a convex programming problem with rate of convergence O(1/k2).

Soviet Math. Doklady, 269(3):543–547, 1983.

B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal

on Control and Optimization, 30(4):838–855, 1992.

H. Robbins and S. Monro. A stochastic approximation method. Ann. Math. Statistics, 22:400–407,

1951. ISSN 0003-4851.

D. Ruppert. Efficient estimations from a slowly convergent Robbins-Monro process. Technical Report

781, Cornell University Operations Research and Industrial Engineering, 1988.

M. Schmidt, N. Le Roux, and F. Bach. Convergence rates for inexact proximal-gradient method. In

Page 55: Stochastic gradient descent and robustness to ill … › cohen › Sanservolotalks › bach.pdf– Mixing statistics and optimization – Using smoothness to go beyond stochastic

Adv. NIPS, 2011.

A. B. Tsybakov. Optimal rates of aggregation. 2003.