perspectives on stochastic gradient descent for machine...

1/38

Supervised learningStochastic Gradient Descent

SGD in RKHS: non parametric rates

Perspectives on Stochastic Gradient Descent forMachine Learning problems

Loucas Pillaud-Vivien

Cermics seminarOctober 3, 2019

Loucas Pillaud-VivienPerspectives on Stochastic Gradient Descent for Machine Learning problems

2/38



Outline

1 Supervised learningMachine learning: general contextMathematical framework

2 Stochastic Gradient DescentGeneral resultsSGD for Least square in finite dimensionSGD for Least square: classification problem

3 SGD in RKHS: non parametric ratesSGD in RKHSMultiple passes over the data


3/38



Machine learning: general contextMathematical framework

Outline





4/38




Examples of tasks

Goal: explain/predict a phenomenon given observations.

Two examples:Bio-informatics.Input: DNA sequence. Output: Disease prediction.Vision.Input: Digit image. Output Digit prediction.


5/38




Examples of tasks

Goal: explain/predict a phenomenon given observations.

Two examples:Bio-informatics. n ∼ 103, d ∼ 106

Input: DNA sequence. Output: Disease prediction.Vision. n ∼ 109, d ∼ 106

Input: Digit image. Output: Digit prediction.Large scale machine learning:Large dimensionality d and large number of samples n.


6/38




Supervised learning: mathematical framework

Input/Output pair (X ,Y ) ∈ X × Y distributed according toρ.

ρ is unknown, only have access to n samples i.i.d.Y = R for regression and Y = −1, 1 for classification.

Goal: Find prediction function g : X −→ Y such thatg(X ) ≈ Y .

Measure of the error through the risk or generalization error

R(g) = E(X ,Y )∼ρ [` (Y , g(X ))]


7/38




Supervised learning: parametrization

R(g) = E(X ,Y )∼ρ [` (Y , g(X ))]

Parametric case, g belongs to a functional space H,parametrized by some parameter θ ∈ Rd : g(X ) = gθ(X ).

Linear prediction: gθ(X ) = 〈θ, φ(x)〉, with features φ(x) ∈ Rd

Neural networks: gθ(X ) = θ>mσ(θ>m−1σ(· · ·σ(θ>m−1x))


7/38




Supervised learning: parametrization

R(g) = E(X ,Y )∼ρ [` (Y , g(X ))]

Parametric case, g belongs to a functional space H,parametrized by some parameter θ ∈ Rd : g(X ) = gθ(X ).

Linear prediction: gθ(X ) = 〈θ, φ(x)〉, with features φ(x) ∈ Rd

Neural networks: gθ(X ) = θ>mσ(θ>m−1σ(· · ·σ(θ>m−1x))

Non-parametric case: Prediction as a function g ∈ H, for Hinfinite-dimensional space.


8/38




Supervised learning cast as optimization

Machine Learning can be cast as an optimization problem:

Find infg∈HR(g) = E(X ,Y )∼ρ [` (Y , g(X ))]

Questions:How to find the optimum ? with i.i.d. samples only ?How to compute it efficiently ?

3 2 1 0 1 2 30.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

0-1squarehingelogistic


9/38




Where optimization meets statistics

Data: n i.i.d. observations (xi , yi )i ∈ X × Y.

Replace unknown test error Eρ (Y − g(X ))2 by the computabletraining error (empirical risk):

R(g) := 1n

n∑i=1

(yi − g(xi ))2

If H is too rich −→ Problem: overfitting.


9/38




Where optimization meets statistics

Data: n i.i.d. observations (xi , yi )i ∈ X × Y.

Replace unknown test error Eρ (Y − g(X ))2 by the computabletraining error (empirical risk):

R(g) := 1n

n∑i=1

(yi − g(xi ))2

General approach: regularize.

θn := argming∈H

R(g) + λΩ(g)

data fitting term + regularizer


10/38




Tradeoffs in Machine Learning: example of linear regression

Linear regression in Rd : gθ(x) = 〈θ, x〉.

Let fi (θ) := (yi − 〈θ, xi〉)2 + λ‖θ‖22, then

θn := argminθ∈Rd

1n

n∑i=1

fi (θ).

Comparison of three methods:Explicit: let X = (x1, . . . , xn)>, θn = (XX> + λnI)−1X>Y .

Stability problems with inversion + O(n2d + n3).


11/38








1n

n∑i=1

fi (θ).

Comparison of three methods:Explicit: let X = (x1, . . . , xn)>, θn = (XX> + λnI)−1X>Y ,(Full) Gradient descent: θk+1 = θk − γ

n∑n

i=1∇fi (θ)

Cost of one iteration O(nd)


12/38








1n

n∑i=1

fi (θ).


n∑n

i=1∇fi (θ)Stochastic Gradient descent on ERM: θk+1 = θk − γ∇fik (θ)where ik is uniformly picked at random in 1, . . . , n.

Cost of one iteration O(d)


13/38




Tradeoffs in Machine Learning: example of linear regressionLet fi (θ) := (yi − 〈θ, xi〉)2 + λ‖θ‖2

2, then


1n

n∑i=1

fi (θ).


n∑n

i=1∇fi (θ)Stochastic Gradient descent on ERM: θk+1 = θk − γ∇fik (θ)where ik is uniformly picked at random in 1, . . . , n.

Two important insights for ML [Bottou and Bousquet, 2008]:

1 No need to optimize below statistical error,2 True risk is more important than empirical risk.


14/38



General resultsSGD for Least square in finite dimensionSGD for Least square: classification problem





15/38




SGD: general setting

Goal: minθ∈Rd f (θ), given unbiased estimates of the truegradient: (∇fi )i6n.θ∗ := argminθ∈Rd f (θ)Key algorithm: Stochastic Gradient Descent (SGD) [Robbinsand Monro, 1951]:

θt = θt−1 − γt∇ft(θt−1),E [∇ft(θt−1)|Ft−1] = ∇f (θt−1). θt is Ft-measurable.For γt = γ, θt homogeneous Markov chain: law converges tostationary distribution πγ of radius O(γ1/2) around θ∗.[Dieuleveut, Durmus, Bach, 2017].

To make it converge, either decaying step-size, either average.

θt = 1t

t∑i=1

θi


16/38




SGD for Machine Learning

Data: n i.i.d. observations (xi , yi )i6n ∈ X × Y.Loss for a single pair of observations, for t 6 n,

ft(θ) = `(yt , 〈θ, xt〉).

SGD for the true risk: R(θ) = Eρ [`(Y , 〈θ,X 〉)],Filtration adapted to the problem: Ft = σ((xi , yi )i6t).Unbiased estimates of the gradient for each observation:

∇θft(θ) = ∇θ`(yt , 〈θ, xt〉)E [∇θft(θt−1)|Ft−1] = ∇R(θt−1)

Single pass through the data – “Automatic” regularization.


17/38




SGD for Least square in finite dimensionProblem Setting:

Data: n i.i.d observations (xi , yi ) ∈ X ×R with distribution ρ.Least-squares: Finding the optimal predictor θ∗ minimizing

R(θ) = Eρ (y − 〈θ, φ(x)〉)2 .

Prediction: linear functions of feature vectors in Rd :g(x) = 〈θ, φ(x)〉, with θ, φ(x) ∈ Rd .Assumptions: ‖φ(x)‖ 6 R, |y | 6 M, |y − 〈θ∗, φ(x)〉| 6 σ.Statistical performance of θn: defined as ER(θn)−R(θ∗)

In finite dimension, single pass SGD achieves optimal rate:

4σ2 dimHn + 4R2‖θ∗‖2

n [Bach, Moulines, 2013]


18/38




Stochastic Gradient Descent

Regression problems: best convergence rates O(1/n).

Can it be faster for classification problems ?


19/38




Binary classification: problem setting

Data: (x , y) ∈ X × −1, 1 distributed according to ρ.Prediction: y = sign g(x), with g(x) = 〈g , φ(x)〉H.Aim: minimize over g ∈ H the error,

R01(g) = E`01 (y , g(x)) = E1yg(x)<0.

From error to lossesAs `01 is non convex, we use convex surrogates:

3 2 1 0 1 2 30.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0



19/38






R01(g) = E`01 (y , g(x)) = E1yg(x)<0.

From error to lossesAs `01 is non convex, we use square loss:

3 2 1 0 1 2 30.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0



19/38






R01(g) = E`01 (y , g(x)) = E1yg(x)<0.


Square loss: R(g) = E` (y , g(x)) = E (y − g(x))2, minimumfor g∗(x) = E(y |x). Good news: g01

∗ (x) = signE(y |x)

Ridge regression: Rλ(g) = E (y − g(x))2 + λ‖g‖2H, minimum

for gλ.


19/38






R01(g) = E`01 (y , g(x)) = E1yg(x)<0.


Excess error and loss [Bartlett et al., 2006]:

E`01 (y , g(x))− `01∗︸︷︷︸

Excess error

6√E (y − g(x))2 − `∗︸︷︷︸

Excess loss

If we use existing results for SGD: E`01 (y , g(x))− `01∗ 6

1√λn.

−→ Not exponentialLoucas Pillaud-Vivien

Perspectives on Stochastic Gradient Descent for Machine Learning problems

20/38




Main assumptions

Margin condition (Mammen and Tsybakov, 1999)

Hard inputs to predict: P(y = 1|x) = 1/2, i.e., E(y |x) = 0Easy inputs to predict: P(y = 1|x) ∈ 0, 1, i.e., |E(y |x)| = 1

−→ Margin condition: ∃δ > 0, s.t. |E(y |x)| > δ for allx ∈ supp(ρX ).


21/38




Main assumptions

(A1) Margin condition: ∃δ > 0, s.t. |E(y |x)| > δ,for all x ∈ supp(ρX ).(A2) Technical condition: ∃λ > 0 s.t.sign(E(y |x))gλ(x) > δ/2, for all x ∈ supp(ρX ).

Consequence: if ‖gλ − g‖L∞ < δ/2, sign g(x) = sign(E(y |x)).


22/38




Main result

Single pass SGD through the data on the regularized problem

gn = gn−1 − γn [(〈φ(xn), gn−1〉 − yn)φ(xn) + λ(gn−1 − g0)] ,

Take tail averaged estimator, g tailn = 1

n/2∑n

i=n/2 gi , [Jain et al.(2016)].

Theorem (P., Rudi, Bach, 2018)

Assume (A1), (A2), n >1λγ

log Rδ

and γ 6 1/(4R2) then,

Ex1...xnE`01(y , g tail

n (x))− `01

∗ . 4 exp(−λ2δ2n/R2

).


23/38




Synthetic experiments

Comparing test/train losses/errors for tail-averaged SGD(X = [0, 1], H Sobolev.)


24/38




Conclusion

Take home message:Exponential convergence of test error and not test lossImportance of the margin condition


25/38



SGD in RKHSMultiple passes over the data





26/38




Non-parametric Random Design Least Squares RegressionGoal:

mingR(g) = Eρ(Y − g(X ))2

ρX marginal distribution of X in X ,L2ρX set of squared integrable functions w.r.t. ρX .

Bayes predictor minimizes the quadratic risk over L2ρX :

gρ(X ) = E [Y |X ] .

Moreover, for any function g in L2ρX , the excess risk is:

R(g)−R(gρ) = ‖g − gρ‖2L2

ρX.

H a space of functions: there exists gH ∈ HL2ρX such that

R(gH) = infg∈HR(g).


26/38




Non-parametric Random Design Least Squares RegressionGoal:

mingR(g) = Eρ(Y − g(X ))2

ρX marginal distribution of X in X ,L2ρX set of squared integrable functions w.r.t. ρX .

Bayes predictor minimizes the quadratic risk over L2ρX :

gρ(X ) = E [Y |X ] .Moreover, for any function g in L2

ρX , the excess risk is:

R(g)−R(gρ) = ‖g − gρ‖2L2

ρX.

H a space of functions: there exists gH ∈ HL2ρX such that

R(gH) = infg∈HR(g).


27/38




Reproducing Kernel Hilbert Space

Reproducing Kernel Hilbert Space (RKHS) H is a space offunctions from X into R, such that there exists a reproducingkernel K : X × X → R, satisfying:

For any x ∈ X , H contains the function Kx , defined by:

Kx : X → Rz 7→ K (x , z).

For any x ∈ X and f ∈ H, the reproducing property holds:

〈Kx , f 〉H = f (x).


28/38




Why are RKHS so nice?

Computation:Linear spaces of functions.Existence of gradients (Hilbert).Only deal with functions in the set spanKxi , i = 1 . . . n(representer theorem).

# the algebraic framework is preserved !

Approximation: many kernels satisfy HL2ρX = L2

ρX , there is noapproximation error !

Representation: Feature map,X → H, x 7→ Kx maps pointsfrom any set into a linear spaceto apply a linear method.


28/38




Why are RKHS so nice?

Computation:Linear spaces of functions.Existence of gradients (Hilbert).Only deal with functions in the set spanKxi , i = 1 . . . n(representer theorem).

# the algebraic framework is preserved !Approximation: many kernels satisfy HL2

ρX = L2ρX , there is no

approximation error !

Representation: Feature map,X → H, x 7→ Kx maps pointsfrom any set into a linear spaceto apply a linear method.


29/38




Stochastic approximation in the RKHS.As R(g) = E

[(〈g ,KX 〉H − Y )2], for each pair of observations

(〈g ,Kxn〉H − yn)Kxn = (g(xn)− yn)Kxn

is an unbiased stochastic gradient of R at g .

Stochastic gradient recursion, starting from g0 ∈ H:gn = gn−1 − γ [〈gn−1,Kxn〉H − yn] Kxn ,

where γ is the step-size. Thus

gn =n∑

i=1aiKxi ,

with (an)n>1, an = −γn(gn−1(xn)− yn). With averaging,

gn = 1n + 1

n∑k=0

gk

Total complexity: O(n2)


29/38







is an unbiased stochastic gradient of R at g .Stochastic gradient recursion, starting from g0 ∈ H:

gn = gn−1 − γ [〈gn−1,Kxn〉H − yn] Kxn ,

where γ is the step-size.

Thus

gn =n∑

i=1aiKxi ,


gn = 1n + 1

n∑k=0

gk

Total complexity: O(n2)


29/38







is an unbiased stochastic gradient of R at g .Stochastic gradient recursion, starting from g0 ∈ H:

gn = gn−1 − γ [〈gn−1,Kxn〉H − yn] Kxn ,

where γ is the step-size. Thus

gn =n∑

i=1aiKxi ,


gn = 1n + 1

n∑k=0

gk

Total complexity: O(n2)Loucas Pillaud-Vivien

Perspectives on Stochastic Gradient Descent for Machine Learning problems

30/38




Kernel regression: Analysis

Assume E [K (X ,X )] and E[Y 2] are finite. Define the covariance

operator.

Σ = E[KX K>X

], 〈f ,Σg〉 = EρX (f (X )g(X )).

Parametrization of the problem:Capacity condition: eigenvalue decay of Σ.Source condition: position of gH w.r.t. the kernel space H.

Σ is a trace-class operator, that can be decomposed over itseigen-spaces. Its power: Στ , τ > 0. are thus well defined.


31/38




Capacity condition (CC)

CC(α): for some α > 1, we assume that tr(Σ1/α) <∞.

If we denote (µi )i∈I the sequence of non-zero eigenvalues of theoperator Σ, in decreasing order, then µi = O (i−α).

Sobolev first order kernel Gaussian kernel

log 1

0(µ

i)

0 0.5 1 1.5 2-5

-4

-3

-2

-1

0

0 0.5 1 1.5 2-12

-9

-6

-3

0 Eigenvalue decay ofthe covariance opera-tor.

log10(i) log10(i)

Left: min kernel, ρX = U [0; 1], −→ CC(α = 2) .Right: Gaussian kernel, ρX = U [−1; 1]. −→ CC(α),∀ α ≥ 1.


32/38




Source condition (SC)Concerning the optimal function gH, we assume:

SC(r): for some r > 0, gH ∈ Σr(L2ρX

)Thus ‖Σ−r (gH)‖L2

ρX<∞.

L2ρX

H

Σr

L2

ρX

!

gH

L2ρX

H Σ1

2

L2

ρX

!

gH

=

L2ρX

H

Σr

L2

ρX

!

gH

r < 0.5 r = 0.5 r > 0.5


33/38




NPSA with large step sizes

Theorem (Dieuleuveut, Bach, 2016)Assume CC(α) and SC(r). Then for any γ ≤ 1

4R2 ,

ER (gn)−R(gH) ≤ 4σ2γ1/αtr(Σ1/α)n1−1/α + 4

‖Σ−r (gH − g0)‖2L2

ρX

γ2r nmin(2r ,2) .

for γ = γ0n−2αr−1+α

2αr+1 , for α−12α ≤ r ≤ 1

ER (gn)−R(gH) ≤ n−2αr2αr+1

(4σ2tr(Σ1/α) + 4‖Σ−r (gH − g0)‖2

L2ρX

).

Statistically optimal rate. [Caponetto and De Vito, 2007].Beyond: online, minimal assumptions...


34/38




Result for multipass SGD


‖Σ−r (gH − g0)‖2L2

ρX

γ2r nmin(2r ,2) .

Optimal result only for easy cases r > α−12α . Because bias

saturate.Idea: what about multiple passes over the data to make biasdecrease.

1 2 3 4 5 6 7 8 9 100.0

0.2

0.4

0.6

0.8

1.0

r

Hard Problems

Easy Problems r = 1

2


35/38




Result for multipass SGD


‖Σ−r (gH − g0)‖2L2

ρX

γ2r nmin(2r ,2) .

Optimal result only for easy cases r > α−12α . Because bias

saturate.Idea: what about multiple passes over the data to make biasdecrease. Aims:

Get optimality in the hardcases with multiple passesBridge the gap betweentheory and practice


36/38




Sampling with replacement from n i.i.d observations (xi , yi ):gt = gt−1 − γt(〈Kxi(t) , gt−1〉 − yi(t))Kxi(t) ,

where i(u) is uniform over 1, . . . , n.Averaged estimator over t > n iterations: gt = 1

t∑t

i=1 gt .

Theorem (Convergence of multiple passes SGD for hard problems(P.,Rudi, Bach, 2018))Let n ∈ N∗ and t > n, let γ = 1/(4R2).

For µα < 2rα + 1 < α, then after t = Θ(nα/(2rα+1))iterations:

ER(gt)−R(g∗) = O(n−2rα/(2rα+1)) Optimal

For µα > 2rα + 1, then after t = Θ(n1/µ (log n)1µ ) iterations:

ER(gt)−R(g∗) 6 O(n−2r/µ) Improved


37/38




Recall the statistical optimal rate for this problem

O(n−2rα2rα+1 ).

The main theorem can be summed up in the following figure:

1 2 3 4 5 6 7 8 9 100.0

0.2

0.4

0.6

0.8

1.0

r

Optimal rates with multiple passes

Optimal rates with one pass

Improved rateswith multiple passes

r = 12


38/38




Conclusion

We showed (almost) finished for SGD-least squares:Optimal rates for parametric SGD for regressionOptimal rates for parametric SGD for classificationOptimal rates for non-parametric SGD in almost all thesettings

What about going beyond Least-squares ?Beyond linear settings ? Deal with non-linear activation ?

Thanks for your attention !


perspectives on stochastic gradient descent for machine...

Documents