perspectives on stochastic gradient descent for machine...
TRANSCRIPT
1/38
Supervised learningStochastic Gradient Descent
SGD in RKHS: non parametric rates
Perspectives on Stochastic Gradient Descent forMachine Learning problems
Loucas Pillaud-Vivien
Cermics seminarOctober 3, 2019
Loucas Pillaud-VivienPerspectives on Stochastic Gradient Descent for Machine Learning problems
2/38
Supervised learningStochastic Gradient Descent
SGD in RKHS: non parametric rates
Outline
1 Supervised learningMachine learning: general contextMathematical framework
2 Stochastic Gradient DescentGeneral resultsSGD for Least square in finite dimensionSGD for Least square: classification problem
3 SGD in RKHS: non parametric ratesSGD in RKHSMultiple passes over the data
Loucas Pillaud-VivienPerspectives on Stochastic Gradient Descent for Machine Learning problems
3/38
Supervised learningStochastic Gradient Descent
SGD in RKHS: non parametric rates
Machine learning: general contextMathematical framework
Outline
1 Supervised learningMachine learning: general contextMathematical framework
2 Stochastic Gradient DescentGeneral resultsSGD for Least square in finite dimensionSGD for Least square: classification problem
3 SGD in RKHS: non parametric ratesSGD in RKHSMultiple passes over the data
Loucas Pillaud-VivienPerspectives on Stochastic Gradient Descent for Machine Learning problems
4/38
Supervised learningStochastic Gradient Descent
SGD in RKHS: non parametric rates
Machine learning: general contextMathematical framework
Examples of tasks
Goal: explain/predict a phenomenon given observations.
Two examples:Bio-informatics.Input: DNA sequence. Output: Disease prediction.Vision.Input: Digit image. Output Digit prediction.
Loucas Pillaud-VivienPerspectives on Stochastic Gradient Descent for Machine Learning problems
5/38
Supervised learningStochastic Gradient Descent
SGD in RKHS: non parametric rates
Machine learning: general contextMathematical framework
Examples of tasks
Goal: explain/predict a phenomenon given observations.
Two examples:Bio-informatics. n ∼ 103, d ∼ 106
Input: DNA sequence. Output: Disease prediction.Vision. n ∼ 109, d ∼ 106
Input: Digit image. Output: Digit prediction.Large scale machine learning:Large dimensionality d and large number of samples n.
Loucas Pillaud-VivienPerspectives on Stochastic Gradient Descent for Machine Learning problems
6/38
Supervised learningStochastic Gradient Descent
SGD in RKHS: non parametric rates
Machine learning: general contextMathematical framework
Supervised learning: mathematical framework
Input/Output pair (X ,Y ) ∈ X × Y distributed according toρ.
ρ is unknown, only have access to n samples i.i.d.Y = R for regression and Y = −1, 1 for classification.
Goal: Find prediction function g : X −→ Y such thatg(X ) ≈ Y .
Measure of the error through the risk or generalization error
R(g) = E(X ,Y )∼ρ [` (Y , g(X ))]
Loucas Pillaud-VivienPerspectives on Stochastic Gradient Descent for Machine Learning problems
7/38
Supervised learningStochastic Gradient Descent
SGD in RKHS: non parametric rates
Machine learning: general contextMathematical framework
Supervised learning: parametrization
R(g) = E(X ,Y )∼ρ [` (Y , g(X ))]
Parametric case, g belongs to a functional space H,parametrized by some parameter θ ∈ Rd : g(X ) = gθ(X ).
Linear prediction: gθ(X ) = 〈θ, φ(x)〉, with features φ(x) ∈ Rd
Neural networks: gθ(X ) = θ>mσ(θ>m−1σ(· · ·σ(θ>m−1x))
Loucas Pillaud-VivienPerspectives on Stochastic Gradient Descent for Machine Learning problems
7/38
Supervised learningStochastic Gradient Descent
SGD in RKHS: non parametric rates
Machine learning: general contextMathematical framework
Supervised learning: parametrization
R(g) = E(X ,Y )∼ρ [` (Y , g(X ))]
Parametric case, g belongs to a functional space H,parametrized by some parameter θ ∈ Rd : g(X ) = gθ(X ).
Linear prediction: gθ(X ) = 〈θ, φ(x)〉, with features φ(x) ∈ Rd
Neural networks: gθ(X ) = θ>mσ(θ>m−1σ(· · ·σ(θ>m−1x))
Non-parametric case: Prediction as a function g ∈ H, for Hinfinite-dimensional space.
Loucas Pillaud-VivienPerspectives on Stochastic Gradient Descent for Machine Learning problems
8/38
Supervised learningStochastic Gradient Descent
SGD in RKHS: non parametric rates
Machine learning: general contextMathematical framework
Supervised learning cast as optimization
Machine Learning can be cast as an optimization problem:
Find infg∈HR(g) = E(X ,Y )∼ρ [` (Y , g(X ))]
Questions:How to find the optimum ? with i.i.d. samples only ?How to compute it efficiently ?
3 2 1 0 1 2 30.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
0-1squarehingelogistic
Loucas Pillaud-VivienPerspectives on Stochastic Gradient Descent for Machine Learning problems
9/38
Supervised learningStochastic Gradient Descent
SGD in RKHS: non parametric rates
Machine learning: general contextMathematical framework
Where optimization meets statistics
Data: n i.i.d. observations (xi , yi )i ∈ X × Y.
Replace unknown test error Eρ (Y − g(X ))2 by the computabletraining error (empirical risk):
R(g) := 1n
n∑i=1
(yi − g(xi ))2
If H is too rich −→ Problem: overfitting.
Loucas Pillaud-VivienPerspectives on Stochastic Gradient Descent for Machine Learning problems
9/38
Supervised learningStochastic Gradient Descent
SGD in RKHS: non parametric rates
Machine learning: general contextMathematical framework
Where optimization meets statistics
Data: n i.i.d. observations (xi , yi )i ∈ X × Y.
Replace unknown test error Eρ (Y − g(X ))2 by the computabletraining error (empirical risk):
R(g) := 1n
n∑i=1
(yi − g(xi ))2
General approach: regularize.
θn := argming∈H
R(g) + λΩ(g)
data fitting term + regularizer
Loucas Pillaud-VivienPerspectives on Stochastic Gradient Descent for Machine Learning problems
10/38
Supervised learningStochastic Gradient Descent
SGD in RKHS: non parametric rates
Machine learning: general contextMathematical framework
Tradeoffs in Machine Learning: example of linear regression
Linear regression in Rd : gθ(x) = 〈θ, x〉.
Let fi (θ) := (yi − 〈θ, xi〉)2 + λ‖θ‖22, then
θn := argminθ∈Rd
1n
n∑i=1
fi (θ).
Comparison of three methods:Explicit: let X = (x1, . . . , xn)>, θn = (XX> + λnI)−1X>Y .
Stability problems with inversion + O(n2d + n3).
Loucas Pillaud-VivienPerspectives on Stochastic Gradient Descent for Machine Learning problems
11/38
Supervised learningStochastic Gradient Descent
SGD in RKHS: non parametric rates
Machine learning: general contextMathematical framework
Tradeoffs in Machine Learning: example of linear regression
Linear regression in Rd : gθ(x) = 〈θ, x〉.
Let fi (θ) := (yi − 〈θ, xi〉)2 + λ‖θ‖22, then
θn := argminθ∈Rd
1n
n∑i=1
fi (θ).
Comparison of three methods:Explicit: let X = (x1, . . . , xn)>, θn = (XX> + λnI)−1X>Y ,(Full) Gradient descent: θk+1 = θk − γ
n∑n
i=1∇fi (θ)
Cost of one iteration O(nd)
Loucas Pillaud-VivienPerspectives on Stochastic Gradient Descent for Machine Learning problems
12/38
Supervised learningStochastic Gradient Descent
SGD in RKHS: non parametric rates
Machine learning: general contextMathematical framework
Tradeoffs in Machine Learning: example of linear regression
Linear regression in Rd : gθ(x) = 〈θ, x〉.
Let fi (θ) := (yi − 〈θ, xi〉)2 + λ‖θ‖22, then
θn := argminθ∈Rd
1n
n∑i=1
fi (θ).
Comparison of three methods:Explicit: let X = (x1, . . . , xn)>, θn = (XX> + λnI)−1X>Y ,(Full) Gradient descent: θk+1 = θk − γ
n∑n
i=1∇fi (θ)Stochastic Gradient descent on ERM: θk+1 = θk − γ∇fik (θ)where ik is uniformly picked at random in 1, . . . , n.
Cost of one iteration O(d)
Loucas Pillaud-VivienPerspectives on Stochastic Gradient Descent for Machine Learning problems
13/38
Supervised learningStochastic Gradient Descent
SGD in RKHS: non parametric rates
Machine learning: general contextMathematical framework
Tradeoffs in Machine Learning: example of linear regressionLet fi (θ) := (yi − 〈θ, xi〉)2 + λ‖θ‖2
2, then
θn := argminθ∈Rd
1n
n∑i=1
fi (θ).
Comparison of three methods:Explicit: let X = (x1, . . . , xn)>, θn = (XX> + λnI)−1X>Y ,(Full) Gradient descent: θk+1 = θk − γ
n∑n
i=1∇fi (θ)Stochastic Gradient descent on ERM: θk+1 = θk − γ∇fik (θ)where ik is uniformly picked at random in 1, . . . , n.
Two important insights for ML [Bottou and Bousquet, 2008]:
1 No need to optimize below statistical error,2 True risk is more important than empirical risk.
Loucas Pillaud-VivienPerspectives on Stochastic Gradient Descent for Machine Learning problems
14/38
Supervised learningStochastic Gradient Descent
SGD in RKHS: non parametric rates
General resultsSGD for Least square in finite dimensionSGD for Least square: classification problem
1 Supervised learningMachine learning: general contextMathematical framework
2 Stochastic Gradient DescentGeneral resultsSGD for Least square in finite dimensionSGD for Least square: classification problem
3 SGD in RKHS: non parametric ratesSGD in RKHSMultiple passes over the data
Loucas Pillaud-VivienPerspectives on Stochastic Gradient Descent for Machine Learning problems
15/38
Supervised learningStochastic Gradient Descent
SGD in RKHS: non parametric rates
General resultsSGD for Least square in finite dimensionSGD for Least square: classification problem
SGD: general setting
Goal: minθ∈Rd f (θ), given unbiased estimates of the truegradient: (∇fi )i6n.θ∗ := argminθ∈Rd f (θ)Key algorithm: Stochastic Gradient Descent (SGD) [Robbinsand Monro, 1951]:
θt = θt−1 − γt∇ft(θt−1),E [∇ft(θt−1)|Ft−1] = ∇f (θt−1). θt is Ft-measurable.For γt = γ, θt homogeneous Markov chain: law converges tostationary distribution πγ of radius O(γ1/2) around θ∗.[Dieuleveut, Durmus, Bach, 2017].
To make it converge, either decaying step-size, either average.
θt = 1t
t∑i=1
θi
Loucas Pillaud-VivienPerspectives on Stochastic Gradient Descent for Machine Learning problems
16/38
Supervised learningStochastic Gradient Descent
SGD in RKHS: non parametric rates
General resultsSGD for Least square in finite dimensionSGD for Least square: classification problem
SGD for Machine Learning
Data: n i.i.d. observations (xi , yi )i6n ∈ X × Y.Loss for a single pair of observations, for t 6 n,
ft(θ) = `(yt , 〈θ, xt〉).
SGD for the true risk: R(θ) = Eρ [`(Y , 〈θ,X 〉)],Filtration adapted to the problem: Ft = σ((xi , yi )i6t).Unbiased estimates of the gradient for each observation:
∇θft(θ) = ∇θ`(yt , 〈θ, xt〉)E [∇θft(θt−1)|Ft−1] = ∇R(θt−1)
Single pass through the data – “Automatic” regularization.
Loucas Pillaud-VivienPerspectives on Stochastic Gradient Descent for Machine Learning problems
17/38
Supervised learningStochastic Gradient Descent
SGD in RKHS: non parametric rates
General resultsSGD for Least square in finite dimensionSGD for Least square: classification problem
SGD for Least square in finite dimensionProblem Setting:
Data: n i.i.d observations (xi , yi ) ∈ X ×R with distribution ρ.Least-squares: Finding the optimal predictor θ∗ minimizing
R(θ) = Eρ (y − 〈θ, φ(x)〉)2 .
Prediction: linear functions of feature vectors in Rd :g(x) = 〈θ, φ(x)〉, with θ, φ(x) ∈ Rd .Assumptions: ‖φ(x)‖ 6 R, |y | 6 M, |y − 〈θ∗, φ(x)〉| 6 σ.Statistical performance of θn: defined as ER(θn)−R(θ∗)
In finite dimension, single pass SGD achieves optimal rate:
4σ2 dimHn + 4R2‖θ∗‖2
n [Bach, Moulines, 2013]
Loucas Pillaud-VivienPerspectives on Stochastic Gradient Descent for Machine Learning problems
18/38
Supervised learningStochastic Gradient Descent
SGD in RKHS: non parametric rates
General resultsSGD for Least square in finite dimensionSGD for Least square: classification problem
Stochastic Gradient Descent
Regression problems: best convergence rates O(1/n).
Can it be faster for classification problems ?
Loucas Pillaud-VivienPerspectives on Stochastic Gradient Descent for Machine Learning problems
19/38
Supervised learningStochastic Gradient Descent
SGD in RKHS: non parametric rates
General resultsSGD for Least square in finite dimensionSGD for Least square: classification problem
Binary classification: problem setting
Data: (x , y) ∈ X × −1, 1 distributed according to ρ.Prediction: y = sign g(x), with g(x) = 〈g , φ(x)〉H.Aim: minimize over g ∈ H the error,
R01(g) = E`01 (y , g(x)) = E1yg(x)<0.
From error to lossesAs `01 is non convex, we use convex surrogates:
3 2 1 0 1 2 30.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
0-1squarehingelogistic
Loucas Pillaud-VivienPerspectives on Stochastic Gradient Descent for Machine Learning problems
19/38
Supervised learningStochastic Gradient Descent
SGD in RKHS: non parametric rates
General resultsSGD for Least square in finite dimensionSGD for Least square: classification problem
Binary classification: problem setting
Data: (x , y) ∈ X × −1, 1 distributed according to ρ.Prediction: y = sign g(x), with g(x) = 〈g , φ(x)〉H.Aim: minimize over g ∈ H the error,
R01(g) = E`01 (y , g(x)) = E1yg(x)<0.
From error to lossesAs `01 is non convex, we use square loss:
3 2 1 0 1 2 30.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
0-1squarehingelogistic
Loucas Pillaud-VivienPerspectives on Stochastic Gradient Descent for Machine Learning problems
19/38
Supervised learningStochastic Gradient Descent
SGD in RKHS: non parametric rates
General resultsSGD for Least square in finite dimensionSGD for Least square: classification problem
Binary classification: problem setting
Data: (x , y) ∈ X × −1, 1 distributed according to ρ.Prediction: y = sign g(x), with g(x) = 〈g , φ(x)〉H.Aim: minimize over g ∈ H the error,
R01(g) = E`01 (y , g(x)) = E1yg(x)<0.
From error to lossesAs `01 is non convex, we use square loss:
Square loss: R(g) = E` (y , g(x)) = E (y − g(x))2, minimumfor g∗(x) = E(y |x). Good news: g01
∗ (x) = signE(y |x)
Ridge regression: Rλ(g) = E (y − g(x))2 + λ‖g‖2H, minimum
for gλ.
Loucas Pillaud-VivienPerspectives on Stochastic Gradient Descent for Machine Learning problems
19/38
Supervised learningStochastic Gradient Descent
SGD in RKHS: non parametric rates
General resultsSGD for Least square in finite dimensionSGD for Least square: classification problem
Binary classification: problem setting
Data: (x , y) ∈ X × −1, 1 distributed according to ρ.Prediction: y = sign g(x), with g(x) = 〈g , φ(x)〉H.Aim: minimize over g ∈ H the error,
R01(g) = E`01 (y , g(x)) = E1yg(x)<0.
From error to lossesAs `01 is non convex, we use square loss:
Excess error and loss [Bartlett et al., 2006]:
E`01 (y , g(x))− `01∗︸ ︷︷ ︸
Excess error
6√E (y − g(x))2 − `∗︸ ︷︷ ︸
Excess loss
If we use existing results for SGD: E`01 (y , g(x))− `01∗ 6
1√λn.
−→ Not exponentialLoucas Pillaud-Vivien
Perspectives on Stochastic Gradient Descent for Machine Learning problems
20/38
Supervised learningStochastic Gradient Descent
SGD in RKHS: non parametric rates
General resultsSGD for Least square in finite dimensionSGD for Least square: classification problem
Main assumptions
Margin condition (Mammen and Tsybakov, 1999)
Hard inputs to predict: P(y = 1|x) = 1/2, i.e., E(y |x) = 0Easy inputs to predict: P(y = 1|x) ∈ 0, 1, i.e., |E(y |x)| = 1
−→ Margin condition: ∃δ > 0, s.t. |E(y |x)| > δ for allx ∈ supp(ρX ).
Loucas Pillaud-VivienPerspectives on Stochastic Gradient Descent for Machine Learning problems
21/38
Supervised learningStochastic Gradient Descent
SGD in RKHS: non parametric rates
General resultsSGD for Least square in finite dimensionSGD for Least square: classification problem
Main assumptions
(A1) Margin condition: ∃δ > 0, s.t. |E(y |x)| > δ,for all x ∈ supp(ρX ).(A2) Technical condition: ∃λ > 0 s.t.sign(E(y |x))gλ(x) > δ/2, for all x ∈ supp(ρX ).
Consequence: if ‖gλ − g‖L∞ < δ/2, sign g(x) = sign(E(y |x)).
Loucas Pillaud-VivienPerspectives on Stochastic Gradient Descent for Machine Learning problems
22/38
Supervised learningStochastic Gradient Descent
SGD in RKHS: non parametric rates
General resultsSGD for Least square in finite dimensionSGD for Least square: classification problem
Main result
Single pass SGD through the data on the regularized problem
gn = gn−1 − γn [(〈φ(xn), gn−1〉 − yn)φ(xn) + λ(gn−1 − g0)] ,
Take tail averaged estimator, g tailn = 1
n/2∑n
i=n/2 gi , [Jain et al.(2016)].
Theorem (P., Rudi, Bach, 2018)
Assume (A1), (A2), n >1λγ
log Rδ
and γ 6 1/(4R2) then,
Ex1...xnE`01(y , g tail
n (x))− `01
∗ . 4 exp(−λ2δ2n/R2
).
Loucas Pillaud-VivienPerspectives on Stochastic Gradient Descent for Machine Learning problems
23/38
Supervised learningStochastic Gradient Descent
SGD in RKHS: non parametric rates
General resultsSGD for Least square in finite dimensionSGD for Least square: classification problem
Synthetic experiments
Comparing test/train losses/errors for tail-averaged SGD(X = [0, 1], H Sobolev.)
Loucas Pillaud-VivienPerspectives on Stochastic Gradient Descent for Machine Learning problems
24/38
Supervised learningStochastic Gradient Descent
SGD in RKHS: non parametric rates
General resultsSGD for Least square in finite dimensionSGD for Least square: classification problem
Conclusion
Take home message:Exponential convergence of test error and not test lossImportance of the margin condition
Loucas Pillaud-VivienPerspectives on Stochastic Gradient Descent for Machine Learning problems
25/38
Supervised learningStochastic Gradient Descent
SGD in RKHS: non parametric rates
SGD in RKHSMultiple passes over the data
1 Supervised learningMachine learning: general contextMathematical framework
2 Stochastic Gradient DescentGeneral resultsSGD for Least square in finite dimensionSGD for Least square: classification problem
3 SGD in RKHS: non parametric ratesSGD in RKHSMultiple passes over the data
Loucas Pillaud-VivienPerspectives on Stochastic Gradient Descent for Machine Learning problems
26/38
Supervised learningStochastic Gradient Descent
SGD in RKHS: non parametric rates
SGD in RKHSMultiple passes over the data
Non-parametric Random Design Least Squares RegressionGoal:
mingR(g) = Eρ(Y − g(X ))2
ρX marginal distribution of X in X ,L2ρX set of squared integrable functions w.r.t. ρX .
Bayes predictor minimizes the quadratic risk over L2ρX :
gρ(X ) = E [Y |X ] .
Moreover, for any function g in L2ρX , the excess risk is:
R(g)−R(gρ) = ‖g − gρ‖2L2
ρX.
H a space of functions: there exists gH ∈ HL2ρX such that
R(gH) = infg∈HR(g).
Loucas Pillaud-VivienPerspectives on Stochastic Gradient Descent for Machine Learning problems
26/38
Supervised learningStochastic Gradient Descent
SGD in RKHS: non parametric rates
SGD in RKHSMultiple passes over the data
Non-parametric Random Design Least Squares RegressionGoal:
mingR(g) = Eρ(Y − g(X ))2
ρX marginal distribution of X in X ,L2ρX set of squared integrable functions w.r.t. ρX .
Bayes predictor minimizes the quadratic risk over L2ρX :
gρ(X ) = E [Y |X ] .Moreover, for any function g in L2
ρX , the excess risk is:
R(g)−R(gρ) = ‖g − gρ‖2L2
ρX.
H a space of functions: there exists gH ∈ HL2ρX such that
R(gH) = infg∈HR(g).
Loucas Pillaud-VivienPerspectives on Stochastic Gradient Descent for Machine Learning problems
26/38
Supervised learningStochastic Gradient Descent
SGD in RKHS: non parametric rates
SGD in RKHSMultiple passes over the data
Non-parametric Random Design Least Squares RegressionGoal:
mingR(g) = Eρ(Y − g(X ))2
ρX marginal distribution of X in X ,L2ρX set of squared integrable functions w.r.t. ρX .
Bayes predictor minimizes the quadratic risk over L2ρX :
gρ(X ) = E [Y |X ] .Moreover, for any function g in L2
ρX , the excess risk is:
R(g)−R(gρ) = ‖g − gρ‖2L2
ρX.
H a space of functions: there exists gH ∈ HL2ρX such that
R(gH) = infg∈HR(g).
Loucas Pillaud-VivienPerspectives on Stochastic Gradient Descent for Machine Learning problems
27/38
Supervised learningStochastic Gradient Descent
SGD in RKHS: non parametric rates
SGD in RKHSMultiple passes over the data
Reproducing Kernel Hilbert Space
Reproducing Kernel Hilbert Space (RKHS) H is a space offunctions from X into R, such that there exists a reproducingkernel K : X × X → R, satisfying:
For any x ∈ X , H contains the function Kx , defined by:
Kx : X → Rz 7→ K (x , z).
For any x ∈ X and f ∈ H, the reproducing property holds:
〈Kx , f 〉H = f (x).
Loucas Pillaud-VivienPerspectives on Stochastic Gradient Descent for Machine Learning problems
27/38
Supervised learningStochastic Gradient Descent
SGD in RKHS: non parametric rates
SGD in RKHSMultiple passes over the data
Reproducing Kernel Hilbert Space
Reproducing Kernel Hilbert Space (RKHS) H is a space offunctions from X into R, such that there exists a reproducingkernel K : X × X → R, satisfying:
For any x ∈ X , H contains the function Kx , defined by:
Kx : X → Rz 7→ K (x , z).
For any x ∈ X and f ∈ H, the reproducing property holds:
〈Kx , f 〉H = f (x).
Loucas Pillaud-VivienPerspectives on Stochastic Gradient Descent for Machine Learning problems
28/38
Supervised learningStochastic Gradient Descent
SGD in RKHS: non parametric rates
SGD in RKHSMultiple passes over the data
Why are RKHS so nice?
Computation:Linear spaces of functions.Existence of gradients (Hilbert).Only deal with functions in the set spanKxi , i = 1 . . . n(representer theorem).
# the algebraic framework is preserved !
Approximation: many kernels satisfy HL2ρX = L2
ρX , there is noapproximation error !
Representation: Feature map,X → H, x 7→ Kx maps pointsfrom any set into a linear spaceto apply a linear method.
Loucas Pillaud-VivienPerspectives on Stochastic Gradient Descent for Machine Learning problems
28/38
Supervised learningStochastic Gradient Descent
SGD in RKHS: non parametric rates
SGD in RKHSMultiple passes over the data
Why are RKHS so nice?
Computation:Linear spaces of functions.Existence of gradients (Hilbert).Only deal with functions in the set spanKxi , i = 1 . . . n(representer theorem).
# the algebraic framework is preserved !Approximation: many kernels satisfy HL2
ρX = L2ρX , there is no
approximation error !
Representation: Feature map,X → H, x 7→ Kx maps pointsfrom any set into a linear spaceto apply a linear method.
Loucas Pillaud-VivienPerspectives on Stochastic Gradient Descent for Machine Learning problems
28/38
Supervised learningStochastic Gradient Descent
SGD in RKHS: non parametric rates
SGD in RKHSMultiple passes over the data
Why are RKHS so nice?
Computation:Linear spaces of functions.Existence of gradients (Hilbert).Only deal with functions in the set spanKxi , i = 1 . . . n(representer theorem).
# the algebraic framework is preserved !Approximation: many kernels satisfy HL2
ρX = L2ρX , there is no
approximation error !
Representation: Feature map,X → H, x 7→ Kx maps pointsfrom any set into a linear spaceto apply a linear method.
Loucas Pillaud-VivienPerspectives on Stochastic Gradient Descent for Machine Learning problems
29/38
Supervised learningStochastic Gradient Descent
SGD in RKHS: non parametric rates
SGD in RKHSMultiple passes over the data
Stochastic approximation in the RKHS.As R(g) = E
[(〈g ,KX 〉H − Y )2], for each pair of observations
(〈g ,Kxn〉H − yn)Kxn = (g(xn)− yn)Kxn
is an unbiased stochastic gradient of R at g .
Stochastic gradient recursion, starting from g0 ∈ H:gn = gn−1 − γ [〈gn−1,Kxn〉H − yn] Kxn ,
where γ is the step-size. Thus
gn =n∑
i=1aiKxi ,
with (an)n>1, an = −γn(gn−1(xn)− yn). With averaging,
gn = 1n + 1
n∑k=0
gk
Total complexity: O(n2)
Loucas Pillaud-VivienPerspectives on Stochastic Gradient Descent for Machine Learning problems
29/38
Supervised learningStochastic Gradient Descent
SGD in RKHS: non parametric rates
SGD in RKHSMultiple passes over the data
Stochastic approximation in the RKHS.As R(g) = E
[(〈g ,KX 〉H − Y )2], for each pair of observations
(〈g ,Kxn〉H − yn)Kxn = (g(xn)− yn)Kxn
is an unbiased stochastic gradient of R at g .Stochastic gradient recursion, starting from g0 ∈ H:
gn = gn−1 − γ [〈gn−1,Kxn〉H − yn] Kxn ,
where γ is the step-size.
Thus
gn =n∑
i=1aiKxi ,
with (an)n>1, an = −γn(gn−1(xn)− yn). With averaging,
gn = 1n + 1
n∑k=0
gk
Total complexity: O(n2)
Loucas Pillaud-VivienPerspectives on Stochastic Gradient Descent for Machine Learning problems
29/38
Supervised learningStochastic Gradient Descent
SGD in RKHS: non parametric rates
SGD in RKHSMultiple passes over the data
Stochastic approximation in the RKHS.As R(g) = E
[(〈g ,KX 〉H − Y )2], for each pair of observations
(〈g ,Kxn〉H − yn)Kxn = (g(xn)− yn)Kxn
is an unbiased stochastic gradient of R at g .Stochastic gradient recursion, starting from g0 ∈ H:
gn = gn−1 − γ [〈gn−1,Kxn〉H − yn] Kxn ,
where γ is the step-size. Thus
gn =n∑
i=1aiKxi ,
with (an)n>1, an = −γn(gn−1(xn)− yn). With averaging,
gn = 1n + 1
n∑k=0
gk
Total complexity: O(n2)Loucas Pillaud-Vivien
Perspectives on Stochastic Gradient Descent for Machine Learning problems
30/38
Supervised learningStochastic Gradient Descent
SGD in RKHS: non parametric rates
SGD in RKHSMultiple passes over the data
Kernel regression: Analysis
Assume E [K (X ,X )] and E[Y 2] are finite. Define the covariance
operator.
Σ = E[KX K>X
], 〈f ,Σg〉 = EρX (f (X )g(X )).
Parametrization of the problem:Capacity condition: eigenvalue decay of Σ.Source condition: position of gH w.r.t. the kernel space H.
Σ is a trace-class operator, that can be decomposed over itseigen-spaces. Its power: Στ , τ > 0. are thus well defined.
Loucas Pillaud-VivienPerspectives on Stochastic Gradient Descent for Machine Learning problems
30/38
Supervised learningStochastic Gradient Descent
SGD in RKHS: non parametric rates
SGD in RKHSMultiple passes over the data
Kernel regression: Analysis
Assume E [K (X ,X )] and E[Y 2] are finite. Define the covariance
operator.
Σ = E[KX K>X
], 〈f ,Σg〉 = EρX (f (X )g(X )).
Parametrization of the problem:Capacity condition: eigenvalue decay of Σ.Source condition: position of gH w.r.t. the kernel space H.
Σ is a trace-class operator, that can be decomposed over itseigen-spaces. Its power: Στ , τ > 0. are thus well defined.
Loucas Pillaud-VivienPerspectives on Stochastic Gradient Descent for Machine Learning problems
31/38
Supervised learningStochastic Gradient Descent
SGD in RKHS: non parametric rates
SGD in RKHSMultiple passes over the data
Capacity condition (CC)
CC(α): for some α > 1, we assume that tr(Σ1/α) <∞.
If we denote (µi )i∈I the sequence of non-zero eigenvalues of theoperator Σ, in decreasing order, then µi = O (i−α).
Sobolev first order kernel Gaussian kernel
log 1
0(µ
i)
0 0.5 1 1.5 2-5
-4
-3
-2
-1
0
0 0.5 1 1.5 2-12
-9
-6
-3
0 Eigenvalue decay ofthe covariance opera-tor.
log10(i) log10(i)
Left: min kernel, ρX = U [0; 1], −→ CC(α = 2) .Right: Gaussian kernel, ρX = U [−1; 1]. −→ CC(α),∀ α ≥ 1.
Loucas Pillaud-VivienPerspectives on Stochastic Gradient Descent for Machine Learning problems
31/38
Supervised learningStochastic Gradient Descent
SGD in RKHS: non parametric rates
SGD in RKHSMultiple passes over the data
Capacity condition (CC)
CC(α): for some α > 1, we assume that tr(Σ1/α) <∞.
If we denote (µi )i∈I the sequence of non-zero eigenvalues of theoperator Σ, in decreasing order, then µi = O (i−α).
Sobolev first order kernel Gaussian kernel
log 1
0(µ
i)
0 0.5 1 1.5 2-5
-4
-3
-2
-1
0
0 0.5 1 1.5 2-12
-9
-6
-3
0 Eigenvalue decay ofthe covariance opera-tor.
log10(i) log10(i)
Left: min kernel, ρX = U [0; 1], −→ CC(α = 2) .Right: Gaussian kernel, ρX = U [−1; 1]. −→ CC(α),∀ α ≥ 1.
Loucas Pillaud-VivienPerspectives on Stochastic Gradient Descent for Machine Learning problems
31/38
Supervised learningStochastic Gradient Descent
SGD in RKHS: non parametric rates
SGD in RKHSMultiple passes over the data
Capacity condition (CC)
CC(α): for some α > 1, we assume that tr(Σ1/α) <∞.
If we denote (µi )i∈I the sequence of non-zero eigenvalues of theoperator Σ, in decreasing order, then µi = O (i−α).
Sobolev first order kernel Gaussian kernel
log 1
0(µ
i)
0 0.5 1 1.5 2-5
-4
-3
-2
-1
0
0 0.5 1 1.5 2-12
-9
-6
-3
0 Eigenvalue decay ofthe covariance opera-tor.
log10(i) log10(i)
Left: min kernel, ρX = U [0; 1], −→ CC(α = 2) .Right: Gaussian kernel, ρX = U [−1; 1]. −→ CC(α),∀ α ≥ 1.
Loucas Pillaud-VivienPerspectives on Stochastic Gradient Descent for Machine Learning problems
32/38
Supervised learningStochastic Gradient Descent
SGD in RKHS: non parametric rates
SGD in RKHSMultiple passes over the data
Source condition (SC)Concerning the optimal function gH, we assume:
SC(r): for some r > 0, gH ∈ Σr(L2ρX
)Thus ‖Σ−r (gH)‖L2
ρX<∞.
L2ρX
H
Σr
L2
ρX
!
gH
L2ρX
H Σ1
2
L2
ρX
!
gH
=
L2ρX
H
Σr
L2
ρX
!
gH
r < 0.5 r = 0.5 r > 0.5
Loucas Pillaud-VivienPerspectives on Stochastic Gradient Descent for Machine Learning problems
33/38
Supervised learningStochastic Gradient Descent
SGD in RKHS: non parametric rates
SGD in RKHSMultiple passes over the data
NPSA with large step sizes
Theorem (Dieuleuveut, Bach, 2016)Assume CC(α) and SC(r). Then for any γ ≤ 1
4R2 ,
ER (gn)−R(gH) ≤ 4σ2γ1/αtr(Σ1/α)n1−1/α + 4
‖Σ−r (gH − g0)‖2L2
ρX
γ2r nmin(2r ,2) .
for γ = γ0n−2αr−1+α
2αr+1 , for α−12α ≤ r ≤ 1
ER (gn)−R(gH) ≤ n−2αr2αr+1
(4σ2tr(Σ1/α) + 4‖Σ−r (gH − g0)‖2
L2ρX
).
Statistically optimal rate. [Caponetto and De Vito, 2007].Beyond: online, minimal assumptions...
Loucas Pillaud-VivienPerspectives on Stochastic Gradient Descent for Machine Learning problems
34/38
Supervised learningStochastic Gradient Descent
SGD in RKHS: non parametric rates
SGD in RKHSMultiple passes over the data
Result for multipass SGD
ER (gn)−R(gH) ≤ 4σ2γ1/αtr(Σ1/α)n1−1/α + 4
‖Σ−r (gH − g0)‖2L2
ρX
γ2r nmin(2r ,2) .
Optimal result only for easy cases r > α−12α . Because bias
saturate.Idea: what about multiple passes over the data to make biasdecrease.
1 2 3 4 5 6 7 8 9 100.0
0.2
0.4
0.6
0.8
1.0
r
Hard Problems
Easy Problems r = 1
2
Loucas Pillaud-VivienPerspectives on Stochastic Gradient Descent for Machine Learning problems
35/38
Supervised learningStochastic Gradient Descent
SGD in RKHS: non parametric rates
SGD in RKHSMultiple passes over the data
Result for multipass SGD
ER (gn)−R(gH) ≤ 4σ2γ1/αtr(Σ1/α)n1−1/α + 4
‖Σ−r (gH − g0)‖2L2
ρX
γ2r nmin(2r ,2) .
Optimal result only for easy cases r > α−12α . Because bias
saturate.Idea: what about multiple passes over the data to make biasdecrease. Aims:
Get optimality in the hardcases with multiple passesBridge the gap betweentheory and practice
Loucas Pillaud-VivienPerspectives on Stochastic Gradient Descent for Machine Learning problems
35/38
Supervised learningStochastic Gradient Descent
SGD in RKHS: non parametric rates
SGD in RKHSMultiple passes over the data
Result for multipass SGD
ER (gn)−R(gH) ≤ 4σ2γ1/αtr(Σ1/α)n1−1/α + 4
‖Σ−r (gH − g0)‖2L2
ρX
γ2r nmin(2r ,2) .
Optimal result only for easy cases r > α−12α . Because bias
saturate.Idea: what about multiple passes over the data to make biasdecrease. Aims:
Get optimality in the hardcases with multiple passesBridge the gap betweentheory and practice
Loucas Pillaud-VivienPerspectives on Stochastic Gradient Descent for Machine Learning problems
35/38
Supervised learningStochastic Gradient Descent
SGD in RKHS: non parametric rates
SGD in RKHSMultiple passes over the data
Result for multipass SGD
ER (gn)−R(gH) ≤ 4σ2γ1/αtr(Σ1/α)n1−1/α + 4
‖Σ−r (gH − g0)‖2L2
ρX
γ2r nmin(2r ,2) .
Optimal result only for easy cases r > α−12α . Because bias
saturate.Idea: what about multiple passes over the data to make biasdecrease. Aims:
Get optimality in the hardcases with multiple passesBridge the gap betweentheory and practice
Loucas Pillaud-VivienPerspectives on Stochastic Gradient Descent for Machine Learning problems
36/38
Supervised learningStochastic Gradient Descent
SGD in RKHS: non parametric rates
SGD in RKHSMultiple passes over the data
Sampling with replacement from n i.i.d observations (xi , yi ):gt = gt−1 − γt(〈Kxi(t) , gt−1〉 − yi(t))Kxi(t) ,
where i(u) is uniform over 1, . . . , n.Averaged estimator over t > n iterations: gt = 1
t∑t
i=1 gt .
Theorem (Convergence of multiple passes SGD for hard problems(P.,Rudi, Bach, 2018))Let n ∈ N∗ and t > n, let γ = 1/(4R2).
For µα < 2rα + 1 < α, then after t = Θ(nα/(2rα+1))iterations:
ER(gt)−R(g∗) = O(n−2rα/(2rα+1)) Optimal
For µα > 2rα + 1, then after t = Θ(n1/µ (log n)1µ ) iterations:
ER(gt)−R(g∗) 6 O(n−2r/µ) Improved
Loucas Pillaud-VivienPerspectives on Stochastic Gradient Descent for Machine Learning problems
37/38
Supervised learningStochastic Gradient Descent
SGD in RKHS: non parametric rates
SGD in RKHSMultiple passes over the data
Recall the statistical optimal rate for this problem
O(n−2rα2rα+1 ).
The main theorem can be summed up in the following figure:
1 2 3 4 5 6 7 8 9 100.0
0.2
0.4
0.6
0.8
1.0
r
Optimal rates with multiple passes
Optimal rates with one pass
Improved rateswith multiple passes
r = 12
Loucas Pillaud-VivienPerspectives on Stochastic Gradient Descent for Machine Learning problems
38/38
Supervised learningStochastic Gradient Descent
SGD in RKHS: non parametric rates
SGD in RKHSMultiple passes over the data
Conclusion
We showed (almost) finished for SGD-least squares:Optimal rates for parametric SGD for regressionOptimal rates for parametric SGD for classificationOptimal rates for non-parametric SGD in almost all thesettings
What about going beyond Least-squares ?Beyond linear settings ? Deal with non-linear activation ?
Thanks for your attention !
Loucas Pillaud-VivienPerspectives on Stochastic Gradient Descent for Machine Learning problems