Exercise List: Convergence rates and complexity
Francis Bach and Robert M. Gower
February 10, 2019
1 Rate of convergence and complexity
All the algorithm we discuss in the course generate a sequence of random vectors xt thatconverge to a desired x∗ in some sense. Because the xt’s are random we always proveconvergence in expectation. In particular, we focus on two forms of convergence, eithershowing that the difference of function values converges
E[f(xt)− f(x∗)
]−→ 0,
or the expected norm difference of the iterates converges
E[‖xt − x∗‖2
]−→ 0.
Two important questions: 1) How fast is this convergence and 2) given an ε how manyiterations t are needed before E
[f(xt)− f(x∗)
]< ε or E
[‖xt − x∗‖2
]< ε.
Ex. 1 — Consider a sequence (αt)t ∈ R+ that converge to zero according to
αt ≤C
t,
where C > 0. Given an ε > 0, show that
t ≥ C
ε⇒ αt < ε.
We refer to this result as a O(1/ε) iteration complexity.
Ex. 2 — Using that1
1− ρlog
(1
ρ
)≥ 1, (1)
prove the following lemma.
1
Lemma 1.1. Consider the sequence (αk)k ∈ R+ of positive scalars that converges tozero according to
αk ≤ ρk α0, (2)
where ρ ∈ [0, 1). For a given 1 > ε > 0 we have that
k ≥ 1
1− ρlog
(1
ε
)⇒ αk ≤ ε α0. (3)
We refer to this as a O(log(1/ε)) iteration complexity.
Following the introduction, we can write αt def= E
[f(xt)− f(x∗)
]or αt def
= E[‖xt − x∗‖2
].
The type of convergence (2) is known as linear convergence at a rate of ρk.
2
Exercise List: Proving convergence of the Stochastic Gradient
Descent and Coordinate Descent on the Ridge Regression
Problem.
Robert M. Gower & Francis Bach & Nidham Gazagnadou
February 10, 2019
Introduction
Consider the task of learning a rule that maps the feature vector x ∈ Rd to outputs y ∈ R.Furthermore you are given a set of labelled observations (xi, yi) for i = 1, . . . , n. Werestrict ourselves to linear mappings. That is, we need to find w ∈ Rd such that
x>i w ≈ yi, for i = 1, . . . , n. (1)
That is the hypothesis function is parametrized by w and is given by hw : x 7→ w>x.1 Tochoose a w such that each x>i w is close to yi, we use the squared loss `(y) = y2/2 and thesquared regularizor. That is, we minimize
w∗ = arg minw
1
n
n∑i=1
1
2(x>i w − yi)2 +
λ
2‖w‖22, (2)
where λ > 0 is the regularization parameter. We now have a complete training prob-lem (2)2.
Using the matrix notation
Xdef= [x1, . . . , xn] ∈ Rd×n, and y = [y1, . . . , yn] ∈ Rn, (3)
we can re-write the objective function in (2) as
f(w)def=
1
2n‖X>w − y‖22 +
λ
2‖w‖22. (4)
First we introduce some necessary notation.
1We need only consider a linear mapping as opposed to the more general affine mapping xi 7→ w>xi +β,because the zero order term β ∈ R can be incorporated by defining a new feature vectors xi = [x1, 1] andnew variable w = [w, β] so that x>i w = x>i w + β
2Excluding the issue of selection λ using something like crossvalidation https://en.wikipedia.org/
wiki/Cross-validation_(statistics)
1
Notation: For every x,w,∈ Rd let 〈x,w〉 def= x>y and let ‖x‖2 =√〈x, x〉. Let A ∈ Rd×d
be a matrix and let σmin(A) and σmax(A) be the smallest and largest singular values of Adefined by
σmin(A)def= min
x∈Rd, x 6=0
‖Ax‖2‖x‖2
and σmax(A)def= max
x∈Rd, x 6=0
‖Ax‖2‖x‖2
. (5)
Finally, a result you will need, if A is a symmetric positive semi-definite matrix thelargest singular value of A can be defined instead as
σmax(A) = maxx∈Rd, x 6=0
〈Ax, x〉2‖x‖22
= maxx∈Rd, x 6=0
‖Ax‖2‖x‖2
. (6)
Therefore〈Ax, x〉‖x‖22
≤ σmax(A), ∀x ∈ Rd \ {0}. (7)
and‖Ax‖2‖x‖2
≤ σmax(A), ∀x ∈ Rd \ {0}. (8)
We will now solve the following ridge regression problem
w∗ = arg minw∈Rd
(1
2n‖X>w − y‖22 +
λ
2‖w‖22
def= f(w)
), (9)
using stochastic gradient descent and stochastic coordinate descent.
Exercise 1 : Stochastic Gradient Descent (SGD)
Some more notation: Let ‖A‖2Fdef= Tr
(A>A
)denote the Frobenius norm of A. Let
Adef= 1
nXX> + λI ∈ Rd×d and b
def= 1
nXy. (10)
We can exploit the separability of the objective function (2) to design a stochasticgradient method. For this, first we re-write the problem Aw = b as different linear leastsquares problem
w∗ = arg minw
12‖Aw−b‖
22 = arg min
w
d∑i=1
12(Ai:w−bi)2
def= arg min
w
d∑i=1
pifi(w), (11)
where fi(w) = 12pi
(Ai:w − bi)2, Ai: denotes the ith row of A, bi denotes the ith element of
b and pi =‖Ai:‖22‖A‖2F
for i = 1, . . . , d. Note that∑d
i=1 pi = 1 thus the pi’s are probabilities.
2
From a given w0 ∈ Rd, consider the iterates
wt+1 = wt − α∇fj(wt), (12)
where
α =1
‖A‖2F, (13)
and j is a random index chosen from {1, . . . , d} sampled with probability pj . In other words,
P(j = i) = pi =‖Ai:‖22‖A‖2F
for all i ∈ {1, . . . , d}.
Question 1.1: Show that the solution w∗ to (11) and the solution to w∗ to (9) are equal.
Question 1.2: Show that
∇fj(w) =1
pjA>j:Aj:(w − w∗) (14)
and that
Ej∼p [∇fj(w)]def=
d∑i=1
pi∇fi(w) = A>A(w − w∗) ,
thus ∇fj(w) is an unbiased estimator of the full gradient of the objective function in (11).This justifies applying the stochastic gradient method.
Question 1.3: Let Πjdef=
A>j:Aj:
‖Aj:‖22, show that
ΠjΠj = Πj , (15)
and(I −Πj)(I −Πj) = I −Πj . (16)
In other words, Πj is a projection operator which projects orthogonally onto Range (Aj:) .Furthermore, if j ∼ pj verify that
E [Πj ] =d∑i=1
piΠi =A>A
‖A‖2F. (17)
Question 1.4: Show the following equality ruling the squared norm of the distance tothe solution
‖wt+1 − w∗‖22 = ‖wt − w∗‖22 −
⟨A>j:Aj:
‖Aj:‖22(wt − w∗), wt − w∗
⟩. (18)
3
Question 1.5: Using previous answer and analogous techniques from the course, showthat the iterates (12) converge according to
E[‖wt+1 − w∗‖22
]≤
(1− σmin(A)2
‖A‖2F
)E[‖wt − w∗‖22
]. (19)
Remark: This is an amazing and recent result [2], since it shows that SGD convergesexponentially fast despite the fact that the iterates (14) only require access to a single rowof A at a time! This result can be extended to solving any linear system Aw = b, includingthe case where A rank deficient. Indeed, so long as there exists a solution to Aw = b, the
iterates (14) converge to the solution of least norm and at rate of(
1− σ+min(A)
2
‖A‖2F
)where
σ+min(A) is the smallest nonzero singular value of A [1]. Thus this method can solve anylinear system.
4
BONUS
Exercise 2: Stochastic Coordinate Descent (CD)
Consider the minimization problem
w∗ = arg minx∈Rd
(f(w)
def=
1
2w>Aw − w>b
), (20)
where A ∈ Rd×d is a symmetric positive definite matrix, and w, b ∈ Rd.
Question 2.1: First show that, using the notation (10), solving (20) is equivalent tosolving (9).
Question 2.2: Show that∂f(w)
∂wi= Ai:w − bi , (21)
where Ai: is the ith row of A. Furthermore note that w∗ = A−1b, thus
∂f(w)
∂wi= e>i (Aw − b) = e>i A(w − w∗) . (22)
Question 2.3: Consider a step of the stochastic coordinate descent method
wk+1 = wk − αi∂f(wk)
∂xiei, (23)
where ei ∈ Rd is the ith unit coordinate vector, αi =1
Aii, and i ∈ {1, . . . , d} is sampled
i.i.d at each step according to i ∼ pi where pi =Aii
Tr (A). Let ‖x‖2A
def= x>Ax.
First, prove that
‖wk+1 − w∗‖2A =⟨
(I −Π>i )A(I −Πi)(wk − w∗), wk − w∗
⟩. (24)
Question 2.4: Let rkdef= A1/2(wk − w∗). Deduce from (24) that
‖rk+1‖22 = ‖rk‖22 −
⟨A1/2eie
>i A
1/2
Aiirk, rk
⟩. (25)
5
Question 2.5: Finally, prove the convergence of the iterates of CD (23) convergeaccording to
E[‖wk+1 − w∗‖2A
]≤
(1− λmin(A)
Tr (A)
)E[‖wk − w∗‖2A
](26)
thus (23) converges to the solution.Hint: Since A is symmetric positive definite you can use that
λmin(A) = infx∈Rd,x 6=0
x>Ax
‖x‖22.
You will need to use that x>Ax ≥ λmin(A)‖x‖22 at some point.
Question 2.6: When is this stochastic gradient method (14) faster than the stochasticcoordinate descent method of gradient descent? Note that the each iteration of SGD andCD costs O(d) floating point operations while an iteration of the GD method costs O(d2)floating point operations (assuming that A has been previously calculated and stored).What happens if d is very big? What if ‖A‖2F is very large? Discuss this.
References
[1] R. M. Gower and P. Richtarik. “Stochastic Dual Ascent for Solving Linear Systems”.In: arXiv:1512.06890 (2015).
[2] T. Strohmer and R. Vershynin. “A Randomized Kaczmarz Algorithm with ExponentialConvergence”. In: Journal of Fourier Analysis and Applications 15.2 (2009), pp. 262–278.
6
(BONUS) Exercise List: Proving convergence of the Stochastic
Gradient Descent for smooth and convex functions.
Robert M. Gower
February 10, 2019
1 Introduction
Consider the problem
w∗ ∈ arg minw
(1
n
n∑i=1
fi(w)def= f(w)
), (1)
where we assume that f(w) is µ–strongly quasi-convex
f(w∗) ≥ f(w) + 〈w∗ − w,∇f(w)〉+µ
2‖w − w∗‖2, (2)
and each fi is convex and Li–smooth
fi(w + h) ≤ fi(w) + 〈∇fi(w), h〉+Li2‖h‖2, for i = 1, . . . , n. (3)
Here we will provide a modern proof of the convergence of the SGD algorithm
wt+1 = wt − γt∇fit(wt), where it ∼1
n. (4)
The result we will prove is given in the following theorem.
Theorem 1.1. Assume f is µ-quasi-strongly convex and the fi’s are convex and Li–smooth. LetLmax = maxi=1,...,n Li and let
σ2def=
n∑i=1
1
n‖∇fi(w∗)‖2. (5)
Choose γt = γ ∈ (0, 12Lmax
] for all t. Then the iterates of SGD given by (4) satisfy:
E‖wt − w∗‖2 ≤ (1− γµ)t ‖w0 − w∗‖2 + 2γσ2
µ . (6)
2 Proof of Theorem 1.1
We will now give a modern proof of the convergance of SGD.
1
Ex. 1 — Let Et [·] def= E
[· |wt
]and consider the tth iteration of the SGD method (4). Show that
Et[∇fit(wt)
]= ∇f(wt).
Ex. 2 — Let Et [·] def= E
[· |wt
]be the expectation conditioned on wt. Using a step of SGD (4)
show that
Et[‖wt+1 − w∗‖2
]= ‖wt − w∗‖2 − 2γ
⟨wt − w∗,∇f(wt)
⟩+ γ2
n∑i=1
1
n‖∇fi(wt)‖2. (7)
Ex. 3 — Now we need to bound the term∑n
i=11n‖∇fi(w
t)‖2 to continue the proof. We breakthis into the following steps.
Part I
Using that each fi is Li–smooth and convex and using Lemma A.1 in the appendix show that
n∑i=1
1
2nLi‖∇fi(w)−∇fi(w∗)‖22 ≤ f(w)− f(w∗). (9)
Hint : Remember that ∇f(w∗) = 0.Now let Lmax = maxi=1,...,n Li and conlude that
n∑i=1
1
n‖∇fi(w)−∇fi(w∗)‖22 ≤ 2Lmax(f(w)− f(w∗)). (10)
Part II
Using (10) and Definition 5 show that
n∑i=1
1
n‖∇fi(w)‖2 ≤ 4Lmax(f(w)− f(w∗)) + 2σ2. (11)
Ex. 4 — Using (11) together with (7) and the strong quasi-convexity (2) of f(w) show that
Et[‖wt+1 − w∗‖2
]≤ (1− µγ)‖wt − w∗‖2 + 2γ(2γLmax − 1)(f(wt)− f(w∗)) + 2σ2γ2. (15)
Ex. 5 — Using that γ ∈ (0, 12Lmax
] conclude the proof by taking expectation again, and unrollingthe recurrence.
Ex. 6 — BONUS importance sampling: Let it ∼ pi in the SGD update (4), where pi > 0 areprobabilities with
∑ni=1 pi = 1. What should the pi’s be so that SGD has the fastest convergence?
3 Decreasing step-sizes
Based on Theorem 1.1 we can introduce a decreasing stepsize.
2
Theorem 3.1 (Decreasing stepsizes). Let f be µ–strongly quasi-convex and each fi be Li–smooth
and convex. Let K def= Lmax/µ and
γt =
1
2Lmaxfor t ≤ 4dKe
2t+1(t+1)2µ
for t > 4dKe.(18)
If t ≥ 4dKe, then SGD iterates given by (4) satisfy:
E‖wt − w∗‖2 ≤ σ2
µ28
t+
16
e2dKe2
t2‖w0 − w∗‖2. (19)
Proof. Let γtdef= 2t+1
(t+1)2µand let t∗ be an integer that satisfies γt∗ ≤ 1
2Lmax. In particular this holds
fort∗ ≥ d4K − 1e.
Note that γt is decreasing in t and consequently γt ≤ 12Lmax
for all t ≥ t∗. This in turn guaranteesthat (6) holds for all t ≥ t∗ with γt in place of γ, that is
E‖rt+1‖2 ≤ t2
(t+ 1)2E‖rt‖2 +
2σ2
µ2(2t+ 1)2
(t+ 1)4. (20)
Multiplying both sides by (t+ 1)2 we obtain
(t+ 1)2E‖rt+1‖2 ≤ t2E‖rt‖2 +2σ2
µ2
(2t+ 1
t+ 1
)2
≤ t2E‖rt‖2 +8σ2
µ2,
where the second inequality holds because 2t+1t+1 < 2. Rearranging and summing from j = t∗ . . . t
we obtain:t∑
j=t∗
[(j + 1)2E‖rj+1‖2 − j2E‖rj‖2
]≤
t∑j=t∗
8σ2
µ2. (21)
Using telescopic cancellation gives
(t+ 1)2E‖rt+1‖2 ≤ (t∗)2E‖rt∗‖2 +8σ2(t− t∗)
µ2.
Dividing the above by (t+ 1)2 gives
E‖rt+1‖2 ≤ (t∗)2
(t+ 1)2E‖rt∗‖2 +
8σ2(t− t∗)µ2(t+ 1)2
. (22)
For t ≤ t∗ we have that (6) holds, which combined with (22), gives
E‖rt+1‖2 ≤ (t∗)2
(t+ 1)2
(1− µ
2Lmax
)t∗‖r0‖2
+σ2
µ2(t+ 1)2
(8(t− t∗) +
(t∗)2
K
). (23)
3
Choosing t∗ that minimizes the second line of the above gives t∗ = 4dKe, which when insertedinto (23) becomes
E‖rt+1‖2 ≤ 16dKe2
(t+ 1)2
(1− 1
2K
)4dKe‖r0‖2
+σ2
µ28(t− 2dKe)
(t+ 1)2
≤ 16dKe2
e2(t+ 1)2‖r0‖2 +
σ2
µ28
t+ 1, (24)
where we have used that(1− 1
2x
)4x ≤ e−2 for all x ≥ 1.
A Appendix: Auxiliary smooth and convex lemma
As a consequence of the fi’s being smooth and convex we have that f is also smooth and convex.In particular f is convex since it is a convex combination of the fi’s. This gives us the followinguseful lemma.
Lemma A.1. If f is both L–smooth
f(z) ≤ f(w) + 〈∇f(w), z − w〉+L
2‖z − w‖22 (25)
and convexf(z) ≥ f(y) + 〈∇f(y), z − y〉 , (26)
then we have that
f(y)− f(w) ≤ 〈∇f(y), y − w〉 − 1
2L‖∇f(y)−∇f(w)‖22. (27)
Proof. To prove (27), it follows that
f(y)− f(w) = f(y)− f(z) + f(z)− f(w)
(26)+(25)
≤ 〈∇f(y), y − z〉+ 〈∇f(w), z − w〉+L
2‖z − w‖22.
To get the tightest upper bound on the right hand side, we can minimize the right hand side in z,which gives
z = w − 1
L(∇f(w)−∇f(y)). (28)
Substituting this in gives
f(y)− f(w) =
⟨∇f(y), y − w +
1
L(∇f(w)−∇f(y))
⟩− 1
L〈∇f(w),∇f(w)−∇f(y)〉+
1
2L‖∇f(w)−∇f(y)‖22
= 〈∇f(y), y − w〉 − 1
L‖∇f(w)−∇f(y)‖22 +
1
2L‖∇f(w)−∇f(y)‖22
= 〈∇f(y), y − w〉 − 1
2L‖∇f(w)−∇f(y)‖22.
4