![Page 1: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,](https://reader035.vdocuments.net/reader035/viewer/2022071504/61246f0d053f0d2239112c56/html5/thumbnails/1.jpg)
A geometric alternative to Nesterov acceleratedgradient descent
Bubeck, Lee, Singh, 2015
Yulia Rubanova
University of Toronto
November 29, 2017
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
![Page 2: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,](https://reader035.vdocuments.net/reader035/viewer/2022071504/61246f0d053f0d2239112c56/html5/thumbnails/2.jpg)
Problem statement
Goal: minimize β-smooth and α-strongly convex function
minx∈Rn
f (x)
α-strongly convex function: ”not too flat”
∀x , y ∈ Rn : f (y) ≥ f (x) +∇f (x)T (y − x) +α
2‖y − x‖2
β-smooth function: ”not too curvy”
∀x , y ∈ Rn : f (y) ≤ f (x) +∇f (x)T (y − x) +β
2‖y − x‖2
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
![Page 3: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,](https://reader035.vdocuments.net/reader035/viewer/2022071504/61246f0d053f0d2239112c56/html5/thumbnails/3.jpg)
Problem statement
Goal: minimize β-smooth and α-strongly convex function
minx∈Rn
f (x)
α-strongly convex function: ”not too flat”
∀x , y ∈ Rn : f (y) ≥ f (x) +∇f (x)T (y − x) +α
2‖y − x‖2
β-smooth function: ”not too curvy”
∀x , y ∈ Rn : f (y) ≤ f (x) +∇f (x)T (y − x) +β
2‖y − x‖2
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
![Page 4: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,](https://reader035.vdocuments.net/reader035/viewer/2022071504/61246f0d053f0d2239112c56/html5/thumbnails/4.jpg)
Problem statement
Goal: minimize β-smooth and α-strongly convex function
minx∈Rn
f (x)
α-strongly convex function: ”not too flat”
∀x , y ∈ Rn : f (y) ≥ f (x) +∇f (x)T (y − x) +α
2‖y − x‖2
β-smooth function: ”not too curvy”
∀x , y ∈ Rn : f (y) ≤ f (x) +∇f (x)T (y − x) +β
2‖y − x‖2
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
![Page 5: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,](https://reader035.vdocuments.net/reader035/viewer/2022071504/61246f0d053f0d2239112c56/html5/thumbnails/5.jpg)
Problem statement
x⇤
ff�sm f↵
conv
x
f βsm(y) = f (x) +∇f (x)T (y − x) +β
2‖y − x‖2
f αconv (y) = f (x) +∇f (x)T (y − x) +α
2‖y − x‖2
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
![Page 6: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,](https://reader035.vdocuments.net/reader035/viewer/2022071504/61246f0d053f0d2239112c56/html5/thumbnails/6.jpg)
Classic approach: gradient descent
Gradient descent:
xt+1 = xt − γ∇f (xt), t = 0, 1, ...
GD convergence rate: O((1− 1
κ)t), where κ = β
α
Can we obtain faster convergence? Yes!
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
![Page 7: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,](https://reader035.vdocuments.net/reader035/viewer/2022071504/61246f0d053f0d2239112c56/html5/thumbnails/7.jpg)
Classic approach: gradient descent
Gradient descent:
xt+1 = xt − γ∇f (xt), t = 0, 1, ...
GD convergence rate: O((1− 1
κ)t), where κ = β
α
Can we obtain faster convergence? Yes!
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
![Page 8: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,](https://reader035.vdocuments.net/reader035/viewer/2022071504/61246f0d053f0d2239112c56/html5/thumbnails/8.jpg)
Can we obtain faster convergence?
Accelerated Gradient Descent algorithm [Nesterov, 83]:
Let y0 = x0. For t=0,1,2,..., update:
xt+1 = yt −1
β∇f (yt)
yt+1 = xt+1 +
√β −√α√β +√α
(xt+1 − xt)
Convergence rate: O(
(1− 1√κ
)t)
I Convergence rate is optimal
I Pure algebraic trick; no intuition behind the method.
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
![Page 9: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,](https://reader035.vdocuments.net/reader035/viewer/2022071504/61246f0d053f0d2239112c56/html5/thumbnails/9.jpg)
Can we obtain faster convergence?
Accelerated Gradient Descent algorithm [Nesterov, 83]:
Let y0 = x0. For t=0,1,2,..., update:
xt+1 = yt −1
β∇f (yt)
yt+1 = xt+1 +
√β −√α√β +√α
(xt+1 − xt)
Convergence rate: O(
(1− 1√κ
)t)
I Convergence rate is optimal
I Pure algebraic trick; no intuition behind the method.
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
![Page 10: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,](https://reader035.vdocuments.net/reader035/viewer/2022071504/61246f0d053f0d2239112c56/html5/thumbnails/10.jpg)
Can we obtain faster convergence?
Accelerated Gradient Descent algorithm [Nesterov, 83]:
Let y0 = x0. For t=0,1,2,..., update:
xt+1 = yt −1
β∇f (yt)
yt+1 = xt+1 +
√β −√α√β +√α
(xt+1 − xt)
Convergence rate: O(
(1− 1√κ
)t)
I Convergence rate is optimal
I Pure algebraic trick; no intuition behind the method.
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
![Page 11: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,](https://reader035.vdocuments.net/reader035/viewer/2022071504/61246f0d053f0d2239112c56/html5/thumbnails/11.jpg)
Attempts to interpret Nesterov AGD
[Allen-Zhu and Orrechia, 2014] View AGD as a linear coupling ofGradient Descent and Mirror Descent
[Su, Boyd and Candes, 2015] view AGD as the discretization of acertain second-order ODE
[Bubeck, Lee, Singh, 2015] A new algorithm with the sametheoretical guarantees as AGD with a geometric interpretation.
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
![Page 12: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,](https://reader035.vdocuments.net/reader035/viewer/2022071504/61246f0d053f0d2239112c56/html5/thumbnails/12.jpg)
Notation
f (x) is β-smooth and α-strongly convex
Denote κ = βα
x+ = x − 1
β∇f (x), x++ = x − 1
α∇f (x)
The ball with center x and radius r :
B(x , r2) = {y ∈ Rn : ‖y − x‖2 ≤ r2}
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
![Page 13: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,](https://reader035.vdocuments.net/reader035/viewer/2022071504/61246f0d053f0d2239112c56/html5/thumbnails/13.jpg)
Notation
f (x) is β-smooth and α-strongly convex
Denote κ = βα
x+ = x − 1
β∇f (x), x++ = x − 1
α∇f (x)
The ball with center x and radius r :
B(x , r2) = {y ∈ Rn : ‖y − x‖2 ≤ r2}
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
![Page 14: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,](https://reader035.vdocuments.net/reader035/viewer/2022071504/61246f0d053f0d2239112c56/html5/thumbnails/14.jpg)
Notation
f (x) is β-smooth and α-strongly convex
Denote κ = βα
x+ = x − 1
β∇f (x), x++ = x − 1
α∇f (x)
The ball with center x and radius r :
B(x , r2) = {y ∈ Rn : ‖y − x‖2 ≤ r2}
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
![Page 15: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,](https://reader035.vdocuments.net/reader035/viewer/2022071504/61246f0d053f0d2239112c56/html5/thumbnails/15.jpg)
Notation
f (x) is β-smooth and α-strongly convex
Denote κ = βα
x+ = x − 1
β∇f (x), x++ = x − 1
α∇f (x)
The ball with center x and radius r :
B(x , r2) = {y ∈ Rn : ‖y − x‖2 ≤ r2}
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
![Page 16: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,](https://reader035.vdocuments.net/reader035/viewer/2022071504/61246f0d053f0d2239112c56/html5/thumbnails/16.jpg)
Preliminaries
TheoremFor α-strongly convex function f (x) and ∀x ∈ Rn:
x∗ ∈ B
(x++,
‖∇f (x)‖2α2
− 2
α(f (x)− f (x∗))
)
It is equivalent to show that:
∥∥x∗ − x++∥∥2 ≤ ‖∇f (x)‖2
α2− 2
α(f (x)− f (x∗))
Recall the definition x++ = x − 1α∇f (x):
∥∥x∗ − x++∥∥2 = ‖x − x∗‖2 +
2
α∇f (x)T (x∗ − x) +
1
α2‖∇f (x)‖2
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
![Page 17: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,](https://reader035.vdocuments.net/reader035/viewer/2022071504/61246f0d053f0d2239112c56/html5/thumbnails/17.jpg)
Preliminaries
∥∥x∗ − x++∥∥2 = ‖x − x∗‖2 +
2
α∇f (x)T (x∗ − x) +
1
α2‖∇f (x)‖2
By definition of strong convexity:
‖x∗ − x‖2 +2
α∇f (x)T (x∗ − x) ≤ − 2
α(f (x)− f (x∗))
Substituting into the previous equation for the ball:∥∥x∗ − x++∥∥2 ≤ 1
α2‖∇f (x)‖2 − 2
α(f (x)− f (x∗))
Recall that x∗ is a minimum and f (x) ≥ f (x∗):∥∥x∗ − x++∥∥2 ≤ 1
α2‖∇f (x)‖2
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
![Page 18: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,](https://reader035.vdocuments.net/reader035/viewer/2022071504/61246f0d053f0d2239112c56/html5/thumbnails/18.jpg)
Interpretation
(recall x++ = x − 1α∇f (x))∥∥x++ − x∗∥∥2 ≤ ∥∥x++ − x
∥∥2 =1
α2‖∇f (x)‖2
x⇤
ff�sm f↵
conv
xx+
x++
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
![Page 19: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,](https://reader035.vdocuments.net/reader035/viewer/2022071504/61246f0d053f0d2239112c56/html5/thumbnails/19.jpg)
Preliminaries
TheoremFor α-strongly convex and β-smooth function f (x) and ∀x ∈ Rn:
x∗ ∈ B
(x++,
‖∇f (x)‖2α2
(1− 1
κ)− 2
α(f (x+)− f (x∗))
)
From the strong convexity:
x∗ ∈ B
(x++,
‖∇f (x)‖2α2
− 2
α(f (x)− f (x∗))
)Using definition of smoothness (recall x+ = x − 1
β∇f (x)):
f (x+) ≤ f (x)+∇f (x)T (x+−x)+β
2
∥∥x+ − x∥∥2 = f (x)− 1
2β‖∇f (x)‖2
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
![Page 20: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,](https://reader035.vdocuments.net/reader035/viewer/2022071504/61246f0d053f0d2239112c56/html5/thumbnails/20.jpg)
Preliminaries
From smoothness:
f (x) ≥ f (x+) +1
2β‖∇f (x)‖2
Substituting into x∗ ∈ B(x++, ‖∇f (x)‖
2
α2 − 2α(f (x)− f (x∗))
):
x∗ ∈ B
(x++,
‖∇f (x)‖2α2
(1− 1
κ)− 2
α(f (x+)− f (x∗))
)Note that f (x+) ≥ f (x∗) because x∗ is a minimum of f (x).
x∗ ∈ B
(x++,
‖∇f (x)‖2α2
(1− 1
κ)
)
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
![Page 21: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,](https://reader035.vdocuments.net/reader035/viewer/2022071504/61246f0d053f0d2239112c56/html5/thumbnails/21.jpg)
Interpretation
Introduce x̃ on f αconv s.t.f (x̃) = f (x+) ≥ f (x∗)
∥∥x++ − x∗∥∥ ≤ ∥∥x++ − x̃
∥∥How to find x̃?
x̃ = x − t∇f (x)
x⇤x̃
ff�sm f↵
conv
xx+
x++
Our requirement on x̃ :
f αconv (x̃) = f (x)− t ‖∇f (x)‖2 +t2α
2‖∇f (x)‖2 =
= f βsm(x+) = f (x)− 1
2β‖∇f (x)‖2
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
![Page 22: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,](https://reader035.vdocuments.net/reader035/viewer/2022071504/61246f0d053f0d2239112c56/html5/thumbnails/22.jpg)
Interpretation
Solving quadratic equation:
αt2 − 2t +1
β= 0 =⇒ t =
1
α
(1−
√1− 1
κ
)∥∥x̃ − x++
∥∥2 =
∥∥∥∥(x − t∇f (x))− (x − 1
α∇f (x))
∥∥∥∥2 =
∥∥∥∥∥− 1
α∇f (x) +
1
α
(1−
√1− 1
κ
)∇f (x)
∥∥∥∥∥2
= (1− 1
κ)‖∇f (x)‖2
α2
∥∥x∗ − x++∥∥2 ≤ ∥∥x̃ − x++
∥∥2 = (1− 1
κ)‖∇f (x)‖2
α2
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
![Page 23: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,](https://reader035.vdocuments.net/reader035/viewer/2022071504/61246f0d053f0d2239112c56/html5/thumbnails/23.jpg)
A suboptimal algorithm
We have just proven:
x∗ ∈ Bsm−conv
(x++0 , (1− 1
κ)‖∇f (x0)‖2
α2
)
Assume that we are given x0 such that
x∗ ∈ Bguarantee(x0,R20 )
We can intersect these two balls and get a ball with a smallerradius.
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
![Page 24: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,](https://reader035.vdocuments.net/reader035/viewer/2022071504/61246f0d053f0d2239112c56/html5/thumbnails/24.jpg)
A suboptimal algorithm
We have just proven:
x∗ ∈ Bsm−conv
(x++0 , (1− 1
κ)‖∇f (x0)‖2
α2
)
Assume that we are given x0 such that
x∗ ∈ Bguarantee(x0,R20 )
We can intersect these two balls and get a ball with a smallerradius.
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
![Page 25: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,](https://reader035.vdocuments.net/reader035/viewer/2022071504/61246f0d053f0d2239112c56/html5/thumbnails/25.jpg)
A suboptimal algorithm
We have just proven:
x∗ ∈ Bsm−conv
(x++0 , (1− 1
κ)‖∇f (x0)‖2
α2
)
Assume that we are given x0 such that
x∗ ∈ Bguarantee(x0,R20 )
We can intersect these two balls and get a ball with a smallerradius.
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
![Page 26: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,](https://reader035.vdocuments.net/reader035/viewer/2022071504/61246f0d053f0d2239112c56/html5/thumbnails/26.jpg)
A suboptimal algorithm
x⇤
Bguarantee
Bsm�conv
x0 x+0 x++
0
�rf(x0)x1
R0
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
![Page 27: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,](https://reader035.vdocuments.net/reader035/viewer/2022071504/61246f0d053f0d2239112c56/html5/thumbnails/27.jpg)
A suboptimal algorithm
From geometry:
x∗ ∈ Bintersect
(x1, (1− 1
κ)R2
0
)where x1 is the center of a new ball.
Repeat the procedure using B(x1, (1− 1
κ)R20
)as a guarantee.
We have reduced the original ball Bguarantee(x0,R20 ) by (1− 1
κ).
After t iterations:
‖xt − x∗‖ ≤ (1− 1
κ)tR2
0
Same rate of convergence as gradient descent
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
![Page 28: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,](https://reader035.vdocuments.net/reader035/viewer/2022071504/61246f0d053f0d2239112c56/html5/thumbnails/28.jpg)
A suboptimal algorithm
From geometry:
x∗ ∈ Bintersect
(x1, (1− 1
κ)R2
0
)where x1 is the center of a new ball.Repeat the procedure using B
(x1, (1− 1
κ)R20
)as a guarantee.
We have reduced the original ball Bguarantee(x0,R20 ) by (1− 1
κ).
After t iterations:
‖xt − x∗‖ ≤ (1− 1
κ)tR2
0
Same rate of convergence as gradient descent
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
![Page 29: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,](https://reader035.vdocuments.net/reader035/viewer/2022071504/61246f0d053f0d2239112c56/html5/thumbnails/29.jpg)
A suboptimal algorithm
From geometry:
x∗ ∈ Bintersect
(x1, (1− 1
κ)R2
0
)where x1 is the center of a new ball.Repeat the procedure using B
(x1, (1− 1
κ)R20
)as a guarantee.
We have reduced the original ball Bguarantee(x0,R20 ) by (1− 1
κ).
After t iterations:
‖xt − x∗‖ ≤ (1− 1
κ)tR2
0
Same rate of convergence as gradient descent
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
![Page 30: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,](https://reader035.vdocuments.net/reader035/viewer/2022071504/61246f0d053f0d2239112c56/html5/thumbnails/30.jpg)
Acceleration
What if we had a better guarantee?Assume that we are given x0 such that
x∗ ∈ Bguarantee(x0,R20 −
2
α(f (y)− f (x∗))),
where y is such that f (x0) ≤ f (y) (for example, y := x0).
From definition of smoothness and f (x0) ≤ f (y):
f (x+0 ) ≤ f (x) +∇f (x)T (x+0 − x) +β
2
∥∥x+0 − x∥∥2 ≤
≤ f (x0)− 1
2β‖∇f (x0)‖2 ≤ f (y)− 1
2β‖∇f (x0)‖2
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
![Page 31: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,](https://reader035.vdocuments.net/reader035/viewer/2022071504/61246f0d053f0d2239112c56/html5/thumbnails/31.jpg)
Acceleration
f (x+0 ) ≤ f (y)− 1
2β‖∇f (x0)‖2
f (y) ≥ f (x+0 ) +1
2β‖∇f (x0)‖2
Substituting f (y) in Bguarantee(x0,R20 − 2
α(f (y)− f (x∗))):
x∗ ∈ Bguarantee
(x0,R
20 −‖∇f (x0)‖2
α2κ− 2
α(f (x+0 )− f (x∗))
)
This ball is smaller than Bguarantee(x0,R20 ).
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
![Page 32: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,](https://reader035.vdocuments.net/reader035/viewer/2022071504/61246f0d053f0d2239112c56/html5/thumbnails/32.jpg)
Acceleration
x∗ ∈ Bguarantee
(x0,R
20 −‖∇f (x0)‖2
α2κ− 2
α(f (x+0 )− f (x∗))
)
For the second ball, we use the same one as before:
x∗ ∈ Bsm−conv
(x++0 ,‖∇f (x0)‖2
α2(1− 1
κ)− 2
α(f (x+0 )− f (x∗))
)
We take the intersection of the two balls. Note that if one ballbecame smaller, the intersection is also smaller.
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
![Page 33: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,](https://reader035.vdocuments.net/reader035/viewer/2022071504/61246f0d053f0d2239112c56/html5/thumbnails/33.jpg)
Acceleration
x∗ ∈ Bguarantee
(x0,R
20 −‖∇f (x0)‖2
α2κ− 2
α(f (x+0 )− f (x∗))
)
For the second ball, we use the same one as before:
x∗ ∈ Bsm−conv
(x++0 ,‖∇f (x0)‖2
α2(1− 1
κ)− 2
α(f (x+0 )− f (x∗))
)
We take the intersection of the two balls. Note that if one ballbecame smaller, the intersection is also smaller.
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
![Page 34: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,](https://reader035.vdocuments.net/reader035/viewer/2022071504/61246f0d053f0d2239112c56/html5/thumbnails/34.jpg)
Acceleration
x∗ ∈ Bguarantee
(x0,R
20 −‖∇f (x0)‖2
α2κ− 2
α(f (x+0 )− f (x∗))
)
For the second ball, we use the same one as before:
x∗ ∈ Bsm−conv
(x++0 ,‖∇f (x0)‖2
α2(1− 1
κ)− 2
α(f (x+0 )− f (x∗))
)
We take the intersection of the two balls. Note that if one ballbecame smaller, the intersection is also smaller.
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
![Page 35: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,](https://reader035.vdocuments.net/reader035/viewer/2022071504/61246f0d053f0d2239112c56/html5/thumbnails/35.jpg)
Lemma: intersection of the two balls
Denote:
ε =1
κ; g2 =
‖∇f (x0)‖2α2
; δ =2
α(f (x+0 )− f (x∗))
x∗ ∈ Bguarantee
(x0,R
20 − εg2 − δ
)∩Bsm−conv
(x++0 , g2(1− ε)− δ
)
TheoremConsider g , ε such that ε ∈ (0, 1), g ≥ 0. There exists c such thatfor any δ > 0:
B(0,R − εg2 − δ) ∩ B(g , g2(1− ε)− δ) ⊂ B(c ,R(1−√ε)− δ)
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
![Page 36: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,](https://reader035.vdocuments.net/reader035/viewer/2022071504/61246f0d053f0d2239112c56/html5/thumbnails/36.jpg)
Lemma: intersection of the two balls
Denote:
ε =1
κ; g2 =
‖∇f (x0)‖2α2
; δ =2
α(f (x+0 )− f (x∗))
x∗ ∈ Bguarantee
(x0,R
20 − εg2 − δ
)∩Bsm−conv
(x++0 , g2(1− ε)− δ
)
TheoremConsider g , ε such that ε ∈ (0, 1), g ≥ 0. There exists c such thatfor any δ > 0:
B(0,R − εg2 − δ) ∩ B(g , g2(1− ε)− δ) ⊂ B(c ,R(1−√ε)− δ)
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
![Page 37: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,](https://reader035.vdocuments.net/reader035/viewer/2022071504/61246f0d053f0d2239112c56/html5/thumbnails/37.jpg)
Sketch of the proof
0
1�"g
2 �� g 2
(1� ")� �r
x g � x g
r2 = 1− εg2 − δ− x2 = g2(1− ε)− δ− (g − x)2 =⇒ x =1
2g
r2 = 1− εg2 − 1
(2g)2− δ
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
![Page 38: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,](https://reader035.vdocuments.net/reader035/viewer/2022071504/61246f0d053f0d2239112c56/html5/thumbnails/38.jpg)
Sketch of the proof
We will use AM-GM inequality to estimate εg2 + 1(2g)2
:
1
2
(εg2 +
1
(2g)2
)≥√εg2
1
4g2=
√ε
2=⇒ εg2 +
1
(2g)2≥ √ε
Therefore,
r2 = 1− εg2 − 1
(2g)2− δ ≤ 1−√ε− δ
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
![Page 39: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,](https://reader035.vdocuments.net/reader035/viewer/2022071504/61246f0d053f0d2239112c56/html5/thumbnails/39.jpg)
Main result
Using the theorem, we take the smallest ball that contains theintersection of the two:
x∗ ∈ Bintersect
(x1,R
20 (1− 1√
κ)− 2
α(f (x+0 )− f (x∗))
)The intersection ball is always shrinking:
R2t ≤ (1− 1√
κ)tR2
0
‖x∗ − xt‖2 ≤ (1− 1√κ
)tR20
Same convergence rate as in Nesterov AGDMuch faster than (1− 1
κ)t rate in Gradient Descent
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
![Page 40: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,](https://reader035.vdocuments.net/reader035/viewer/2022071504/61246f0d053f0d2239112c56/html5/thumbnails/40.jpg)
Caveats
How do we find such x0,R0, y for the guarantee ball?
x∗ ∈ Bguarantee(x0,R20 −
2
α(f (y)− f (x∗))),
where y is such that f (x0) ≤ f (y)
For the first iteration, we can take use Bsm−conv :
x∗ ∈ Bsm−conv
(x++0 ,‖∇f (x0)‖2
α2(1− 1
κ)− 2
α(f (x+0 )− f (x∗))
)
For other iterations we need a trick which we won’t cover in thislecture.
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
![Page 41: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,](https://reader035.vdocuments.net/reader035/viewer/2022071504/61246f0d053f0d2239112c56/html5/thumbnails/41.jpg)
Caveats
How do we find such x0,R0, y for the guarantee ball?
x∗ ∈ Bguarantee(x0,R20 −
2
α(f (y)− f (x∗))),
where y is such that f (x0) ≤ f (y)
For the first iteration, we can take use Bsm−conv :
x∗ ∈ Bsm−conv
(x++0 ,‖∇f (x0)‖2
α2(1− 1
κ)− 2
α(f (x+0 )− f (x∗))
)
For other iterations we need a trick which we won’t cover in thislecture.
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
![Page 42: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,](https://reader035.vdocuments.net/reader035/viewer/2022071504/61246f0d053f0d2239112c56/html5/thumbnails/42.jpg)
Algorithm
Geometric gradient descent
1: Input: convexity parameter α and initial point x0
2: c0 = x++0 ; R0 = ‖∇f (x0)‖2
α2 (1− 1κ)
3: for t to 1,2,... do4: xt := ct−1
5: Ball 1: Bguarantee(xt ,R2t−1 − ‖∇f (xt)‖
2
α2κ)
6: Ball 2: Bsm−conv (x++t , ‖∇f (xt)‖
2
α2 (1− 1κ))
7: Find B(ct ,R2t ) – minimum ball enclosing the intersection;
R2t ≤ (1− 1√
κ)R2
t−18: end for9: return xt
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD
![Page 43: A geometric alternative to Nesterov accelerated gradient descentrubanova/pdf/geometric_presentation... · 2020. 11. 14. · Attempts to interpret Nesterov AGD [Allen-Zhu and Orrechia,](https://reader035.vdocuments.net/reader035/viewer/2022071504/61246f0d053f0d2239112c56/html5/thumbnails/43.jpg)
Summary
I Goal: minimize β-smooth and α-strongly convex function
I We have provided geometric interpretation to GD andAccelerated GD
I Gradient descent has (1− 1κ)t convergence rate
I Accelerated GD achieves (1− 1√κ
)t convergence rate
Thank you for listening
Bubeck, Lee, Singh, 2015 Geometric alternative to Nesterov GD