alternate views of agd - university of...
TRANSCRIPT
Alternate Views of AGDBIG DATA OPTIMIZATION - IE 598HONGYI LI AND SAMANTHA THRUSH29 NOV. 2016
Overview
Introduction/Overview of AGD The Geometrical Interpretation (Bubek, Lee, and Singh, 2015) The Game Theory Perspective (Lan and Zhou, 2015) The ODE Viewpoint (Su, Boyd and Candes, 2015) Questions
Original AGD
What you know and love: works on L-smooth and μ-strongly convex functions (f)
L-smooth convergence rate: 𝑂𝑂 𝐿𝐿𝐷𝐷2
𝑡𝑡2
L-smooth and μ-strongly convex convergence: O 𝜅𝜅 −1𝜅𝜅+1
2𝑡𝑡
Created in 1983 by Yurii Nesterov for smooth functions, has been generalized to non-smooth or composite problems
Pure “algebraic trick”; no true physical intuition behind original method.
Optimal convergence rates
AGD Scheme A two step process: allow 𝑥𝑥0 = 𝑦𝑦0 for t = 0, 1, 2, 3, … and:
𝛼𝛼𝑡𝑡+12 = 1 − 𝛼𝛼𝑡𝑡+1 𝛼𝛼𝑡𝑡2 +𝜇𝜇𝐿𝐿𝛼𝛼𝑡𝑡+1
Allow x and y to be updated in the following fashion:
𝑥𝑥𝑡𝑡+1 = 𝑦𝑦𝑡𝑡 −1𝐿𝐿𝛻𝛻𝛻𝛻(𝑦𝑦𝑡𝑡)
𝑦𝑦𝑡𝑡+1 = 𝑥𝑥𝑡𝑡+1 + 𝛽𝛽𝑡𝑡 𝑥𝑥𝑡𝑡+1 − 𝑥𝑥𝑡𝑡Where the following is true:
𝛽𝛽𝑡𝑡 =𝛼𝛼𝑡𝑡(1 − 𝛼𝛼𝑡𝑡)𝛼𝛼𝑡𝑡2 + 𝛼𝛼𝑡𝑡+1
For the ‘83 Nesterov scheme, the following was true instead:
𝛽𝛽𝑡𝑡 =𝐿𝐿 − 𝜇𝜇𝐿𝐿 + 𝜇𝜇
The Geometrical Interpretation
The Geometrical Interpretation
From smoothness and strongly convexity , we have:
𝑥𝑥+ = 𝑥𝑥 −1𝐿𝐿𝛻𝛻𝛻𝛻 𝑥𝑥
𝑥𝑥++ = 𝑥𝑥 −1𝜇𝜇𝛻𝛻𝛻𝛻 𝑥𝑥
𝑥𝑥∗ − 𝑥𝑥++ 2 ≤ 1 −1𝜅𝜅
𝛻𝛻𝛻𝛻 𝑥𝑥 2
𝜇𝜇2−
2𝜇𝜇
(𝛻𝛻 𝑥𝑥+ − 𝛻𝛻∗)
The Geometrical Interpretation
𝑅𝑅2 = 1 −1𝜅𝜅
𝛻𝛻𝛻𝛻 𝑥𝑥 2
𝜇𝜇2 −2𝜇𝜇 𝛻𝛻 𝑥𝑥+ − 𝛻𝛻∗ , 𝜅𝜅 =
𝐿𝐿𝜇𝜇
It looks like this:
The Geometrical Interpretation
𝐵𝐵(0,1) ∩ 𝐵𝐵(𝑔𝑔, 1 − 𝜀𝜀 𝑔𝑔 2) ⊆ 𝐵𝐵(𝑥𝑥, 1 − 𝜀𝜀)
The Geometrical Interpretation
𝑥𝑥∗ ∈ 𝐵𝐵 𝑥𝑥0,𝑅𝑅02
𝑥𝑥∗ ∈ 𝐵𝐵(𝑥𝑥0++, 1 − 𝜅𝜅𝛻𝛻𝛻𝛻 𝑥𝑥0 2
𝜇𝜇2)
Let 𝑥𝑥1be the 𝑥𝑥 in previous graph, then repeat:
𝑥𝑥𝑘𝑘 − 𝑥𝑥∗ 2 ≤ 1 −1𝜅𝜅
𝑘𝑘
𝑅𝑅02
weaker than 𝜅𝜅−1𝜅𝜅+1
2𝑘𝑘converge
The Geometrical Interpretation
𝐵𝐵 0,1 − 𝜀𝜀𝑔𝑔2 − 𝛿𝛿 ∩ 𝐵𝐵 𝑎𝑎, 1 − 𝜀𝜀 𝑔𝑔2 − 𝛿𝛿 ⊆ 𝐵𝐵 𝑥𝑥, 1 − 𝜀𝜀 − 𝛿𝛿 , 𝑎𝑎 ≥ 𝑔𝑔
The Geometrical Interpretation
Intuition: 𝑥𝑥∗ − 𝑥𝑥++ 2 ≤ 1 − 1𝜅𝜅
𝛻𝛻𝑓𝑓 𝑥𝑥 2
𝜇𝜇2− 2
𝜇𝜇(𝛻𝛻 𝑥𝑥+ − 𝛻𝛻∗), set 𝛿𝛿 = 2
𝜇𝜇(𝛻𝛻 𝑥𝑥+ − 𝛻𝛻∗), can achieve 𝜅𝜅.
The Geometrical Interpretation
𝑥𝑥𝑘𝑘+1 = 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑎𝑎𝑙𝑙 𝑠𝑠𝑙𝑙𝑎𝑎𝑙𝑙𝑠𝑠𝑠 𝑠𝑠𝑘𝑘 , 𝑥𝑥𝑘𝑘
𝑥𝑥∗ ∈ 𝐵𝐵(𝑠𝑠𝑘𝑘 ,𝑅𝑅𝑘𝑘2 −𝛻𝛻𝛻𝛻 𝑥𝑥𝑘𝑘+1 2
𝜇𝜇2𝜅𝜅− 𝛿𝛿)
𝑥𝑥∗ ∈ 𝐵𝐵(𝑥𝑥𝑘𝑘+1++, (1 −1𝜅𝜅
)𝛻𝛻𝛻𝛻 𝑥𝑥𝑘𝑘+1 2
𝜇𝜇2− 𝛿𝛿)
Let 𝑠𝑠𝑘𝑘, 𝑅𝑅𝑘𝑘2 be updated knowing 𝑥𝑥∗ ∈ 𝐵𝐵(𝑠𝑠𝑘𝑘 ,𝑅𝑅𝑘𝑘2):
𝑅𝑅𝑘𝑘2 ≤ 1 −1𝜅𝜅
𝑘𝑘
𝑅𝑅02
≲ 𝑂𝑂( 𝜅𝜅−1𝜅𝜅+1
2𝑘𝑘) when 𝜅𝜅 large, weaker than 𝜅𝜅 −1
𝜅𝜅+1
2𝑘𝑘converge
The Geometrical Interpretation
Fact of big and small balls:
The Geometrical Interpretation
𝑥𝑥+ = 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑎𝑎𝑙𝑙 𝑠𝑠𝑙𝑙𝑎𝑎𝑙𝑙𝑠𝑠𝑠 𝑥𝑥, 𝑥𝑥 − 𝛻𝛻𝛻𝛻 𝑥𝑥
𝑥𝑥∗ ∈ 𝐵𝐵(𝑠𝑠𝑘𝑘 ,𝑅𝑅𝑘𝑘2 −2𝜇𝜇
(𝛻𝛻 𝑥𝑥𝑘𝑘+ − 𝛻𝛻(𝑥𝑥𝑘𝑘+1+)))
𝑥𝑥∗ ∈ 𝐵𝐵(𝑥𝑥𝑘𝑘+1++,𝛻𝛻𝛻𝛻 𝑥𝑥𝑘𝑘+1 2
𝜇𝜇2−
2𝜇𝜇
(𝛻𝛻 𝑥𝑥𝑘𝑘+1 − 𝛻𝛻(𝑥𝑥𝑘𝑘+1+)))
At least as fast as previous oneAll method need (lower bound) 𝜇𝜇!
The Geometrical InterpretationExample: Logistic Regression (L2)
The Game Theory Perspective
The Game Theory Perspective (GTP)
“An optimal randomized incremental gradient method” - Lan and Zhou (2015)
Two different “Games” and two different techniques. A different subset of Gradient Descent Accelerated Gradient Descent is a special case of the Primal-Dual
gradient method
GTP- Inspiration
Started off with a scheme that was not unlike a buyer/seller scheme. Had two updated variables: primal and dual, and was in search of
optimal solution of a saddle point problem. Needed both to converge to the same value Two different quantities being updated that are dependent on
each other, not unlike a two player game Buyer/seller scheme can be thought of in game-theory terms: buyer
and seller both want to “win” (get/keep the most money), so they look for the best order quantity and product price.
GTP – Game 1 (Prime-Dual Gradient)
Two players: Buyer (primal) and Seller (dual) The goal of the game is to come to an equilibrium of cost and price
(neither want to be taken advantage of, but both want a fair price/profit)
What they know: their local cost (buyer: h(x) +μω(x), seller: 𝐽𝐽𝑓𝑓(𝑔𝑔))
interactive cost (but not their partner’s)(buyer: revenue, supplier: <x, g>)
Three different steps, taken iteratively.
GTP – Game 1, Steps
GTP – Game 1, Steps Saddle point problem Seller predicts demand �̅�𝑥𝑡𝑡 by looking two steps into the past: Seller determines price to best bring home the Benjamins
(essentially, we are maximizing < �̅�𝑥𝑡𝑡,𝑔𝑔 > − 𝐽𝐽𝑓𝑓(𝑔𝑔), regularized by: 𝐷𝐷𝑓𝑓(𝑔𝑔𝑡𝑡−1,𝑔𝑔) (the dual prox-function) with weight 𝜏𝜏𝑡𝑡 ≥ 0.
Buyer then tries to find the best number of objects to buy in order to minimize the cost. Essentially minimizing 𝑠 𝑥𝑥 + 𝜇𝜇𝜇𝜇 𝑥𝑥 +< 𝑥𝑥,𝑔𝑔 >, regularized by the primal prox-function 𝑃𝑃 𝑥𝑥𝑡𝑡−1, 𝑥𝑥 with weight 𝜂𝜂𝑡𝑡 ≥ 0
Will keep doing this iteratively until equilibrium is found (best order quantity and product price)
Price too high, won’t order much; price too low, orders will be huge and the seller gets taken advantage of.
GTP – Game 2 (Randomized Prime Dual Gradient)
More than 2 players: 1 buyer and m number of suppliers The goal of the game is to come to an equilibrium of cost and price
(neither want to be taken advantage of, but both want a fair price/profit)
What they know: their local cost (buyer: h(x) +μω(x), sellers: 𝐽𝐽𝑓𝑓(𝑔𝑔))
interactive cost (but not their partner’s)(buyer: revenue, supplier: <x, g>)
But! The buyer must purchase the same amount of products from each supplier
And, a random supplier can make a price change during each iteration according to the predicted demand
GTP – Game 2 (Randomized Prime Dual Gradient) (cont.)
Four different steps, taken iteratively. Solves saddle point problem (again) Randomization allows you to reduce the total number of gradient
evaluations needed (but more primal prox-mapping) Must make sure that f’s are differentiable over ℛ𝑛𝑛 (or else method
might fail)
GTP – Game 2, Steps
GTP – Game 2, Steps
Sellers predicts demand �̅�𝑥𝑡𝑡 by looking two steps into the past All sellers determine price to best bring home the Benjamins Then one random seller changes their price additionally, so that
they have a small advantage (price-wise) Buyer then tries to find the best number of products to buy from all
sellers in each iteration so as to minimize cost. Will keep doing this iteratively until equilibrium is found. (Ideal cost,
and ideal amount to buy)
Differences between AGD and this method
AGD is a special case of the primal-dual gradient methods presented here:
“… the computation of the gradient at the extrapolation point of the accelerated gradient method is equivalent to a primal prediction step combined with a dual ascent step (employed with the aforementioned dual prox-function) in the PDG method.”- Lan and Zhou
Essentially, the primal-dual gradient method presented can be thought of as just re-labeling some constants found in AGD, but updates variables in the same manner in both methods.
Can be considered as a stand in for AGD if you consider the more generalized algorithms (Not Nesterov’s originals)
Convergence rates comparable to original AGD for first method
The ODE Viewpoint
The ODE Viewpoint – Inspiration & Precedent
Traditional link between ODEs and optimization (Helmke and Moore 1996, and others). Examples Linear regression via linearized Bregman iteration algorithm (Osher 2014)
Control design via a continuous-time Nesterov-like alg. (Durr and Ebenbauer 2012)
Others
Simply done (numerically) by taking extremely small steps in a numerical optimization scheme to allow method to converge to an ODE modeled curve.
The ODE Viewpoint
Nesterovs accelerated gradient decent scheme:
�𝑥𝑥𝑘𝑘+1 = 𝑦𝑦𝑘𝑘 − 𝑠𝑠𝛻𝛻𝛻𝛻(𝑦𝑦𝑘𝑘)
𝑦𝑦𝑘𝑘+1 = 𝑥𝑥𝑘𝑘+1 +𝑘𝑘
𝑘𝑘 + 3(𝑥𝑥𝑘𝑘+1 − 𝑥𝑥𝑘𝑘)
What if 𝑠𝑠 → 0?
Let’s ∆𝑡𝑡 = 𝑠𝑠, 𝑡𝑡𝑘𝑘 = 𝑘𝑘∆𝑡𝑡
The ODE Viewpoint𝑋𝑋′′ +
3𝑡𝑡𝑋𝑋𝑋 + 𝛻𝛻𝛻𝛻 𝑋𝑋 = 0
𝑦𝑦𝑘𝑘+1 = 𝑥𝑥𝑘𝑘+1 +𝑘𝑘
𝑘𝑘 + 𝛾𝛾𝑥𝑥𝑘𝑘+1 − 𝑥𝑥𝑘𝑘 → 𝑋𝑋′′ +
𝛾𝛾𝑡𝑡𝑋𝑋𝑋 + 𝛻𝛻𝛻𝛻 𝑋𝑋 = 0, 𝛾𝛾 ≥ 3
The ODE Viewpoint
Time Invariance:
𝑡𝑡 → 𝑠𝑠𝑡𝑡,𝑌𝑌 𝜏𝜏 = 𝑋𝑋 𝑠𝑠𝜏𝜏 ,𝑌𝑌′′ +𝛾𝛾𝜏𝜏𝑌𝑌𝑋 +
𝛻𝛻𝛻𝛻 𝑌𝑌𝑠𝑠2
= 0
Rotation Invariance:If 𝑋𝑋(𝑡𝑡) minimizes 𝛻𝛻, 𝑄𝑄 is rotation, 𝑔𝑔 𝑥𝑥 = 𝛻𝛻 𝑄𝑄𝑥𝑥 , then 𝑄𝑄𝑋𝑋(𝑡𝑡) minimizes 𝑔𝑔.
Initial Asymptotic:
When 𝑡𝑡 small, 𝑋𝑋 𝑡𝑡 ≈ −𝛻𝛻𝑓𝑓 𝑥𝑥0 𝑡𝑡2
8+ 𝑥𝑥0 + 𝑜𝑜(𝑡𝑡2)
The ODE Viewpoint
𝛻𝛻 𝐿𝐿-smooth:
𝛻𝛻 𝑋𝑋(𝑡𝑡) − 𝛻𝛻∗ ≤2 𝑥𝑥0 − 𝑥𝑥∗ 2
𝑡𝑡2
�0
∞𝑡𝑡 𝛻𝛻 𝑋𝑋 𝑡𝑡 − 𝛻𝛻∗ 𝑑𝑑𝑡𝑡 < ∞
Energy Function 𝜀𝜀 𝑡𝑡 :2𝑡𝑡2
𝛾𝛾 − 1𝛻𝛻 𝑋𝑋 𝑡𝑡 − 𝛻𝛻∗ + 𝛾𝛾 − 1 𝑋𝑋 𝑡𝑡 −
𝑡𝑡𝛾𝛾 − 1
𝑋𝑋′ 𝑡𝑡 − 𝑥𝑥∗2
The ODE Viewpoint
Proximal Gradient for 𝛻𝛻 = 𝑠 + 𝑔𝑔, 𝑔𝑔 𝐿𝐿-smooth, 𝑠 non-smooth.No ODE for 𝑠 not smooth.
𝛻𝛻 𝑥𝑥𝑘𝑘 − 𝛻𝛻∗ ≤ 𝐶𝐶 𝑥𝑥0−𝑥𝑥∗ 2
𝑘𝑘+𝛾𝛾−2 2, 𝐶𝐶 depends on 𝛾𝛾 > 3, 𝑠𝑠 < 1/𝐿𝐿.
�𝑘𝑘=1
∞
(𝑘𝑘 + 𝛾𝛾 − 1)(𝛻𝛻 𝑥𝑥𝑘𝑘 − 𝛻𝛻∗) < ∞
Discrete Energy Function 𝜀𝜀(𝑘𝑘):2 𝑘𝑘 + 𝛾𝛾 − 2 2𝑠𝑠
𝛾𝛾 − 1𝛻𝛻 𝑥𝑥𝑘𝑘 − 𝛻𝛻∗ + (𝛾𝛾 − 1) 𝑧𝑧𝑘𝑘 − 𝑥𝑥∗ 2
𝑧𝑧𝑘𝑘 =𝑘𝑘 + 𝛾𝛾 − 1𝛾𝛾 − 1
𝑦𝑦𝑘𝑘 −𝑘𝑘
𝛾𝛾 − 1𝑥𝑥𝑘𝑘
The ODE Viewpoint
𝛻𝛻 𝐿𝐿-smooth, 𝜇𝜇-strongly convex: (𝛾𝛾 ≥ 4.5)
𝛻𝛻 𝑥𝑥 𝑡𝑡 − 𝛻𝛻∗ ≤ 𝐶𝐶 𝑥𝑥0−𝑥𝑥∗ 2
𝑡𝑡3, 𝐶𝐶 depends on 𝛾𝛾, 𝜇𝜇.
Energy Function 𝜀𝜀 𝑡𝑡 :
𝑡𝑡3 𝛻𝛻 𝑋𝑋 𝑡𝑡 − 𝛻𝛻∗ +2𝛾𝛾 − 3 2𝑡𝑡
8𝑋𝑋 𝑡𝑡 −
2𝑡𝑡2𝛾𝛾 − 3
𝑋𝑋′ 𝑡𝑡 − 𝑥𝑥∗2
The ODE Viewpoint
𝛾𝛾 = 3, even for quadratic function, cannot be better than 𝑂𝑂( 1𝑡𝑡3
)
(∆𝑡𝑡 = 𝑠𝑠, 𝑠𝑠 = 0.000001observe every 1000 steps)
The ODE Viewpoint
Old Guess: 𝛻𝛻 𝐿𝐿-smooth, 𝜇𝜇-strongly convex:
𝛻𝛻 𝑋𝑋 𝑡𝑡 − 𝛻𝛻∗~𝑂𝑂(1𝑡𝑡𝛾𝛾
)
New Discovery (Journal of Machine Learning Research, published 9/16)
𝛻𝛻 𝐿𝐿-smooth, 𝜇𝜇-strongly convex: 𝛻𝛻 𝑋𝑋 𝑡𝑡 − 𝛻𝛻∗ ≲ 𝑂𝑂( 1
𝑡𝑡23𝛾𝛾
)
𝛻𝛻 = 𝑠 + 𝑔𝑔, 𝑔𝑔 𝐿𝐿-smooth, 𝜇𝜇-strongly convex: 𝛻𝛻 𝑋𝑋 𝑡𝑡 − 𝛻𝛻∗ ≲ 𝑂𝑂( 1𝑡𝑡3
)
The ODE Viewpoint
Restart when 𝑑𝑑 𝑋𝑋′ 𝑡𝑡 2
𝑑𝑑𝑡𝑡≤ 0: keep 𝑋𝑋(𝑡𝑡) moving fast
𝑦𝑦𝑘𝑘+1 = 𝑥𝑥𝑘𝑘+1 +𝑘𝑘
𝑘𝑘 + 3𝑥𝑥𝑘𝑘+1 − 𝑥𝑥𝑘𝑘
𝑘𝑘=0𝑦𝑦𝑘𝑘+1 = 𝑥𝑥𝑘𝑘+1
𝛻𝛻 𝐿𝐿-smooth, 𝜇𝜇-strongly convex: (𝛾𝛾 = 3)
𝛻𝛻 𝑋𝑋 𝑡𝑡 − 𝛻𝛻∗ ≤ 𝐶𝐶 𝑥𝑥0 − 𝑥𝑥∗ 2exp(−𝐶𝐶′𝑡𝑡), 𝐶𝐶,𝐶𝐶𝑋 ≥ 0 depend on 𝜇𝜇, 𝐿𝐿.
The ODE Viewpoint
Example: LASSO (𝛻𝛻 = 𝑠 + 𝑔𝑔)
Questions??