alternate views of agd - university of...

Alternate Views of AGDBIG DATA OPTIMIZATION - IE 598HONGYI LI AND SAMANTHA THRUSH29 NOV. 2016

Overview

Introduction/Overview of AGD The Geometrical Interpretation (Bubek, Lee, and Singh, 2015) The Game Theory Perspective (Lan and Zhou, 2015) The ODE Viewpoint (Su, Boyd and Candes, 2015) Questions

Original AGD

What you know and love: works on L-smooth and μ-strongly convex functions (f)

L-smooth convergence rate: 𝑂𝑂 𝐿𝐿𝐷𝐷2

𝑡𝑡2

L-smooth and μ-strongly convex convergence: O 𝜅𝜅 −1𝜅𝜅+1

2𝑡𝑡

Created in 1983 by Yurii Nesterov for smooth functions, has been generalized to non-smooth or composite problems

Pure “algebraic trick”; no true physical intuition behind original method.

Optimal convergence rates

AGD Scheme A two step process: allow 𝑥𝑥0 = 𝑦𝑦0 for t = 0, 1, 2, 3, … and:

𝛼𝛼𝑡𝑡+12 = 1 − 𝛼𝛼𝑡𝑡+1 𝛼𝛼𝑡𝑡2 +𝜇𝜇𝐿𝐿𝛼𝛼𝑡𝑡+1

Allow x and y to be updated in the following fashion:

𝑥𝑥𝑡𝑡+1 = 𝑦𝑦𝑡𝑡 −1𝐿𝐿𝛻𝛻𝛻𝛻(𝑦𝑦𝑡𝑡)

𝑦𝑦𝑡𝑡+1 = 𝑥𝑥𝑡𝑡+1 + 𝛽𝛽𝑡𝑡 𝑥𝑥𝑡𝑡+1 − 𝑥𝑥𝑡𝑡Where the following is true:

𝛽𝛽𝑡𝑡 =𝛼𝛼𝑡𝑡(1 − 𝛼𝛼𝑡𝑡)𝛼𝛼𝑡𝑡2 + 𝛼𝛼𝑡𝑡+1

For the ‘83 Nesterov scheme, the following was true instead:

𝛽𝛽𝑡𝑡 =𝐿𝐿 − 𝜇𝜇𝐿𝐿 + 𝜇𝜇

The Geometrical Interpretation


From smoothness and strongly convexity , we have:

𝑥𝑥+ = 𝑥𝑥 −1𝐿𝐿𝛻𝛻𝛻𝛻 𝑥𝑥

𝑥𝑥++ = 𝑥𝑥 −1𝜇𝜇𝛻𝛻𝛻𝛻 𝑥𝑥

𝑥𝑥∗ − 𝑥𝑥++ 2 ≤ 1 −1𝜅𝜅

𝛻𝛻𝛻𝛻 𝑥𝑥 2

𝜇𝜇2−

2𝜇𝜇

(𝛻𝛻 𝑥𝑥+ − 𝛻𝛻∗)


𝑅𝑅2 = 1 −1𝜅𝜅

𝛻𝛻𝛻𝛻 𝑥𝑥 2

𝜇𝜇2 −2𝜇𝜇 𝛻𝛻 𝑥𝑥+ − 𝛻𝛻∗ , 𝜅𝜅 =

𝐿𝐿𝜇𝜇

It looks like this:


𝐵𝐵(0,1) ∩ 𝐵𝐵(𝑔𝑔, 1 − 𝜀𝜀 𝑔𝑔 2) ⊆ 𝐵𝐵(𝑥𝑥, 1 − 𝜀𝜀)


𝑥𝑥∗ ∈ 𝐵𝐵 𝑥𝑥0,𝑅𝑅02

𝑥𝑥∗ ∈ 𝐵𝐵(𝑥𝑥0++, 1 − 𝜅𝜅𝛻𝛻𝛻𝛻 𝑥𝑥0 2

𝜇𝜇2)

Let 𝑥𝑥1be the 𝑥𝑥 in previous graph, then repeat:

𝑥𝑥𝑘𝑘 − 𝑥𝑥∗ 2 ≤ 1 −1𝜅𝜅

𝑘𝑘

𝑅𝑅02

weaker than 𝜅𝜅−1𝜅𝜅+1

2𝑘𝑘converge


𝐵𝐵 0,1 − 𝜀𝜀𝑔𝑔2 − 𝛿𝛿 ∩ 𝐵𝐵 𝑎𝑎, 1 − 𝜀𝜀 𝑔𝑔2 − 𝛿𝛿 ⊆ 𝐵𝐵 𝑥𝑥, 1 − 𝜀𝜀 − 𝛿𝛿 , 𝑎𝑎 ≥ 𝑔𝑔


Intuition: 𝑥𝑥∗ − 𝑥𝑥++ 2 ≤ 1 − 1𝜅𝜅

𝛻𝛻𝑓𝑓 𝑥𝑥 2

𝜇𝜇2− 2

𝜇𝜇(𝛻𝛻 𝑥𝑥+ − 𝛻𝛻∗), set 𝛿𝛿 = 2

𝜇𝜇(𝛻𝛻 𝑥𝑥+ − 𝛻𝛻∗), can achieve 𝜅𝜅.


𝑥𝑥𝑘𝑘+1 = 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑎𝑎𝑙𝑙 𝑠𝑠𝑙𝑙𝑎𝑎𝑙𝑙𝑠𝑠𝑠 𝑠𝑠𝑘𝑘 , 𝑥𝑥𝑘𝑘

𝑥𝑥∗ ∈ 𝐵𝐵(𝑠𝑠𝑘𝑘 ,𝑅𝑅𝑘𝑘2 −𝛻𝛻𝛻𝛻 𝑥𝑥𝑘𝑘+1 2

𝜇𝜇2𝜅𝜅− 𝛿𝛿)

𝑥𝑥∗ ∈ 𝐵𝐵(𝑥𝑥𝑘𝑘+1++, (1 −1𝜅𝜅

)𝛻𝛻𝛻𝛻 𝑥𝑥𝑘𝑘+1 2

𝜇𝜇2− 𝛿𝛿)

Let 𝑠𝑠𝑘𝑘, 𝑅𝑅𝑘𝑘2 be updated knowing 𝑥𝑥∗ ∈ 𝐵𝐵(𝑠𝑠𝑘𝑘 ,𝑅𝑅𝑘𝑘2):

𝑅𝑅𝑘𝑘2 ≤ 1 −1𝜅𝜅

𝑘𝑘

𝑅𝑅02

≲ 𝑂𝑂( 𝜅𝜅−1𝜅𝜅+1

2𝑘𝑘) when 𝜅𝜅 large, weaker than 𝜅𝜅 −1

𝜅𝜅+1

2𝑘𝑘converge


Fact of big and small balls:


𝑥𝑥+ = 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑎𝑎𝑙𝑙 𝑠𝑠𝑙𝑙𝑎𝑎𝑙𝑙𝑠𝑠𝑠 𝑥𝑥, 𝑥𝑥 − 𝛻𝛻𝛻𝛻 𝑥𝑥

𝑥𝑥∗ ∈ 𝐵𝐵(𝑠𝑠𝑘𝑘 ,𝑅𝑅𝑘𝑘2 −2𝜇𝜇

(𝛻𝛻 𝑥𝑥𝑘𝑘+ − 𝛻𝛻(𝑥𝑥𝑘𝑘+1+)))

𝑥𝑥∗ ∈ 𝐵𝐵(𝑥𝑥𝑘𝑘+1++,𝛻𝛻𝛻𝛻 𝑥𝑥𝑘𝑘+1 2

𝜇𝜇2−

2𝜇𝜇

(𝛻𝛻 𝑥𝑥𝑘𝑘+1 − 𝛻𝛻(𝑥𝑥𝑘𝑘+1+)))

At least as fast as previous oneAll method need (lower bound) 𝜇𝜇!

The Geometrical InterpretationExample: Logistic Regression (L2)

The Game Theory Perspective

The Game Theory Perspective (GTP)

“An optimal randomized incremental gradient method” - Lan and Zhou (2015)

Two different “Games” and two different techniques. A different subset of Gradient Descent Accelerated Gradient Descent is a special case of the Primal-Dual

gradient method

GTP- Inspiration

Started off with a scheme that was not unlike a buyer/seller scheme. Had two updated variables: primal and dual, and was in search of

optimal solution of a saddle point problem. Needed both to converge to the same value Two different quantities being updated that are dependent on

each other, not unlike a two player game Buyer/seller scheme can be thought of in game-theory terms: buyer

and seller both want to “win” (get/keep the most money), so they look for the best order quantity and product price.

GTP – Game 1 (Prime-Dual Gradient)

Two players: Buyer (primal) and Seller (dual) The goal of the game is to come to an equilibrium of cost and price

(neither want to be taken advantage of, but both want a fair price/profit)

What they know: their local cost (buyer: h(x) +μω(x), seller: 𝐽𝐽𝑓𝑓(𝑔𝑔))

interactive cost (but not their partner’s)(buyer: revenue, supplier: <x, g>)

Three different steps, taken iteratively.

GTP – Game 1, Steps

GTP – Game 1, Steps Saddle point problem Seller predicts demand �̅�𝑥𝑡𝑡 by looking two steps into the past: Seller determines price to best bring home the Benjamins

(essentially, we are maximizing < �̅�𝑥𝑡𝑡,𝑔𝑔 > − 𝐽𝐽𝑓𝑓(𝑔𝑔), regularized by: 𝐷𝐷𝑓𝑓(𝑔𝑔𝑡𝑡−1,𝑔𝑔) (the dual prox-function) with weight 𝜏𝜏𝑡𝑡 ≥ 0.

Buyer then tries to find the best number of objects to buy in order to minimize the cost. Essentially minimizing 𝑠 𝑥𝑥 + 𝜇𝜇𝜇𝜇 𝑥𝑥 +< 𝑥𝑥,𝑔𝑔 >, regularized by the primal prox-function 𝑃𝑃 𝑥𝑥𝑡𝑡−1, 𝑥𝑥 with weight 𝜂𝜂𝑡𝑡 ≥ 0

Will keep doing this iteratively until equilibrium is found (best order quantity and product price)

Price too high, won’t order much; price too low, orders will be huge and the seller gets taken advantage of.

GTP – Game 2 (Randomized Prime Dual Gradient)

More than 2 players: 1 buyer and m number of suppliers The goal of the game is to come to an equilibrium of cost and price

(neither want to be taken advantage of, but both want a fair price/profit)

What they know: their local cost (buyer: h(x) +μω(x), sellers: 𝐽𝐽𝑓𝑓(𝑔𝑔))

interactive cost (but not their partner’s)(buyer: revenue, supplier: <x, g>)

But! The buyer must purchase the same amount of products from each supplier

And, a random supplier can make a price change during each iteration according to the predicted demand

GTP – Game 2 (Randomized Prime Dual Gradient) (cont.)

Four different steps, taken iteratively. Solves saddle point problem (again) Randomization allows you to reduce the total number of gradient

evaluations needed (but more primal prox-mapping) Must make sure that f’s are differentiable over ℛ𝑛𝑛 (or else method

might fail)


Sellers predicts demand �̅�𝑥𝑡𝑡 by looking two steps into the past All sellers determine price to best bring home the Benjamins Then one random seller changes their price additionally, so that

they have a small advantage (price-wise) Buyer then tries to find the best number of products to buy from all

sellers in each iteration so as to minimize cost. Will keep doing this iteratively until equilibrium is found. (Ideal cost,

and ideal amount to buy)

Differences between AGD and this method

AGD is a special case of the primal-dual gradient methods presented here:

“… the computation of the gradient at the extrapolation point of the accelerated gradient method is equivalent to a primal prediction step combined with a dual ascent step (employed with the aforementioned dual prox-function) in the PDG method.”- Lan and Zhou

Essentially, the primal-dual gradient method presented can be thought of as just re-labeling some constants found in AGD, but updates variables in the same manner in both methods.

Can be considered as a stand in for AGD if you consider the more generalized algorithms (Not Nesterov’s originals)

Convergence rates comparable to original AGD for first method

The ODE Viewpoint

The ODE Viewpoint – Inspiration & Precedent

Traditional link between ODEs and optimization (Helmke and Moore 1996, and others). Examples Linear regression via linearized Bregman iteration algorithm (Osher 2014)

Control design via a continuous-time Nesterov-like alg. (Durr and Ebenbauer 2012)

Others

Simply done (numerically) by taking extremely small steps in a numerical optimization scheme to allow method to converge to an ODE modeled curve.

The ODE Viewpoint

Nesterovs accelerated gradient decent scheme:

�𝑥𝑥𝑘𝑘+1 = 𝑦𝑦𝑘𝑘 − 𝑠𝑠𝛻𝛻𝛻𝛻(𝑦𝑦𝑘𝑘)

𝑦𝑦𝑘𝑘+1 = 𝑥𝑥𝑘𝑘+1 +𝑘𝑘

𝑘𝑘 + 3(𝑥𝑥𝑘𝑘+1 − 𝑥𝑥𝑘𝑘)

What if 𝑠𝑠 → 0?

Let’s ∆𝑡𝑡 = 𝑠𝑠, 𝑡𝑡𝑘𝑘 = 𝑘𝑘∆𝑡𝑡

The ODE Viewpoint𝑋𝑋′′ +

3𝑡𝑡𝑋𝑋𝑋 + 𝛻𝛻𝛻𝛻 𝑋𝑋 = 0


𝑘𝑘 + 𝛾𝛾𝑥𝑥𝑘𝑘+1 − 𝑥𝑥𝑘𝑘 → 𝑋𝑋′′ +

𝛾𝛾𝑡𝑡𝑋𝑋𝑋 + 𝛻𝛻𝛻𝛻 𝑋𝑋 = 0, 𝛾𝛾 ≥ 3

The ODE Viewpoint

Time Invariance:

𝑡𝑡 → 𝑠𝑠𝑡𝑡,𝑌𝑌 𝜏𝜏 = 𝑋𝑋 𝑠𝑠𝜏𝜏 ,𝑌𝑌′′ +𝛾𝛾𝜏𝜏𝑌𝑌𝑋 +

𝛻𝛻𝛻𝛻 𝑌𝑌𝑠𝑠2

= 0

Rotation Invariance:If 𝑋𝑋(𝑡𝑡) minimizes 𝛻𝛻, 𝑄𝑄 is rotation, 𝑔𝑔 𝑥𝑥 = 𝛻𝛻 𝑄𝑄𝑥𝑥 , then 𝑄𝑄𝑋𝑋(𝑡𝑡) minimizes 𝑔𝑔.

Initial Asymptotic:

When 𝑡𝑡 small, 𝑋𝑋 𝑡𝑡 ≈ −𝛻𝛻𝑓𝑓 𝑥𝑥0 𝑡𝑡2

8+ 𝑥𝑥0 + 𝑜𝑜(𝑡𝑡2)

The ODE Viewpoint

𝛻𝛻 𝐿𝐿-smooth:

𝛻𝛻 𝑋𝑋(𝑡𝑡) − 𝛻𝛻∗ ≤2 𝑥𝑥0 − 𝑥𝑥∗ 2

𝑡𝑡2

�0

∞𝑡𝑡 𝛻𝛻 𝑋𝑋 𝑡𝑡 − 𝛻𝛻∗ 𝑑𝑑𝑡𝑡 < ∞

Energy Function 𝜀𝜀 𝑡𝑡 :2𝑡𝑡2

𝛾𝛾 − 1𝛻𝛻 𝑋𝑋 𝑡𝑡 − 𝛻𝛻∗ + 𝛾𝛾 − 1 𝑋𝑋 𝑡𝑡 −

𝑡𝑡𝛾𝛾 − 1

𝑋𝑋′ 𝑡𝑡 − 𝑥𝑥∗2

The ODE Viewpoint

Proximal Gradient for 𝛻𝛻 = 𝑠 + 𝑔𝑔, 𝑔𝑔 𝐿𝐿-smooth, 𝑠 non-smooth.No ODE for 𝑠 not smooth.

𝛻𝛻 𝑥𝑥𝑘𝑘 − 𝛻𝛻∗ ≤ 𝐶𝐶 𝑥𝑥0−𝑥𝑥∗ 2

𝑘𝑘+𝛾𝛾−2 2, 𝐶𝐶 depends on 𝛾𝛾 > 3, 𝑠𝑠 < 1/𝐿𝐿.

�𝑘𝑘=1

∞

(𝑘𝑘 + 𝛾𝛾 − 1)(𝛻𝛻 𝑥𝑥𝑘𝑘 − 𝛻𝛻∗) < ∞

Discrete Energy Function 𝜀𝜀(𝑘𝑘):2 𝑘𝑘 + 𝛾𝛾 − 2 2𝑠𝑠

𝛾𝛾 − 1𝛻𝛻 𝑥𝑥𝑘𝑘 − 𝛻𝛻∗ + (𝛾𝛾 − 1) 𝑧𝑧𝑘𝑘 − 𝑥𝑥∗ 2

𝑧𝑧𝑘𝑘 =𝑘𝑘 + 𝛾𝛾 − 1𝛾𝛾 − 1

𝑦𝑦𝑘𝑘 −𝑘𝑘

𝛾𝛾 − 1𝑥𝑥𝑘𝑘

The ODE Viewpoint

𝛻𝛻 𝐿𝐿-smooth, 𝜇𝜇-strongly convex: (𝛾𝛾 ≥ 4.5)

𝛻𝛻 𝑥𝑥 𝑡𝑡 − 𝛻𝛻∗ ≤ 𝐶𝐶 𝑥𝑥0−𝑥𝑥∗ 2

𝑡𝑡3, 𝐶𝐶 depends on 𝛾𝛾, 𝜇𝜇.

Energy Function 𝜀𝜀 𝑡𝑡 :

𝑡𝑡3 𝛻𝛻 𝑋𝑋 𝑡𝑡 − 𝛻𝛻∗ +2𝛾𝛾 − 3 2𝑡𝑡

8𝑋𝑋 𝑡𝑡 −

2𝑡𝑡2𝛾𝛾 − 3

𝑋𝑋′ 𝑡𝑡 − 𝑥𝑥∗2

The ODE Viewpoint

𝛾𝛾 = 3, even for quadratic function, cannot be better than 𝑂𝑂( 1𝑡𝑡3

)

(∆𝑡𝑡 = 𝑠𝑠, 𝑠𝑠 = 0.000001observe every 1000 steps)

The ODE Viewpoint

Old Guess: 𝛻𝛻 𝐿𝐿-smooth, 𝜇𝜇-strongly convex:

𝛻𝛻 𝑋𝑋 𝑡𝑡 − 𝛻𝛻∗~𝑂𝑂(1𝑡𝑡𝛾𝛾

)

New Discovery (Journal of Machine Learning Research, published 9/16)

𝛻𝛻 𝐿𝐿-smooth, 𝜇𝜇-strongly convex: 𝛻𝛻 𝑋𝑋 𝑡𝑡 − 𝛻𝛻∗ ≲ 𝑂𝑂( 1

𝑡𝑡23𝛾𝛾

)

𝛻𝛻 = 𝑠 + 𝑔𝑔, 𝑔𝑔 𝐿𝐿-smooth, 𝜇𝜇-strongly convex: 𝛻𝛻 𝑋𝑋 𝑡𝑡 − 𝛻𝛻∗ ≲ 𝑂𝑂( 1𝑡𝑡3

)

The ODE Viewpoint

Restart when 𝑑𝑑 𝑋𝑋′ 𝑡𝑡 2

𝑑𝑑𝑡𝑡≤ 0: keep 𝑋𝑋(𝑡𝑡) moving fast


𝑘𝑘 + 3𝑥𝑥𝑘𝑘+1 − 𝑥𝑥𝑘𝑘

𝑘𝑘=0𝑦𝑦𝑘𝑘+1 = 𝑥𝑥𝑘𝑘+1

𝛻𝛻 𝐿𝐿-smooth, 𝜇𝜇-strongly convex: (𝛾𝛾 = 3)

𝛻𝛻 𝑋𝑋 𝑡𝑡 − 𝛻𝛻∗ ≤ 𝐶𝐶 𝑥𝑥0 − 𝑥𝑥∗ 2exp(−𝐶𝐶′𝑡𝑡), 𝐶𝐶,𝐶𝐶𝑋 ≥ 0 depend on 𝜇𝜇, 𝐿𝐿.

The ODE Viewpoint

Example: LASSO (𝛻𝛻 = 𝑠 + 𝑔𝑔)

Questions??

alternate views of agd - university of...

Documents