randomized smoothing techniques in optimizationjduchi/projects/duchijowawi14_slides.pdf ·...
TRANSCRIPT
![Page 1: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/1.jpg)
Randomized Smoothing Techniques in Optimization
John Duchi
Based on joint work with Peter Bartlett, Michael Jordan,
Martin Wainwright, Andre Wibisono
Stanford University
Information Systems Laboratory SeminarOctober 2014
![Page 2: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/2.jpg)
Problem Statement
Goal: solve the following problem
minimize f(x)
subject to x ∈ X
![Page 3: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/3.jpg)
Problem Statement
Goal: solve the following problem
minimize f(x)
subject to x ∈ X
Often, we will assume
f(x) :=1
n
n∑
i=1
F (x; ξi) or f(x) := E[F (x; ξ)]
![Page 4: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/4.jpg)
Gradient Descent
Goal: solve
minimize f(x)
Technique: go down the slope,
xt+1 = xt − α∇f(xt)
f(x)
f(y) + 〈∇f(y), x− y〉
(y, f(y))
![Page 5: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/5.jpg)
When is optimization easy?
Easy problem: function is convex,nice and smooth
![Page 6: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/6.jpg)
When is optimization easy?
Easy problem: function is convex,nice and smooth
Not so easy problem: function isnon-smooth
![Page 7: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/7.jpg)
When is optimization easy?
Easy problem: function is convex,nice and smooth
Not so easy problem: function isnon-smooth
Even harder problems:
◮ We cannot compute gradients ∇f(x)
◮ Function f is non-convex and non-smooth
![Page 8: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/8.jpg)
Example 1: Robust regression
◮ Data in pairs ξi = (ai, bi) ∈ Rd × R
◮ Want to estimate bi ≈ a⊤i x
![Page 9: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/9.jpg)
Example 1: Robust regression
◮ Data in pairs ξi = (ai, bi) ∈ Rd × R
◮ Want to estimate bi ≈ a⊤i x
◮ To avoid outliers, minimize
f(x) =1
n
n∑
i=1
|a⊤i x− bi| =1
n‖Ax− b‖
1
a⊤i x− bi
![Page 10: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/10.jpg)
Example 2: Protein Structure Prediction
RSCCPCYWGGCPWGQNCYPEGCSGPKV1 2 3 4 5 6
N
C
N(A)
(B)
45
70
60
103
84
118
(Grace et al., PNAS 2004)
1 2
4
65
3
![Page 11: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/11.jpg)
Protein Structure Prediction
Featurize edges e in graph: vector ξe. Labels y are matching in a graph,set V is all matchings.
1 2
4
65
3
xT ξ1→2
xT ξ1→3
xT ξ4→6
![Page 12: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/12.jpg)
Protein Structure Prediction
Featurize edges e in graph: vector ξe. Labels y are matching in a graph,set V is all matchings.
1 2
4
65
3
xT ξ1→2
xT ξ1→3
xT ξ4→6
Goal: Learn weights x so that
argmaxν̂∈V
{∑
e∈ν̂
ξ⊤e x
}= ν
i.e. learn x so maximummatching in graph with edgeweights x⊤ξe is correct
![Page 13: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/13.jpg)
Protein Structure Prediction
1 2
4
65
3
xT ξ1→2
xT ξ1→3
xT ξ4→6
Goal: Learn weights x so that
argmaxν̂∈V
{∑
e∈ν̂
ξ⊤e x
}= ν
i.e. learn x so maximummatching in graph with edgeweights x⊤ξe is correct
![Page 14: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/14.jpg)
Protein Structure Prediction
1 2
4
65
3
xT ξ1→2
xT ξ1→3
xT ξ4→6
Goal: Learn weights x so that
argmaxν̂∈V
{∑
e∈ν̂
ξ⊤e x
}= ν
i.e. learn x so maximummatching in graph with edgeweights x⊤ξe is correct
Loss function: L(ν, ν̂) is number of disagreements in matchings
F (x; {ξ, ν}) := maxν̂∈V
(L(ν, ν̂) + x⊤
∑
e∈ν̂
ξe − x⊤∑
e∈ν
ξe
).
![Page 15: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/15.jpg)
When is optimization easy?
Easy problem: function is convex,nice and smooth
Not so easy problem: function isnon-smooth
Even harder problems:
◮ We cannot compute gradients ∇f(x)
◮ Function f is non-convex and non-smooth
![Page 16: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/16.jpg)
One technique to address many of these
Instead of only using f(x) and ∇f(x) to solve
minimize f(x),
get more global information
![Page 17: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/17.jpg)
One technique to address many of these
Instead of only using f(x) and ∇f(x) to solve
minimize f(x),
get more global information
Let Z be a random variable,and for small u, look at fnear points
f(x+ uZ),
where u is small
![Page 18: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/18.jpg)
One technique to address many of these
Instead of only using f(x) and ∇f(x) to solve
minimize f(x),
get more global information
Let Z be a random variable,and for small u, look at fnear points
f(x+ uZ),
where u is small
u
![Page 19: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/19.jpg)
One technique to address many of these
Instead of only using f(x) and ∇f(x) to solve
minimize f(x),
get more global information
Let Z be a random variable,and for small u, look at fnear points
f(x+ uZ),
where u is small
u
![Page 20: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/20.jpg)
Three instances
I Solving previously unsolvable problems [Burke, Lewis, Overton 2005]◮ Non-smooth, non-convex problems
![Page 21: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/21.jpg)
Three instances
I Solving previously unsolvable problems [Burke, Lewis, Overton 2005]◮ Non-smooth, non-convex problems
II Optimal convergence guarantees for problems with existing algorithms[D., Jordan, Wainwright, Wibisono 2014]
◮ Smooth and non-smooth zero order stochastic and non-stochasticoptimization problems
![Page 22: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/22.jpg)
Three instances
I Solving previously unsolvable problems [Burke, Lewis, Overton 2005]◮ Non-smooth, non-convex problems
II Optimal convergence guarantees for problems with existing algorithms[D., Jordan, Wainwright, Wibisono 2014]
◮ Smooth and non-smooth zero order stochastic and non-stochasticoptimization problems
III Parallelism: really fast solutions for large scale problems [D., Bartlett,Wainwright 2013]
◮ Smooth and non-smooth stochastic optimization problems
![Page 23: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/23.jpg)
Instance I: Gradient Sampling Algorithm
Problem: Solveminimize
x∈Xf(x)
where f is potentially non-smooth and non-convex (but assume it iscontinuous and a.e. differentiable) [Burke, Lewis, Overton 2005]
![Page 24: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/24.jpg)
Instance I: Gradient Sampling Algorithm
Problem: Solveminimize
x∈Xf(x)
where f is potentially non-smooth and non-convex (but assume it iscontinuous and a.e. differentiable) [Burke, Lewis, Overton 2005]
At each iteration t,
◮ Draw Z1, . . . , Zm i.i.d. ‖Zi‖ ≤ 1
◮ Set git = ∇f(xt + uZi)
◮ Set gradient gt as
gt = argming{‖g‖2
2:
g =∑
i λigit
λ ≥ 0,∑
i λi = 1
}
◮ Update xt+1 = xt − αgt, whereα > 0 chosen by line search
u
![Page 25: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/25.jpg)
Instance I: Gradient Sampling Algorithm
Problem: Solveminimize
x∈Xf(x)
where f is potentially non-smooth and non-convex (but assume it iscontinuous and a.e. differentiable) [Burke, Lewis, Overton 2005]Define the set
Gu(x) := Conv{∇f(x′) :
∥∥x′ − x∥∥2≤ u, ∇f(x′) exists
}
Proposition (Burke, Lewis, Overton):There exist cluster points x̄ of the sequence xt, and for any suchcluster point,
0 ∈ Gu(x̄)
![Page 26: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/26.jpg)
Instance II: Zero Order Optimization
Problem: We want to solve
minimizex∈X
f(x) = E[F (x; ξ)]
but we are only allowed to observe function values f(x) (or F (x; ξ))
![Page 27: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/27.jpg)
Instance II: Zero Order Optimization
Problem: We want to solve
minimizex∈X
f(x) = E[F (x; ξ)]
but we are only allowed to observe function values f(x) (or F (x; ξ))Idea: Approximate gradientby function differences
f ′(y) ≈ gu :=f(y + u)− f(y)
u
f (x)
f (y)
f (y) + 〈gu0, x − y〉
f (y) + 〈gu1, x − y〉
u1 ≪ u0
![Page 28: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/28.jpg)
Instance II: Zero Order Optimization
Problem: We want to solve
minimizex∈X
f(x) = E[F (x; ξ)]
but we are only allowed to observe function values f(x) (or F (x; ξ))Idea: Approximate gradientby function differences
f ′(y) ≈ gu :=f(y + u)− f(y)
u
◮ Long history inoptimization:Kiefer-Wolfowitz, Spall,Robbins-Monroe
f (x)
f (y)
f (y) + 〈gu0, x − y〉
f (y) + 〈gu1, x − y〉
u1 ≪ u0
![Page 29: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/29.jpg)
Instance II: Zero Order Optimization
Problem: We want to solve
minimizex∈X
f(x) = E[F (x; ξ)]
but we are only allowed to observe function values f(x) (or F (x; ξ))Idea: Approximate gradientby function differences
f ′(y) ≈ gu :=f(y + u)− f(y)
u
◮ Long history inoptimization:Kiefer-Wolfowitz, Spall,Robbins-Monroe
◮ Can randomizedperturbations giveinsights?
f (x)
f (y)
f (y) + 〈gu0, x − y〉
f (y) + 〈gu1, x − y〉
u1 ≪ u0
![Page 30: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/30.jpg)
Stochastic Gradient Descent
Algorithm: At iteration t
◮ Choose random ξ, set
gt = ∇F (xt; ξi)
◮ Update
xt+1 = xt − αgt
![Page 31: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/31.jpg)
Stochastic Gradient Descent
Algorithm: At iteration t
◮ Choose random ξ, set
gt = ∇F (xt; ξi)
◮ Update
xt+1 = xt − αgt
![Page 32: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/32.jpg)
Stochastic Gradient Descent
Algorithm: At iteration t
◮ Choose random ξ, set
gt = ∇F (xt; ξi)
◮ Update
xt+1 = xt − αgt
Theorem (Russians): Let x̂T = 1
T
∑Tt=1
xt and assume R ≥‖x∗ − x1‖2, G2 ≥ E[‖gt‖22]. Then
E[f(x̂T )− f(x∗)] ≤ RG1√T
![Page 33: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/33.jpg)
Stochastic Gradient Descent
Algorithm: At iteration t
◮ Choose random ξ, set
gt = ∇F (xt; ξi)
◮ Update
xt+1 = xt − αgt
Theorem (Russians): Let x̂T = 1
T
∑Tt=1
xt and assume R ≥‖x∗ − x1‖2, G2 ≥ E[‖gt‖22]. Then
E[f(x̂T )− f(x∗)] ≤ RG1√T
Note: Dependence on G important
![Page 34: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/34.jpg)
Derivative-free gradient descent
E[f(x̂T )− f(x∗)] ≤ RG1√T
Question: How well can we estimate gradient ∇f using only functiondifferences? And how small is the norm of this estimate?
![Page 35: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/35.jpg)
Derivative-free gradient descent
E[f(x̂T )− f(x∗)] ≤ RG1√T
Question: How well can we estimate gradient ∇f using only functiondifferences? And how small is the norm of this estimate?
First idea gradient estimator:
◮ Sample Z ∼ µ satisfying Eµ[ZZ⊤] = Id×d
◮ Gradient estimator at x:
g =f(x+ uZ)− f(x)
uZ
Perform gradient descent using these g
![Page 36: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/36.jpg)
Two-point gradient estimates
◮ At any point x and any direction z, for small u > 0
f(x+ uz)− f(x)
u≈ f ′(x, z) := lim
h↓0
f(x+ hz)− f(x)
h
◮ If ∇f(x) exists, f ′(x, z) = 〈∇f(x), z〉◮ If E[ZZ⊤] = I, then E[f ′(x, Z)Z] = E[ZZ⊤∇f(x)] = ∇f(x)
![Page 37: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/37.jpg)
Two-point gradient estimates
◮ At any point x and any direction z, for small u > 0
f(x+ uz)− f(x)
u≈ f ′(x, z) := lim
h↓0
f(x+ hz)− f(x)
h
◮ If ∇f(x) exists, f ′(x, z) = 〈∇f(x), z〉◮ If E[ZZ⊤] = I, then E[f ′(x, Z)Z] = E[ZZ⊤∇f(x)] = ∇f(x)
Random estimates Average ≈ ∇f
![Page 38: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/38.jpg)
Two-point stochastic gradient: differentiable functions
To solve d-dimensional problem
minimizex∈X⊂Rd
f(x) := E[F (x; ξ)]
Algorithm: Iterate
◮ Draw ξ according to distribution, draw Z ∼ µ with Cov(Z) = I
◮ Set ut = u/t and
gt =F (xt + utZ; ξ)− F (xt; ξ)
utZ
◮ Update xt+1 = xt − αgt
![Page 39: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/39.jpg)
Two-point stochastic gradient: differentiable functions
To solve d-dimensional problem
minimizex∈X⊂Rd
f(x) := E[F (x; ξ)]
Algorithm: Iterate
◮ Draw ξ according to distribution, draw Z ∼ µ with Cov(Z) = I
◮ Set ut = u/t and
gt =F (xt + utZ; ξ)− F (xt; ξ)
utZ
◮ Update xt+1 = xt − αgt
Theorem (D., Jordan, Wainwright, Wibisono): With appropriate α,if R ≥ ‖x∗ − x1‖2 and E[‖∇F (x; ξ)‖2
2] ≤ G2 for all x, then
E[f(x̂T )− f(x∗)] ≤ RG ·√d√T
+O
(u2
log T
T
).
![Page 40: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/40.jpg)
Comparisons to knowing gradient
Convergence rate scaling
RG1√T
versus RG
√d√T
![Page 41: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/41.jpg)
Comparisons to knowing gradient
Convergence rate scaling
RG1√T
versus RG
√d√T
◮ To achieve ǫ-accuracy, 1/ǫ2 versus d/ǫ2
![Page 42: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/42.jpg)
Comparisons to knowing gradient
Convergence rate scaling
RG1√T
versus RG
√d√T
◮ To achieve ǫ-accuracy, 1/ǫ2 versus d/ǫ2
0 50 100 150 200 250 300
Iterations
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Optimality gap
SGD
Zero Order
![Page 43: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/43.jpg)
Comparisons to knowing gradient
Convergence rate scaling
RG1√T
versus RG
√d√T
◮ To achieve ǫ-accuracy, 1/ǫ2 versus d/ǫ2
0 50 100 150 200 250 300
Iterations
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Optimality gap
SGD
Zero Order
0 50 100 150 200 250 300
Rescaled Iterations
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Optimality gap
SGD
Zero Order
![Page 44: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/44.jpg)
Two-point stochastic gradient: non-differentiable functions
Problem: If f is non-differentiable, “kinks” make estimates too large
![Page 45: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/45.jpg)
Two-point stochastic gradient: non-differentiable functions
Problem: If f is non-differentiable, “kinks” make estimates too large
![Page 46: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/46.jpg)
Two-point stochastic gradient: non-differentiable functions
Problem: If f is non-differentiable, “kinks” make estimates too large
Example: Let f(x) = ‖x‖2. Then if E[ZZ⊤] = Id×d, at x = 0
E[‖(f(uZ)− 0)Z/u‖22] = E[‖Z‖2
2‖Z‖2
2] ≥ d2
![Page 47: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/47.jpg)
Two-point stochastic gradient: non-differentiable functions
Problem: If f is non-differentiable, “kinks” make estimates too large
Example: Let f(x) = ‖x‖2. Then if E[ZZ⊤] = Id×d, at x = 0
E[‖(f(uZ)− 0)Z/u‖22] = E[‖Z‖2
2‖Z‖2
2] ≥ d2
Idea: More randomization!
u1
u2
![Page 48: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/48.jpg)
Two-point stochastic gradient: non-differentiable functions
Problem: If f is non-differentiable, “kinks” make estimates too large
Proposition (D., Jordan, Wainwright, Wibisono): If Z1, Z2 areN(0, Id×d) or uniform on ‖z‖
2≤
√d, then
g :=f(x+ u1Z1 + u2Z2)− f(x+ u1Z1)
u2Z2
satisfies
E[g] = ∇fu1(x)+O(u2/u1) and E[‖g‖2
2] ≤ d
(√u2u1
d+ log(2d)
).
![Page 49: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/49.jpg)
Two-point stochastic gradient: non-differentiable functions
Problem: If f is non-differentiable, “kinks” make estimates too large
Proposition (D., Jordan, Wainwright, Wibisono): If Z1, Z2 areN(0, Id×d) or uniform on ‖z‖
2≤
√d, then
g :=f(x+ u1Z1 + u2Z2)− f(x+ u1Z1)
u2Z2
satisfies
E[g] = ∇fu1(x)+O(u2/u1) and E[‖g‖2
2] ≤ d
(√u2u1
d+ log(2d)
).
Note: If u2/u1 → 0, scaling linear in d
![Page 50: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/50.jpg)
Two-point sub-gradient: non-differentiable functions
To solve d-dimensional problem
minimize f(x) subject to x ∈ X ⊂ Rd
Algorithm: Iterate
◮ Draw Z1 ∼ µ and Z2 ∼ µ with Cov(Z) = I
◮ Set ut,1 = u/t, ut,2 = u/t2, and
gt =f(xt + ut,1Z1 + ut,2Z2)− F (xt + ut,1Z1)
utZ2
◮ Update xt+1 = xt − αgt
![Page 51: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/51.jpg)
Two-point sub-gradient: non-differentiable functions
To solve d-dimensional problem
minimize f(x) subject to x ∈ X ⊂ Rd
Algorithm: Iterate
◮ Draw Z1 ∼ µ and Z2 ∼ µ with Cov(Z) = I
◮ Set ut,1 = u/t, ut,2 = u/t2, and
gt =f(xt + ut,1Z1 + ut,2Z2)− F (xt + ut,1Z1)
utZ2
◮ Update xt+1 = xt − αgt
Theorem (D., Jordan, Wainwright, Wibisono): With appropriate α,if R ≥ ‖x∗ − x1‖2 and ‖∂f(x)‖2
2≤ G2 for all x, then
E[f(x̂T )− f(x∗)] ≤ RG ·√d log d√T
+O
(ulog T
T
).
![Page 52: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/52.jpg)
Two-point sub-gradient: non-differentiable functions
To solve d-dimensional problem
minimizex∈X⊂Rd
f(x) := E[F (x; ξ)]
Algorithm: Iterate
◮ Draw ξ according to distribution, draw Z1 ∼ µ and Z2 ∼ µ withCov(Z) = I
◮ Set ut,1 = u/t, ut,2 = u/t2, and
gt =F (xt + ut,1Z1 + ut,2Z2; ξ)− F (xt + ut,1; ξ)
utZ2
◮ Update xt+1 = xt − αgt
![Page 53: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/53.jpg)
Two-point sub-gradient: non-differentiable functions
To solve d-dimensional problem
minimizex∈X⊂Rd
f(x) := E[F (x; ξ)]
Algorithm: Iterate
◮ Draw ξ according to distribution, draw Z1 ∼ µ and Z2 ∼ µ withCov(Z) = I
◮ Set ut,1 = u/t, ut,2 = u/t2, and
gt =F (xt + ut,1Z1 + ut,2Z2; ξ)− F (xt + ut,1; ξ)
utZ2
◮ Update xt+1 = xt − αgt
Corollary (D., Jordan, Wainwright, Wibisono): With appropriate α,if R ≥ ‖x∗ − x1‖2 and E[‖∇F (x; ξ)‖2
2] ≤ G2 for all x, then
E[f(x̂T )− f(x∗)] ≤ RG ·√d log d√T
+O
(ulog T
T
).
![Page 54: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/54.jpg)
Wrapping up zero-order gradient methods
◮ If gradients available, convergence rates of√1/T
◮ If only zero order information available, in smooth and non-smoothcase, convergence rates of
√d/T
◮ Time to ǫ-accuracy: 1/ǫ2 7→ d/ǫ2
![Page 55: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/55.jpg)
Wrapping up zero-order gradient methods
◮ If gradients available, convergence rates of√1/T
◮ If only zero order information available, in smooth and non-smoothcase, convergence rates of
√d/T
◮ Time to ǫ-accuracy: 1/ǫ2 7→ d/ǫ2
◮ Sharpness: In stochastic case, no algorithms exist that can do betterthan those we have provided. That is, lower bound for all zero-orderalgorithms of
RG
√d√T.
![Page 56: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/56.jpg)
Wrapping up zero-order gradient methods
◮ If gradients available, convergence rates of√1/T
◮ If only zero order information available, in smooth and non-smoothcase, convergence rates of
√d/T
◮ Time to ǫ-accuracy: 1/ǫ2 7→ d/ǫ2
◮ Sharpness: In stochastic case, no algorithms exist that can do betterthan those we have provided. That is, lower bound for all zero-orderalgorithms of
RG
√d√T.
◮ Open question: Non-stochastic lower bounds? (Sebastian Pokutta,next week.)
![Page 57: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/57.jpg)
Instance III: Parallelization and fast algorithms
Goal: solve the following problem
minimize f(x)
subject to x ∈ X
where
f(x) :=1
n
n∑
i=1
F (x; ξi) or f(x) := E[F (x; ξ)]
![Page 58: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/58.jpg)
Stochastic Gradient Descent
Problem: Tough to compute
f(x) =1
n
n∑
i=1
F (x; ξi).
Instead: At iteration t
◮ Choose random ξi, set
gt = ∇F (xt; ξi)
◮ Update
xt+1 = xt − αgt
![Page 59: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/59.jpg)
Stochastic Gradient Descent
Problem: Tough to compute
f(x) =1
n
n∑
i=1
F (x; ξi).
Instead: At iteration t
◮ Choose random ξi, set
gt = ∇F (xt; ξi)
◮ Update
xt+1 = xt − αgt
![Page 60: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/60.jpg)
What everyone “knows” we should do
Obviously: get a lower-variance estimate of the gradient.
Sample gj,t with E[gj,t] = ∇f(xt) and use gt =1
m
m∑
j=1
gj,t
![Page 61: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/61.jpg)
What everyone “knows” we should do
Obviously: get a lower-variance estimate of the gradient.
Sample gj,t with E[gj,t] = ∇f(xt) and use gt =1
m
m∑
j=1
gj,t
![Page 62: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/62.jpg)
What everyone “knows” we should do
Obviously: get a lower-variance estimate of the gradient.
Sample gj,t with E[gj,t] = ∇f(xt) and use gt =1
m
m∑
j=1
gj,t
![Page 63: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/63.jpg)
What everyone “knows” we should do
Obviously: get a lower-variance estimate of the gradient.
Sample gj,t with E[gj,t] = ∇f(xt) and use gt =1
m
m∑
j=1
gj,t
![Page 64: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/64.jpg)
Problem: only works for smooth functions.
![Page 65: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/65.jpg)
Non-smooth problems we care about:
◮ Classification
F (x; ξ) = F (x; (a, b)) =[1− bx⊤a
]+
◮ Robust regression
F (x; (a, b)) =∣∣∣b− x⊤a
∣∣∣
![Page 66: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/66.jpg)
Non-smooth problems we care about:
◮ Classification
F (x; ξ) = F (x; (a, b)) =[1− bx⊤a
]+
◮ Robust regression
F (x; (a, b)) =∣∣∣b− x⊤a
∣∣∣
◮ Structured prediction (ranking, parsing, learning matchings)
F (x; {ξ, ν}) = maxν̂∈V
[L(ν, ν̂) + x⊤Φ(ξ, ν̂)− x⊤Φ(ξ, ν)
]
![Page 67: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/67.jpg)
Difficulties of non-smooth
Intuition: Gradient is poor indicator of global structure
![Page 68: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/68.jpg)
Better global estimators
Idea: Ask for subgradients from multiple points
![Page 69: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/69.jpg)
Better global estimators
Idea: Ask for subgradients from multiple points
![Page 70: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/70.jpg)
The algorithm
Normal approach: sample ξ atrandom,
gj,t ∈ ∂F (xt; ξ).
Our approach: add noise to x
gj,t ∈ ∂F (xt + utZj ; ξ)
Decrease magnitude ut over time
xt
![Page 71: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/71.jpg)
The algorithm
Normal approach: sample ξ atrandom,
gj,t ∈ ∂F (xt; ξ).
Our approach: add noise to x
gj,t ∈ ∂F (xt + utZj ; ξ)
Decrease magnitude ut over time
xt ut
![Page 72: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/72.jpg)
The algorithm
Normal approach: sample ξ atrandom,
gj,t ∈ ∂F (xt; ξ).
Our approach: add noise to x
gj,t ∈ ∂F (xt + utZj ; ξ)
Decrease magnitude ut over time
xt ut
![Page 73: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/73.jpg)
Algorithm
Generalization of accelerated gradient methods (Nesterov 1983, Tseng2008, Lan 2010). Have query point and exploration point
![Page 74: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/74.jpg)
Algorithm
Generalization of accelerated gradient methods (Nesterov 1983, Tseng2008, Lan 2010). Have query point and exploration point
I. Get query point and gradients:
yt = (1− θt)xt + θtzt
Sample ξj,t and Zj,t, compute gradient approximation
gt =1
m
m∑
j=1
gj,t, gj,t ∈ ∂F (yt + utZj,t; ξj,t)
![Page 75: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/75.jpg)
Algorithm
Generalization of accelerated gradient methods (Nesterov 1983, Tseng2008, Lan 2010). Have query point and exploration point
I. Get query point and gradients:
yt = (1− θt)xt + θtzt
Sample ξj,t and Zj,t, compute gradient approximation
gt =1
m
m∑
j=1
gj,t, gj,t ∈ ∂F (yt + utZj,t; ξj,t)
II. Solve for exploration point
zt+1 = argminx∈X
{ t∑
τ=0
1
θτ
[〈gτ , x〉
]
︸ ︷︷ ︸Approximate f
+1
2αt‖x‖2
2
︸ ︷︷ ︸Regularize
}
![Page 76: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/76.jpg)
Algorithm
Generalization of accelerated gradient methods (Nesterov 1983, Tseng2008, Lan 2010). Have query point and exploration point
I. Get query point and gradients:
yt = (1− θt)xt + θtzt
Sample ξj,t and Zj,t, compute gradient approximation
gt =1
m
m∑
j=1
gj,t, gj,t ∈ ∂F (yt + utZj,t; ξj,t)
II. Solve for exploration point
zt+1 = argminx∈X
{ t∑
τ=0
1
θτ
[〈gτ , x〉
]
︸ ︷︷ ︸Approximate f
+1
2αt‖x‖2
2
︸ ︷︷ ︸Regularize
}
III. Interpolatext+1 = (1− θt)xt + θtzt+1
![Page 77: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/77.jpg)
Theoretical Results
Objective:minimize
x∈Xf(x) where f(x) = E[F (x; ξ)]
using m gradient samples for stochastic gradients.
![Page 78: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/78.jpg)
Theoretical Results
Objective:minimize
x∈Xf(x) where f(x) = E[F (x; ξ)]
using m gradient samples for stochastic gradients.
Non-strongly convex objectives:
f(xT )− f(x∗) = O(C
T+
1√Tm
)
![Page 79: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/79.jpg)
Theoretical Results
Objective:minimize
x∈Xf(x) where f(x) = E[F (x; ξ)]
using m gradient samples for stochastic gradients.
Non-strongly convex objectives:
f(xT )− f(x∗) = O(C
T+
1√Tm
)
λ-strongly convex objectives:
f(xT )− f(x∗) = O(
C
T 2+
1
λTm
)
![Page 80: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/80.jpg)
A few remarks on distributing
Convergence rate:
f(xT )− f(x∗) = O(1
T+
1√Tm
)
◮ If communication is expensive, uselarger batch sizes m:
(a) Communication cost is c(b) n computers with batch size m(c) S total update steps
︸︷︷
︸
︸ ︷︷ ︸
n
m
︸
︷︷
︸c
![Page 81: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/81.jpg)
A few remarks on distributing
Convergence rate:
f(xT )− f(x∗) = O(1
T+
1√Tm
)
◮ If communication is expensive, uselarger batch sizes m:
(a) Communication cost is c(b) n computers with batch size m(c) S total update steps
Backsolve: after T = S(m+ c) units oftime, error is
O(m+ c
T+
1√Tn
·√
m+ c
m
)
︸︷︷
︸
︸ ︷︷ ︸
n
m
︸
︷︷
︸c
![Page 82: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/82.jpg)
Experimental results
![Page 83: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/83.jpg)
Iteration complexity simulations
Define T (ǫ,m) = min {t ∈ N | f(xt)− f(x∗) ≤ ǫ}, solve robust regressionproblem
f(x) =1
n
n∑
i=1
∣∣∣a⊤i x− bi
∣∣∣ = 1
n‖Ax− b‖
1
100
101
102
103
102
103
Iter
ati
ons
toǫ-
opti
mality
Number m of gradient samples
Actual T (ǫ, m)Predicted T (ǫ, m)
100
101
102
103
102
103
Iter
ati
ons
toǫ-
opti
mality
Number m of gradient samples
Actual T (ǫ, m)Predicted T (ǫ, m)
![Page 84: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/84.jpg)
Robustness to stepsize and smoothing
◮ Two parameters: smoothing parameter u, stepsize η
10−1
101
103
10−1
101
103
10−2
10−1
η1/u
f(x̂
)+
ϕ(x̂
)−
f(x
∗)−
ϕ(x
∗)
Plot: optimality gap after 2000 iterations on synthetic SVM problem
f(x) + ϕ(x) :=1
n
n∑
i=1
[1− ξ⊤i x
]++
λ
2‖x‖2
2
![Page 85: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/85.jpg)
Text Classification
Reuter’s RCV1 dataset, time to ǫ-optimal solution for
1
n
n∑
i=1
[1− ξ⊤i x
]++
λ
2‖x‖2
2
0 2 4 6 8 10 12 140.5
1
1.5
2
2.5
3
3.5
4
4.5Time to optimality gap of 0.004
Number of worker threads
Mea
n tim
e (s
econ
ds)
Batch size 10Batch size 20
0 2 4 6 8 10 12 141
1.5
2
2.5
3
3.5
4
4.5
5
5.5
6Time to optimality gap of 0.002
Number of worker threads
Mea
n tim
e (s
econ
ds)
Batch size 10Batch size 20
![Page 86: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/86.jpg)
Text Classification
Reuter’s RCV1 dataset, optimization speed for minimizing
1
n
n∑
i=1
[1− ξ⊤i x
]++
λ
2‖x‖2
2
0 1 2 3 410
−3
10−2
10−1
100
Opt
imal
ity g
ap
Time (s)
Acc (1)Acc (2)Acc (3)Acc (4)Acc (6)Acc (8)Pegasos
![Page 87: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/87.jpg)
Parsing
Penn Treebank dataset, learning weights for a hypergraph parser (here x isa sentence, y ∈ V is a parse tree)
1
n
n∑
i=1
maxν̂∈V
[L(νi, ν̂) + x⊤ (Φ(ξi, ν̂)− Φ(ξi, νi))
]+
λ
2‖x‖2
2.
0 500 1000 150010
−2
10−1
100
101
Iterations
f(x
)+
ϕ(x
)−
f(x
∗)−
ϕ(x
∗)
12468
0 50 100 150 20010
−2
10−1
100
101
Time (seconds)
f(x
)+
ϕ(x
)−
f(x
∗)−
ϕ(x
∗)
12468RDA
![Page 88: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/88.jpg)
Is smoothing necessary?
Solve multiple-median problem
f(x) =1
n
n∑
i=1
‖x− ξi‖1 ,
ξi ∈ {−1, 1}d. Compare standard stochastic gradient:
0 5 10 15 20 25 30 3520
40
60
80
100
120
140
Iter
ati
ons
toǫ-
opti
mality
Number m of gradient samples
SmoothedUnsmoothed
![Page 89: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/89.jpg)
Discussion
◮ Randomized smoothing allows◮ Stationary points of non-convex non-smooth problems◮ Optimal solutions in zero-order problems (including non-smooth)◮ Parallelization for non-smooth problems
![Page 90: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/90.jpg)
Discussion
◮ Randomized smoothing allows◮ Stationary points of non-convex non-smooth problems◮ Optimal solutions in zero-order problems (including non-smooth)◮ Parallelization for non-smooth problems
◮ Current experiments in consultation with Google for large-scaleparsing/translation tasks
◮ Open questions: non-stochastic optimality guarantees? Truezero-order optimization?
![Page 91: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/91.jpg)
Discussion
◮ Randomized smoothing allows◮ Stationary points of non-convex non-smooth problems◮ Optimal solutions in zero-order problems (including non-smooth)◮ Parallelization for non-smooth problems
◮ Current experiments in consultation with Google for large-scaleparsing/translation tasks
◮ Open questions: non-stochastic optimality guarantees? Truezero-order optimization?
Thanks!
![Page 92: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/92.jpg)
Discussion
◮ Randomized smoothing allows◮ Stationary points of non-convex non-smooth problems◮ Optimal solutions in zero-order problems (including non-smooth)◮ Parallelization for non-smooth problems
◮ Current experiments in consultation with Google for large-scaleparsing/translation tasks
◮ Open questions: non-stochastic optimality guarantees? Truezero-order optimization?
References:
◮ Randomized Smoothing for Stochastic Optimization (D., Bartlett,Wainwright). SIAM Journal on Optimization, 22(2), pages 674–701.
◮ Optimal rates for zero-order convex optimization: the power of two functionevaluations (D., Jordan, Wainwright, Wibisono). arXiv:1312.2139[math.OC].
Available on my webpage (http://web.stanford.edu/~jduchi)
![Page 93: Randomized Smoothing Techniques in Optimizationjduchi/projects/DuchiJoWaWi14_slides.pdf · Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter](https://reader036.vdocuments.net/reader036/viewer/2022070112/60536fe561d8c8075b26e76e/html5/thumbnails/93.jpg)