randomized smoothing techniques in optimizationjduchi/projects/duchijowawi14_slides.pdf ·...

Randomized Smoothing Techniques in Optimization

John Duchi

Based on joint work with Peter Bartlett, Michael Jordan,

Martin Wainwright, Andre Wibisono

Stanford University

Information Systems Laboratory SeminarOctober 2014

Problem Statement

Goal: solve the following problem

minimize f(x)

subject to x ∈ X

Problem Statement


minimize f(x)

subject to x ∈ X

Often, we will assume

f(x) :=1

n

n∑

i=1

F (x; ξi) or f(x) := E[F (x; ξ)]

Gradient Descent

Goal: solve

minimize f(x)

Technique: go down the slope,

xt+1 = xt − α∇f(xt)

f(x)

f(y) + 〈∇f(y), x− y〉

(y, f(y))

When is optimization easy?

Easy problem: function is convex,nice and smooth



Not so easy problem: function isnon-smooth




Even harder problems:

◮ We cannot compute gradients ∇f(x)

◮ Function f is non-convex and non-smooth

Example 1: Robust regression

◮ Data in pairs ξi = (ai, bi) ∈ Rd × R

◮ Want to estimate bi ≈ a⊤i x

Example 1: Robust regression

◮ Data in pairs ξi = (ai, bi) ∈ Rd × R

◮ Want to estimate bi ≈ a⊤i x

◮ To avoid outliers, minimize

f(x) =1

n

n∑

i=1

|a⊤i x− bi| =1

n‖Ax− b‖

1

a⊤i x− bi

Example 2: Protein Structure Prediction

RSCCPCYWGGCPWGQNCYPEGCSGPKV1 2 3 4 5 6

N

C

N(A)

(B)

45

70

60

103

84

118

(Grace et al., PNAS 2004)

1 2

4

65

3

Protein Structure Prediction

Featurize edges e in graph: vector ξe. Labels y are matching in a graph,set V is all matchings.

1 2

4

65

3

xT ξ1→2

xT ξ1→3

xT ξ4→6


Featurize edges e in graph: vector ξe. Labels y are matching in a graph,set V is all matchings.

1 2

4

65

3

xT ξ1→2

xT ξ1→3

xT ξ4→6

Goal: Learn weights x so that

argmaxν̂∈V

{∑

e∈ν̂

ξ⊤e x

}= ν

i.e. learn x so maximummatching in graph with edgeweights x⊤ξe is correct


1 2

4

65

3

xT ξ1→2

xT ξ1→3

xT ξ4→6


argmaxν̂∈V

{∑

e∈ν̂

ξ⊤e x

}= ν



1 2

4

65

3

xT ξ1→2

xT ξ1→3

xT ξ4→6


argmaxν̂∈V

{∑

e∈ν̂

ξ⊤e x

}= ν


Loss function: L(ν, ν̂) is number of disagreements in matchings

F (x; {ξ, ν}) := maxν̂∈V

(L(ν, ν̂) + x⊤

∑

e∈ν̂

ξe − x⊤∑

e∈ν

ξe

).




Even harder problems:

◮ We cannot compute gradients ∇f(x)

◮ Function f is non-convex and non-smooth

One technique to address many of these

Instead of only using f(x) and ∇f(x) to solve

minimize f(x),

get more global information



minimize f(x),


Let Z be a random variable,and for small u, look at fnear points

f(x+ uZ),

where u is small



minimize f(x),


Let Z be a random variable,and for small u, look at fnear points

f(x+ uZ),

where u is small

u

Three instances

I Solving previously unsolvable problems [Burke, Lewis, Overton 2005]◮ Non-smooth, non-convex problems

Three instances


II Optimal convergence guarantees for problems with existing algorithms[D., Jordan, Wainwright, Wibisono 2014]

◮ Smooth and non-smooth zero order stochastic and non-stochasticoptimization problems

Three instances


II Optimal convergence guarantees for problems with existing algorithms[D., Jordan, Wainwright, Wibisono 2014]

◮ Smooth and non-smooth zero order stochastic and non-stochasticoptimization problems

III Parallelism: really fast solutions for large scale problems [D., Bartlett,Wainwright 2013]

◮ Smooth and non-smooth stochastic optimization problems

Instance I: Gradient Sampling Algorithm

Problem: Solveminimize

x∈Xf(x)

where f is potentially non-smooth and non-convex (but assume it iscontinuous and a.e. differentiable) [Burke, Lewis, Overton 2005]



x∈Xf(x)

where f is potentially non-smooth and non-convex (but assume it iscontinuous and a.e. differentiable) [Burke, Lewis, Overton 2005]

At each iteration t,

◮ Draw Z1, . . . , Zm i.i.d. ‖Zi‖ ≤ 1

◮ Set git = ∇f(xt + uZi)

◮ Set gradient gt as

gt = argming{‖g‖2

2:

g =∑

i λigit

λ ≥ 0,∑

i λi = 1

}

◮ Update xt+1 = xt − αgt, whereα > 0 chosen by line search

u



x∈Xf(x)

where f is potentially non-smooth and non-convex (but assume it iscontinuous and a.e. differentiable) [Burke, Lewis, Overton 2005]Define the set

Gu(x) := Conv{∇f(x′) :

∥∥x′ − x∥∥2≤ u, ∇f(x′) exists

}

Proposition (Burke, Lewis, Overton):There exist cluster points x̄ of the sequence xt, and for any suchcluster point,

0 ∈ Gu(x̄)

Instance II: Zero Order Optimization

Problem: We want to solve

minimizex∈X

f(x) = E[F (x; ξ)]

but we are only allowed to observe function values f(x) (or F (x; ξ))



minimizex∈X

f(x) = E[F (x; ξ)]

but we are only allowed to observe function values f(x) (or F (x; ξ))Idea: Approximate gradientby function differences

f ′(y) ≈ gu :=f(y + u)− f(y)

u

f (x)

f (y)

f (y) + 〈gu0, x − y〉

f (y) + 〈gu1, x − y〉

u1 ≪ u0



minimizex∈X

f(x) = E[F (x; ξ)]


f ′(y) ≈ gu :=f(y + u)− f(y)

u

◮ Long history inoptimization:Kiefer-Wolfowitz, Spall,Robbins-Monroe

f (x)

f (y)

f (y) + 〈gu0, x − y〉

f (y) + 〈gu1, x − y〉

u1 ≪ u0



minimizex∈X

f(x) = E[F (x; ξ)]


f ′(y) ≈ gu :=f(y + u)− f(y)

u

◮ Long history inoptimization:Kiefer-Wolfowitz, Spall,Robbins-Monroe

◮ Can randomizedperturbations giveinsights?

f (x)

f (y)

f (y) + 〈gu0, x − y〉

f (y) + 〈gu1, x − y〉

u1 ≪ u0

Stochastic Gradient Descent

Algorithm: At iteration t

◮ Choose random ξ, set

gt = ∇F (xt; ξi)

◮ Update

xt+1 = xt − αgt




gt = ∇F (xt; ξi)

◮ Update

xt+1 = xt − αgt

Theorem (Russians): Let x̂T = 1

T

∑Tt=1

xt and assume R ≥‖x∗ − x1‖2, G2 ≥ E[‖gt‖22]. Then

E[f(x̂T )− f(x∗)] ≤ RG1√T




gt = ∇F (xt; ξi)

◮ Update

xt+1 = xt − αgt

Theorem (Russians): Let x̂T = 1

T

∑Tt=1

xt and assume R ≥‖x∗ − x1‖2, G2 ≥ E[‖gt‖22]. Then

E[f(x̂T )− f(x∗)] ≤ RG1√T

Note: Dependence on G important

Derivative-free gradient descent

E[f(x̂T )− f(x∗)] ≤ RG1√T

Question: How well can we estimate gradient ∇f using only functiondifferences? And how small is the norm of this estimate?

Derivative-free gradient descent

E[f(x̂T )− f(x∗)] ≤ RG1√T

Question: How well can we estimate gradient ∇f using only functiondifferences? And how small is the norm of this estimate?

First idea gradient estimator:

◮ Sample Z ∼ µ satisfying Eµ[ZZ⊤] = Id×d

◮ Gradient estimator at x:

g =f(x+ uZ)− f(x)

uZ

Perform gradient descent using these g

Two-point gradient estimates

◮ At any point x and any direction z, for small u > 0

f(x+ uz)− f(x)

u≈ f ′(x, z) := lim

h↓0

f(x+ hz)− f(x)

h

◮ If ∇f(x) exists, f ′(x, z) = 〈∇f(x), z〉◮ If E[ZZ⊤] = I, then E[f ′(x, Z)Z] = E[ZZ⊤∇f(x)] = ∇f(x)

Two-point gradient estimates

◮ At any point x and any direction z, for small u > 0

f(x+ uz)− f(x)

u≈ f ′(x, z) := lim

h↓0

f(x+ hz)− f(x)

h

◮ If ∇f(x) exists, f ′(x, z) = 〈∇f(x), z〉◮ If E[ZZ⊤] = I, then E[f ′(x, Z)Z] = E[ZZ⊤∇f(x)] = ∇f(x)

Random estimates Average ≈ ∇f

Two-point stochastic gradient: differentiable functions

To solve d-dimensional problem

minimizex∈X⊂Rd

f(x) := E[F (x; ξ)]

Algorithm: Iterate

◮ Draw ξ according to distribution, draw Z ∼ µ with Cov(Z) = I

◮ Set ut = u/t and

gt =F (xt + utZ; ξ)− F (xt; ξ)

utZ

◮ Update xt+1 = xt − αgt

Two-point stochastic gradient: differentiable functions


minimizex∈X⊂Rd

f(x) := E[F (x; ξ)]

Algorithm: Iterate

◮ Draw ξ according to distribution, draw Z ∼ µ with Cov(Z) = I

◮ Set ut = u/t and

gt =F (xt + utZ; ξ)− F (xt; ξ)

utZ


Theorem (D., Jordan, Wainwright, Wibisono): With appropriate α,if R ≥ ‖x∗ − x1‖2 and E[‖∇F (x; ξ)‖2

2] ≤ G2 for all x, then

E[f(x̂T )− f(x∗)] ≤ RG ·√d√T

+O

(u2

log T

T

).

Comparisons to knowing gradient

Convergence rate scaling

RG1√T

versus RG

√d√T



RG1√T

versus RG

√d√T

◮ To achieve ǫ-accuracy, 1/ǫ2 versus d/ǫ2



RG1√T

versus RG

√d√T


0 50 100 150 200 250 300

Iterations

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Optimality gap

SGD

Zero Order



RG1√T

versus RG

√d√T


0 50 100 150 200 250 300

Iterations

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Optimality gap

SGD

Zero Order

0 50 100 150 200 250 300

Rescaled Iterations

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Optimality gap

SGD

Zero Order

Two-point stochastic gradient: non-differentiable functions

Problem: If f is non-differentiable, “kinks” make estimates too large



Example: Let f(x) = ‖x‖2. Then if E[ZZ⊤] = Id×d, at x = 0

E[‖(f(uZ)− 0)Z/u‖22] = E[‖Z‖2

2‖Z‖2

2] ≥ d2



Example: Let f(x) = ‖x‖2. Then if E[ZZ⊤] = Id×d, at x = 0

E[‖(f(uZ)− 0)Z/u‖22] = E[‖Z‖2

2‖Z‖2

2] ≥ d2

Idea: More randomization!

u1

u2



Proposition (D., Jordan, Wainwright, Wibisono): If Z1, Z2 areN(0, Id×d) or uniform on ‖z‖

2≤

√d, then

g :=f(x+ u1Z1 + u2Z2)− f(x+ u1Z1)

u2Z2

satisfies

E[g] = ∇fu1(x)+O(u2/u1) and E[‖g‖2

2] ≤ d

(√u2u1

d+ log(2d)

).



Proposition (D., Jordan, Wainwright, Wibisono): If Z1, Z2 areN(0, Id×d) or uniform on ‖z‖

2≤

√d, then

g :=f(x+ u1Z1 + u2Z2)− f(x+ u1Z1)

u2Z2

satisfies

E[g] = ∇fu1(x)+O(u2/u1) and E[‖g‖2

2] ≤ d

(√u2u1

d+ log(2d)

).

Note: If u2/u1 → 0, scaling linear in d

Two-point sub-gradient: non-differentiable functions


minimize f(x) subject to x ∈ X ⊂ Rd

Algorithm: Iterate

◮ Draw Z1 ∼ µ and Z2 ∼ µ with Cov(Z) = I

◮ Set ut,1 = u/t, ut,2 = u/t2, and

gt =f(xt + ut,1Z1 + ut,2Z2)− F (xt + ut,1Z1)

utZ2




minimize f(x) subject to x ∈ X ⊂ Rd

Algorithm: Iterate

◮ Draw Z1 ∼ µ and Z2 ∼ µ with Cov(Z) = I


gt =f(xt + ut,1Z1 + ut,2Z2)− F (xt + ut,1Z1)

utZ2


Theorem (D., Jordan, Wainwright, Wibisono): With appropriate α,if R ≥ ‖x∗ − x1‖2 and ‖∂f(x)‖2

2≤ G2 for all x, then

E[f(x̂T )− f(x∗)] ≤ RG ·√d log d√T

+O

(ulog T

T

).



minimizex∈X⊂Rd

f(x) := E[F (x; ξ)]

Algorithm: Iterate

◮ Draw ξ according to distribution, draw Z1 ∼ µ and Z2 ∼ µ withCov(Z) = I


gt =F (xt + ut,1Z1 + ut,2Z2; ξ)− F (xt + ut,1; ξ)

utZ2




minimizex∈X⊂Rd

f(x) := E[F (x; ξ)]

Algorithm: Iterate

◮ Draw ξ according to distribution, draw Z1 ∼ µ and Z2 ∼ µ withCov(Z) = I


gt =F (xt + ut,1Z1 + ut,2Z2; ξ)− F (xt + ut,1; ξ)

utZ2


Corollary (D., Jordan, Wainwright, Wibisono): With appropriate α,if R ≥ ‖x∗ − x1‖2 and E[‖∇F (x; ξ)‖2

2] ≤ G2 for all x, then

E[f(x̂T )− f(x∗)] ≤ RG ·√d log d√T

+O

(ulog T

T

).

Wrapping up zero-order gradient methods

◮ If gradients available, convergence rates of√1/T

◮ If only zero order information available, in smooth and non-smoothcase, convergence rates of

√d/T

◮ Time to ǫ-accuracy: 1/ǫ2 7→ d/ǫ2




√d/T


◮ Sharpness: In stochastic case, no algorithms exist that can do betterthan those we have provided. That is, lower bound for all zero-orderalgorithms of

RG

√d√T.




√d/T


◮ Sharpness: In stochastic case, no algorithms exist that can do betterthan those we have provided. That is, lower bound for all zero-orderalgorithms of

RG

√d√T.

◮ Open question: Non-stochastic lower bounds? (Sebastian Pokutta,next week.)

Instance III: Parallelization and fast algorithms


minimize f(x)

subject to x ∈ X

where

f(x) :=1

n

n∑

i=1

F (x; ξi) or f(x) := E[F (x; ξ)]


Problem: Tough to compute

f(x) =1

n

n∑

i=1

F (x; ξi).

Instead: At iteration t

◮ Choose random ξi, set

gt = ∇F (xt; ξi)

◮ Update

xt+1 = xt − αgt

What everyone “knows” we should do

Obviously: get a lower-variance estimate of the gradient.

Sample gj,t with E[gj,t] = ∇f(xt) and use gt =1

m

m∑

j=1

gj,t

Problem: only works for smooth functions.

Non-smooth problems we care about:

◮ Classification

F (x; ξ) = F (x; (a, b)) =[1− bx⊤a

]+

◮ Robust regression

F (x; (a, b)) =∣∣∣b− x⊤a

∣∣∣

Non-smooth problems we care about:

◮ Classification

F (x; ξ) = F (x; (a, b)) =[1− bx⊤a

]+

◮ Robust regression

F (x; (a, b)) =∣∣∣b− x⊤a

∣∣∣

◮ Structured prediction (ranking, parsing, learning matchings)

F (x; {ξ, ν}) = maxν̂∈V

[L(ν, ν̂) + x⊤Φ(ξ, ν̂)− x⊤Φ(ξ, ν)

]

Difficulties of non-smooth

Intuition: Gradient is poor indicator of global structure

Better global estimators

Idea: Ask for subgradients from multiple points

The algorithm

Normal approach: sample ξ atrandom,

gj,t ∈ ∂F (xt; ξ).

Our approach: add noise to x

gj,t ∈ ∂F (xt + utZj ; ξ)

Decrease magnitude ut over time

xt

The algorithm

Normal approach: sample ξ atrandom,

gj,t ∈ ∂F (xt; ξ).

Our approach: add noise to x

gj,t ∈ ∂F (xt + utZj ; ξ)

Decrease magnitude ut over time

xt ut

Algorithm

Generalization of accelerated gradient methods (Nesterov 1983, Tseng2008, Lan 2010). Have query point and exploration point

Algorithm


I. Get query point and gradients:

yt = (1− θt)xt + θtzt

Sample ξj,t and Zj,t, compute gradient approximation

gt =1

m

m∑

j=1

gj,t, gj,t ∈ ∂F (yt + utZj,t; ξj,t)

Algorithm





gt =1

m

m∑

j=1


II. Solve for exploration point

zt+1 = argminx∈X

{ t∑

τ=0

1

θτ

[〈gτ , x〉

]

︸︷︷︸Approximate f

+1

2αt‖x‖2

2

︸︷︷︸Regularize

}

Algorithm





gt =1

m

m∑

j=1


II. Solve for exploration point

zt+1 = argminx∈X

{ t∑

τ=0

1

θτ

[〈gτ , x〉

]

︸︷︷︸Approximate f

+1

2αt‖x‖2

2

︸︷︷︸Regularize

}

III. Interpolatext+1 = (1− θt)xt + θtzt+1

Theoretical Results

Objective:minimize

x∈Xf(x) where f(x) = E[F (x; ξ)]

using m gradient samples for stochastic gradients.

Theoretical Results

Objective:minimize



Non-strongly convex objectives:

f(xT )− f(x∗) = O(C

T+

1√Tm

)

Theoretical Results

Objective:minimize



Non-strongly convex objectives:

f(xT )− f(x∗) = O(C

T+

1√Tm

)

λ-strongly convex objectives:

f(xT )− f(x∗) = O(

C

T 2+

1

λTm

)

A few remarks on distributing

Convergence rate:

f(xT )− f(x∗) = O(1

T+

1√Tm

)

◮ If communication is expensive, uselarger batch sizes m:

(a) Communication cost is c(b) n computers with batch size m(c) S total update steps

︸︷︷

︸

︸︷︷︸

n

m

︸

︷︷

︸c

A few remarks on distributing

Convergence rate:

f(xT )− f(x∗) = O(1

T+

1√Tm

)

◮ If communication is expensive, uselarger batch sizes m:

(a) Communication cost is c(b) n computers with batch size m(c) S total update steps

Backsolve: after T = S(m+ c) units oftime, error is

O(m+ c

T+

1√Tn

·√

m+ c

m

)

︸︷︷

︸

︸︷︷︸

n

m

︸

︷︷

︸c

Experimental results

Iteration complexity simulations

Define T (ǫ,m) = min {t ∈ N | f(xt)− f(x∗) ≤ ǫ}, solve robust regressionproblem

f(x) =1

n

n∑

i=1

∣∣∣a⊤i x− bi

∣∣∣ = 1

n‖Ax− b‖

1

100

101

102

103

102

103

Iter

ati

ons

toǫ-

opti

mality

Number m of gradient samples

Actual T (ǫ, m)Predicted T (ǫ, m)

100

101

102

103

102

103

Iter

ati

ons

toǫ-

opti

mality


Actual T (ǫ, m)Predicted T (ǫ, m)

Robustness to stepsize and smoothing

◮ Two parameters: smoothing parameter u, stepsize η

10−1

101

103

10−1

101

103

10−2

10−1

η1/u

f(x̂

)+

ϕ(x̂

)−

f(x

∗)−

ϕ(x

∗)

Plot: optimality gap after 2000 iterations on synthetic SVM problem

f(x) + ϕ(x) :=1

n

n∑

i=1

[1− ξ⊤i x

]++

λ

2‖x‖2

2

Text Classification

Reuter’s RCV1 dataset, time to ǫ-optimal solution for

1

n

n∑

i=1

[1− ξ⊤i x

]++

λ

2‖x‖2

2

0 2 4 6 8 10 12 140.5

1

1.5

2

2.5

3

3.5

4

4.5Time to optimality gap of 0.004

Number of worker threads

Mea

n tim

e (s

econ

ds)

Batch size 10Batch size 20

0 2 4 6 8 10 12 141

1.5

2

2.5

3

3.5

4

4.5

5

5.5

6Time to optimality gap of 0.002

Number of worker threads

Mea

n tim

e (s

econ

ds)

Batch size 10Batch size 20

Text Classification

Reuter’s RCV1 dataset, optimization speed for minimizing

1

n

n∑

i=1

[1− ξ⊤i x

]++

λ

2‖x‖2

2

0 1 2 3 410

−3

10−2

10−1

100

Opt

imal

ity g

ap

Time (s)

Acc (1)Acc (2)Acc (3)Acc (4)Acc (6)Acc (8)Pegasos

Parsing

Penn Treebank dataset, learning weights for a hypergraph parser (here x isa sentence, y ∈ V is a parse tree)

1

n

n∑

i=1

maxν̂∈V

[L(νi, ν̂) + x⊤ (Φ(ξi, ν̂)− Φ(ξi, νi))

]+

λ

2‖x‖2

2.

0 500 1000 150010

−2

10−1

100

101

Iterations

f(x

)+

ϕ(x

)−

f(x

∗)−

ϕ(x

∗)

12468

0 50 100 150 20010

−2

10−1

100

101

Time (seconds)

f(x

)+

ϕ(x

)−

f(x

∗)−

ϕ(x

∗)

12468RDA

Is smoothing necessary?

Solve multiple-median problem

f(x) =1

n

n∑

i=1

‖x− ξi‖1 ,

ξi ∈ {−1, 1}d. Compare standard stochastic gradient:

0 5 10 15 20 25 30 3520

40

60

80

100

120

140

Iter

ati

ons

toǫ-

opti

mality


SmoothedUnsmoothed

Discussion

◮ Randomized smoothing allows◮ Stationary points of non-convex non-smooth problems◮ Optimal solutions in zero-order problems (including non-smooth)◮ Parallelization for non-smooth problems

Discussion


◮ Current experiments in consultation with Google for large-scaleparsing/translation tasks

◮ Open questions: non-stochastic optimality guarantees? Truezero-order optimization?

Discussion




Thanks!

Discussion




References:

◮ Randomized Smoothing for Stochastic Optimization (D., Bartlett,Wainwright). SIAM Journal on Optimization, 22(2), pages 674–701.

◮ Optimal rates for zero-order convex optimization: the power of two functionevaluations (D., Jordan, Wainwright, Wibisono). arXiv:1312.2139[math.OC].

Available on my webpage (http://web.stanford.edu/~jduchi)

http://web.stanford.edu/~jduchi

randomized smoothing techniques in optimizationjduchi/projects/duchijowawi14_slides.pdf ·...

Documents