stochastic alternating direction method of multipliers

Stochastic Alternating Direction Method of Multipliers

Taiji Suzuki

Tokyo Institute of Technology

1 / 49

Goal of this study

Machine learning problems with a structured regularization.

minw∈Rp

1

n

n∑i=1

fi(z⊤i w) + ψ(B⊤w)

We propose stochastic optimization methods to solve the above problem.Stochastic optimization

+Alternating Direction Method of Multipliers (ADMM)

Advantageous points

Few samples are needed for each update.(We don’t need to read all the samples for each iteration.)

Wide range of applications

Easy implementation

Nice convergence properties

Parallelizable

2 / 49

Proposed method and existing methods

Stochastic methods for regularized learning problems.Normal ADMM

Online • Proximal gradient type (Nesterov,

2007)

Online-ADMM (Wang and Banerjee,

2012)

FOBOS (Duchi and Singer, 2009) OPG-ADMM (Suzuki, 2013; Ouyang

et al., 2013)

• Dual averaging type (Nesterov, 2009) RDA-ADMM (Suzuki, 2013)

RDA (Xiao, 2009)

BatchSDCA(Shalev-Shwartz and Zhang, 2013b)

SDCA-ADMM (Suzuki, 2014)

(Stochastic Dual Coordinate Ascent)

SAG (Le Roux et al., 2013)

(Stochastic Averaging Gradient)

SVRG (Johnson and Zhang, 2013)

(Stochastic Variance Reduced Gradient)

Online type: O(1/√T ) in general,

O(log(T )/T ) for a strongly convex objective.Batch type: linear convergence.

3 / 49

Outline

1 Problem settings

2 Stochastic ADMM for online data

3 Stochastic ADMM for batch data

4 / 49

High dimensional data analysis

Redundant information deteriorates the estimation accuracy.

Bio-informatics Text data Image data

5 / 49

Sparse estimation

Cut off redundant information → sparsity

R. Tsibshirani (1996). Regression shrinkage and selection via the lasso. J. Royal.

Statist. Soc B., Vol. 58, No. 1, pages 267–288. Citation: 10185 (25th/May/2014)

6 / 49

Regularized learning problem

Lasso:

minw∈Rp

1

n

n∑i=1

(yi − z⊤i w)2 + ∥w∥1︸︷︷︸regularization

.

Sparsity: Useful for high dimensional learning problem.Sparsity inducing regularization is usually non-smooth.

General regularized learning problem:

minw∈Rp

1

n

n∑i=1

fi (z⊤i w) + ψ(w).

7 / 49

Proximal mapping

Regularized learning problem:

minw∈Rp

1

n

n∑i=1

fi (z⊤i w) + ψ(w).

gt ∈ ∂w(1n

∑ni=1 ft(z

⊤t w)

)|w=wt , gt =

1t

∑tτ=1 gτ .

Proximal gradient descent: wt+1 = argminw∈Rp

{g⊤t w + ψ(w) + 1

2ηt∥w − wt∥2

}.

Regularized dual averaging: wt+1 = argminw∈Rp

{g⊤t w + ψ(w) + 1

2ηt∥w∥2

}.

A key computation is the proximal mapping:

prox(q|ψ) := argminx

{ψ(x) +

1

2∥x− q∥2

}.

8 / 49

Example of Proximal mapping: ℓ1 regularization

prox(q|C∥ · ∥1) = argminx

{C∥x∥1 +

1

2∥x− q∥2

}= (sign(qj)max(|qj | − C , 0))j .

→ Soft-thresholding function. Analytic form!

9 / 49

Motivation

Structured regularization with large sample

minw∈Rp

1

n

∑n

i=1fi(z

⊤i w)︸︷︷︸

large samples

+ ψ(w)︸︷︷︸complicated

Difficulties (complex regularization and many samples)

Proximal mapping can not be easily computed:


{ψ(x) +

1

2∥x− q∥2

}.

→ ADMM (Alternating Direction Multiplier Method).

Computation on the loss function is heavy for large number ofsamples.

→ Stochastic optimization.

10 / 49

Motivation

Structured regularization with large sample

minw∈Rp

1

n

∑n

i=1fi(z

⊤i w)︸︷︷︸

large samples

+ ψ(w)︸︷︷︸complicated

Difficulties (complex regularization and many samples)

Proximal mapping can not be easily computed:


{ψ(x) +

1

2∥x− q∥2

}.

→ ADMM (Alternating Direction Multiplier Method).

Computation on the loss function is heavy for large number ofsamples.→ Stochastic optimization.

10 / 49

Examples

Overlapped group lasso

ψ(w) = C∑g∈G

∥wg∥

The groups have over-laps.

It is difficult to computethe proximal mapping.

Solution:Prepare ψ for which proximalmapping is easily computable.Let ψ(B⊤w) = ψ(w), and utilize theproximal mapping w.r.t. ψ.

BTw

11 / 49

Decomposition technique

Decompose into independent groups:

B⊤w =

ψ(y) = C∑g′∈G′

∥yg′∥

prox(q|ψ) =(qg′max

{1− C

∥qg′∥, 0

})g′∈G′

Graph regularization (Jacob et al., 2009),

Low rank tensor estimation (Signoretto et al., 2010; Tomioka et al.,2011)

Dictionary learning (Kasiviswanathan et al., 2012; Rakotomamonjy,2013).

12 / 49

Another example

Graph regularization (Jacob et al., 2009)

ψ(w) = C∑

(i ,j)∈E

|wi − wj |.

ψ(y) = C∑e∈E|ye |, B⊤w = (wi − wj)(i ,j)∈E

⇒

ψ(B⊤w) = ψ(w),

prox(q|ψ) =(qemax

{1− C

|qe |, 0})

e∈E.

Soft-Thresholding function.

13 / 49

Optimizing composite objective function

minw

1

n

n∑i=1

fi (z⊤i w) + ψ(B⊤w)

⇔ minw ,y

1

n

n∑i=1

fi (z⊤i w) + ψ(y) s.t. y = B⊤w .

Augmented Lagrangian

L(w , y , λ) = 1

n

∑i

fi (z⊤i w) + ψ(y) + λ⊤(y − B⊤w) +

ρ

2∥y − B⊤w∥2

supλ infw ,y L(w , y , λ) yields the optimization of the original problem.

The augmented Lagrangian is the basis of the method of multipliers(Hestenes, 1969; Powell, 1969; Rockafellar, 1976).

14 / 49

ADMM

minw

1

n

n∑i=1

fi (z⊤i w) + ψ(B⊤w)⇔ min

w ,y

1

n

n∑i=1

fi (z⊤i w) + ψ(y) s.t. y = B⊤w .

L(w , y , λ) = 1n

∑fi (z

⊤i w) + ψ(y) + λ⊤(y − B⊤w) + ρ

2∥y − B⊤w∥2.

ADMM (Gabay and Mercier, 1976)

wt = argminw

1

n

n∑i=1

fi (z⊤i w) + λ⊤t (−B⊤w) +

ρ

2∥yt−1 − B⊤w∥2

yt = argminy

ψ(y) + λ⊤t y +ρ

2∥y − B⊤wt∥2(= prox(B⊤wt − λt/ρ|ψ/ρ))

λt = λt−1 − ρ(B⊤wt − yt)

The computational difficulty of proximal mapping is resolved.

However, the computation of 1n

∑ni=1 fi (z

⊤i w) is still heavy.

→ Stochastic version of ADMM.

Online type: Suzuki (2013); Wang and Banerjee (2012); Ouyang et al. (2013)

Batch type: Suzuki (2014) → linear convergence.

15 / 49

ADMM

minw

1

n

n∑i=1

fi (z⊤i w) + ψ(B⊤w)⇔ min

w ,y

1

n

n∑i=1

fi (z⊤i w) + ψ(y) s.t. y = B⊤w .

L(w , y , λ) = 1n

∑fi (z

⊤i w) + ψ(y) + λ⊤(y − B⊤w) + ρ

2∥y − B⊤w∥2.

ADMM (Gabay and Mercier, 1976)

wt = argminw

1

n

n∑i=1

fi (z⊤i w) + λ⊤t (−B⊤w) +

ρ

2∥yt−1 − B⊤w∥2

yt = argminy

ψ(y) + λ⊤t y +ρ

2∥y − B⊤wt∥2(= prox(B⊤wt − λt/ρ|ψ/ρ))

λt = λt−1 − ρ(B⊤wt − yt)

The computational difficulty of proximal mapping is resolved.However, the computation of 1

n

∑ni=1 fi (z

⊤i w) is still heavy.

→ Stochastic version of ADMM.

Online type: Suzuki (2013); Wang and Banerjee (2012); Ouyang et al. (2013)

Batch type: Suzuki (2014) → linear convergence.15 / 49

Outline

1 Problem settings



16 / 49

Online setting

zt-1 z

twt

wt+1

P(z)

minw

1

T

T∑t=1

f (zt ,w) + ψ(w)

We don’t want to read all the samples.

wt is updated per one (or few) observation.

17 / 49

Existing oneline optimization method

gt ∈ ∂ft(wt), gt =1t

∑tτ=1 gτ (ft(w) = f (zt ,w))

OPG (Online Proximal Gradient Descent)

wt+1 = argminw

{⟨gt ,w⟩+ ψ(w) +

1

2ηt∥w − wt∥2

}

RDA (Regularized Dual Averaging; Xiao (2009); Nesterov (2009))

wt+1 = argminw

{⟨gt ,w⟩+ ψ(w) +

1

2ηt∥w∥2

}These update rule is computed by a proximal mapping associated with ψ.

Efficient for a simple regularization such as ℓ1 regularization.

How about structured regularization? → ADMM.

18 / 49

Proposed method 1: RDA-ADMM

Ordinary RDA: wt+1 = argminw

{⟨gt ,w⟩+ ψ(w) + 1

2ηt∥w∥2

}RDA-ADMM

Let xt =1t

∑tτ=1 xτ , λt =

1t

∑tτ=1 λτ , yt =

1t

∑tτ=1 yτ , gt =

1t

∑tτ=1 gτ .

xt+1 =argminx∈X

{g⊤t x − λ⊤t B⊤x +

ρ

2t∥B⊤x∥2

+ρ(B⊤xt − yt)⊤B⊤x +

1

2ηt∥x∥2Gt

},

yt+1 =argminy∈Y

{ψ(y)− λ⊤t (B⊤xt+1 − y) +

ρ

2∥B⊤xt+1 − y∥2

},

λt+1 =λt − ρ(B⊤xt+1 − yt+1).

The update rule of yt+1 and xt+1 are same as the ordinary ADMM.

yt+1 = prox(B⊤xt+1 − λt/ρ|ψ)

19 / 49

Proposed method 2: OPG-ADMM

Ordinary OPG: wt+1 = argminw

{⟨gt ,w⟩+ ψ(w) + 1

2ηt∥w − wt∥2

}.

OPG-ADMM

xt+1 =argminx∈X

{g⊤t x − λ⊤t (B⊤x − yt) +

ρ

2∥B⊤x − yt∥2 +

1

2ηt∥x − xt∥2Gt

},

yt+1 =prox(B⊤xt+1 − λt/ρ|ψ),λt+1 =λt − ρ(B⊤xt+1 − yt+1).

The update rule of yt+1 and xt+1 are same as the ordinary ADMM.

20 / 49

Simplified versionLetting Gt be a specific form, the update rule is much simplified.(Gt = γI − ρ

t ηtBB⊤ for RDA-ADMM, and Gt = γI − ρηtBB⊤ for OPG-ADMM)

Online stochastic ADMM

1 Update of x

(RDA-ADMM) xt+1 = ΠX

[−ηtγ{gt − B(λt − ρB⊤xt + ρyt)}

],

(OPG-ADMM) xt+1 = ΠX

[−ηtγ{gt − B(λt − ρB⊤xt + ρyt)}+ xt

].

2 Update of yyt+1 = prox(B⊤xt+1 − λt/ρ|ψ).

3 Update of λλt+1 = λt − ρ(B⊤xt+1 − yt+1).

Fast computation.Easy implementation.

21 / 49

Convergence analysis

We bound the expected risk:

Expected riskF (x) = EZ [f (Z , x)] + ψ(x).

Assumptions:

(A1) ∃G s.t. ∀g ∈ ∂x f (z , x) satisfies ∥g∥ ≤ G for all z , x .

(A2) ∃L s.t. ∀g ∈ ∂ψ(y) satisfies ∥g∥ ≤ L for all y .

(A3) ∃R s.t. ∀x ∈ X satisfies ∥x∥ ≤ R.

22 / 49

Convergence rate: bounded gradient(A1) ∃G s.t. ∀g ∈ ∂x f (z , x) satisfies ∥g∥ ≤ G for all z , x .(A2) ∃L s.t. ∀g ∈ ∂ψ(y) satisfies ∥g∥ ≤ L for all y .(A3) ∃R s.t. ∀x ∈ X satisfies ∥x∥ ≤ R.

Theorem (Convergence rate of RDA-ADMM)

Under (A1), (A2), (A3), we have

Ez1:T−1[F (xT )− F (x∗)] ≤ 1

T

T∑t=2

ηt−1

2(t − 1)G 2 +

γ

ηT∥x∗∥2 + K

T.

Theorem (Convergence rate of OPG-ADMM)

Under (A1), (A2), (A3), we haveEz1:T−1

[F (xT )− F (x∗)] ≤ 12T

∑Tt=2max

{γηt− γ

ηt−1, 0}R2

+ 1T

∑Tt=1

ηt2 G

2 + KT .

Both methods have convergence rate O(

1√T

)by letting ηt = η0

√t for

RDA-ADMM and ηt = η0/√t for OPG-ADMM.

23 / 49

Convergence rate: strongly convex

(A4) There exist σf , σψ ≥ 0 and P,Q ⪰ O such that

EZ [f (Z , x) + (x ′ − x)⊤∇x f (Z , x)] +σf2∥x − x ′∥2P ≤ EZ [f (Z , x

′)],

ψ(y) + (y ′ − y)⊤∇ψ(y) +σψ2∥y − y ′∥2Q ,

for all x , x ′ ∈ X and y , y ′ ∈ Y, and ∃σ > 0 satisfying

σI ⪯ σf P +ρσψ

2ρ+ σψQ.

The update rule of RDA-ADMM is modified as

xt+1 = argminx∈X

{g⊤t x − λ⊤t B⊤x +

ρ

2t∥B⊤x∥2 + ρ(B⊤xt − yt)

⊤B⊤x +1

2ηt∥x∥2Gt

+σ

2∥x − xt∥2

}.

24 / 49

Convergence analysis: strongly convex

Theorem (Convergence rate of RDA-ADMM)

Under (A1), (A2), (A3), (A4), we have

Ez1:T−1[F (xT )− F (x∗)] ≤ 1

2T

T∑t=2

1t−1ηt−1

+ tσG 2 +

γ

ηT∥x∗∥2 + K

T.

Theorem (Convergence rate of OPG-ADMM)

Under (A1), (A2), (A3), (A4), we haveEz1:T−1

[F (xT )− F (x∗)] ≤ 12T

∑Tt=2max

{γηt− γ

ηt−1− σ, 0

}R2

+ 1T

∑Tt=1

ηt2 G

2 + KT .

Both methods have convergence rate O(log(T )

T

)by letting ηt = η0t for

RDA-ADMM and ηt = η0/t for OPG-ADMM.

25 / 49

Numerical experiments

101

102

103

10-2

10-1

100

CPU time (s)

Expec

ted R

isk

OPG−ADMM

RDA−ADMM

NL−ADMM

RDA

OPG

Figure: Artificial data: 1024 dim,512 sample, Overlapped group lasso.

100

101

102

103

10-0.8

10-0.7

10-0.6

CPU time (s)

Cla

ssif

icati

on E

rror

OPG−ADMM

RDA−ADMM

NL−ADMM

RDA

Figure: Real data (Adult, a9a):15,252 dim, 32,561 sample,Overlapped group lasso + ℓ1 reg.

26 / 49

Outline

1 Problem settings



27 / 49

Batch setting

In the batch setting, the data are fixed. We just minimize the objectivefunction defined by

1

n

n∑i=1

fi (z⊤i w) + ψ(B⊤w).

� �We construct a method that

observes few samples per iteration (like online method),

converges linearly (unlike online method).� �

28 / 49

Dual problem

The dual formulation is utilized to obtain our algorithm.

Primal: minw

1

n

n∑i=1

fi (z⊤i w) + ψ(B⊤w)

Dual problem

minx∈Rn,y∈Rd

1

n

n∑i=1

f ∗i (xi ) + ψ∗(y

n

)s.t. Zx + By = 0.

where Z = [z1, z2, . . . , zn] ∈ Rp×n.

Optimality condition:

z⊤i w∗ ∈ ∇f ∗i (x∗i ),1

ny∗ ∈ ∇ψ(u)|u=B⊤w∗ , Zx∗ + By∗ = 0.

⋆ Each coordinate xi corresponds to each sample zi .29 / 49

Dual of a loss function

loss function

00

00 1

ℓ(y , x) = log(1 + exp(−yx)) ℓ∗(y , u) = −yu log(−yu) + (1 + yu) log(1 + yu)

30 / 49

Dual of regularization functions

regularization function

0

ℓ1 and elastic-net regularization dual

31 / 49

Proposed algorithm

Basic algorithm

Split the index set {1, . . . , n} into K groups (I1, I2, . . . , IK ).For each t = 1, 2, . . .

Choose k ∈ {1, . . . ,K} uniformly at random, and set I = Ik .

y (t) ← argminy

{nψ∗(y/n)− ⟨w (t−1),Zx (t−1)+By⟩

+ρ

2∥Zx (t−1) + By∥2 + 1

2∥y − y (t−1)∥2Q

},

x(t)I ← argmin

xI

{∑i∈I f

∗i (xi )− ⟨w

(t−1),ZI xI + By (t)⟩

+ρ

2∥ZI xI+Z\I x

(t−1)\I +By (t)∥2+ 1

2∥xI − x

(t−1)I ∥2GI ,I

},

w (t) ←w (t−1) − γρ{n(Zx (t) + By (t))

− (n − n/K )(Zx (t−1) + By (t−1))}.

where Q,G are some appropriate positive semidefinite matrices.32 / 49

We observe only a few samples at each iteration.Mini-batch technique (Takac et al., 2013; Shalev-Shwartz and Zhang, 2013a).

If K = n, we need to observe only one sample at each iteration.33 / 49

Simplified algorithm

SettingQ = ρ(ηBId − B⊤B), GI ,I = ρ(ηZ ,I I|I | − Z⊤

I ZI ),

then, by the relation prox(q|ψ) + prox(q|ψ∗) = q, we obtain the followingupdate rule:

Simplified algorithm

For q(t) = y (t−1) + B⊤

ρηB{w (t−1) − ρ(Zx (t−1) + By (t−1))}, let

y (t) ← q(t) − prox(q(t)|nψ(ρηB · )/(ρηB)),

For p(t)I = x

(t−1)I +

Z⊤I

ρηZ,I{w (t−1) − ρ(Zx (t−1) + By (t))}, let

x(t)i ← prox(p

(t)i |f

∗i /(ρηZ ,I )) (∀i ∈ I ).

⋆ The update of x can be parallelized.34 / 49

Convergence Analysis

x∗: the optimal variable of xY∗: the set of optimal variables (not necessarily unique)w∗: the optimal variable of wAssumption:

There exits v > 0 such that, ∀xi ∈ R,

f ∗i (xi )− f ∗i (x∗i ) ≥ ⟨∇f ∗i (x∗i ), xi − x∗i ⟩+

v∥xi − x∗i ∥2

2.

∃h, vψ > 0 such that, for all y , u, there exists y∗ ∈ Y∗ such that

ψ∗(y/n)− ψ∗(y∗/n) ≥ ⟨B⊤w∗, y/n − y∗/n⟩+ vψ2∥PKer(B)(y/n − y∗/n)∥2,

ψ(u)− ψ(B⊤w∗) ≥ ⟨y∗/n, u − B⊤w∗⟩+ h

2∥u − B⊤w∗∥2.

B⊤ is injective.

35 / 49

FD(x , y) :=1n

∑ni=1 f

∗i (xi ) + ψ∗( yn )− ⟨w

∗,Z xn − B y

n ⟩.RD(x , y ,w) := FD(x , y)− FD(x

∗, y∗) + 12n2γρ

∥w − w∗∥2 + ρ(1−γ)2n ∥Zx +

By∥2 + 12n∥x − x∗∥2vIp+H + 1

2nK ∥y − y∗∥2Q .

Theorem (Linear convergence of SDCA-ADMM)

Let H be a matrix such that HI ,I = ρZ⊤I ZI + GI ,I for all I ∈ {I1, . . . , IK},

µ = min

{v

4(v+σmax(H)) ,hρσmin(B

⊤B)2max{1/n,4hρ,4hσmax(Q)}

,Kvψ/n

4σmax(Q) ,Kvσmin(BB

⊤)4σmax(Q)(ρσmax(Z⊤Z)+4v)

},

γ = 14n , and C1 = RD(x

(0), y (0),w (0)), then we have that

(dual residual) E[RD(x(t), y (t),w (t))] ≤

(1− µ

K

)tC1,

(primal variable) E[∥w (t) − w∗∥2] ≤ nρ

2

(1− µ

K

)tC1.

For t ≥ C ′Kµlog

(C ′′nϵ

), we have that E[∥w (t) − w ∗∥2] ≤ ϵ.

36 / 49

Numerical Experiments: Loss function

Binary classification.

Smoothed hinge loss: fi (u) =

0, (yiu ≥ 1),12 − yiu, (yiu < 0),12 (1− yiu)

2, (otherwise).

-0.5 0 0.5 1 1.50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

⇒ proximal mapping is analytically obtained.

prox(u|f ∗i /C ) =

Cu−yi1+C (−1 ≤ Cuyi−1

1+C ≤ 0),

−yi (−1 > Cuyi−11+C ),

0 (otherwise).37 / 49

Numerical Experiments (Artificial data): Setting

Artificial data: Overlapped group regularization:

ψ(X ) = C (32∑i=1

∥X:,i∥+32∑j=1

∥Xj,:∥+ 0.01×∑i,j

X 2i,j/2),

X ∈ R32×32.

38 / 49

Numerical Experiments (Artificial data): Results

5 10 15 20

10-5

10-4

10-3

10-2

10-1

100

CPU time (s)

Em

pir

ical

Ris

k

(a) Excess objective value

10-1

100

101

0

0.5

1

1.5

2

2.5

CPU time (s)

Test

Loss

(b) Test loss

10-1

100

101

0.05

0.1

0.15

0.2

0.25

0.3

CPU time (s)

Cla

ssif

icati

on

Err

or

RDA

OPG−ADMM

RDA−ADMM

OL−ADMM

Batch−ADMM

SDCA−ADMM

(c) Class. Error

Figure: Artificial data (n=5,120). Overlapped group lasso. Mini-batch sizen/K = 50.

39 / 49

Numerical Experiments (Artificial data): Results

100 200 300 400 500

10-6

10-4

10-2

100

CPU time (s)

Em

pir

ical

Ris

k

(a) Excess objective value

100

101

102

0

0.5

1

1.5

2

2.5

3

3.5

CPU time (s)

Test

Loss

(b) Test loss

100

101

102

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

CPU time (s)

Cla

ssif

icati

on E

rror

RDA

OPG−ADMM

RDA−ADMM

OL−ADMM

Batch−ADMM

SDCA−ADMM

(c) Class. Error

Figure: Artificial data (n=51,200). Overlapped group lasso. Mini-batch sizen/K = 50.

40 / 49

Numerical Experiments (Real data): Results

5 10 15 20 25 30 35 40 450.166

0.167

0.168

0.169

0.17

0.171

0.172

0.173

0.174

0.175

CPU time (s)

Em

pir

ical

Ris

k

(a) Objective function

100

101

0.17

0.172

0.174

0.176

0.178

0.18

0.182

0.184

0.186

0.188

0.19

CPU time (s)

Test

Lo

ss

(b) Test loss

100

101

0.12

0.122

0.124

0.126

0.128

0.13

0.132

0.134

0.136

0.138

0.14

CPU time (s)

Cla

ssif

icati

on

Err

or

OPG−ADMM

RDA−ADMM

OL−ADMM

SDCA−ADMM

(c) Class. Error

Figure: Real data (20 Newsgroups, n=12,995). Graph regularization. Mini-batchsize is n/K = 50.

42 / 49

Numerical Experiments (Real data): Results

0 50 100 150 200

0.2

0.25

0.3

0.35

0.4

0.45

0.5

CPU time (s)

Em

pir

ical

Ris

k

(a) Objective function

0 50 100 150 200

0.2

0.25

0.3

0.35

0.4

0.45

CPU time (s)

Test

Lo

ss

(b) Test loss

0 50 100 150 200

0.15

0.155

0.16

0.165

0.17

0.175

0.18

CPU time (s)

Cla

ssif

icati

on

Err

or

OPG−ADMM

RDA−ADMM

OL−ADMM

SDCA−ADMM

(c) Class. Error

Figure: Real data (a9a, n=32,561). Graph regularization. Mini-batch sizen/K = 50.

43 / 49

Summary

A new stochastic optimization methods with ADMM is proposed.

Applicable to wide range of regularization functions.

Easy implementation.

Online setting:

RDA-ADMM, OPG-ADMMO(1/

√T ) in general, and O(log(T )/T ) for strongly convex objective.

Batch setting:

SDCA-ADMM (Stochastic dual coordinate descent ADMM)Linear convergence with strong convexity around the optimum.The convergence rate is controlled by the mini-batch size K .

44 / 49

References I

O. Banerjee, L. E. Ghaoui, and A. d’Aspremont. Model selection throughsparse maximum likelihood estimation for multivariate gaussian orbinary data. Journal of Machine Learning Research, 9:485–516, 2008.

J. Duchi and Y. Singer. Efficient online and batch learning using forwardbackward splitting. Journal of Machine Learning Research, 10:2873–2908, 2009.

D. Gabay and B. Mercier. A dual algorithm for the solution of nonlinearvariational problems via finite-element approximations. Computers &Mathematics with Applications, 2:17–40, 1976.

M. Hestenes. Multiplier and gradient methods. Journal of OptimizationTheory & Applications, 4:303–320, 1969.

45 / 49

References II

R. Johnson and T. Zhang. Accelerating stochastic gradient descent usingpredictive variance reduction. In C. Burges, L. Bottou, M. Welling,Z. Ghahramani, and K. Weinberger, editors, Advances in NeuralInformation Processing Systems 26, pages 315–323. Curran Associates,Inc., 2013. URL http://papers.nips.cc/paper/

4937-accelerating-stochastic-gradient-descent-using-predictive-variance-reduction.

pdf.

N. Le Roux, M. Schmidt, and F. Bach. A stochastic gradient method withan exponential convergence rate for strongly-convex optimization withfinite training sets. In Advances in Neural Information ProcessingSystems 25, 2013.

Y. Nesterov. Gradient methods for minimizing composite objectivefunction. Technical Report 76, Center for Operations Research andEconometrics (CORE), Catholic University of Louvain (UCL), 2007.

46 / 49

http://papers.nips.cc/paper/4937-accelerating-stochastic-gradient-descent-using-predictive-variance-reduction.pdf



References III

Y. Nesterov. Primal-dual subgradient methods for convex problems.Mathematical Programming, Series B, 120:221–259, 2009.

H. Ouyang, N. He, L. Q. Tran, and A. Gray. Stochastic alternatingdirection method of multipliers. In Proceedings of the 30th InternationalConference on Machine Learning, 2013.

M. Powell. A method for nonlinear constraints in minimization problems.In R. Fletcher, editor, Optimization, pages 283–298. Academic Press,London, New York, 1969.

R. T. Rockafellar. Augmented Lagrangians and applications of theproximal point algorithm in convex programming. Mathematics ofOperations Research, 1:97–116, 1976.

S. Shalev-Shwartz and T. Zhang. Accelerated mini-batch stochastic dualcoordinate ascent. In Advances in Neural Information ProcessingSystems 26, 2013a.

47 / 49

References IV

S. Shalev-Shwartz and T. Zhang. Proximal stochastic dual coordinateascent. Technical report, 2013b. arXiv:1211.2717.

T. Suzuki. Dual averaging and proximal gradient descent for onlinealternating direction multiplier method. In Proceedings of the 30thInternational Conference on Machine Learning, pages 392–400, 2013.

T. Suzuki. Stochastic dual coordinate ascent with alternating directionmethod of multipliers. In Proceedings of the 31th InternationalConference on Machine Learning, pages 736–744, 2014.

M. Takac, A. Bijral, P. Richtarik, and N. Srebro. Mini-batch primal anddual methods for SVMs. In Proceedings of the 30th InternationalConference on Machine Learning, 2013.

H. Wang and A. Banerjee. Online alternating direction method. InProceedings of the 29th International Conference on Machine Learning,2012.

48 / 49

References V

L. Xiao. Dual averaging methods for regularized stochastic learning andonline optimization. In Advances in Neural Information ProcessingSystems 23, 2009.

M. Yuan and Y. Lin. Model selection and estimation in the gaussiangraphical model. Biometrika, 94(1):19–35, 2007.

49 / 49

stochastic alternating direction method of multipliers

Data & Analytics

fi zi w

regularized learning

computethe proximal

stochastic variance

large samples

example of proximal

arg minx

online data3 stochastic