stochastic alternating direction method of multipliers

54
Stochastic Alternating Direction Method of Multipliers Taiji Suzuki Tokyo Institute of Technology 1 / 49

Upload: taiji-suzuki

Post on 15-Jan-2015

479 views

Category:

Data & Analytics


4 download

DESCRIPTION

Stochastic ADMM

TRANSCRIPT

Page 1: Stochastic Alternating Direction Method of Multipliers

Stochastic Alternating Direction Method of Multipliers

Taiji Suzuki

Tokyo Institute of Technology

1 / 49

Page 2: Stochastic Alternating Direction Method of Multipliers

Goal of this study

Machine learning problems with a structured regularization.

minw∈Rp

1

n

n∑i=1

fi(z⊤i w) + ψ(B⊤w)

We propose stochastic optimization methods to solve the above problem.Stochastic optimization

+Alternating Direction Method of Multipliers (ADMM)

Advantageous points

Few samples are needed for each update.(We don’t need to read all the samples for each iteration.)

Wide range of applications

Easy implementation

Nice convergence properties

Parallelizable

2 / 49

Page 3: Stochastic Alternating Direction Method of Multipliers

Proposed method and existing methods

Stochastic methods for regularized learning problems.Normal ADMM

Online • Proximal gradient type (Nesterov,

2007)

Online-ADMM (Wang and Banerjee,

2012)

FOBOS (Duchi and Singer, 2009) OPG-ADMM (Suzuki, 2013; Ouyang

et al., 2013)

• Dual averaging type (Nesterov, 2009) RDA-ADMM (Suzuki, 2013)

RDA (Xiao, 2009)

BatchSDCA(Shalev-Shwartz and Zhang, 2013b)

SDCA-ADMM (Suzuki, 2014)

(Stochastic Dual Coordinate Ascent)

SAG (Le Roux et al., 2013)

(Stochastic Averaging Gradient)

SVRG (Johnson and Zhang, 2013)

(Stochastic Variance Reduced Gradient)

Online type: O(1/√T ) in general,

O(log(T )/T ) for a strongly convex objective.Batch type: linear convergence.

3 / 49

Page 4: Stochastic Alternating Direction Method of Multipliers

Outline

1 Problem settings

2 Stochastic ADMM for online data

3 Stochastic ADMM for batch data

4 / 49

Page 5: Stochastic Alternating Direction Method of Multipliers

High dimensional data analysis

Redundant information deteriorates the estimation accuracy.

Bio-informatics Text data Image data

5 / 49

Page 6: Stochastic Alternating Direction Method of Multipliers

High dimensional data analysis

Redundant information deteriorates the estimation accuracy.

Bio-informatics Text data Image data

5 / 49

Page 7: Stochastic Alternating Direction Method of Multipliers

Sparse estimation

Cut off redundant information → sparsity

R. Tsibshirani (1996). Regression shrinkage and selection via the lasso. J. Royal.

Statist. Soc B., Vol. 58, No. 1, pages 267–288. Citation: 10185 (25th/May/2014)

6 / 49

Page 8: Stochastic Alternating Direction Method of Multipliers

Regularized learning problem

Lasso:

minw∈Rp

1

n

n∑i=1

(yi − z⊤i w)2 + ∥w∥1︸ ︷︷ ︸regularization

.

Sparsity: Useful for high dimensional learning problem.Sparsity inducing regularization is usually non-smooth.

General regularized learning problem:

minw∈Rp

1

n

n∑i=1

fi (z⊤i w) + ψ(w).

7 / 49

Page 9: Stochastic Alternating Direction Method of Multipliers

Regularized learning problem

Lasso:

minw∈Rp

1

n

n∑i=1

(yi − z⊤i w)2 + ∥w∥1︸ ︷︷ ︸regularization

.

Sparsity: Useful for high dimensional learning problem.Sparsity inducing regularization is usually non-smooth.

General regularized learning problem:

minw∈Rp

1

n

n∑i=1

fi (z⊤i w) + ψ(w).

7 / 49

Page 10: Stochastic Alternating Direction Method of Multipliers

Proximal mapping

Regularized learning problem:

minw∈Rp

1

n

n∑i=1

fi (z⊤i w) + ψ(w).

gt ∈ ∂w(1n

∑ni=1 ft(z

⊤t w)

)|w=wt , gt =

1t

∑tτ=1 gτ .

Proximal gradient descent: wt+1 = argminw∈Rp

{g⊤t w + ψ(w) + 1

2ηt∥w − wt∥2

}.

Regularized dual averaging: wt+1 = argminw∈Rp

{g⊤t w + ψ(w) + 1

2ηt∥w∥2

}.

A key computation is the proximal mapping:

prox(q|ψ) := argminx

{ψ(x) +

1

2∥x− q∥2

}.

8 / 49

Page 11: Stochastic Alternating Direction Method of Multipliers

Example of Proximal mapping: ℓ1 regularization

prox(q|C∥ · ∥1) = argminx

{C∥x∥1 +

1

2∥x− q∥2

}= (sign(qj)max(|qj | − C , 0))j .

→ Soft-thresholding function. Analytic form!

9 / 49

Page 12: Stochastic Alternating Direction Method of Multipliers

Motivation

Structured regularization with large sample

minw∈Rp

1

n

∑n

i=1fi(z

⊤i w)︸ ︷︷ ︸

large samples

+ ψ(w)︸ ︷︷ ︸complicated

Difficulties (complex regularization and many samples)

Proximal mapping can not be easily computed:

prox(q|ψ) := argminx

{ψ(x) +

1

2∥x− q∥2

}.

→ ADMM (Alternating Direction Multiplier Method).

Computation on the loss function is heavy for large number ofsamples.

→ Stochastic optimization.

10 / 49

Page 13: Stochastic Alternating Direction Method of Multipliers

Motivation

Structured regularization with large sample

minw∈Rp

1

n

∑n

i=1fi(z

⊤i w)︸ ︷︷ ︸

large samples

+ ψ(w)︸ ︷︷ ︸complicated

Difficulties (complex regularization and many samples)

Proximal mapping can not be easily computed:

prox(q|ψ) := argminx

{ψ(x) +

1

2∥x− q∥2

}.

→ ADMM (Alternating Direction Multiplier Method).

Computation on the loss function is heavy for large number ofsamples.→ Stochastic optimization.

10 / 49

Page 14: Stochastic Alternating Direction Method of Multipliers

Examples

Overlapped group lasso

ψ(w) = C∑g∈G

∥wg∥

The groups have over-laps.

It is difficult to computethe proximal mapping.

Solution:Prepare ψ for which proximalmapping is easily computable.Let ψ(B⊤w) = ψ(w), and utilize theproximal mapping w.r.t. ψ.

BTw

11 / 49

Page 15: Stochastic Alternating Direction Method of Multipliers

Examples

Overlapped group lasso

ψ(w) = C∑g∈G

∥wg∥

The groups have over-laps.

It is difficult to computethe proximal mapping.

Solution:Prepare ψ for which proximalmapping is easily computable.Let ψ(B⊤w) = ψ(w), and utilize theproximal mapping w.r.t. ψ.

BTw

11 / 49

Page 16: Stochastic Alternating Direction Method of Multipliers

Decomposition technique

Decompose into independent groups:

B⊤w =

ψ(y) = C∑g′∈G′

∥yg′∥

prox(q|ψ) =(qg′max

{1− C

∥qg′∥, 0

})g′∈G′

Graph regularization (Jacob et al., 2009),

Low rank tensor estimation (Signoretto et al., 2010; Tomioka et al.,2011)

Dictionary learning (Kasiviswanathan et al., 2012; Rakotomamonjy,2013).

12 / 49

Page 17: Stochastic Alternating Direction Method of Multipliers

Another example

Graph regularization (Jacob et al., 2009)

ψ(w) = C∑

(i ,j)∈E

|wi − wj |.

ψ(y) = C∑e∈E|ye |, B⊤w = (wi − wj)(i ,j)∈E

ψ(B⊤w) = ψ(w),

prox(q|ψ) =(qemax

{1− C

|qe |, 0})

e∈E.

Soft-Thresholding function.

13 / 49

Page 18: Stochastic Alternating Direction Method of Multipliers

Optimizing composite objective function

minw

1

n

n∑i=1

fi (z⊤i w) + ψ(B⊤w)

⇔ minw ,y

1

n

n∑i=1

fi (z⊤i w) + ψ(y) s.t. y = B⊤w .

Augmented Lagrangian

L(w , y , λ) = 1

n

∑i

fi (z⊤i w) + ψ(y) + λ⊤(y − B⊤w) +

ρ

2∥y − B⊤w∥2

supλ infw ,y L(w , y , λ) yields the optimization of the original problem.

The augmented Lagrangian is the basis of the method of multipliers(Hestenes, 1969; Powell, 1969; Rockafellar, 1976).

14 / 49

Page 19: Stochastic Alternating Direction Method of Multipliers

ADMM

minw

1

n

n∑i=1

fi (z⊤i w) + ψ(B⊤w)⇔ min

w ,y

1

n

n∑i=1

fi (z⊤i w) + ψ(y) s.t. y = B⊤w .

L(w , y , λ) = 1n

∑fi (z

⊤i w) + ψ(y) + λ⊤(y − B⊤w) + ρ

2∥y − B⊤w∥2.

ADMM (Gabay and Mercier, 1976)

wt = argminw

1

n

n∑i=1

fi (z⊤i w) + λ⊤t (−B⊤w) +

ρ

2∥yt−1 − B⊤w∥2

yt = argminy

ψ(y) + λ⊤t y +ρ

2∥y − B⊤wt∥2(= prox(B⊤wt − λt/ρ|ψ/ρ))

λt = λt−1 − ρ(B⊤wt − yt)

The computational difficulty of proximal mapping is resolved.

However, the computation of 1n

∑ni=1 fi (z

⊤i w) is still heavy.

→ Stochastic version of ADMM.

Online type: Suzuki (2013); Wang and Banerjee (2012); Ouyang et al. (2013)

Batch type: Suzuki (2014) → linear convergence.

15 / 49

Page 20: Stochastic Alternating Direction Method of Multipliers

ADMM

minw

1

n

n∑i=1

fi (z⊤i w) + ψ(B⊤w)⇔ min

w ,y

1

n

n∑i=1

fi (z⊤i w) + ψ(y) s.t. y = B⊤w .

L(w , y , λ) = 1n

∑fi (z

⊤i w) + ψ(y) + λ⊤(y − B⊤w) + ρ

2∥y − B⊤w∥2.

ADMM (Gabay and Mercier, 1976)

wt = argminw

1

n

n∑i=1

fi (z⊤i w) + λ⊤t (−B⊤w) +

ρ

2∥yt−1 − B⊤w∥2

yt = argminy

ψ(y) + λ⊤t y +ρ

2∥y − B⊤wt∥2(= prox(B⊤wt − λt/ρ|ψ/ρ))

λt = λt−1 − ρ(B⊤wt − yt)

The computational difficulty of proximal mapping is resolved.However, the computation of 1

n

∑ni=1 fi (z

⊤i w) is still heavy.

→ Stochastic version of ADMM.

Online type: Suzuki (2013); Wang and Banerjee (2012); Ouyang et al. (2013)

Batch type: Suzuki (2014) → linear convergence.15 / 49

Page 21: Stochastic Alternating Direction Method of Multipliers

Outline

1 Problem settings

2 Stochastic ADMM for online data

3 Stochastic ADMM for batch data

16 / 49

Page 22: Stochastic Alternating Direction Method of Multipliers

Online setting

zt-1 z

twt

wt+1

P(z)

minw

1

T

T∑t=1

f (zt ,w) + ψ(w)

We don’t want to read all the samples.

wt is updated per one (or few) observation.

17 / 49

Page 23: Stochastic Alternating Direction Method of Multipliers

Existing oneline optimization method

gt ∈ ∂ft(wt), gt =1t

∑tτ=1 gτ (ft(w) = f (zt ,w))

OPG (Online Proximal Gradient Descent)

wt+1 = argminw

{⟨gt ,w⟩+ ψ(w) +

1

2ηt∥w − wt∥2

}

RDA (Regularized Dual Averaging; Xiao (2009); Nesterov (2009))

wt+1 = argminw

{⟨gt ,w⟩+ ψ(w) +

1

2ηt∥w∥2

}These update rule is computed by a proximal mapping associated with ψ.

Efficient for a simple regularization such as ℓ1 regularization.

How about structured regularization? → ADMM.

18 / 49

Page 24: Stochastic Alternating Direction Method of Multipliers

Proposed method 1: RDA-ADMM

Ordinary RDA: wt+1 = argminw

{⟨gt ,w⟩+ ψ(w) + 1

2ηt∥w∥2

}RDA-ADMM

Let xt =1t

∑tτ=1 xτ , λt =

1t

∑tτ=1 λτ , yt =

1t

∑tτ=1 yτ , gt =

1t

∑tτ=1 gτ .

xt+1 =argminx∈X

{g⊤t x − λ⊤t B⊤x +

ρ

2t∥B⊤x∥2

+ρ(B⊤xt − yt)⊤B⊤x +

1

2ηt∥x∥2Gt

},

yt+1 =argminy∈Y

{ψ(y)− λ⊤t (B⊤xt+1 − y) +

ρ

2∥B⊤xt+1 − y∥2

},

λt+1 =λt − ρ(B⊤xt+1 − yt+1).

The update rule of yt+1 and xt+1 are same as the ordinary ADMM.

yt+1 = prox(B⊤xt+1 − λt/ρ|ψ)

19 / 49

Page 25: Stochastic Alternating Direction Method of Multipliers

Proposed method 2: OPG-ADMM

Ordinary OPG: wt+1 = argminw

{⟨gt ,w⟩+ ψ(w) + 1

2ηt∥w − wt∥2

}.

OPG-ADMM

xt+1 =argminx∈X

{g⊤t x − λ⊤t (B⊤x − yt) +

ρ

2∥B⊤x − yt∥2 +

1

2ηt∥x − xt∥2Gt

},

yt+1 =prox(B⊤xt+1 − λt/ρ|ψ),λt+1 =λt − ρ(B⊤xt+1 − yt+1).

The update rule of yt+1 and xt+1 are same as the ordinary ADMM.

20 / 49

Page 26: Stochastic Alternating Direction Method of Multipliers

Simplified versionLetting Gt be a specific form, the update rule is much simplified.(Gt = γI − ρ

t ηtBB⊤ for RDA-ADMM, and Gt = γI − ρηtBB⊤ for OPG-ADMM)

Online stochastic ADMM

1 Update of x

(RDA-ADMM) xt+1 = ΠX

[−ηtγ{gt − B(λt − ρB⊤xt + ρyt)}

],

(OPG-ADMM) xt+1 = ΠX

[−ηtγ{gt − B(λt − ρB⊤xt + ρyt)}+ xt

].

2 Update of yyt+1 = prox(B⊤xt+1 − λt/ρ|ψ).

3 Update of λλt+1 = λt − ρ(B⊤xt+1 − yt+1).

Fast computation.Easy implementation.

21 / 49

Page 27: Stochastic Alternating Direction Method of Multipliers

Convergence analysis

We bound the expected risk:

Expected riskF (x) = EZ [f (Z , x)] + ψ(x).

Assumptions:

(A1) ∃G s.t. ∀g ∈ ∂x f (z , x) satisfies ∥g∥ ≤ G for all z , x .

(A2) ∃L s.t. ∀g ∈ ∂ψ(y) satisfies ∥g∥ ≤ L for all y .

(A3) ∃R s.t. ∀x ∈ X satisfies ∥x∥ ≤ R.

22 / 49

Page 28: Stochastic Alternating Direction Method of Multipliers

Convergence rate: bounded gradient(A1) ∃G s.t. ∀g ∈ ∂x f (z , x) satisfies ∥g∥ ≤ G for all z , x .(A2) ∃L s.t. ∀g ∈ ∂ψ(y) satisfies ∥g∥ ≤ L for all y .(A3) ∃R s.t. ∀x ∈ X satisfies ∥x∥ ≤ R.

Theorem (Convergence rate of RDA-ADMM)

Under (A1), (A2), (A3), we have

Ez1:T−1[F (xT )− F (x∗)] ≤ 1

T

T∑t=2

ηt−1

2(t − 1)G 2 +

γ

ηT∥x∗∥2 + K

T.

Theorem (Convergence rate of OPG-ADMM)

Under (A1), (A2), (A3), we haveEz1:T−1

[F (xT )− F (x∗)] ≤ 12T

∑Tt=2max

{γηt− γ

ηt−1, 0}R2

+ 1T

∑Tt=1

ηt2 G

2 + KT .

Both methods have convergence rate O(

1√T

)by letting ηt = η0

√t for

RDA-ADMM and ηt = η0/√t for OPG-ADMM.

23 / 49

Page 29: Stochastic Alternating Direction Method of Multipliers

Convergence rate: strongly convex

(A4) There exist σf , σψ ≥ 0 and P,Q ⪰ O such that

EZ [f (Z , x) + (x ′ − x)⊤∇x f (Z , x)] +σf2∥x − x ′∥2P ≤ EZ [f (Z , x

′)],

ψ(y) + (y ′ − y)⊤∇ψ(y) +σψ2∥y − y ′∥2Q ,

for all x , x ′ ∈ X and y , y ′ ∈ Y, and ∃σ > 0 satisfying

σI ⪯ σf P +ρσψ

2ρ+ σψQ.

The update rule of RDA-ADMM is modified as

xt+1 = argminx∈X

{g⊤t x − λ⊤t B⊤x +

ρ

2t∥B⊤x∥2 + ρ(B⊤xt − yt)

⊤B⊤x +1

2ηt∥x∥2Gt

2∥x − xt∥2

}.

24 / 49

Page 30: Stochastic Alternating Direction Method of Multipliers

Convergence analysis: strongly convex

Theorem (Convergence rate of RDA-ADMM)

Under (A1), (A2), (A3), (A4), we have

Ez1:T−1[F (xT )− F (x∗)] ≤ 1

2T

T∑t=2

1t−1ηt−1

+ tσG 2 +

γ

ηT∥x∗∥2 + K

T.

Theorem (Convergence rate of OPG-ADMM)

Under (A1), (A2), (A3), (A4), we haveEz1:T−1

[F (xT )− F (x∗)] ≤ 12T

∑Tt=2max

{γηt− γ

ηt−1− σ, 0

}R2

+ 1T

∑Tt=1

ηt2 G

2 + KT .

Both methods have convergence rate O(log(T )

T

)by letting ηt = η0t for

RDA-ADMM and ηt = η0/t for OPG-ADMM.

25 / 49

Page 31: Stochastic Alternating Direction Method of Multipliers

Numerical experiments

101

102

103

10-2

10-1

100

CPU time (s)

Expec

ted R

isk

OPG−ADMM

RDA−ADMM

NL−ADMM

RDA

OPG

Figure: Artificial data: 1024 dim,512 sample, Overlapped group lasso.

100

101

102

103

10-0.8

10-0.7

10-0.6

CPU time (s)

Cla

ssif

icati

on E

rror

OPG−ADMM

RDA−ADMM

NL−ADMM

RDA

Figure: Real data (Adult, a9a):15,252 dim, 32,561 sample,Overlapped group lasso + ℓ1 reg.

26 / 49

Page 32: Stochastic Alternating Direction Method of Multipliers

Outline

1 Problem settings

2 Stochastic ADMM for online data

3 Stochastic ADMM for batch data

27 / 49

Page 33: Stochastic Alternating Direction Method of Multipliers

Batch setting

In the batch setting, the data are fixed. We just minimize the objectivefunction defined by

1

n

n∑i=1

fi (z⊤i w) + ψ(B⊤w).

� �We construct a method that

observes few samples per iteration (like online method),

converges linearly (unlike online method).� �

28 / 49

Page 34: Stochastic Alternating Direction Method of Multipliers

Dual problem

The dual formulation is utilized to obtain our algorithm.

Primal: minw

1

n

n∑i=1

fi (z⊤i w) + ψ(B⊤w)

Dual problem

minx∈Rn,y∈Rd

1

n

n∑i=1

f ∗i (xi ) + ψ∗(y

n

)s.t. Zx + By = 0.

where Z = [z1, z2, . . . , zn] ∈ Rp×n.

Optimality condition:

z⊤i w∗ ∈ ∇f ∗i (x∗i ),1

ny∗ ∈ ∇ψ(u)|u=B⊤w∗ , Zx∗ + By∗ = 0.

⋆ Each coordinate xi corresponds to each sample zi .29 / 49

Page 35: Stochastic Alternating Direction Method of Multipliers

Dual of a loss function

loss function

00

00 1

ℓ(y , x) = log(1 + exp(−yx)) ℓ∗(y , u) = −yu log(−yu) + (1 + yu) log(1 + yu)

30 / 49

Page 36: Stochastic Alternating Direction Method of Multipliers

Dual of regularization functions

regularization function

0

ℓ1 and elastic-net regularization dual

31 / 49

Page 37: Stochastic Alternating Direction Method of Multipliers

Proposed algorithm

Basic algorithm

Split the index set {1, . . . , n} into K groups (I1, I2, . . . , IK ).For each t = 1, 2, . . .

Choose k ∈ {1, . . . ,K} uniformly at random, and set I = Ik .

y (t) ← argminy

{nψ∗(y/n)− ⟨w (t−1),Zx (t−1)+By⟩

2∥Zx (t−1) + By∥2 + 1

2∥y − y (t−1)∥2Q

},

x(t)I ← argmin

xI

{∑i∈I f

∗i (xi )− ⟨w

(t−1),ZI xI + By (t)⟩

2∥ZI xI+Z\I x

(t−1)\I +By (t)∥2+ 1

2∥xI − x

(t−1)I ∥2GI ,I

},

w (t) ←w (t−1) − γρ{n(Zx (t) + By (t))

− (n − n/K )(Zx (t−1) + By (t−1))}.

where Q,G are some appropriate positive semidefinite matrices.32 / 49

Page 38: Stochastic Alternating Direction Method of Multipliers

We observe only a few samples at each iteration.Mini-batch technique (Takac et al., 2013; Shalev-Shwartz and Zhang, 2013a).

If K = n, we need to observe only one sample at each iteration.33 / 49

Page 39: Stochastic Alternating Direction Method of Multipliers

Simplified algorithm

SettingQ = ρ(ηBId − B⊤B), GI ,I = ρ(ηZ ,I I|I | − Z⊤

I ZI ),

then, by the relation prox(q|ψ) + prox(q|ψ∗) = q, we obtain the followingupdate rule:

Simplified algorithm

For q(t) = y (t−1) + B⊤

ρηB{w (t−1) − ρ(Zx (t−1) + By (t−1))}, let

y (t) ← q(t) − prox(q(t)|nψ(ρηB · )/(ρηB)),

For p(t)I = x

(t−1)I +

Z⊤I

ρηZ,I{w (t−1) − ρ(Zx (t−1) + By (t))}, let

x(t)i ← prox(p

(t)i |f

∗i /(ρηZ ,I )) (∀i ∈ I ).

⋆ The update of x can be parallelized.34 / 49

Page 40: Stochastic Alternating Direction Method of Multipliers

Convergence Analysis

x∗: the optimal variable of xY∗: the set of optimal variables (not necessarily unique)w∗: the optimal variable of wAssumption:

There exits v > 0 such that, ∀xi ∈ R,

f ∗i (xi )− f ∗i (x∗i ) ≥ ⟨∇f ∗i (x∗i ), xi − x∗i ⟩+

v∥xi − x∗i ∥2

2.

∃h, vψ > 0 such that, for all y , u, there exists y∗ ∈ Y∗ such that

ψ∗(y/n)− ψ∗(y∗/n) ≥ ⟨B⊤w∗, y/n − y∗/n⟩+ vψ2∥PKer(B)(y/n − y∗/n)∥2,

ψ(u)− ψ(B⊤w∗) ≥ ⟨y∗/n, u − B⊤w∗⟩+ h

2∥u − B⊤w∗∥2.

B⊤ is injective.

35 / 49

Page 41: Stochastic Alternating Direction Method of Multipliers

FD(x , y) :=1n

∑ni=1 f

∗i (xi ) + ψ∗( yn )− ⟨w

∗,Z xn − B y

n ⟩.RD(x , y ,w) := FD(x , y)− FD(x

∗, y∗) + 12n2γρ

∥w − w∗∥2 + ρ(1−γ)2n ∥Zx +

By∥2 + 12n∥x − x∗∥2vIp+H + 1

2nK ∥y − y∗∥2Q .

Theorem (Linear convergence of SDCA-ADMM)

Let H be a matrix such that HI ,I = ρZ⊤I ZI + GI ,I for all I ∈ {I1, . . . , IK},

µ = min

{v

4(v+σmax(H)) ,hρσmin(B

⊤B)2max{1/n,4hρ,4hσmax(Q)}

,Kvψ/n

4σmax(Q) ,Kvσmin(BB

⊤)4σmax(Q)(ρσmax(Z⊤Z)+4v)

},

γ = 14n , and C1 = RD(x

(0), y (0),w (0)), then we have that

(dual residual) E[RD(x(t), y (t),w (t))] ≤

(1− µ

K

)tC1,

(primal variable) E[∥w (t) − w∗∥2] ≤ nρ

2

(1− µ

K

)tC1.

For t ≥ C ′Kµlog

(C ′′nϵ

), we have that E[∥w (t) − w ∗∥2] ≤ ϵ.

36 / 49

Page 42: Stochastic Alternating Direction Method of Multipliers

Numerical Experiments: Loss function

Binary classification.

Smoothed hinge loss: fi (u) =

0, (yiu ≥ 1),12 − yiu, (yiu < 0),12 (1− yiu)

2, (otherwise).

-0.5 0 0.5 1 1.50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

⇒ proximal mapping is analytically obtained.

prox(u|f ∗i /C ) =

Cu−yi1+C (−1 ≤ Cuyi−1

1+C ≤ 0),

−yi (−1 > Cuyi−11+C ),

0 (otherwise).37 / 49

Page 43: Stochastic Alternating Direction Method of Multipliers

Numerical Experiments (Artificial data): Setting

Artificial data: Overlapped group regularization:

ψ(X ) = C (32∑i=1

∥X:,i∥+32∑j=1

∥Xj,:∥+ 0.01×∑i,j

X 2i,j/2),

X ∈ R32×32.

38 / 49

Page 44: Stochastic Alternating Direction Method of Multipliers

Numerical Experiments (Artificial data): Results

5 10 15 20

10-5

10-4

10-3

10-2

10-1

100

CPU time (s)

Em

pir

ical

Ris

k

(a) Excess objective value

10-1

100

101

0

0.5

1

1.5

2

2.5

CPU time (s)

Test

Loss

(b) Test loss

10-1

100

101

0.05

0.1

0.15

0.2

0.25

0.3

CPU time (s)

Cla

ssif

icati

on

Err

or

RDA

OPG−ADMM

RDA−ADMM

OL−ADMM

Batch−ADMM

SDCA−ADMM

(c) Class. Error

Figure: Artificial data (n=5,120). Overlapped group lasso. Mini-batch sizen/K = 50.

39 / 49

Page 45: Stochastic Alternating Direction Method of Multipliers

Numerical Experiments (Artificial data): Results

100 200 300 400 500

10-6

10-4

10-2

100

CPU time (s)

Em

pir

ical

Ris

k

(a) Excess objective value

100

101

102

0

0.5

1

1.5

2

2.5

3

3.5

CPU time (s)

Test

Loss

(b) Test loss

100

101

102

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

CPU time (s)

Cla

ssif

icati

on E

rror

RDA

OPG−ADMM

RDA−ADMM

OL−ADMM

Batch−ADMM

SDCA−ADMM

(c) Class. Error

Figure: Artificial data (n=51,200). Overlapped group lasso. Mini-batch sizen/K = 50.

40 / 49

Page 46: Stochastic Alternating Direction Method of Multipliers

Numerical Experiments (Real data): Setting

Real data: Graph regularization:

ψ(w) = C1

p∑i=1

|wi |+C2

∑(i,j)∈E

|wi−wj |+0.01×(C1

p∑i=1

|wi |2+C2

∑(i,j)∈E

|wi−wj |2)

where E is the set of edges obtained from the similarity matrix.The similarity graph is obtained by Graphical Lasso (Yuan and Lin, 2007;

Banerjee et al., 2008).

41 / 49

Page 47: Stochastic Alternating Direction Method of Multipliers

Numerical Experiments (Real data): Results

5 10 15 20 25 30 35 40 450.166

0.167

0.168

0.169

0.17

0.171

0.172

0.173

0.174

0.175

CPU time (s)

Em

pir

ical

Ris

k

(a) Objective function

100

101

0.17

0.172

0.174

0.176

0.178

0.18

0.182

0.184

0.186

0.188

0.19

CPU time (s)

Test

Lo

ss

(b) Test loss

100

101

0.12

0.122

0.124

0.126

0.128

0.13

0.132

0.134

0.136

0.138

0.14

CPU time (s)

Cla

ssif

icati

on

Err

or

OPG−ADMM

RDA−ADMM

OL−ADMM

SDCA−ADMM

(c) Class. Error

Figure: Real data (20 Newsgroups, n=12,995). Graph regularization. Mini-batchsize is n/K = 50.

42 / 49

Page 48: Stochastic Alternating Direction Method of Multipliers

Numerical Experiments (Real data): Results

0 50 100 150 200

0.2

0.25

0.3

0.35

0.4

0.45

0.5

CPU time (s)

Em

pir

ical

Ris

k

(a) Objective function

0 50 100 150 200

0.2

0.25

0.3

0.35

0.4

0.45

CPU time (s)

Test

Lo

ss

(b) Test loss

0 50 100 150 200

0.15

0.155

0.16

0.165

0.17

0.175

0.18

CPU time (s)

Cla

ssif

icati

on

Err

or

OPG−ADMM

RDA−ADMM

OL−ADMM

SDCA−ADMM

(c) Class. Error

Figure: Real data (a9a, n=32,561). Graph regularization. Mini-batch sizen/K = 50.

43 / 49

Page 49: Stochastic Alternating Direction Method of Multipliers

Summary

A new stochastic optimization methods with ADMM is proposed.

Applicable to wide range of regularization functions.

Easy implementation.

Online setting:

RDA-ADMM, OPG-ADMMO(1/

√T ) in general, and O(log(T )/T ) for strongly convex objective.

Batch setting:

SDCA-ADMM (Stochastic dual coordinate descent ADMM)Linear convergence with strong convexity around the optimum.The convergence rate is controlled by the mini-batch size K .

44 / 49

Page 50: Stochastic Alternating Direction Method of Multipliers

References I

O. Banerjee, L. E. Ghaoui, and A. d’Aspremont. Model selection throughsparse maximum likelihood estimation for multivariate gaussian orbinary data. Journal of Machine Learning Research, 9:485–516, 2008.

J. Duchi and Y. Singer. Efficient online and batch learning using forwardbackward splitting. Journal of Machine Learning Research, 10:2873–2908, 2009.

D. Gabay and B. Mercier. A dual algorithm for the solution of nonlinearvariational problems via finite-element approximations. Computers &Mathematics with Applications, 2:17–40, 1976.

M. Hestenes. Multiplier and gradient methods. Journal of OptimizationTheory & Applications, 4:303–320, 1969.

45 / 49

Page 51: Stochastic Alternating Direction Method of Multipliers

References II

R. Johnson and T. Zhang. Accelerating stochastic gradient descent usingpredictive variance reduction. In C. Burges, L. Bottou, M. Welling,Z. Ghahramani, and K. Weinberger, editors, Advances in NeuralInformation Processing Systems 26, pages 315–323. Curran Associates,Inc., 2013. URL http://papers.nips.cc/paper/

4937-accelerating-stochastic-gradient-descent-using-predictive-variance-reduction.

pdf.

N. Le Roux, M. Schmidt, and F. Bach. A stochastic gradient method withan exponential convergence rate for strongly-convex optimization withfinite training sets. In Advances in Neural Information ProcessingSystems 25, 2013.

Y. Nesterov. Gradient methods for minimizing composite objectivefunction. Technical Report 76, Center for Operations Research andEconometrics (CORE), Catholic University of Louvain (UCL), 2007.

46 / 49

Page 52: Stochastic Alternating Direction Method of Multipliers

References III

Y. Nesterov. Primal-dual subgradient methods for convex problems.Mathematical Programming, Series B, 120:221–259, 2009.

H. Ouyang, N. He, L. Q. Tran, and A. Gray. Stochastic alternatingdirection method of multipliers. In Proceedings of the 30th InternationalConference on Machine Learning, 2013.

M. Powell. A method for nonlinear constraints in minimization problems.In R. Fletcher, editor, Optimization, pages 283–298. Academic Press,London, New York, 1969.

R. T. Rockafellar. Augmented Lagrangians and applications of theproximal point algorithm in convex programming. Mathematics ofOperations Research, 1:97–116, 1976.

S. Shalev-Shwartz and T. Zhang. Accelerated mini-batch stochastic dualcoordinate ascent. In Advances in Neural Information ProcessingSystems 26, 2013a.

47 / 49

Page 53: Stochastic Alternating Direction Method of Multipliers

References IV

S. Shalev-Shwartz and T. Zhang. Proximal stochastic dual coordinateascent. Technical report, 2013b. arXiv:1211.2717.

T. Suzuki. Dual averaging and proximal gradient descent for onlinealternating direction multiplier method. In Proceedings of the 30thInternational Conference on Machine Learning, pages 392–400, 2013.

T. Suzuki. Stochastic dual coordinate ascent with alternating directionmethod of multipliers. In Proceedings of the 31th InternationalConference on Machine Learning, pages 736–744, 2014.

M. Takac, A. Bijral, P. Richtarik, and N. Srebro. Mini-batch primal anddual methods for SVMs. In Proceedings of the 30th InternationalConference on Machine Learning, 2013.

H. Wang and A. Banerjee. Online alternating direction method. InProceedings of the 29th International Conference on Machine Learning,2012.

48 / 49

Page 54: Stochastic Alternating Direction Method of Multipliers

References V

L. Xiao. Dual averaging methods for regularized stochastic learning andonline optimization. In Advances in Neural Information ProcessingSystems 23, 2009.

M. Yuan and Y. Lin. Model selection and estimation in the gaussiangraphical model. Biometrika, 94(1):19–35, 2007.

49 / 49