stochastic alternating direction method of multipliers
DESCRIPTION
Stochastic ADMMTRANSCRIPT
Stochastic Alternating Direction Method of Multipliers
Taiji Suzuki
Tokyo Institute of Technology
1 / 49
Goal of this study
Machine learning problems with a structured regularization.
minw∈Rp
1
n
n∑i=1
fi(z⊤i w) + ψ(B⊤w)
We propose stochastic optimization methods to solve the above problem.Stochastic optimization
+Alternating Direction Method of Multipliers (ADMM)
Advantageous points
Few samples are needed for each update.(We don’t need to read all the samples for each iteration.)
Wide range of applications
Easy implementation
Nice convergence properties
Parallelizable
2 / 49
Proposed method and existing methods
Stochastic methods for regularized learning problems.Normal ADMM
Online • Proximal gradient type (Nesterov,
2007)
Online-ADMM (Wang and Banerjee,
2012)
FOBOS (Duchi and Singer, 2009) OPG-ADMM (Suzuki, 2013; Ouyang
et al., 2013)
• Dual averaging type (Nesterov, 2009) RDA-ADMM (Suzuki, 2013)
RDA (Xiao, 2009)
BatchSDCA(Shalev-Shwartz and Zhang, 2013b)
SDCA-ADMM (Suzuki, 2014)
(Stochastic Dual Coordinate Ascent)
SAG (Le Roux et al., 2013)
(Stochastic Averaging Gradient)
SVRG (Johnson and Zhang, 2013)
(Stochastic Variance Reduced Gradient)
Online type: O(1/√T ) in general,
O(log(T )/T ) for a strongly convex objective.Batch type: linear convergence.
3 / 49
Outline
1 Problem settings
2 Stochastic ADMM for online data
3 Stochastic ADMM for batch data
4 / 49
High dimensional data analysis
Redundant information deteriorates the estimation accuracy.
Bio-informatics Text data Image data
5 / 49
High dimensional data analysis
Redundant information deteriorates the estimation accuracy.
Bio-informatics Text data Image data
5 / 49
Sparse estimation
Cut off redundant information → sparsity
R. Tsibshirani (1996). Regression shrinkage and selection via the lasso. J. Royal.
Statist. Soc B., Vol. 58, No. 1, pages 267–288. Citation: 10185 (25th/May/2014)
6 / 49
Regularized learning problem
Lasso:
minw∈Rp
1
n
n∑i=1
(yi − z⊤i w)2 + ∥w∥1︸ ︷︷ ︸regularization
.
Sparsity: Useful for high dimensional learning problem.Sparsity inducing regularization is usually non-smooth.
General regularized learning problem:
minw∈Rp
1
n
n∑i=1
fi (z⊤i w) + ψ(w).
7 / 49
Regularized learning problem
Lasso:
minw∈Rp
1
n
n∑i=1
(yi − z⊤i w)2 + ∥w∥1︸ ︷︷ ︸regularization
.
Sparsity: Useful for high dimensional learning problem.Sparsity inducing regularization is usually non-smooth.
General regularized learning problem:
minw∈Rp
1
n
n∑i=1
fi (z⊤i w) + ψ(w).
7 / 49
Proximal mapping
Regularized learning problem:
minw∈Rp
1
n
n∑i=1
fi (z⊤i w) + ψ(w).
gt ∈ ∂w(1n
∑ni=1 ft(z
⊤t w)
)|w=wt , gt =
1t
∑tτ=1 gτ .
Proximal gradient descent: wt+1 = argminw∈Rp
{g⊤t w + ψ(w) + 1
2ηt∥w − wt∥2
}.
Regularized dual averaging: wt+1 = argminw∈Rp
{g⊤t w + ψ(w) + 1
2ηt∥w∥2
}.
A key computation is the proximal mapping:
prox(q|ψ) := argminx
{ψ(x) +
1
2∥x− q∥2
}.
8 / 49
Example of Proximal mapping: ℓ1 regularization
prox(q|C∥ · ∥1) = argminx
{C∥x∥1 +
1
2∥x− q∥2
}= (sign(qj)max(|qj | − C , 0))j .
→ Soft-thresholding function. Analytic form!
9 / 49
Motivation
Structured regularization with large sample
minw∈Rp
1
n
∑n
i=1fi(z
⊤i w)︸ ︷︷ ︸
large samples
+ ψ(w)︸ ︷︷ ︸complicated
Difficulties (complex regularization and many samples)
Proximal mapping can not be easily computed:
prox(q|ψ) := argminx
{ψ(x) +
1
2∥x− q∥2
}.
→ ADMM (Alternating Direction Multiplier Method).
Computation on the loss function is heavy for large number ofsamples.
→ Stochastic optimization.
10 / 49
Motivation
Structured regularization with large sample
minw∈Rp
1
n
∑n
i=1fi(z
⊤i w)︸ ︷︷ ︸
large samples
+ ψ(w)︸ ︷︷ ︸complicated
Difficulties (complex regularization and many samples)
Proximal mapping can not be easily computed:
prox(q|ψ) := argminx
{ψ(x) +
1
2∥x− q∥2
}.
→ ADMM (Alternating Direction Multiplier Method).
Computation on the loss function is heavy for large number ofsamples.→ Stochastic optimization.
10 / 49
Examples
Overlapped group lasso
ψ(w) = C∑g∈G
∥wg∥
The groups have over-laps.
It is difficult to computethe proximal mapping.
Solution:Prepare ψ for which proximalmapping is easily computable.Let ψ(B⊤w) = ψ(w), and utilize theproximal mapping w.r.t. ψ.
BTw
11 / 49
Examples
Overlapped group lasso
ψ(w) = C∑g∈G
∥wg∥
The groups have over-laps.
It is difficult to computethe proximal mapping.
Solution:Prepare ψ for which proximalmapping is easily computable.Let ψ(B⊤w) = ψ(w), and utilize theproximal mapping w.r.t. ψ.
BTw
11 / 49
Decomposition technique
Decompose into independent groups:
B⊤w =
ψ(y) = C∑g′∈G′
∥yg′∥
prox(q|ψ) =(qg′max
{1− C
∥qg′∥, 0
})g′∈G′
Graph regularization (Jacob et al., 2009),
Low rank tensor estimation (Signoretto et al., 2010; Tomioka et al.,2011)
Dictionary learning (Kasiviswanathan et al., 2012; Rakotomamonjy,2013).
12 / 49
Another example
Graph regularization (Jacob et al., 2009)
ψ(w) = C∑
(i ,j)∈E
|wi − wj |.
ψ(y) = C∑e∈E|ye |, B⊤w = (wi − wj)(i ,j)∈E
⇒
ψ(B⊤w) = ψ(w),
prox(q|ψ) =(qemax
{1− C
|qe |, 0})
e∈E.
Soft-Thresholding function.
13 / 49
Optimizing composite objective function
minw
1
n
n∑i=1
fi (z⊤i w) + ψ(B⊤w)
⇔ minw ,y
1
n
n∑i=1
fi (z⊤i w) + ψ(y) s.t. y = B⊤w .
Augmented Lagrangian
L(w , y , λ) = 1
n
∑i
fi (z⊤i w) + ψ(y) + λ⊤(y − B⊤w) +
ρ
2∥y − B⊤w∥2
supλ infw ,y L(w , y , λ) yields the optimization of the original problem.
The augmented Lagrangian is the basis of the method of multipliers(Hestenes, 1969; Powell, 1969; Rockafellar, 1976).
14 / 49
ADMM
minw
1
n
n∑i=1
fi (z⊤i w) + ψ(B⊤w)⇔ min
w ,y
1
n
n∑i=1
fi (z⊤i w) + ψ(y) s.t. y = B⊤w .
L(w , y , λ) = 1n
∑fi (z
⊤i w) + ψ(y) + λ⊤(y − B⊤w) + ρ
2∥y − B⊤w∥2.
ADMM (Gabay and Mercier, 1976)
wt = argminw
1
n
n∑i=1
fi (z⊤i w) + λ⊤t (−B⊤w) +
ρ
2∥yt−1 − B⊤w∥2
yt = argminy
ψ(y) + λ⊤t y +ρ
2∥y − B⊤wt∥2(= prox(B⊤wt − λt/ρ|ψ/ρ))
λt = λt−1 − ρ(B⊤wt − yt)
The computational difficulty of proximal mapping is resolved.
However, the computation of 1n
∑ni=1 fi (z
⊤i w) is still heavy.
→ Stochastic version of ADMM.
Online type: Suzuki (2013); Wang and Banerjee (2012); Ouyang et al. (2013)
Batch type: Suzuki (2014) → linear convergence.
15 / 49
ADMM
minw
1
n
n∑i=1
fi (z⊤i w) + ψ(B⊤w)⇔ min
w ,y
1
n
n∑i=1
fi (z⊤i w) + ψ(y) s.t. y = B⊤w .
L(w , y , λ) = 1n
∑fi (z
⊤i w) + ψ(y) + λ⊤(y − B⊤w) + ρ
2∥y − B⊤w∥2.
ADMM (Gabay and Mercier, 1976)
wt = argminw
1
n
n∑i=1
fi (z⊤i w) + λ⊤t (−B⊤w) +
ρ
2∥yt−1 − B⊤w∥2
yt = argminy
ψ(y) + λ⊤t y +ρ
2∥y − B⊤wt∥2(= prox(B⊤wt − λt/ρ|ψ/ρ))
λt = λt−1 − ρ(B⊤wt − yt)
The computational difficulty of proximal mapping is resolved.However, the computation of 1
n
∑ni=1 fi (z
⊤i w) is still heavy.
→ Stochastic version of ADMM.
Online type: Suzuki (2013); Wang and Banerjee (2012); Ouyang et al. (2013)
Batch type: Suzuki (2014) → linear convergence.15 / 49
Outline
1 Problem settings
2 Stochastic ADMM for online data
3 Stochastic ADMM for batch data
16 / 49
Online setting
zt-1 z
twt
wt+1
P(z)
minw
1
T
T∑t=1
f (zt ,w) + ψ(w)
We don’t want to read all the samples.
wt is updated per one (or few) observation.
17 / 49
Existing oneline optimization method
gt ∈ ∂ft(wt), gt =1t
∑tτ=1 gτ (ft(w) = f (zt ,w))
OPG (Online Proximal Gradient Descent)
wt+1 = argminw
{⟨gt ,w⟩+ ψ(w) +
1
2ηt∥w − wt∥2
}
RDA (Regularized Dual Averaging; Xiao (2009); Nesterov (2009))
wt+1 = argminw
{⟨gt ,w⟩+ ψ(w) +
1
2ηt∥w∥2
}These update rule is computed by a proximal mapping associated with ψ.
Efficient for a simple regularization such as ℓ1 regularization.
How about structured regularization? → ADMM.
18 / 49
Proposed method 1: RDA-ADMM
Ordinary RDA: wt+1 = argminw
{⟨gt ,w⟩+ ψ(w) + 1
2ηt∥w∥2
}RDA-ADMM
Let xt =1t
∑tτ=1 xτ , λt =
1t
∑tτ=1 λτ , yt =
1t
∑tτ=1 yτ , gt =
1t
∑tτ=1 gτ .
xt+1 =argminx∈X
{g⊤t x − λ⊤t B⊤x +
ρ
2t∥B⊤x∥2
+ρ(B⊤xt − yt)⊤B⊤x +
1
2ηt∥x∥2Gt
},
yt+1 =argminy∈Y
{ψ(y)− λ⊤t (B⊤xt+1 − y) +
ρ
2∥B⊤xt+1 − y∥2
},
λt+1 =λt − ρ(B⊤xt+1 − yt+1).
The update rule of yt+1 and xt+1 are same as the ordinary ADMM.
yt+1 = prox(B⊤xt+1 − λt/ρ|ψ)
19 / 49
Proposed method 2: OPG-ADMM
Ordinary OPG: wt+1 = argminw
{⟨gt ,w⟩+ ψ(w) + 1
2ηt∥w − wt∥2
}.
OPG-ADMM
xt+1 =argminx∈X
{g⊤t x − λ⊤t (B⊤x − yt) +
ρ
2∥B⊤x − yt∥2 +
1
2ηt∥x − xt∥2Gt
},
yt+1 =prox(B⊤xt+1 − λt/ρ|ψ),λt+1 =λt − ρ(B⊤xt+1 − yt+1).
The update rule of yt+1 and xt+1 are same as the ordinary ADMM.
20 / 49
Simplified versionLetting Gt be a specific form, the update rule is much simplified.(Gt = γI − ρ
t ηtBB⊤ for RDA-ADMM, and Gt = γI − ρηtBB⊤ for OPG-ADMM)
Online stochastic ADMM
1 Update of x
(RDA-ADMM) xt+1 = ΠX
[−ηtγ{gt − B(λt − ρB⊤xt + ρyt)}
],
(OPG-ADMM) xt+1 = ΠX
[−ηtγ{gt − B(λt − ρB⊤xt + ρyt)}+ xt
].
2 Update of yyt+1 = prox(B⊤xt+1 − λt/ρ|ψ).
3 Update of λλt+1 = λt − ρ(B⊤xt+1 − yt+1).
Fast computation.Easy implementation.
21 / 49
Convergence analysis
We bound the expected risk:
Expected riskF (x) = EZ [f (Z , x)] + ψ(x).
Assumptions:
(A1) ∃G s.t. ∀g ∈ ∂x f (z , x) satisfies ∥g∥ ≤ G for all z , x .
(A2) ∃L s.t. ∀g ∈ ∂ψ(y) satisfies ∥g∥ ≤ L for all y .
(A3) ∃R s.t. ∀x ∈ X satisfies ∥x∥ ≤ R.
22 / 49
Convergence rate: bounded gradient(A1) ∃G s.t. ∀g ∈ ∂x f (z , x) satisfies ∥g∥ ≤ G for all z , x .(A2) ∃L s.t. ∀g ∈ ∂ψ(y) satisfies ∥g∥ ≤ L for all y .(A3) ∃R s.t. ∀x ∈ X satisfies ∥x∥ ≤ R.
Theorem (Convergence rate of RDA-ADMM)
Under (A1), (A2), (A3), we have
Ez1:T−1[F (xT )− F (x∗)] ≤ 1
T
T∑t=2
ηt−1
2(t − 1)G 2 +
γ
ηT∥x∗∥2 + K
T.
Theorem (Convergence rate of OPG-ADMM)
Under (A1), (A2), (A3), we haveEz1:T−1
[F (xT )− F (x∗)] ≤ 12T
∑Tt=2max
{γηt− γ
ηt−1, 0}R2
+ 1T
∑Tt=1
ηt2 G
2 + KT .
Both methods have convergence rate O(
1√T
)by letting ηt = η0
√t for
RDA-ADMM and ηt = η0/√t for OPG-ADMM.
23 / 49
Convergence rate: strongly convex
(A4) There exist σf , σψ ≥ 0 and P,Q ⪰ O such that
EZ [f (Z , x) + (x ′ − x)⊤∇x f (Z , x)] +σf2∥x − x ′∥2P ≤ EZ [f (Z , x
′)],
ψ(y) + (y ′ − y)⊤∇ψ(y) +σψ2∥y − y ′∥2Q ,
for all x , x ′ ∈ X and y , y ′ ∈ Y, and ∃σ > 0 satisfying
σI ⪯ σf P +ρσψ
2ρ+ σψQ.
The update rule of RDA-ADMM is modified as
xt+1 = argminx∈X
{g⊤t x − λ⊤t B⊤x +
ρ
2t∥B⊤x∥2 + ρ(B⊤xt − yt)
⊤B⊤x +1
2ηt∥x∥2Gt
+σ
2∥x − xt∥2
}.
24 / 49
Convergence analysis: strongly convex
Theorem (Convergence rate of RDA-ADMM)
Under (A1), (A2), (A3), (A4), we have
Ez1:T−1[F (xT )− F (x∗)] ≤ 1
2T
T∑t=2
1t−1ηt−1
+ tσG 2 +
γ
ηT∥x∗∥2 + K
T.
Theorem (Convergence rate of OPG-ADMM)
Under (A1), (A2), (A3), (A4), we haveEz1:T−1
[F (xT )− F (x∗)] ≤ 12T
∑Tt=2max
{γηt− γ
ηt−1− σ, 0
}R2
+ 1T
∑Tt=1
ηt2 G
2 + KT .
Both methods have convergence rate O(log(T )
T
)by letting ηt = η0t for
RDA-ADMM and ηt = η0/t for OPG-ADMM.
25 / 49
Numerical experiments
101
102
103
10-2
10-1
100
CPU time (s)
Expec
ted R
isk
OPG−ADMM
RDA−ADMM
NL−ADMM
RDA
OPG
Figure: Artificial data: 1024 dim,512 sample, Overlapped group lasso.
100
101
102
103
10-0.8
10-0.7
10-0.6
CPU time (s)
Cla
ssif
icati
on E
rror
OPG−ADMM
RDA−ADMM
NL−ADMM
RDA
Figure: Real data (Adult, a9a):15,252 dim, 32,561 sample,Overlapped group lasso + ℓ1 reg.
26 / 49
Outline
1 Problem settings
2 Stochastic ADMM for online data
3 Stochastic ADMM for batch data
27 / 49
Batch setting
In the batch setting, the data are fixed. We just minimize the objectivefunction defined by
1
n
n∑i=1
fi (z⊤i w) + ψ(B⊤w).
� �We construct a method that
observes few samples per iteration (like online method),
converges linearly (unlike online method).� �
28 / 49
Dual problem
The dual formulation is utilized to obtain our algorithm.
Primal: minw
1
n
n∑i=1
fi (z⊤i w) + ψ(B⊤w)
Dual problem
minx∈Rn,y∈Rd
1
n
n∑i=1
f ∗i (xi ) + ψ∗(y
n
)s.t. Zx + By = 0.
where Z = [z1, z2, . . . , zn] ∈ Rp×n.
Optimality condition:
z⊤i w∗ ∈ ∇f ∗i (x∗i ),1
ny∗ ∈ ∇ψ(u)|u=B⊤w∗ , Zx∗ + By∗ = 0.
⋆ Each coordinate xi corresponds to each sample zi .29 / 49
Dual of a loss function
loss function
00
00 1
ℓ(y , x) = log(1 + exp(−yx)) ℓ∗(y , u) = −yu log(−yu) + (1 + yu) log(1 + yu)
30 / 49
Dual of regularization functions
regularization function
0
ℓ1 and elastic-net regularization dual
31 / 49
Proposed algorithm
Basic algorithm
Split the index set {1, . . . , n} into K groups (I1, I2, . . . , IK ).For each t = 1, 2, . . .
Choose k ∈ {1, . . . ,K} uniformly at random, and set I = Ik .
y (t) ← argminy
{nψ∗(y/n)− ⟨w (t−1),Zx (t−1)+By⟩
+ρ
2∥Zx (t−1) + By∥2 + 1
2∥y − y (t−1)∥2Q
},
x(t)I ← argmin
xI
{∑i∈I f
∗i (xi )− ⟨w
(t−1),ZI xI + By (t)⟩
+ρ
2∥ZI xI+Z\I x
(t−1)\I +By (t)∥2+ 1
2∥xI − x
(t−1)I ∥2GI ,I
},
w (t) ←w (t−1) − γρ{n(Zx (t) + By (t))
− (n − n/K )(Zx (t−1) + By (t−1))}.
where Q,G are some appropriate positive semidefinite matrices.32 / 49
We observe only a few samples at each iteration.Mini-batch technique (Takac et al., 2013; Shalev-Shwartz and Zhang, 2013a).
If K = n, we need to observe only one sample at each iteration.33 / 49
Simplified algorithm
SettingQ = ρ(ηBId − B⊤B), GI ,I = ρ(ηZ ,I I|I | − Z⊤
I ZI ),
then, by the relation prox(q|ψ) + prox(q|ψ∗) = q, we obtain the followingupdate rule:
Simplified algorithm
For q(t) = y (t−1) + B⊤
ρηB{w (t−1) − ρ(Zx (t−1) + By (t−1))}, let
y (t) ← q(t) − prox(q(t)|nψ(ρηB · )/(ρηB)),
For p(t)I = x
(t−1)I +
Z⊤I
ρηZ,I{w (t−1) − ρ(Zx (t−1) + By (t))}, let
x(t)i ← prox(p
(t)i |f
∗i /(ρηZ ,I )) (∀i ∈ I ).
⋆ The update of x can be parallelized.34 / 49
Convergence Analysis
x∗: the optimal variable of xY∗: the set of optimal variables (not necessarily unique)w∗: the optimal variable of wAssumption:
There exits v > 0 such that, ∀xi ∈ R,
f ∗i (xi )− f ∗i (x∗i ) ≥ ⟨∇f ∗i (x∗i ), xi − x∗i ⟩+
v∥xi − x∗i ∥2
2.
∃h, vψ > 0 such that, for all y , u, there exists y∗ ∈ Y∗ such that
ψ∗(y/n)− ψ∗(y∗/n) ≥ ⟨B⊤w∗, y/n − y∗/n⟩+ vψ2∥PKer(B)(y/n − y∗/n)∥2,
ψ(u)− ψ(B⊤w∗) ≥ ⟨y∗/n, u − B⊤w∗⟩+ h
2∥u − B⊤w∗∥2.
B⊤ is injective.
35 / 49
FD(x , y) :=1n
∑ni=1 f
∗i (xi ) + ψ∗( yn )− ⟨w
∗,Z xn − B y
n ⟩.RD(x , y ,w) := FD(x , y)− FD(x
∗, y∗) + 12n2γρ
∥w − w∗∥2 + ρ(1−γ)2n ∥Zx +
By∥2 + 12n∥x − x∗∥2vIp+H + 1
2nK ∥y − y∗∥2Q .
Theorem (Linear convergence of SDCA-ADMM)
Let H be a matrix such that HI ,I = ρZ⊤I ZI + GI ,I for all I ∈ {I1, . . . , IK},
µ = min
{v
4(v+σmax(H)) ,hρσmin(B
⊤B)2max{1/n,4hρ,4hσmax(Q)}
,Kvψ/n
4σmax(Q) ,Kvσmin(BB
⊤)4σmax(Q)(ρσmax(Z⊤Z)+4v)
},
γ = 14n , and C1 = RD(x
(0), y (0),w (0)), then we have that
(dual residual) E[RD(x(t), y (t),w (t))] ≤
(1− µ
K
)tC1,
(primal variable) E[∥w (t) − w∗∥2] ≤ nρ
2
(1− µ
K
)tC1.
For t ≥ C ′Kµlog
(C ′′nϵ
), we have that E[∥w (t) − w ∗∥2] ≤ ϵ.
36 / 49
Numerical Experiments: Loss function
Binary classification.
Smoothed hinge loss: fi (u) =
0, (yiu ≥ 1),12 − yiu, (yiu < 0),12 (1− yiu)
2, (otherwise).
-0.5 0 0.5 1 1.50
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
⇒ proximal mapping is analytically obtained.
prox(u|f ∗i /C ) =
Cu−yi1+C (−1 ≤ Cuyi−1
1+C ≤ 0),
−yi (−1 > Cuyi−11+C ),
0 (otherwise).37 / 49
Numerical Experiments (Artificial data): Setting
Artificial data: Overlapped group regularization:
ψ(X ) = C (32∑i=1
∥X:,i∥+32∑j=1
∥Xj,:∥+ 0.01×∑i,j
X 2i,j/2),
X ∈ R32×32.
38 / 49
Numerical Experiments (Artificial data): Results
5 10 15 20
10-5
10-4
10-3
10-2
10-1
100
CPU time (s)
Em
pir
ical
Ris
k
(a) Excess objective value
10-1
100
101
0
0.5
1
1.5
2
2.5
CPU time (s)
Test
Loss
(b) Test loss
10-1
100
101
0.05
0.1
0.15
0.2
0.25
0.3
CPU time (s)
Cla
ssif
icati
on
Err
or
RDA
OPG−ADMM
RDA−ADMM
OL−ADMM
Batch−ADMM
SDCA−ADMM
(c) Class. Error
Figure: Artificial data (n=5,120). Overlapped group lasso. Mini-batch sizen/K = 50.
39 / 49
Numerical Experiments (Artificial data): Results
100 200 300 400 500
10-6
10-4
10-2
100
CPU time (s)
Em
pir
ical
Ris
k
(a) Excess objective value
100
101
102
0
0.5
1
1.5
2
2.5
3
3.5
CPU time (s)
Test
Loss
(b) Test loss
100
101
102
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
CPU time (s)
Cla
ssif
icati
on E
rror
RDA
OPG−ADMM
RDA−ADMM
OL−ADMM
Batch−ADMM
SDCA−ADMM
(c) Class. Error
Figure: Artificial data (n=51,200). Overlapped group lasso. Mini-batch sizen/K = 50.
40 / 49
Numerical Experiments (Real data): Setting
Real data: Graph regularization:
ψ(w) = C1
p∑i=1
|wi |+C2
∑(i,j)∈E
|wi−wj |+0.01×(C1
p∑i=1
|wi |2+C2
∑(i,j)∈E
|wi−wj |2)
where E is the set of edges obtained from the similarity matrix.The similarity graph is obtained by Graphical Lasso (Yuan and Lin, 2007;
Banerjee et al., 2008).
41 / 49
Numerical Experiments (Real data): Results
5 10 15 20 25 30 35 40 450.166
0.167
0.168
0.169
0.17
0.171
0.172
0.173
0.174
0.175
CPU time (s)
Em
pir
ical
Ris
k
(a) Objective function
100
101
0.17
0.172
0.174
0.176
0.178
0.18
0.182
0.184
0.186
0.188
0.19
CPU time (s)
Test
Lo
ss
(b) Test loss
100
101
0.12
0.122
0.124
0.126
0.128
0.13
0.132
0.134
0.136
0.138
0.14
CPU time (s)
Cla
ssif
icati
on
Err
or
OPG−ADMM
RDA−ADMM
OL−ADMM
SDCA−ADMM
(c) Class. Error
Figure: Real data (20 Newsgroups, n=12,995). Graph regularization. Mini-batchsize is n/K = 50.
42 / 49
Numerical Experiments (Real data): Results
0 50 100 150 200
0.2
0.25
0.3
0.35
0.4
0.45
0.5
CPU time (s)
Em
pir
ical
Ris
k
(a) Objective function
0 50 100 150 200
0.2
0.25
0.3
0.35
0.4
0.45
CPU time (s)
Test
Lo
ss
(b) Test loss
0 50 100 150 200
0.15
0.155
0.16
0.165
0.17
0.175
0.18
CPU time (s)
Cla
ssif
icati
on
Err
or
OPG−ADMM
RDA−ADMM
OL−ADMM
SDCA−ADMM
(c) Class. Error
Figure: Real data (a9a, n=32,561). Graph regularization. Mini-batch sizen/K = 50.
43 / 49
Summary
A new stochastic optimization methods with ADMM is proposed.
Applicable to wide range of regularization functions.
Easy implementation.
Online setting:
RDA-ADMM, OPG-ADMMO(1/
√T ) in general, and O(log(T )/T ) for strongly convex objective.
Batch setting:
SDCA-ADMM (Stochastic dual coordinate descent ADMM)Linear convergence with strong convexity around the optimum.The convergence rate is controlled by the mini-batch size K .
44 / 49
References I
O. Banerjee, L. E. Ghaoui, and A. d’Aspremont. Model selection throughsparse maximum likelihood estimation for multivariate gaussian orbinary data. Journal of Machine Learning Research, 9:485–516, 2008.
J. Duchi and Y. Singer. Efficient online and batch learning using forwardbackward splitting. Journal of Machine Learning Research, 10:2873–2908, 2009.
D. Gabay and B. Mercier. A dual algorithm for the solution of nonlinearvariational problems via finite-element approximations. Computers &Mathematics with Applications, 2:17–40, 1976.
M. Hestenes. Multiplier and gradient methods. Journal of OptimizationTheory & Applications, 4:303–320, 1969.
45 / 49
References II
R. Johnson and T. Zhang. Accelerating stochastic gradient descent usingpredictive variance reduction. In C. Burges, L. Bottou, M. Welling,Z. Ghahramani, and K. Weinberger, editors, Advances in NeuralInformation Processing Systems 26, pages 315–323. Curran Associates,Inc., 2013. URL http://papers.nips.cc/paper/
4937-accelerating-stochastic-gradient-descent-using-predictive-variance-reduction.
pdf.
N. Le Roux, M. Schmidt, and F. Bach. A stochastic gradient method withan exponential convergence rate for strongly-convex optimization withfinite training sets. In Advances in Neural Information ProcessingSystems 25, 2013.
Y. Nesterov. Gradient methods for minimizing composite objectivefunction. Technical Report 76, Center for Operations Research andEconometrics (CORE), Catholic University of Louvain (UCL), 2007.
46 / 49
References III
Y. Nesterov. Primal-dual subgradient methods for convex problems.Mathematical Programming, Series B, 120:221–259, 2009.
H. Ouyang, N. He, L. Q. Tran, and A. Gray. Stochastic alternatingdirection method of multipliers. In Proceedings of the 30th InternationalConference on Machine Learning, 2013.
M. Powell. A method for nonlinear constraints in minimization problems.In R. Fletcher, editor, Optimization, pages 283–298. Academic Press,London, New York, 1969.
R. T. Rockafellar. Augmented Lagrangians and applications of theproximal point algorithm in convex programming. Mathematics ofOperations Research, 1:97–116, 1976.
S. Shalev-Shwartz and T. Zhang. Accelerated mini-batch stochastic dualcoordinate ascent. In Advances in Neural Information ProcessingSystems 26, 2013a.
47 / 49
References IV
S. Shalev-Shwartz and T. Zhang. Proximal stochastic dual coordinateascent. Technical report, 2013b. arXiv:1211.2717.
T. Suzuki. Dual averaging and proximal gradient descent for onlinealternating direction multiplier method. In Proceedings of the 30thInternational Conference on Machine Learning, pages 392–400, 2013.
T. Suzuki. Stochastic dual coordinate ascent with alternating directionmethod of multipliers. In Proceedings of the 31th InternationalConference on Machine Learning, pages 736–744, 2014.
M. Takac, A. Bijral, P. Richtarik, and N. Srebro. Mini-batch primal anddual methods for SVMs. In Proceedings of the 30th InternationalConference on Machine Learning, 2013.
H. Wang and A. Banerjee. Online alternating direction method. InProceedings of the 29th International Conference on Machine Learning,2012.
48 / 49
References V
L. Xiao. Dual averaging methods for regularized stochastic learning andonline optimization. In Advances in Neural Information ProcessingSystems 23, 2009.
M. Yuan and Y. Lin. Model selection and estimation in the gaussiangraphical model. Biometrika, 94(1):19–35, 2007.
49 / 49