communication-efficient algorithms for distributed...

Communication-Efficient Algorithms for DistributedOptimization

Juan Xu Kaiqing Zhang

December 1, 2016

Juan Xu, Kaiqing Zhang Distributed Optimization 1 / 32

Overview

1 Motivation and Problem Description

2 Algorithms with One-round Communication

3 Algorithms with Multi-round Communications

4 Communication Complexity of Multi-round Algorithms

5 Conclusions


Motivation and Problem Description

Overview





5 Conclusions



Motivation

Consider a Stochastic Optimization formulation

minw

F (w) = EX∼P [f (w ,X )] .

Population risk function F (w) : Rd → RPopulation risk minimizer w∗

w∗ := arg minw

F (w),

w∗ is assume to be unique.Loss function f (w ,X )

f is real-value and convex;X is a random variable following distribution P (P is unknown).



Motivation


minw

F (w) = EX∼P [f (w ,X )] .

Population risk function F (w) : Rd → R

Population risk minimizer w∗

w∗ := arg minw

F (w),





Motivation


minw

F (w) = EX∼P [f (w ,X )] .


w∗ := arg minw

F (w),

w∗ is assume to be unique.

Loss function f (w ,X )




Motivation


minw

F (w) = EX∼P [f (w ,X )] .


w∗ := arg minw

F (w),





Motivation


minw

F (w) = EX∼P [f (w ,X )] .


w∗ := arg minw

F (w),


f is real-value and convex;

X is a random variable following distribution P (P is unknown).



Motivation


minw

F (w) = EX∼P [f (w ,X )] .


w∗ := arg minw

F (w),





MotivationInstead of the unavailable probability distribution P, we have access toN = nm samples which are i.i.d. according to distribution P.

Empirical risk function φ(w) : Rd → R

φ(w) = 1

N

N∑i=1

f (w , xi ).

Empirical risk minimizer w

w ∈ arg minwφ(w),

→ Finite Sum Problem.

♠ Can we solve it when the sample size N = nm is so large to be storedin a single computer, i.e., in a centralized setting?→ Distributed Optimization.





φ(w) = 1

N

N∑i=1

f (w , xi ).




♠ Can we solve it when the sample size N = nm is so large to be storedin a single computer, i.e., in a centralized setting?

→ Distributed Optimization.





φ(w) = 1

N

N∑i=1

f (w , xi ).




♠ Can we solve it when the sample size N = nm is so large to be storedin a single computer, i.e., in a centralized setting?→ Distributed Optimization.



Distributed Optimization

Distributed Optimization: divide the large-scale data into m pieces,and use m machines to deal with each of them.

Key Trade-off: pay expensive communication cost to exchange forthe memory feasibility.



Two Streams in Distributed Optimization

Figure: Distributed Optimization

One-round CommunicationMulti-round Communication





One-round Communication

Multi-round Communication





One-round CommunicationMulti-round Communication



Problem Settings

Given a dataset N = nm samples i.i.d. according to distribution P.

Divide the sample evenly among m machines. Each machine gets asubsampled dataset of size n.Local empirical risk function φi (w) : Rd → R

φi (w) := 1n

n∑i=1

f (w , xi ).


Assumption. Local empirical risk function φi (w) is L-smooth andλ-strongly convex.



Problem Settings

Given a dataset N = nm samples i.i.d. according to distribution P.Divide the sample evenly among m machines. Each machine gets asubsampled dataset of size n.

Local empirical risk function φi (w) : Rd → R

φi (w) := 1n

n∑i=1

f (w , xi ).





Problem Settings

Given a dataset N = nm samples i.i.d. according to distribution P.Divide the sample evenly among m machines. Each machine gets asubsampled dataset of size n.Local empirical risk function φi (w) : Rd → R

φi (w) := 1n

n∑i=1

f (w , xi ).





Problem Settings

Given a dataset N = nm samples i.i.d. according to distribution P.Divide the sample evenly among m machines. Each machine gets asubsampled dataset of size n.Local empirical risk function φi (w) : Rd → R

φi (w) := 1n

n∑i=1

f (w , xi ).

→ Finite Sum Problem.Assumption. Local empirical risk function φi (w) is L-smooth andλ-strongly convex.


Algorithms with One-round Communication

Overview





5 Conclusions



Background

Figure: Algorithms with One-round Communication

Averaging methods?Accuracy, i.e., mean-squared error to the population minimizer w∗?Optimization methods for local problem?

The results of Algorithms with One-round Communication are mainlybased on [He16] and [Zhang13].



Background


Averaging methods?

Accuracy, i.e., mean-squared error to the population minimizer w∗?Optimization methods for local problem?




Background


Averaging methods?Accuracy, i.e., mean-squared error to the population minimizer w∗?

Optimization methods for local problem?The results of Algorithms with One-round Communication are mainlybased on [He16] and [Zhang13].



Background


Averaging methods?Accuracy, i.e., mean-squared error to the population minimizer w∗?Optimization methods for local problem?




Averaging Mixture (AVGM) Method

AVGM Methods.Machine i uses its n samples to solve the local empirical riskminimization problem

w1,i ∈ arg minwφi (w).

AVGM combines these m estimates by averaging

w1 := 1m

m∑i=1

w1,i .



Mean-squared Error for AVGM Method

Theorem 1The mean-squared error of AVGM method is upper bounded as

E[||w1 − w∗||22

]≤ O

( 1λ2nm + 1

λ6n2 + L2 log dλ4n2 + 1

n2m + 1n3

).

Remark on Theorem 1.

The leading term decays proportionally to (nm)−1 = N−1 whenassuming m ≤ n.If we assume F (w) is L-smooth, then

E [F (w1)− F (w∗)] ≤ L2E

[||w1 − w∗||22

].

There may exist a constant gap. Because we often haveλ = O(1/

√nm).





E[||w1 − w∗||22

]≤ O

( 1λ2nm + 1

λ6n2 + L2 log dλ4n2 + 1

n2m + 1n3

).

Remark on Theorem 1.The leading term decays proportionally to (nm)−1 = N−1 whenassuming m ≤ n.

If we assume F (w) is L-smooth, then

E [F (w1)− F (w∗)] ≤ L2E

[||w1 − w∗||22

].


√nm).





E[||w1 − w∗||22

]≤ O

( 1λ2nm + 1

λ6n2 + L2 log dλ4n2 + 1

n2m + 1n3

).

Remark on Theorem 1.The leading term decays proportionally to (nm)−1 = N−1 whenassuming m ≤ n.If we assume F (w) is L-smooth, then

E [F (w1)− F (w∗)] ≤ L2E

[||w1 − w∗||22

].


√nm).



Subsampled Averaging Mixture (SAVGM) Method

Define a fixed subsampling rate r ∈ [0, 1].

Get a subset of size drne by sampling uniformly at random withoutreplacement from the local dataset of size n for each machine.Machine i computes both the local empirical minimizer w1,i , and thesubsampling empirical minimizer

w2,i ∈ arg minw

φ2,i (w) = 1drne

drne∑i=1

f (w , xi )

.Compute the bootstrap average w2 := 1

m∑m

i=1 w2,i .SAVGM returns the weighted combination

wSAVGM := w1 − r w21− r .




Define a fixed subsampling rate r ∈ [0, 1].Get a subset of size drne by sampling uniformly at random withoutreplacement from the local dataset of size n for each machine.

Machine i computes both the local empirical minimizer w1,i , and thesubsampling empirical minimizer

w2,i ∈ arg minw

φ2,i (w) = 1drne

drne∑i=1

f (w , xi )


m∑m






Define a fixed subsampling rate r ∈ [0, 1].Get a subset of size drne by sampling uniformly at random withoutreplacement from the local dataset of size n for each machine.Machine i computes both the local empirical minimizer w1,i , and thesubsampling empirical minimizer

w2,i ∈ arg minw

φ2,i (w) = 1drne

drne∑i=1

f (w , xi )

.

Compute the bootstrap average w2 := 1m∑m







w2,i ∈ arg minw

φ2,i (w) = 1drne

drne∑i=1

f (w , xi )


m∑m

i=1 w2,i .

SAVGM returns the weighted combination






w2,i ∈ arg minw

φ2,i (w) = 1drne

drne∑i=1

f (w , xi )


m∑m





Mean-squared error for SAVGM Method

Theorem 2The mean-squared error of SAVGM method is upper bounded as

E[||wSAVGM − w∗||22

]≤ O

( 1λ2nm + 1

n2m + 1λ6n3 + L2d log d

λ4n3

).

Remark on Theorem 2.

If we assume F (w) is L-smooth, then

E [F (wSAVGM)− F (w∗)] ≤ L2E

[||wSAVGM − w∗||22

].

O(n−2) term is eliminated here. Constant gap problem still exists.



Mean-squared error for SAVGM Method

Theorem 2The mean-squared error of SAVGM method is upper bounded as

E[||wSAVGM − w∗||22

]≤ O

( 1λ2nm + 1

n2m + 1λ6n3 + L2d log d

λ4n3

).

Remark on Theorem 2.If we assume F (w) is L-smooth, then

E [F (wSAVGM)− F (w∗)] ≤ L2E

[||wSAVGM − w∗||22

].

O(n−2) term is eliminated here. Constant gap problem still exists.



Optimization Methods for Local Problem

What optimization method we can use to get w1,i and w2,i for machine i ,i.e., how to solve the Finite Sum Problems for each machine?

w1,i ∈ arg minw

φi (w) = 1

n

n∑i=1

f (w , xi )

;

w2,i ∈ arg minw

φ2,i (w) = 1drne

drne∑i=1

f (w , xi )

.

Under the assumption that φi (w) and φ2,i (w) are both L-smooth andλ-strongly convex for each i , then

Gradient Descent (GD): O(Lλ log(1

ε ));

Stochastic Gradient Descent (SGD): O( Lλ2ε

);



Optimization Methods for Local Problem

What optimization method we can use to get w1,i and w2,i for machine i ,i.e., how to solve the Finite Sum Problems for each machine?

w1,i ∈ arg minw

φi (w) = 1

n

n∑i=1

f (w , xi )

;

w2,i ∈ arg minw

φ2,i (w) = 1drne

drne∑i=1

f (w , xi )

.Under the assumption that φi (w) and φ2,i (w) are both L-smooth andλ-strongly convex for each i , then

Gradient Descent (GD): O(Lλ log(1

ε ));

Stochastic Gradient Descent (SGD): O( Lλ2ε

);



Simulation setup

(xj , yj) ∈ Rd × R are i.i.d drawn from xj ∼ N (0,Σ) where Σ ∈ Rd×d

is diagonal with Σk,k = k−2.5, yj = xTj w0 + εj where w0 = 1 and

εj ∼ N (0, 1)

Solve

minwφ(w) = 1

m

m∑i=1

φi (w) + λ

2 ‖w‖22,

where φi (w) = 1n

n∑j=1

(xT

j w − yj)2

, λ = 1/√

nm

set n = 1000,m = 5, d = 100.



Simulation setup



εj ∼ N (0, 1)Solve

minwφ(w) = 1

m

m∑i=1

φi (w) + λ

2 ‖w‖22,

where φi (w) = 1n

n∑j=1

(xT

j w − yj)2

, λ = 1/√

nm

set n = 1000,m = 5, d = 100.



Algorithm Comparison

GD SGD Comparison

AVGM

SAVGM

Comparison


Algorithms with Multi-round Communications

Overview





5 Conclusions



Suboptimality of AVGM and SAVGM

Recall

E [F (w)]− F (w∗) ≤ O(

Lλ2nm + L

λ6n2 + L3 log(d)λ4n2

),

the mean square error of the one-round communication algorithmscan be much worse if λ→ 0

Actually λ decays as 1√nm in many learning applications, e.g.,

regularized loss minimization and SVM-type problems[Sridaran09, Shalev09]There exists a worst-case [Shamir14] where

E [F (w)]− F (w∗) ≤ C1λnm ,while E [F (w)]− F (w∗) ≥ C2

λn

Multi-round communications are necessary for estimators to convergeto the empirical minimizer w




Recall

E [F (w)]− F (w∗) ≤ O(

Lλ2nm + L


),

the mean square error of the one-round communication algorithmscan be much worse if λ→ 0Actually λ decays as 1√

nm in many learning applications, e.g.,regularized loss minimization and SVM-type problems[Sridaran09, Shalev09]

There exists a worst-case [Shamir14] where


λn





Recall

E [F (w)]− F (w∗) ≤ O(

Lλ2nm + L


),

the mean square error of the one-round communication algorithmscan be much worse if λ→ 0Actually λ decays as 1√

nm in many learning applications, e.g.,regularized loss minimization and SVM-type problems[Sridaran09, Shalev09]There exists a worst-case [Shamir14] where


λn




Accelerated gradient averaging

Each machine calculates the gradient ∇φi (w (t))

Averaged full gradient step ∇φ(w (t)) = 1m∇φi (w (t)) is taken for

update and distribute the w (t+1)

Communication rounds / iterations needed for ε-suboptimal solution is

O

√Lλ

log(1ε

)Polynomial dependence on the condition number is undesirable if λtypically decays at most 1/

√nm, resulting in the dependence of

O ( 4√

nm) on sample size




Each machine calculates the gradient ∇φi (w (t))Averaged full gradient step ∇φ(w (t)) = 1

m∇φi (w (t)) is taken forupdate and distribute the w (t+1)


O

√Lλ

log(1ε



O ( 4√

nm) on sample size







O

√Lλ

log(1ε

)

Polynomial dependence on the condition number is undesirable if λtypically decays at most 1/


O ( 4√

nm) on sample size







O

√Lλ

log(1ε



O ( 4√

nm) on sample size



Alternating Direction Method of Multipliers (ADMM)

The machines alternate between computing shared dual variables andsolving augmented Lagrangian problems with local data

Recall for the problem

min f (x) + g(z)s.t. Ax + Bz = c

ADMM iteration followsx (t+1) = arg min

xf (x) + ρ

2‖Ax + Bz(t) − c + u(t)‖22z(t+1) = arg min

zg(z) + ρ

2‖Ax (t+1) + Bz − c + u(t)‖22u(t+1) = u(t) +

(Ax (t+1) + Bz(t+1) − c

)



Alternating Direction Method of Multipliers (ADMM)

The machines alternate between computing shared dual variables andsolving augmented Lagrangian problems with local dataRecall for the problem

min f (x) + g(z)s.t. Ax + Bz = c

ADMM iteration followsx (t+1) = arg min

xf (x) + ρ

2‖Ax + Bz(t) − c + u(t)‖22z(t+1) = arg min

zg(z) + ρ

2‖Ax (t+1) + Bz − c + u(t)‖22u(t+1) = u(t) +

(Ax (t+1) + Bz(t+1) − c

)



Distributed ADMM

Distributed optimization can be reformulated as

min 1m

m∑i=1

φi (wi )

s.t. wi − z = 0

The distributed ADMM iteration thus followsw (t+1)

i = arg minwi

φi (wi ) + ρ2‖wi − z(t) + u(t)

i ‖22,∀i

z(t+1) = 1m

m∑i=1

(w (t)

i + u(t)i

)u(t+1)

i = u(t)i +

(w (t+1)

i − z(t+1)), ∀i

The number of iterations / communication rounds scale strongly withthe condition number, i.e., sample size for learning applications,[Deng12, Hong12]



Distributed Optimization using an ApproximateNewton-Type (DANE) Method

[Shamir14]: Approximate Newton-type step is taken at each iterationwhile no local Hessian ∇2φi (w) is explicitly computed andcommunicated

With parameter η > 0 and regularizer µ > 0

∇φ(w (t)

)= 1

mm∑

i=1∇φi

(w (t)

)w (t+1)

i = arg minw

φi (w)−(∇φi

(w (t)

)− η∇φ

(w (t)

))Tw

+ µ2‖w − w (t)‖22

w (t+1) = 1m

m∑i=1

w (t+1)i

Two rounds of communications per iteration




[Shamir14]: Approximate Newton-type step is taken at each iterationwhile no local Hessian ∇2φi (w) is explicitly computed andcommunicatedWith parameter η > 0 and regularizer µ > 0

∇φ(w (t)

)= 1

mm∑

i=1∇φi

(w (t)

)w (t+1)

i = arg minw

φi (w)−(∇φi

(w (t)

)− η∇φ

(w (t)

))Tw

+ µ2‖w − w (t)‖22

w (t+1) = 1m

m∑i=1

w (t+1)i

Two rounds of communications per iteration




[Shamir14]: Approximate Newton-type step is taken at each iterationwhile no local Hessian ∇2φi (w) is explicitly computed andcommunicatedWith parameter η > 0 and regularizer µ > 0

∇φ(w (t)

)= 1

mm∑

i=1∇φi

(w (t)

)w (t+1)

i = arg minw

φi (w)−(∇φi

(w (t)

)− η∇φ

(w (t)

))Tw

+ µ2‖w − w (t)‖22

w (t+1) = 1m

m∑i=1

w (t+1)i

Two rounds of communications per iterationJuan Xu, Kaiqing Zhang Distributed Optimization 20 / 32


DANE Method

The local optimization can be related to a mirror descent update withdistance generating function hi (w) = φi (w) + µ

2‖w‖22 and step size η

w (t+1)i = arg min

wφi (w)−

(∇φi

(w (t)

)− η∇φ

(w (t)

))Tw

+ µ

2 ‖w − w (t)‖22

= arg minw

Vφi

(w ,w (t)

)+(η∇φ

(w (t)

))Tw + µ

2 ‖w − w (t)‖22

= arg minw

Vhi

(w ,w (t)

)+(η∇φ

(w (t)

))Tw

When µ is large, it reduces to gradient descent and no information oflocal φi (w) is used; when µ = 0 and φi (w) ≈ φ(w), it approximatesNewton’s method since

Vhi

(w ,w (t)

)→ 1

2(w − w (t)

)T∇2φ

(w (t)

) (w − w (t)

)



DANE Method

The local optimization can be related to a mirror descent update withdistance generating function hi (w) = φi (w) + µ

2‖w‖22 and step size η

w (t+1)i = arg min

wφi (w)−

(∇φi

(w (t)

)− η∇φ

(w (t)

))Tw

+ µ

2 ‖w − w (t)‖22

= arg minw

Vφi

(w ,w (t)

)+(η∇φ

(w (t)

))Tw + µ

2 ‖w − w (t)‖22

= arg minw

Vhi

(w ,w (t)

)+(η∇φ

(w (t)

))Tw

When µ is large, it reduces to gradient descent and no information oflocal φi (w) is used; when µ = 0 and φi (w) ≈ φ(w), it approximatesNewton’s method since

Vhi

(w ,w (t)

)→ 1

2(w − w (t)

)T∇2φ

(w (t)

) (w − w (t)

)Juan Xu, Kaiqing Zhang Distributed Optimization 21 / 32


DANE Method

The iteration / communication complexity for L-smooth λ-stronglyconvex quadratic objective φ(w)

O(

(L/λ)2

n log (dm) log(1ε

))

In practice, when λ scales as 1/√

nm, the complexity only scalesroughly linearly with the number of machines m, not with the samplesize N = nm

O(

L2m log (dm) log(1ε

))



Simulation setup



εj ∼ N (0, 1)

Solve

minwφ(w) = 1

m

m∑i=1

φi (w) + λ

2 ‖w‖22,

where φi (w) = 1n

n∑j=1

(xT

j w − yj)2

, λ = 1/√

nm



Simulation setup



εj ∼ N (0, 1)Solve

minwφ(w) = 1

m

m∑i=1

φi (w) + λ

2 ‖w‖22,

where φi (w) = 1n

n∑j=1

(xT

j w − yj)2

, λ = 1/√

nm



Simulation results

To compare the convergence rate, set n = 1000,m = 5, d = 100

Figure: Comparison of the convergence rate of different algorithms.



Simulation results

To compare the communication rounds versus sample size, setn ∈ 100, 500, 1000, 5000, 10000 ,m = 5, d = 100, ε = 10−6

Table: Comparison of communication rounds of different algorithms forε-suboptimal when sample N = nm size increases

Algorithm n=100 n=500 n=1000 n=5000 n=10000Accelerated GD 35 51 61 86 101

ADMM 58 95 122 252 345DANE 10 8 8 8 8


Communication Complexity of Multi-round Algorithms

Overview





5 Conclusions




[Arjevani15]: How many rounds of communications are necessary tosolve the distributed optimization with accuracy ε?

Assume the algorithms only operate over vectors, gradients, andvector products, and only vectors or gradients are communicatedThen there exist m 1-smooth λ-strongly convex local quadraticobjective functions that require at least communication rounds

Ω

√ δ

λlog(λ‖w∗‖2

ε

)where δ ∈ (0, 1) characterizes the similarity of φi (w) among mmachines, typically decaying as O

(1√n

)for random splitting data

[Shamir14]




[Arjevani15]: How many rounds of communications are necessary tosolve the distributed optimization with accuracy ε?Assume the algorithms only operate over vectors, gradients, andvector products, and only vectors or gradients are communicated

Then there exist m 1-smooth λ-strongly convex local quadraticobjective functions that require at least communication rounds

Ω

√ δ

λlog(λ‖w∗‖2

ε


(1√n


[Shamir14]




[Arjevani15]: How many rounds of communications are necessary tosolve the distributed optimization with accuracy ε?Assume the algorithms only operate over vectors, gradients, andvector products, and only vectors or gradients are communicatedThen there exist m 1-smooth λ-strongly convex local quadraticobjective functions that require at least communication rounds

Ω

√ δ

λlog(λ‖w∗‖2

ε


(1√n


[Shamir14]




Accelerated gradient averaging can achieve the bound for unrelatedlocal functions, i.e., δ = Ω(1)

O

√ 1λ

log(1ε

) v.s. Ω

√ δ

λlog(λ‖w∗‖2

ε

)but can be suboptimal when δ → 0DANE can be suboptimal when λ decays as 1√

nm and δ decays as 1√n

O(

m log(1ε

))v.s. Ω

(4√m log

(1ε

))

Designing optimal algorithms for general objective functions remainsopen





O

√ 1λ

log(1ε

) v.s. Ω

√ δ

λlog(λ‖w∗‖2

ε

)

but can be suboptimal when δ → 0DANE can be suboptimal when λ decays as 1√


O(

m log(1ε

))v.s. Ω

(4√m log

(1ε

))






O

√ 1λ

log(1ε

) v.s. Ω

√ δ

λlog(λ‖w∗‖2

ε

)but can be suboptimal when δ → 0

DANE can be suboptimal when λ decays as 1√nm and δ decays as 1√

n

O(

m log(1ε

))v.s. Ω

(4√m log

(1ε

))






O

√ 1λ

log(1ε

) v.s. Ω

√ δ

λlog(λ‖w∗‖2

ε

)but can be suboptimal when δ → 0DANE can be suboptimal when λ decays as 1√


O(

m log(1ε

))v.s. Ω

(4√m log

(1ε

))





Summary

Table: Communication efficiency of several distributed algorithms forquadratic objective functions when λ ∼ O

(1√nm

)and δ ∼ O

(1√n

)Algorithm Number of Communication Rounds

Accelerated GD O ( 4√

nm log (1/ε))ADMM O ( 4

√nm log (1/ε))

DANE O (m log (1/ε))Optimal O ( 4

√m log (1/ε))


Conclusions

Overview





5 Conclusions


Conclusions

Take-aways

Distributed algorithms with only one-round communication sufferfrom constant estimation errors that cannot be mitigated by clevereraveraging

DANE outperforms AGD and ADMM in terms of communicationcomplexity by making use of the similarity of local functionsMany communication rounds can be necessary to achieveε-suboptimal even though the computation capability at localmachines are unlimited


Conclusions

Take-aways

Distributed algorithms with only one-round communication sufferfrom constant estimation errors that cannot be mitigated by clevereraveragingDANE outperforms AGD and ADMM in terms of communicationcomplexity by making use of the similarity of local functions

Many communication rounds can be necessary to achieveε-suboptimal even though the computation capability at localmachines are unlimited


Conclusions

Take-aways

Distributed algorithms with only one-round communication sufferfrom constant estimation errors that cannot be mitigated by clevereraveragingDANE outperforms AGD and ADMM in terms of communicationcomplexity by making use of the similarity of local functionsMany communication rounds can be necessary to achieveε-suboptimal even though the computation capability at localmachines are unlimited


Conclusions

References

He, Niao. ”IE598 Big Data Optimizaiton.” University of Illinois atUrbana-Champaign, 2016.

Zhang, Yuchen and Wainwright, Martin J and Duchi, John C.”Communication-efficient algorithms for statistical optimization.” InAdvances in Neural Information Processing Systems, pp. 1502-1510.2012.Zhang, Yuchen and Wainwright, Martin J and Duchi, John C.”Communication-efficient algorithms for statistical optimization.” InJournal of Machine Learning Research, pp. 3321-3363. 2013.

Sridharan, Karthik, Shai Shalev-Shwartz, and Nathan Srebro. ”Fastrates for regularized objectives.” In Advances in Neural InformationProcessing Systems, pp. 1545-1552. 2009.

Shalev-Shwartz, Shai, Ohad Shamir, Nathan Srebro, and KarthikSridharan. ”Stochastic Convex Optimization.” In COLT. 2009.


Conclusions

References

Shamir, Ohad, Nathan Srebro, and Tong Zhang.”Communication-Efficient Distributed Optimization using anApproximate Newton-type Method.” In ICML, vol. 32, no. 1, pp.1000-1008. 2014.Deng, Wei, and Wotao Yin. ”On the global and linear convergence ofthe generalized alternating direction method of multipliers.” Journal ofScientific Computing 66, no. 3 (2016): 889-916.

Hong, Mingyi, and Zhi-Quan Luo. ”On the linear convergence of thealternating direction method of multipliers.” arXiv preprintarXiv:1208.3922 (2012).

Arjevani, Yossi, and Ohad Shamir. ”Communication complexity ofdistributed convex learning and optimization.” In Advances in NeuralInformation Processing Systems, pp. 1756-1764. 2015.


Conclusions

Thank you !

Q&A


communication-efficient algorithms for distributed...

Documents