communication-efficient algorithms for distributed...

Communication-Efficient Algorithms for DistributedOptimization

Juan Xu Kaiqing Zhang

December 1, 2016

Juan Xu, Kaiqing Zhang Distributed Optimization 1 / 32

Overview

1 Motivation and Problem Description

2 Algorithms with One-round Communication

3 Algorithms with Multi-round Communications

4 Communication Complexity of Multi-round Algorithms

5 Conclusions

Motivation and Problem Description

Overview

5 Conclusions

Motivation

Consider a Stochastic Optimization formulation

F (w) = EX∼P [f (w ,X )] .

Population risk function F (w) : Rd → RPopulation risk minimizer w∗

w∗ := arg minw

F (w),

w∗ is assume to be unique.Loss function f (w ,X )

f is real-value and convex;X is a random variable following distribution P (P is unknown).

Motivation

F (w) = EX∼P [f (w ,X )] .

Population risk function F (w) : Rd → R

Population risk minimizer w∗

w∗ := arg minw

F (w),

Motivation

F (w) = EX∼P [f (w ,X )] .

w∗ := arg minw

F (w),

w∗ is assume to be unique.

Loss function f (w ,X )

Motivation

F (w) = EX∼P [f (w ,X )] .

w∗ := arg minw

F (w),

Motivation

F (w) = EX∼P [f (w ,X )] .

w∗ := arg minw

F (w),

f is real-value and convex;

X is a random variable following distribution P (P is unknown).

Motivation

F (w) = EX∼P [f (w ,X )] .

w∗ := arg minw

F (w),

MotivationInstead of the unavailable probability distribution P, we have access toN = nm samples which are i.i.d. according to distribution P.

Empirical risk function φ(w) : Rd → R

φ(w) = 1

N∑i=1

f (w , xi ).

Empirical risk minimizer w

w ∈ arg minwφ(w),

→ Finite Sum Problem.

♠ Can we solve it when the sample size N = nm is so large to be storedin a single computer, i.e., in a centralized setting?→ Distributed Optimization.

φ(w) = 1

N∑i=1

f (w , xi ).

φ(w) = 1

N∑i=1

f (w , xi ).

φ(w) = 1

N∑i=1

f (w , xi ).

♠ Can we solve it when the sample size N = nm is so large to be storedin a single computer, i.e., in a centralized setting?

→ Distributed Optimization.

φ(w) = 1

N∑i=1

f (w , xi ).

Distributed Optimization

Distributed Optimization: divide the large-scale data into m pieces,and use m machines to deal with each of them.

Key Trade-off: pay expensive communication cost to exchange forthe memory feasibility.

Distributed Optimization

Distributed Optimization: divide the large-scale data into m pieces,and use m machines to deal with each of them.

Key Trade-off: pay expensive communication cost to exchange forthe memory feasibility.

Two Streams in Distributed Optimization

Figure: Distributed Optimization

One-round CommunicationMulti-round Communication

One-round Communication

Multi-round Communication

One-round CommunicationMulti-round Communication

Problem Settings

Given a dataset N = nm samples i.i.d. according to distribution P.

Divide the sample evenly among m machines. Each machine gets asubsampled dataset of size n.Local empirical risk function φi (w) : Rd → R

φi (w) := 1n

n∑i=1

f (w , xi ).

Assumption. Local empirical risk function φi (w) is L-smooth andλ-strongly convex.

Problem Settings

Given a dataset N = nm samples i.i.d. according to distribution P.Divide the sample evenly among m machines. Each machine gets asubsampled dataset of size n.

Local empirical risk function φi (w) : Rd → R

φi (w) := 1n

n∑i=1

f (w , xi ).

Problem Settings

Given a dataset N = nm samples i.i.d. according to distribution P.Divide the sample evenly among m machines. Each machine gets asubsampled dataset of size n.Local empirical risk function φi (w) : Rd → R

φi (w) := 1n

n∑i=1

f (w , xi ).

Problem Settings

φi (w) := 1n

n∑i=1

f (w , xi ).

Problem Settings

φi (w) := 1n

n∑i=1

f (w , xi ).

→ Finite Sum Problem.Assumption. Local empirical risk function φi (w) is L-smooth andλ-strongly convex.

Algorithms with One-round Communication

Overview

5 Conclusions

Background

Figure: Algorithms with One-round Communication

Averaging methods?Accuracy, i.e., mean-squared error to the population minimizer w∗?Optimization methods for local problem?

The results of Algorithms with One-round Communication are mainlybased on [He16] and [Zhang13].

Background

Averaging methods?

Accuracy, i.e., mean-squared error to the population minimizer w∗?Optimization methods for local problem?

Background

Averaging methods?Accuracy, i.e., mean-squared error to the population minimizer w∗?

Optimization methods for local problem?The results of Algorithms with One-round Communication are mainlybased on [He16] and [Zhang13].

Background

Averaging Mixture (AVGM) Method

AVGM Methods.Machine i uses its n samples to solve the local empirical riskminimization problem

w1,i ∈ arg minwφi (w).

AVGM combines these m estimates by averaging

w1 := 1m

m∑i=1

w1,i .

Averaging Mixture (AVGM) Method

AVGM Methods.Machine i uses its n samples to solve the local empirical riskminimization problem

w1,i ∈ arg minwφi (w).

AVGM combines these m estimates by averaging

w1 := 1m

m∑i=1

w1,i .

Mean-squared Error for AVGM Method

Theorem 1The mean-squared error of AVGM method is upper bounded as

E[||w1 − w∗||22

]≤ O

( 1λ2nm + 1

λ6n2 + L2 log dλ4n2 + 1

n2m + 1n3

Remark on Theorem 1.

The leading term decays proportionally to (nm)−1 = N−1 whenassuming m ≤ n.If we assume F (w) is L-smooth, then

E [F (w1)− F (w∗)] ≤ L2E

[||w1 − w∗||22

There may exist a constant gap. Because we often haveλ = O(1/

√nm).

E[||w1 − w∗||22

]≤ O

( 1λ2nm + 1

λ6n2 + L2 log dλ4n2 + 1

n2m + 1n3

Remark on Theorem 1.The leading term decays proportionally to (nm)−1 = N−1 whenassuming m ≤ n.

If we assume F (w) is L-smooth, then

E [F (w1)− F (w∗)] ≤ L2E

[||w1 − w∗||22

√nm).

E[||w1 − w∗||22

]≤ O

( 1λ2nm + 1

λ6n2 + L2 log dλ4n2 + 1

n2m + 1n3

Remark on Theorem 1.The leading term decays proportionally to (nm)−1 = N−1 whenassuming m ≤ n.If we assume F (w) is L-smooth, then

E [F (w1)− F (w∗)] ≤ L2E

[||w1 − w∗||22

√nm).

E[||w1 − w∗||22

]≤ O

( 1λ2nm + 1

λ6n2 + L2 log dλ4n2 + 1

n2m + 1n3

Remark on Theorem 1.The leading term decays proportionally to (nm)−1 = N−1 whenassuming m ≤ n.If we assume F (w) is L-smooth, then

E [F (w1)− F (w∗)] ≤ L2E

[||w1 − w∗||22

√nm).

Subsampled Averaging Mixture (SAVGM) Method

Define a fixed subsampling rate r ∈ [0, 1].

Get a subset of size drne by sampling uniformly at random withoutreplacement from the local dataset of size n for each machine.Machine i computes both the local empirical minimizer w1,i , and thesubsampling empirical minimizer

w2,i ∈ arg minw

φ2,i (w) = 1drne

drne∑i=1

f (w , xi )

.Compute the bootstrap average w2 := 1

i=1 w2,i .SAVGM returns the weighted combination

wSAVGM := w1 − r w21− r .

Define a fixed subsampling rate r ∈ [0, 1].Get a subset of size drne by sampling uniformly at random withoutreplacement from the local dataset of size n for each machine.

Machine i computes both the local empirical minimizer w1,i , and thesubsampling empirical minimizer

w2,i ∈ arg minw

φ2,i (w) = 1drne

drne∑i=1

f (w , xi )

Define a fixed subsampling rate r ∈ [0, 1].Get a subset of size drne by sampling uniformly at random withoutreplacement from the local dataset of size n for each machine.Machine i computes both the local empirical minimizer w1,i , and thesubsampling empirical minimizer

w2,i ∈ arg minw

φ2,i (w) = 1drne

drne∑i=1

f (w , xi )

Compute the bootstrap average w2 := 1m∑m

w2,i ∈ arg minw

φ2,i (w) = 1drne

drne∑i=1

f (w , xi )

i=1 w2,i .

SAVGM returns the weighted combination

w2,i ∈ arg minw

φ2,i (w) = 1drne

drne∑i=1

f (w , xi )

Mean-squared error for SAVGM Method

Theorem 2The mean-squared error of SAVGM method is upper bounded as

E[||wSAVGM − w∗||22

]≤ O

( 1λ2nm + 1

n2m + 1λ6n3 + L2d log d

Remark on Theorem 2.

If we assume F (w) is L-smooth, then

E [F (wSAVGM)− F (w∗)] ≤ L2E

[||wSAVGM − w∗||22

O(n−2) term is eliminated here. Constant gap problem still exists.

]≤ O

( 1λ2nm + 1

Remark on Theorem 2.If we assume F (w) is L-smooth, then

[||wSAVGM − w∗||22

]≤ O

( 1λ2nm + 1

Remark on Theorem 2.If we assume F (w) is L-smooth, then

[||wSAVGM − w∗||22

Optimization Methods for Local Problem

What optimization method we can use to get w1,i and w2,i for machine i ,i.e., how to solve the Finite Sum Problems for each machine?

w1,i ∈ arg minw

φi (w) = 1

n∑i=1

f (w , xi )

w2,i ∈ arg minw

φ2,i (w) = 1drne

drne∑i=1

f (w , xi )

Under the assumption that φi (w) and φ2,i (w) are both L-smooth andλ-strongly convex for each i , then

Gradient Descent (GD): O(Lλ log(1

ε ));

Stochastic Gradient Descent (SGD): O( Lλ2ε

w1,i ∈ arg minw

φi (w) = 1

n∑i=1

f (w , xi )

w2,i ∈ arg minw

φ2,i (w) = 1drne

drne∑i=1

f (w , xi )

.Under the assumption that φi (w) and φ2,i (w) are both L-smooth andλ-strongly convex for each i , then

ε ));

w1,i ∈ arg minw

φi (w) = 1

n∑i=1

f (w , xi )

w2,i ∈ arg minw

φ2,i (w) = 1drne

drne∑i=1

f (w , xi )

ε ));

w1,i ∈ arg minw

φi (w) = 1

n∑i=1

f (w , xi )

w2,i ∈ arg minw

φ2,i (w) = 1drne

drne∑i=1

f (w , xi )

ε ));

Simulation setup

(xj , yj) ∈ Rd × R are i.i.d drawn from xj ∼ N (0,Σ) where Σ ∈ Rd×d

is diagonal with Σk,k = k−2.5, yj = xTj w0 + εj where w0 = 1 and

εj ∼ N (0, 1)

minwφ(w) = 1

m∑i=1

φi (w) + λ

2 ‖w‖22,

where φi (w) = 1n

n∑j=1

j w − yj)2

, λ = 1/√

set n = 1000,m = 5, d = 100.

Simulation setup

εj ∼ N (0, 1)Solve

minwφ(w) = 1

m∑i=1

φi (w) + λ

2 ‖w‖22,

where φi (w) = 1n

n∑j=1

j w − yj)2

, λ = 1/√

set n = 1000,m = 5, d = 100.

Simulation setup

minwφ(w) = 1

m∑i=1

φi (w) + λ

2 ‖w‖22,

where φi (w) = 1n

n∑j=1

j w − yj)2

, λ = 1/√

set n = 1000,m = 5, d = 100.

Algorithm Comparison

GD SGD Comparison

Comparison

Algorithms with Multi-round Communications

Overview

5 Conclusions

Suboptimality of AVGM and SAVGM

Recall

E [F (w)]− F (w∗) ≤ O(

Lλ2nm + L

λ6n2 + L3 log(d)λ4n2

the mean square error of the one-round communication algorithmscan be much worse if λ→ 0

Actually λ decays as 1√nm in many learning applications, e.g.,

regularized loss minimization and SVM-type problems[Sridaran09, Shalev09]There exists a worst-case [Shamir14] where

E [F (w)]− F (w∗) ≤ C1λnm ,while E [F (w)]− F (w∗) ≥ C2

Multi-round communications are necessary for estimators to convergeto the empirical minimizer w

Recall

E [F (w)]− F (w∗) ≤ O(

Lλ2nm + L

the mean square error of the one-round communication algorithmscan be much worse if λ→ 0Actually λ decays as 1√

nm in many learning applications, e.g.,regularized loss minimization and SVM-type problems[Sridaran09, Shalev09]

There exists a worst-case [Shamir14] where

Recall

E [F (w)]− F (w∗) ≤ O(

Lλ2nm + L

nm in many learning applications, e.g.,regularized loss minimization and SVM-type problems[Sridaran09, Shalev09]There exists a worst-case [Shamir14] where

Recall

E [F (w)]− F (w∗) ≤ O(

Lλ2nm + L

nm in many learning applications, e.g.,regularized loss minimization and SVM-type problems[Sridaran09, Shalev09]There exists a worst-case [Shamir14] where

Accelerated gradient averaging

Each machine calculates the gradient ∇φi (w (t))

Averaged full gradient step ∇φ(w (t)) = 1m∇φi (w (t)) is taken for

update and distribute the w (t+1)

Communication rounds / iterations needed for ε-suboptimal solution is

√Lλ

log(1ε

)Polynomial dependence on the condition number is undesirable if λtypically decays at most 1/

√nm, resulting in the dependence of

O ( 4√

nm) on sample size

Each machine calculates the gradient ∇φi (w (t))Averaged full gradient step ∇φ(w (t)) = 1

m∇φi (w (t)) is taken forupdate and distribute the w (t+1)

√Lλ

log(1ε

O ( 4√

nm) on sample size

√Lλ

log(1ε

Polynomial dependence on the condition number is undesirable if λtypically decays at most 1/

O ( 4√

nm) on sample size

√Lλ

log(1ε

O ( 4√

nm) on sample size

Alternating Direction Method of Multipliers (ADMM)

The machines alternate between computing shared dual variables andsolving augmented Lagrangian problems with local data

Recall for the problem

min f (x) + g(z)s.t. Ax + Bz = c

ADMM iteration followsx (t+1) = arg min

xf (x) + ρ

2‖Ax + Bz(t) − c + u(t)‖22z(t+1) = arg min

zg(z) + ρ

2‖Ax (t+1) + Bz − c + u(t)‖22u(t+1) = u(t) +

(Ax (t+1) + Bz(t+1) − c

Alternating Direction Method of Multipliers (ADMM)

The machines alternate between computing shared dual variables andsolving augmented Lagrangian problems with local dataRecall for the problem

min f (x) + g(z)s.t. Ax + Bz = c

ADMM iteration followsx (t+1) = arg min

xf (x) + ρ

2‖Ax + Bz(t) − c + u(t)‖22z(t+1) = arg min

zg(z) + ρ

2‖Ax (t+1) + Bz − c + u(t)‖22u(t+1) = u(t) +

(Ax (t+1) + Bz(t+1) − c

Distributed ADMM

Distributed optimization can be reformulated as

min 1m

m∑i=1

φi (wi )

s.t. wi − z = 0

The distributed ADMM iteration thus followsw (t+1)

i = arg minwi

φi (wi ) + ρ2‖wi − z(t) + u(t)

i ‖22,∀i

z(t+1) = 1m

m∑i=1

(w (t)

i + u(t)i

)u(t+1)

i = u(t)i +

(w (t+1)

i − z(t+1)), ∀i

The number of iterations / communication rounds scale strongly withthe condition number, i.e., sample size for learning applications,[Deng12, Hong12]

Distributed ADMM

min 1m

m∑i=1

φi (wi )

s.t. wi − z = 0

i = arg minwi

φi (wi ) + ρ2‖wi − z(t) + u(t)

i ‖22,∀i

z(t+1) = 1m

m∑i=1

(w (t)

i + u(t)i

)u(t+1)

i = u(t)i +

(w (t+1)

i − z(t+1)), ∀i

Distributed ADMM

min 1m

m∑i=1

φi (wi )

s.t. wi − z = 0

i = arg minwi

φi (wi ) + ρ2‖wi − z(t) + u(t)

i ‖22,∀i

z(t+1) = 1m

m∑i=1

(w (t)

i + u(t)i

)u(t+1)

i = u(t)i +

(w (t+1)

i − z(t+1)), ∀i

Distributed Optimization using an ApproximateNewton-Type (DANE) Method

[Shamir14]: Approximate Newton-type step is taken at each iterationwhile no local Hessian ∇2φi (w) is explicitly computed andcommunicated

With parameter η > 0 and regularizer µ > 0

∇φ(w (t)

i=1∇φi

(w (t)

)w (t+1)

i = arg minw

φi (w)−(∇φi

(w (t)

)− η∇φ

(w (t)

+ µ2‖w − w (t)‖22

w (t+1) = 1m

m∑i=1

w (t+1)i

Two rounds of communications per iteration

[Shamir14]: Approximate Newton-type step is taken at each iterationwhile no local Hessian ∇2φi (w) is explicitly computed andcommunicatedWith parameter η > 0 and regularizer µ > 0

∇φ(w (t)

i=1∇φi

(w (t)

)w (t+1)

i = arg minw

φi (w)−(∇φi

(w (t)

)− η∇φ

(w (t)

+ µ2‖w − w (t)‖22

w (t+1) = 1m

m∑i=1

w (t+1)i

Two rounds of communications per iteration

[Shamir14]: Approximate Newton-type step is taken at each iterationwhile no local Hessian ∇2φi (w) is explicitly computed andcommunicatedWith parameter η > 0 and regularizer µ > 0

∇φ(w (t)

i=1∇φi

(w (t)

)w (t+1)

i = arg minw

φi (w)−(∇φi

(w (t)

)− η∇φ

(w (t)

+ µ2‖w − w (t)‖22

w (t+1) = 1m

m∑i=1

w (t+1)i

Two rounds of communications per iterationJuan Xu, Kaiqing Zhang Distributed Optimization 20 / 32

DANE Method

The local optimization can be related to a mirror descent update withdistance generating function hi (w) = φi (w) + µ

2‖w‖22 and step size η

w (t+1)i = arg min

wφi (w)−

(∇φi

(w (t)

)− η∇φ

(w (t)

2 ‖w − w (t)‖22

= arg minw

(w ,w (t)

)+(η∇φ

(w (t)

))Tw + µ

2 ‖w − w (t)‖22

= arg minw

(w ,w (t)

)+(η∇φ

(w (t)

When µ is large, it reduces to gradient descent and no information oflocal φi (w) is used; when µ = 0 and φi (w) ≈ φ(w), it approximatesNewton’s method since

(w ,w (t)

)→ 1

2(w − w (t)

)T∇2φ

(w (t)

) (w − w (t)

DANE Method

w (t+1)i = arg min

wφi (w)−

(∇φi

(w (t)

)− η∇φ

(w (t)

2 ‖w − w (t)‖22

= arg minw

(w ,w (t)

)+(η∇φ

(w (t)

))Tw + µ

2 ‖w − w (t)‖22

= arg minw

(w ,w (t)

)+(η∇φ

(w (t)

(w ,w (t)

)→ 1

2(w − w (t)

)T∇2φ

(w (t)

) (w − w (t)

DANE Method

w (t+1)i = arg min

wφi (w)−

(∇φi

(w (t)

)− η∇φ

(w (t)

2 ‖w − w (t)‖22

= arg minw

(w ,w (t)

)+(η∇φ

(w (t)

))Tw + µ

2 ‖w − w (t)‖22

= arg minw

(w ,w (t)

)+(η∇φ

(w (t)

(w ,w (t)

)→ 1

2(w − w (t)

)T∇2φ

(w (t)

) (w − w (t)

DANE Method

w (t+1)i = arg min

wφi (w)−

(∇φi

(w (t)

)− η∇φ

(w (t)

2 ‖w − w (t)‖22

= arg minw

(w ,w (t)

)+(η∇φ

(w (t)

))Tw + µ

2 ‖w − w (t)‖22

= arg minw

(w ,w (t)

)+(η∇φ

(w (t)

(w ,w (t)

)→ 1

2(w − w (t)

)T∇2φ

(w (t)

) (w − w (t)

DANE Method

w (t+1)i = arg min

wφi (w)−

(∇φi

(w (t)

)− η∇φ

(w (t)

2 ‖w − w (t)‖22

= arg minw

(w ,w (t)

)+(η∇φ

(w (t)

))Tw + µ

2 ‖w − w (t)‖22

= arg minw

(w ,w (t)

)+(η∇φ

(w (t)

(w ,w (t)

)→ 1

2(w − w (t)

)T∇2φ

(w (t)

) (w − w (t)

)Juan Xu, Kaiqing Zhang Distributed Optimization 21 / 32

DANE Method

The iteration / communication complexity for L-smooth λ-stronglyconvex quadratic objective φ(w)

(L/λ)2

n log (dm) log(1ε

In practice, when λ scales as 1/√

nm, the complexity only scalesroughly linearly with the number of machines m, not with the samplesize N = nm

L2m log (dm) log(1ε

DANE Method

(L/λ)2

n log (dm) log(1ε

DANE Method

(L/λ)2

n log (dm) log(1ε

Simulation setup

εj ∼ N (0, 1)

minwφ(w) = 1

m∑i=1

φi (w) + λ

2 ‖w‖22,

where φi (w) = 1n

n∑j=1

j w − yj)2

, λ = 1/√

Simulation setup

minwφ(w) = 1

m∑i=1

φi (w) + λ

2 ‖w‖22,

where φi (w) = 1n

n∑j=1

j w − yj)2

, λ = 1/√

Simulation results

To compare the convergence rate, set n = 1000,m = 5, d = 100

Figure: Comparison of the convergence rate of different algorithms.

Simulation results

To compare the convergence rate, set n = 1000,m = 5, d = 100

Figure: Comparison of the convergence rate of different algorithms.

Simulation results

To compare the communication rounds versus sample size, setn ∈ 100, 500, 1000, 5000, 10000 ,m = 5, d = 100, ε = 10−6

Table: Comparison of communication rounds of different algorithms forε-suboptimal when sample N = nm size increases

Algorithm n=100 n=500 n=1000 n=5000 n=10000Accelerated GD 35 51 61 86 101

ADMM 58 95 122 252 345DANE 10 8 8 8 8

Simulation results

To compare the communication rounds versus sample size, setn ∈ 100, 500, 1000, 5000, 10000 ,m = 5, d = 100, ε = 10−6

Table: Comparison of communication rounds of different algorithms forε-suboptimal when sample N = nm size increases

Algorithm n=100 n=500 n=1000 n=5000 n=10000Accelerated GD 35 51 61 86 101

ADMM 58 95 122 252 345DANE 10 8 8 8 8

Communication Complexity of Multi-round Algorithms

Overview

5 Conclusions

[Arjevani15]: How many rounds of communications are necessary tosolve the distributed optimization with accuracy ε?

Assume the algorithms only operate over vectors, gradients, andvector products, and only vectors or gradients are communicatedThen there exist m 1-smooth λ-strongly convex local quadraticobjective functions that require at least communication rounds

√ δ

λlog(λ‖w∗‖2

)where δ ∈ (0, 1) characterizes the similarity of φi (w) among mmachines, typically decaying as O

(1√n

)for random splitting data

[Shamir14]

[Arjevani15]: How many rounds of communications are necessary tosolve the distributed optimization with accuracy ε?Assume the algorithms only operate over vectors, gradients, andvector products, and only vectors or gradients are communicated

Then there exist m 1-smooth λ-strongly convex local quadraticobjective functions that require at least communication rounds

√ δ

λlog(λ‖w∗‖2

(1√n

[Shamir14]

[Arjevani15]: How many rounds of communications are necessary tosolve the distributed optimization with accuracy ε?Assume the algorithms only operate over vectors, gradients, andvector products, and only vectors or gradients are communicatedThen there exist m 1-smooth λ-strongly convex local quadraticobjective functions that require at least communication rounds

√ δ

λlog(λ‖w∗‖2

(1√n

[Shamir14]

Accelerated gradient averaging can achieve the bound for unrelatedlocal functions, i.e., δ = Ω(1)

√ 1λ

log(1ε

) v.s. Ω

√ δ

λlog(λ‖w∗‖2

)but can be suboptimal when δ → 0DANE can be suboptimal when λ decays as 1√

nm and δ decays as 1√n

m log(1ε

))v.s. Ω

(4√m log

Designing optimal algorithms for general objective functions remainsopen

√ 1λ

log(1ε

) v.s. Ω

√ δ

λlog(λ‖w∗‖2

but can be suboptimal when δ → 0DANE can be suboptimal when λ decays as 1√

m log(1ε

))v.s. Ω

(4√m log

√ 1λ

log(1ε

) v.s. Ω

√ δ

λlog(λ‖w∗‖2

)but can be suboptimal when δ → 0

DANE can be suboptimal when λ decays as 1√nm and δ decays as 1√

m log(1ε

))v.s. Ω

(4√m log

√ 1λ

log(1ε

) v.s. Ω

√ δ

λlog(λ‖w∗‖2

m log(1ε

))v.s. Ω

(4√m log

√ 1λ

log(1ε

) v.s. Ω

√ δ

λlog(λ‖w∗‖2

m log(1ε

))v.s. Ω

(4√m log

√ 1λ

log(1ε

) v.s. Ω

√ δ

λlog(λ‖w∗‖2

m log(1ε

))v.s. Ω

(4√m log

Summary

Table: Communication efficiency of several distributed algorithms forquadratic objective functions when λ ∼ O

(1√nm

)and δ ∼ O

(1√n

)Algorithm Number of Communication Rounds

Accelerated GD O ( 4√

nm log (1/ε))ADMM O ( 4

√nm log (1/ε))

DANE O (m log (1/ε))Optimal O ( 4

√m log (1/ε))

Conclusions

Overview

5 Conclusions

Conclusions

Take-aways

Distributed algorithms with only one-round communication sufferfrom constant estimation errors that cannot be mitigated by clevereraveraging

DANE outperforms AGD and ADMM in terms of communicationcomplexity by making use of the similarity of local functionsMany communication rounds can be necessary to achieveε-suboptimal even though the computation capability at localmachines are unlimited

Conclusions

Take-aways

Distributed algorithms with only one-round communication sufferfrom constant estimation errors that cannot be mitigated by clevereraveragingDANE outperforms AGD and ADMM in terms of communicationcomplexity by making use of the similarity of local functions

Many communication rounds can be necessary to achieveε-suboptimal even though the computation capability at localmachines are unlimited

Conclusions

Take-aways

Distributed algorithms with only one-round communication sufferfrom constant estimation errors that cannot be mitigated by clevereraveragingDANE outperforms AGD and ADMM in terms of communicationcomplexity by making use of the similarity of local functionsMany communication rounds can be necessary to achieveε-suboptimal even though the computation capability at localmachines are unlimited

Conclusions

References

He, Niao. ”IE598 Big Data Optimizaiton.” University of Illinois atUrbana-Champaign, 2016.

Zhang, Yuchen and Wainwright, Martin J and Duchi, John C.”Communication-efficient algorithms for statistical optimization.” InAdvances in Neural Information Processing Systems, pp. 1502-1510.2012.Zhang, Yuchen and Wainwright, Martin J and Duchi, John C.”Communication-efficient algorithms for statistical optimization.” InJournal of Machine Learning Research, pp. 3321-3363. 2013.

Sridharan, Karthik, Shai Shalev-Shwartz, and Nathan Srebro. ”Fastrates for regularized objectives.” In Advances in Neural InformationProcessing Systems, pp. 1545-1552. 2009.

Shalev-Shwartz, Shai, Ohad Shamir, Nathan Srebro, and KarthikSridharan. ”Stochastic Convex Optimization.” In COLT. 2009.

Conclusions

References

Shamir, Ohad, Nathan Srebro, and Tong Zhang.”Communication-Efficient Distributed Optimization using anApproximate Newton-type Method.” In ICML, vol. 32, no. 1, pp.1000-1008. 2014.Deng, Wei, and Wotao Yin. ”On the global and linear convergence ofthe generalized alternating direction method of multipliers.” Journal ofScientific Computing 66, no. 3 (2016): 889-916.

Hong, Mingyi, and Zhi-Quan Luo. ”On the linear convergence of thealternating direction method of multipliers.” arXiv preprintarXiv:1208.3922 (2012).

Arjevani, Yossi, and Ohad Shamir. ”Communication complexity ofdistributed convex learning and optimization.” In Advances in NeuralInformation Processing Systems, pp. 1756-1764. 2015.

Conclusions

Thank you !

communication-efficient algorithms for distributed...

Documents

matrix completion - university of...

seismic anisotropy changes across upper mantle phase...

branching algorithms - inria · branching algorithms are...

data structure and algorithms. algorithms: efficiency and...

developing algorithms fordeveloping algorithms for early

ie 598: incremental gradient methods -...

arxiv:1903.06372v3 [cs.lg] 18 nov 2019 · 2019. 11. 20. ·...

alphago: the program ahead of its...

data structures and algorithms searching algorithms

randomized algorithms graph algorithms

non-cooperative inverse reinforcement...

introduction to algorithms graph algorithms

algorithms lecture 4: sorting algorithms i

alternate views of agd - university of...

randomized algorithms randomized algorithms cs648 1

stl algorithms algorithms independent of containers

arxiv:1911.10635v1 [cs.lg] 24 nov 2019 · 2019-12-02 ·...

page replacement algorithms page replacement algorithms

applications of evolutionary algorithms · applications of...

comparisons of large complex data sets to catch any changes...