communication-efficient algorithms for distributed...

Post on 20-May-2020

34 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Communication-Efficient Algorithms for DistributedOptimization

Juan Xu Kaiqing Zhang

December 1, 2016

Juan Xu, Kaiqing Zhang Distributed Optimization 1 / 32

Overview

1 Motivation and Problem Description

2 Algorithms with One-round Communication

3 Algorithms with Multi-round Communications

4 Communication Complexity of Multi-round Algorithms

5 Conclusions

Juan Xu, Kaiqing Zhang Distributed Optimization 2 / 32

Motivation and Problem Description

Overview

1 Motivation and Problem Description

2 Algorithms with One-round Communication

3 Algorithms with Multi-round Communications

4 Communication Complexity of Multi-round Algorithms

5 Conclusions

Juan Xu, Kaiqing Zhang Distributed Optimization 2 / 32

Motivation and Problem Description

Motivation

Consider a Stochastic Optimization formulation

minw

F (w) = EX∼P [f (w ,X )] .

Population risk function F (w) : Rd → RPopulation risk minimizer w∗

w∗ := arg minw

F (w),

w∗ is assume to be unique.Loss function f (w ,X )

f is real-value and convex;X is a random variable following distribution P (P is unknown).

Juan Xu, Kaiqing Zhang Distributed Optimization 3 / 32

Motivation and Problem Description

Motivation

Consider a Stochastic Optimization formulation

minw

F (w) = EX∼P [f (w ,X )] .

Population risk function F (w) : Rd → R

Population risk minimizer w∗

w∗ := arg minw

F (w),

w∗ is assume to be unique.Loss function f (w ,X )

f is real-value and convex;X is a random variable following distribution P (P is unknown).

Juan Xu, Kaiqing Zhang Distributed Optimization 3 / 32

Motivation and Problem Description

Motivation

Consider a Stochastic Optimization formulation

minw

F (w) = EX∼P [f (w ,X )] .

Population risk function F (w) : Rd → RPopulation risk minimizer w∗

w∗ := arg minw

F (w),

w∗ is assume to be unique.

Loss function f (w ,X )

f is real-value and convex;X is a random variable following distribution P (P is unknown).

Juan Xu, Kaiqing Zhang Distributed Optimization 3 / 32

Motivation and Problem Description

Motivation

Consider a Stochastic Optimization formulation

minw

F (w) = EX∼P [f (w ,X )] .

Population risk function F (w) : Rd → RPopulation risk minimizer w∗

w∗ := arg minw

F (w),

w∗ is assume to be unique.Loss function f (w ,X )

f is real-value and convex;X is a random variable following distribution P (P is unknown).

Juan Xu, Kaiqing Zhang Distributed Optimization 3 / 32

Motivation and Problem Description

Motivation

Consider a Stochastic Optimization formulation

minw

F (w) = EX∼P [f (w ,X )] .

Population risk function F (w) : Rd → RPopulation risk minimizer w∗

w∗ := arg minw

F (w),

w∗ is assume to be unique.Loss function f (w ,X )

f is real-value and convex;

X is a random variable following distribution P (P is unknown).

Juan Xu, Kaiqing Zhang Distributed Optimization 3 / 32

Motivation and Problem Description

Motivation

Consider a Stochastic Optimization formulation

minw

F (w) = EX∼P [f (w ,X )] .

Population risk function F (w) : Rd → RPopulation risk minimizer w∗

w∗ := arg minw

F (w),

w∗ is assume to be unique.Loss function f (w ,X )

f is real-value and convex;X is a random variable following distribution P (P is unknown).

Juan Xu, Kaiqing Zhang Distributed Optimization 3 / 32

Motivation and Problem Description

MotivationInstead of the unavailable probability distribution P, we have access toN = nm samples which are i.i.d. according to distribution P.

Empirical risk function φ(w) : Rd → R

φ(w) = 1

N

N∑i=1

f (w , xi ).

Empirical risk minimizer w

w ∈ arg minwφ(w),

→ Finite Sum Problem.

♠ Can we solve it when the sample size N = nm is so large to be storedin a single computer, i.e., in a centralized setting?→ Distributed Optimization.

Juan Xu, Kaiqing Zhang Distributed Optimization 4 / 32

Motivation and Problem Description

MotivationInstead of the unavailable probability distribution P, we have access toN = nm samples which are i.i.d. according to distribution P.

Empirical risk function φ(w) : Rd → R

φ(w) = 1

N

N∑i=1

f (w , xi ).

Empirical risk minimizer w

w ∈ arg minwφ(w),

→ Finite Sum Problem.

♠ Can we solve it when the sample size N = nm is so large to be storedin a single computer, i.e., in a centralized setting?→ Distributed Optimization.

Juan Xu, Kaiqing Zhang Distributed Optimization 4 / 32

Motivation and Problem Description

MotivationInstead of the unavailable probability distribution P, we have access toN = nm samples which are i.i.d. according to distribution P.

Empirical risk function φ(w) : Rd → R

φ(w) = 1

N

N∑i=1

f (w , xi ).

Empirical risk minimizer w

w ∈ arg minwφ(w),

→ Finite Sum Problem.

♠ Can we solve it when the sample size N = nm is so large to be storedin a single computer, i.e., in a centralized setting?→ Distributed Optimization.

Juan Xu, Kaiqing Zhang Distributed Optimization 4 / 32

Motivation and Problem Description

MotivationInstead of the unavailable probability distribution P, we have access toN = nm samples which are i.i.d. according to distribution P.

Empirical risk function φ(w) : Rd → R

φ(w) = 1

N

N∑i=1

f (w , xi ).

Empirical risk minimizer w

w ∈ arg minwφ(w),

→ Finite Sum Problem.

♠ Can we solve it when the sample size N = nm is so large to be storedin a single computer, i.e., in a centralized setting?

→ Distributed Optimization.

Juan Xu, Kaiqing Zhang Distributed Optimization 4 / 32

Motivation and Problem Description

MotivationInstead of the unavailable probability distribution P, we have access toN = nm samples which are i.i.d. according to distribution P.

Empirical risk function φ(w) : Rd → R

φ(w) = 1

N

N∑i=1

f (w , xi ).

Empirical risk minimizer w

w ∈ arg minwφ(w),

→ Finite Sum Problem.

♠ Can we solve it when the sample size N = nm is so large to be storedin a single computer, i.e., in a centralized setting?→ Distributed Optimization.

Juan Xu, Kaiqing Zhang Distributed Optimization 4 / 32

Motivation and Problem Description

Distributed Optimization

Distributed Optimization: divide the large-scale data into m pieces,and use m machines to deal with each of them.

Key Trade-off: pay expensive communication cost to exchange forthe memory feasibility.

Juan Xu, Kaiqing Zhang Distributed Optimization 5 / 32

Motivation and Problem Description

Distributed Optimization

Distributed Optimization: divide the large-scale data into m pieces,and use m machines to deal with each of them.

Key Trade-off: pay expensive communication cost to exchange forthe memory feasibility.

Juan Xu, Kaiqing Zhang Distributed Optimization 5 / 32

Motivation and Problem Description

Two Streams in Distributed Optimization

Figure: Distributed Optimization

One-round CommunicationMulti-round Communication

Juan Xu, Kaiqing Zhang Distributed Optimization 6 / 32

Motivation and Problem Description

Two Streams in Distributed Optimization

Figure: Distributed Optimization

One-round Communication

Multi-round Communication

Juan Xu, Kaiqing Zhang Distributed Optimization 6 / 32

Motivation and Problem Description

Two Streams in Distributed Optimization

Figure: Distributed Optimization

One-round CommunicationMulti-round Communication

Juan Xu, Kaiqing Zhang Distributed Optimization 6 / 32

Motivation and Problem Description

Problem Settings

Given a dataset N = nm samples i.i.d. according to distribution P.

Divide the sample evenly among m machines. Each machine gets asubsampled dataset of size n.Local empirical risk function φi (w) : Rd → R

φi (w) := 1n

n∑i=1

f (w , xi ).

→ Finite Sum Problem.

Assumption. Local empirical risk function φi (w) is L-smooth andλ-strongly convex.

Juan Xu, Kaiqing Zhang Distributed Optimization 7 / 32

Motivation and Problem Description

Problem Settings

Given a dataset N = nm samples i.i.d. according to distribution P.Divide the sample evenly among m machines. Each machine gets asubsampled dataset of size n.

Local empirical risk function φi (w) : Rd → R

φi (w) := 1n

n∑i=1

f (w , xi ).

→ Finite Sum Problem.

Assumption. Local empirical risk function φi (w) is L-smooth andλ-strongly convex.

Juan Xu, Kaiqing Zhang Distributed Optimization 7 / 32

Motivation and Problem Description

Problem Settings

Given a dataset N = nm samples i.i.d. according to distribution P.Divide the sample evenly among m machines. Each machine gets asubsampled dataset of size n.Local empirical risk function φi (w) : Rd → R

φi (w) := 1n

n∑i=1

f (w , xi ).

→ Finite Sum Problem.

Assumption. Local empirical risk function φi (w) is L-smooth andλ-strongly convex.

Juan Xu, Kaiqing Zhang Distributed Optimization 7 / 32

Motivation and Problem Description

Problem Settings

Given a dataset N = nm samples i.i.d. according to distribution P.Divide the sample evenly among m machines. Each machine gets asubsampled dataset of size n.Local empirical risk function φi (w) : Rd → R

φi (w) := 1n

n∑i=1

f (w , xi ).

→ Finite Sum Problem.

Assumption. Local empirical risk function φi (w) is L-smooth andλ-strongly convex.

Juan Xu, Kaiqing Zhang Distributed Optimization 7 / 32

Motivation and Problem Description

Problem Settings

Given a dataset N = nm samples i.i.d. according to distribution P.Divide the sample evenly among m machines. Each machine gets asubsampled dataset of size n.Local empirical risk function φi (w) : Rd → R

φi (w) := 1n

n∑i=1

f (w , xi ).

→ Finite Sum Problem.Assumption. Local empirical risk function φi (w) is L-smooth andλ-strongly convex.

Juan Xu, Kaiqing Zhang Distributed Optimization 7 / 32

Algorithms with One-round Communication

Overview

1 Motivation and Problem Description

2 Algorithms with One-round Communication

3 Algorithms with Multi-round Communications

4 Communication Complexity of Multi-round Algorithms

5 Conclusions

Juan Xu, Kaiqing Zhang Distributed Optimization 7 / 32

Algorithms with One-round Communication

Background

Figure: Algorithms with One-round Communication

Averaging methods?Accuracy, i.e., mean-squared error to the population minimizer w∗?Optimization methods for local problem?

The results of Algorithms with One-round Communication are mainlybased on [He16] and [Zhang13].

Juan Xu, Kaiqing Zhang Distributed Optimization 8 / 32

Algorithms with One-round Communication

Background

Figure: Algorithms with One-round Communication

Averaging methods?

Accuracy, i.e., mean-squared error to the population minimizer w∗?Optimization methods for local problem?

The results of Algorithms with One-round Communication are mainlybased on [He16] and [Zhang13].

Juan Xu, Kaiqing Zhang Distributed Optimization 8 / 32

Algorithms with One-round Communication

Background

Figure: Algorithms with One-round Communication

Averaging methods?Accuracy, i.e., mean-squared error to the population minimizer w∗?

Optimization methods for local problem?The results of Algorithms with One-round Communication are mainlybased on [He16] and [Zhang13].

Juan Xu, Kaiqing Zhang Distributed Optimization 8 / 32

Algorithms with One-round Communication

Background

Figure: Algorithms with One-round Communication

Averaging methods?Accuracy, i.e., mean-squared error to the population minimizer w∗?Optimization methods for local problem?

The results of Algorithms with One-round Communication are mainlybased on [He16] and [Zhang13].

Juan Xu, Kaiqing Zhang Distributed Optimization 8 / 32

Algorithms with One-round Communication

Background

Figure: Algorithms with One-round Communication

Averaging methods?Accuracy, i.e., mean-squared error to the population minimizer w∗?Optimization methods for local problem?

The results of Algorithms with One-round Communication are mainlybased on [He16] and [Zhang13].

Juan Xu, Kaiqing Zhang Distributed Optimization 8 / 32

Algorithms with One-round Communication

Background

Figure: Algorithms with One-round Communication

Averaging methods?Accuracy, i.e., mean-squared error to the population minimizer w∗?Optimization methods for local problem?

The results of Algorithms with One-round Communication are mainlybased on [He16] and [Zhang13].

Juan Xu, Kaiqing Zhang Distributed Optimization 8 / 32

Algorithms with One-round Communication

Averaging Mixture (AVGM) Method

AVGM Methods.Machine i uses its n samples to solve the local empirical riskminimization problem

w1,i ∈ arg minwφi (w).

AVGM combines these m estimates by averaging

w1 := 1m

m∑i=1

w1,i .

Juan Xu, Kaiqing Zhang Distributed Optimization 9 / 32

Algorithms with One-round Communication

Averaging Mixture (AVGM) Method

AVGM Methods.Machine i uses its n samples to solve the local empirical riskminimization problem

w1,i ∈ arg minwφi (w).

AVGM combines these m estimates by averaging

w1 := 1m

m∑i=1

w1,i .

Juan Xu, Kaiqing Zhang Distributed Optimization 9 / 32

Algorithms with One-round Communication

Mean-squared Error for AVGM Method

Theorem 1The mean-squared error of AVGM method is upper bounded as

E[||w1 − w∗||22

]≤ O

( 1λ2nm + 1

λ6n2 + L2 log dλ4n2 + 1

n2m + 1n3

).

Remark on Theorem 1.

The leading term decays proportionally to (nm)−1 = N−1 whenassuming m ≤ n.If we assume F (w) is L-smooth, then

E [F (w1)− F (w∗)] ≤ L2E

[||w1 − w∗||22

].

There may exist a constant gap. Because we often haveλ = O(1/

√nm).

Juan Xu, Kaiqing Zhang Distributed Optimization 10 / 32

Algorithms with One-round Communication

Mean-squared Error for AVGM Method

Theorem 1The mean-squared error of AVGM method is upper bounded as

E[||w1 − w∗||22

]≤ O

( 1λ2nm + 1

λ6n2 + L2 log dλ4n2 + 1

n2m + 1n3

).

Remark on Theorem 1.The leading term decays proportionally to (nm)−1 = N−1 whenassuming m ≤ n.

If we assume F (w) is L-smooth, then

E [F (w1)− F (w∗)] ≤ L2E

[||w1 − w∗||22

].

There may exist a constant gap. Because we often haveλ = O(1/

√nm).

Juan Xu, Kaiqing Zhang Distributed Optimization 10 / 32

Algorithms with One-round Communication

Mean-squared Error for AVGM Method

Theorem 1The mean-squared error of AVGM method is upper bounded as

E[||w1 − w∗||22

]≤ O

( 1λ2nm + 1

λ6n2 + L2 log dλ4n2 + 1

n2m + 1n3

).

Remark on Theorem 1.The leading term decays proportionally to (nm)−1 = N−1 whenassuming m ≤ n.If we assume F (w) is L-smooth, then

E [F (w1)− F (w∗)] ≤ L2E

[||w1 − w∗||22

].

There may exist a constant gap. Because we often haveλ = O(1/

√nm).

Juan Xu, Kaiqing Zhang Distributed Optimization 10 / 32

Algorithms with One-round Communication

Mean-squared Error for AVGM Method

Theorem 1The mean-squared error of AVGM method is upper bounded as

E[||w1 − w∗||22

]≤ O

( 1λ2nm + 1

λ6n2 + L2 log dλ4n2 + 1

n2m + 1n3

).

Remark on Theorem 1.The leading term decays proportionally to (nm)−1 = N−1 whenassuming m ≤ n.If we assume F (w) is L-smooth, then

E [F (w1)− F (w∗)] ≤ L2E

[||w1 − w∗||22

].

There may exist a constant gap. Because we often haveλ = O(1/

√nm).

Juan Xu, Kaiqing Zhang Distributed Optimization 10 / 32

Algorithms with One-round Communication

Subsampled Averaging Mixture (SAVGM) Method

Define a fixed subsampling rate r ∈ [0, 1].

Get a subset of size drne by sampling uniformly at random withoutreplacement from the local dataset of size n for each machine.Machine i computes both the local empirical minimizer w1,i , and thesubsampling empirical minimizer

w2,i ∈ arg minw

φ2,i (w) = 1drne

drne∑i=1

f (w , xi )

.Compute the bootstrap average w2 := 1

m∑m

i=1 w2,i .SAVGM returns the weighted combination

wSAVGM := w1 − r w21− r .

Juan Xu, Kaiqing Zhang Distributed Optimization 11 / 32

Algorithms with One-round Communication

Subsampled Averaging Mixture (SAVGM) Method

Define a fixed subsampling rate r ∈ [0, 1].Get a subset of size drne by sampling uniformly at random withoutreplacement from the local dataset of size n for each machine.

Machine i computes both the local empirical minimizer w1,i , and thesubsampling empirical minimizer

w2,i ∈ arg minw

φ2,i (w) = 1drne

drne∑i=1

f (w , xi )

.Compute the bootstrap average w2 := 1

m∑m

i=1 w2,i .SAVGM returns the weighted combination

wSAVGM := w1 − r w21− r .

Juan Xu, Kaiqing Zhang Distributed Optimization 11 / 32

Algorithms with One-round Communication

Subsampled Averaging Mixture (SAVGM) Method

Define a fixed subsampling rate r ∈ [0, 1].Get a subset of size drne by sampling uniformly at random withoutreplacement from the local dataset of size n for each machine.Machine i computes both the local empirical minimizer w1,i , and thesubsampling empirical minimizer

w2,i ∈ arg minw

φ2,i (w) = 1drne

drne∑i=1

f (w , xi )

.

Compute the bootstrap average w2 := 1m∑m

i=1 w2,i .SAVGM returns the weighted combination

wSAVGM := w1 − r w21− r .

Juan Xu, Kaiqing Zhang Distributed Optimization 11 / 32

Algorithms with One-round Communication

Subsampled Averaging Mixture (SAVGM) Method

Define a fixed subsampling rate r ∈ [0, 1].Get a subset of size drne by sampling uniformly at random withoutreplacement from the local dataset of size n for each machine.Machine i computes both the local empirical minimizer w1,i , and thesubsampling empirical minimizer

w2,i ∈ arg minw

φ2,i (w) = 1drne

drne∑i=1

f (w , xi )

.Compute the bootstrap average w2 := 1

m∑m

i=1 w2,i .

SAVGM returns the weighted combination

wSAVGM := w1 − r w21− r .

Juan Xu, Kaiqing Zhang Distributed Optimization 11 / 32

Algorithms with One-round Communication

Subsampled Averaging Mixture (SAVGM) Method

Define a fixed subsampling rate r ∈ [0, 1].Get a subset of size drne by sampling uniformly at random withoutreplacement from the local dataset of size n for each machine.Machine i computes both the local empirical minimizer w1,i , and thesubsampling empirical minimizer

w2,i ∈ arg minw

φ2,i (w) = 1drne

drne∑i=1

f (w , xi )

.Compute the bootstrap average w2 := 1

m∑m

i=1 w2,i .SAVGM returns the weighted combination

wSAVGM := w1 − r w21− r .

Juan Xu, Kaiqing Zhang Distributed Optimization 11 / 32

Algorithms with One-round Communication

Mean-squared error for SAVGM Method

Theorem 2The mean-squared error of SAVGM method is upper bounded as

E[||wSAVGM − w∗||22

]≤ O

( 1λ2nm + 1

n2m + 1λ6n3 + L2d log d

λ4n3

).

Remark on Theorem 2.

If we assume F (w) is L-smooth, then

E [F (wSAVGM)− F (w∗)] ≤ L2E

[||wSAVGM − w∗||22

].

O(n−2) term is eliminated here. Constant gap problem still exists.

Juan Xu, Kaiqing Zhang Distributed Optimization 12 / 32

Algorithms with One-round Communication

Mean-squared error for SAVGM Method

Theorem 2The mean-squared error of SAVGM method is upper bounded as

E[||wSAVGM − w∗||22

]≤ O

( 1λ2nm + 1

n2m + 1λ6n3 + L2d log d

λ4n3

).

Remark on Theorem 2.If we assume F (w) is L-smooth, then

E [F (wSAVGM)− F (w∗)] ≤ L2E

[||wSAVGM − w∗||22

].

O(n−2) term is eliminated here. Constant gap problem still exists.

Juan Xu, Kaiqing Zhang Distributed Optimization 12 / 32

Algorithms with One-round Communication

Mean-squared error for SAVGM Method

Theorem 2The mean-squared error of SAVGM method is upper bounded as

E[||wSAVGM − w∗||22

]≤ O

( 1λ2nm + 1

n2m + 1λ6n3 + L2d log d

λ4n3

).

Remark on Theorem 2.If we assume F (w) is L-smooth, then

E [F (wSAVGM)− F (w∗)] ≤ L2E

[||wSAVGM − w∗||22

].

O(n−2) term is eliminated here. Constant gap problem still exists.

Juan Xu, Kaiqing Zhang Distributed Optimization 12 / 32

Algorithms with One-round Communication

Optimization Methods for Local Problem

What optimization method we can use to get w1,i and w2,i for machine i ,i.e., how to solve the Finite Sum Problems for each machine?

w1,i ∈ arg minw

φi (w) = 1

n

n∑i=1

f (w , xi )

;

w2,i ∈ arg minw

φ2,i (w) = 1drne

drne∑i=1

f (w , xi )

.

Under the assumption that φi (w) and φ2,i (w) are both L-smooth andλ-strongly convex for each i , then

Gradient Descent (GD): O(Lλ log(1

ε ));

Stochastic Gradient Descent (SGD): O( Lλ2ε

);

Juan Xu, Kaiqing Zhang Distributed Optimization 13 / 32

Algorithms with One-round Communication

Optimization Methods for Local Problem

What optimization method we can use to get w1,i and w2,i for machine i ,i.e., how to solve the Finite Sum Problems for each machine?

w1,i ∈ arg minw

φi (w) = 1

n

n∑i=1

f (w , xi )

;

w2,i ∈ arg minw

φ2,i (w) = 1drne

drne∑i=1

f (w , xi )

.Under the assumption that φi (w) and φ2,i (w) are both L-smooth andλ-strongly convex for each i , then

Gradient Descent (GD): O(Lλ log(1

ε ));

Stochastic Gradient Descent (SGD): O( Lλ2ε

);

Juan Xu, Kaiqing Zhang Distributed Optimization 13 / 32

Algorithms with One-round Communication

Optimization Methods for Local Problem

What optimization method we can use to get w1,i and w2,i for machine i ,i.e., how to solve the Finite Sum Problems for each machine?

w1,i ∈ arg minw

φi (w) = 1

n

n∑i=1

f (w , xi )

;

w2,i ∈ arg minw

φ2,i (w) = 1drne

drne∑i=1

f (w , xi )

.Under the assumption that φi (w) and φ2,i (w) are both L-smooth andλ-strongly convex for each i , then

Gradient Descent (GD): O(Lλ log(1

ε ));

Stochastic Gradient Descent (SGD): O( Lλ2ε

);

Juan Xu, Kaiqing Zhang Distributed Optimization 13 / 32

Algorithms with One-round Communication

Optimization Methods for Local Problem

What optimization method we can use to get w1,i and w2,i for machine i ,i.e., how to solve the Finite Sum Problems for each machine?

w1,i ∈ arg minw

φi (w) = 1

n

n∑i=1

f (w , xi )

;

w2,i ∈ arg minw

φ2,i (w) = 1drne

drne∑i=1

f (w , xi )

.Under the assumption that φi (w) and φ2,i (w) are both L-smooth andλ-strongly convex for each i , then

Gradient Descent (GD): O(Lλ log(1

ε ));

Stochastic Gradient Descent (SGD): O( Lλ2ε

);

Juan Xu, Kaiqing Zhang Distributed Optimization 13 / 32

Algorithms with One-round Communication

Simulation setup

(xj , yj) ∈ Rd × R are i.i.d drawn from xj ∼ N (0,Σ) where Σ ∈ Rd×d

is diagonal with Σk,k = k−2.5, yj = xTj w0 + εj where w0 = 1 and

εj ∼ N (0, 1)

Solve

minwφ(w) = 1

m

m∑i=1

φi (w) + λ

2 ‖w‖22,

where φi (w) = 1n

n∑j=1

(xT

j w − yj)2

, λ = 1/√

nm

set n = 1000,m = 5, d = 100.

Juan Xu, Kaiqing Zhang Distributed Optimization 14 / 32

Algorithms with One-round Communication

Simulation setup

(xj , yj) ∈ Rd × R are i.i.d drawn from xj ∼ N (0,Σ) where Σ ∈ Rd×d

is diagonal with Σk,k = k−2.5, yj = xTj w0 + εj where w0 = 1 and

εj ∼ N (0, 1)Solve

minwφ(w) = 1

m

m∑i=1

φi (w) + λ

2 ‖w‖22,

where φi (w) = 1n

n∑j=1

(xT

j w − yj)2

, λ = 1/√

nm

set n = 1000,m = 5, d = 100.

Juan Xu, Kaiqing Zhang Distributed Optimization 14 / 32

Algorithms with One-round Communication

Simulation setup

(xj , yj) ∈ Rd × R are i.i.d drawn from xj ∼ N (0,Σ) where Σ ∈ Rd×d

is diagonal with Σk,k = k−2.5, yj = xTj w0 + εj where w0 = 1 and

εj ∼ N (0, 1)Solve

minwφ(w) = 1

m

m∑i=1

φi (w) + λ

2 ‖w‖22,

where φi (w) = 1n

n∑j=1

(xT

j w − yj)2

, λ = 1/√

nm

set n = 1000,m = 5, d = 100.

Juan Xu, Kaiqing Zhang Distributed Optimization 14 / 32

Algorithms with One-round Communication

Algorithm Comparison

GD SGD Comparison

AVGM

SAVGM

Comparison

Juan Xu, Kaiqing Zhang Distributed Optimization 15 / 32

Algorithms with Multi-round Communications

Overview

1 Motivation and Problem Description

2 Algorithms with One-round Communication

3 Algorithms with Multi-round Communications

4 Communication Complexity of Multi-round Algorithms

5 Conclusions

Juan Xu, Kaiqing Zhang Distributed Optimization 15 / 32

Algorithms with Multi-round Communications

Suboptimality of AVGM and SAVGM

Recall

E [F (w)]− F (w∗) ≤ O(

Lλ2nm + L

λ6n2 + L3 log(d)λ4n2

),

the mean square error of the one-round communication algorithmscan be much worse if λ→ 0

Actually λ decays as 1√nm in many learning applications, e.g.,

regularized loss minimization and SVM-type problems[Sridaran09, Shalev09]There exists a worst-case [Shamir14] where

E [F (w)]− F (w∗) ≤ C1λnm ,while E [F (w)]− F (w∗) ≥ C2

λn

Multi-round communications are necessary for estimators to convergeto the empirical minimizer w

Juan Xu, Kaiqing Zhang Distributed Optimization 16 / 32

Algorithms with Multi-round Communications

Suboptimality of AVGM and SAVGM

Recall

E [F (w)]− F (w∗) ≤ O(

Lλ2nm + L

λ6n2 + L3 log(d)λ4n2

),

the mean square error of the one-round communication algorithmscan be much worse if λ→ 0Actually λ decays as 1√

nm in many learning applications, e.g.,regularized loss minimization and SVM-type problems[Sridaran09, Shalev09]

There exists a worst-case [Shamir14] where

E [F (w)]− F (w∗) ≤ C1λnm ,while E [F (w)]− F (w∗) ≥ C2

λn

Multi-round communications are necessary for estimators to convergeto the empirical minimizer w

Juan Xu, Kaiqing Zhang Distributed Optimization 16 / 32

Algorithms with Multi-round Communications

Suboptimality of AVGM and SAVGM

Recall

E [F (w)]− F (w∗) ≤ O(

Lλ2nm + L

λ6n2 + L3 log(d)λ4n2

),

the mean square error of the one-round communication algorithmscan be much worse if λ→ 0Actually λ decays as 1√

nm in many learning applications, e.g.,regularized loss minimization and SVM-type problems[Sridaran09, Shalev09]There exists a worst-case [Shamir14] where

E [F (w)]− F (w∗) ≤ C1λnm ,while E [F (w)]− F (w∗) ≥ C2

λn

Multi-round communications are necessary for estimators to convergeto the empirical minimizer w

Juan Xu, Kaiqing Zhang Distributed Optimization 16 / 32

Algorithms with Multi-round Communications

Suboptimality of AVGM and SAVGM

Recall

E [F (w)]− F (w∗) ≤ O(

Lλ2nm + L

λ6n2 + L3 log(d)λ4n2

),

the mean square error of the one-round communication algorithmscan be much worse if λ→ 0Actually λ decays as 1√

nm in many learning applications, e.g.,regularized loss minimization and SVM-type problems[Sridaran09, Shalev09]There exists a worst-case [Shamir14] where

E [F (w)]− F (w∗) ≤ C1λnm ,while E [F (w)]− F (w∗) ≥ C2

λn

Multi-round communications are necessary for estimators to convergeto the empirical minimizer w

Juan Xu, Kaiqing Zhang Distributed Optimization 16 / 32

Algorithms with Multi-round Communications

Accelerated gradient averaging

Each machine calculates the gradient ∇φi (w (t))

Averaged full gradient step ∇φ(w (t)) = 1m∇φi (w (t)) is taken for

update and distribute the w (t+1)

Communication rounds / iterations needed for ε-suboptimal solution is

O

√Lλ

log(1ε

)Polynomial dependence on the condition number is undesirable if λtypically decays at most 1/

√nm, resulting in the dependence of

O ( 4√

nm) on sample size

Juan Xu, Kaiqing Zhang Distributed Optimization 17 / 32

Algorithms with Multi-round Communications

Accelerated gradient averaging

Each machine calculates the gradient ∇φi (w (t))Averaged full gradient step ∇φ(w (t)) = 1

m∇φi (w (t)) is taken forupdate and distribute the w (t+1)

Communication rounds / iterations needed for ε-suboptimal solution is

O

√Lλ

log(1ε

)Polynomial dependence on the condition number is undesirable if λtypically decays at most 1/

√nm, resulting in the dependence of

O ( 4√

nm) on sample size

Juan Xu, Kaiqing Zhang Distributed Optimization 17 / 32

Algorithms with Multi-round Communications

Accelerated gradient averaging

Each machine calculates the gradient ∇φi (w (t))Averaged full gradient step ∇φ(w (t)) = 1

m∇φi (w (t)) is taken forupdate and distribute the w (t+1)

Communication rounds / iterations needed for ε-suboptimal solution is

O

√Lλ

log(1ε

)

Polynomial dependence on the condition number is undesirable if λtypically decays at most 1/

√nm, resulting in the dependence of

O ( 4√

nm) on sample size

Juan Xu, Kaiqing Zhang Distributed Optimization 17 / 32

Algorithms with Multi-round Communications

Accelerated gradient averaging

Each machine calculates the gradient ∇φi (w (t))Averaged full gradient step ∇φ(w (t)) = 1

m∇φi (w (t)) is taken forupdate and distribute the w (t+1)

Communication rounds / iterations needed for ε-suboptimal solution is

O

√Lλ

log(1ε

)Polynomial dependence on the condition number is undesirable if λtypically decays at most 1/

√nm, resulting in the dependence of

O ( 4√

nm) on sample size

Juan Xu, Kaiqing Zhang Distributed Optimization 17 / 32

Algorithms with Multi-round Communications

Alternating Direction Method of Multipliers (ADMM)

The machines alternate between computing shared dual variables andsolving augmented Lagrangian problems with local data

Recall for the problem

min f (x) + g(z)s.t. Ax + Bz = c

ADMM iteration followsx (t+1) = arg min

xf (x) + ρ

2‖Ax + Bz(t) − c + u(t)‖22z(t+1) = arg min

zg(z) + ρ

2‖Ax (t+1) + Bz − c + u(t)‖22u(t+1) = u(t) +

(Ax (t+1) + Bz(t+1) − c

)

Juan Xu, Kaiqing Zhang Distributed Optimization 18 / 32

Algorithms with Multi-round Communications

Alternating Direction Method of Multipliers (ADMM)

The machines alternate between computing shared dual variables andsolving augmented Lagrangian problems with local dataRecall for the problem

min f (x) + g(z)s.t. Ax + Bz = c

ADMM iteration followsx (t+1) = arg min

xf (x) + ρ

2‖Ax + Bz(t) − c + u(t)‖22z(t+1) = arg min

zg(z) + ρ

2‖Ax (t+1) + Bz − c + u(t)‖22u(t+1) = u(t) +

(Ax (t+1) + Bz(t+1) − c

)

Juan Xu, Kaiqing Zhang Distributed Optimization 18 / 32

Algorithms with Multi-round Communications

Distributed ADMM

Distributed optimization can be reformulated as

min 1m

m∑i=1

φi (wi )

s.t. wi − z = 0

The distributed ADMM iteration thus followsw (t+1)

i = arg minwi

φi (wi ) + ρ2‖wi − z(t) + u(t)

i ‖22,∀i

z(t+1) = 1m

m∑i=1

(w (t)

i + u(t)i

)u(t+1)

i = u(t)i +

(w (t+1)

i − z(t+1)), ∀i

The number of iterations / communication rounds scale strongly withthe condition number, i.e., sample size for learning applications,[Deng12, Hong12]

Juan Xu, Kaiqing Zhang Distributed Optimization 19 / 32

Algorithms with Multi-round Communications

Distributed ADMM

Distributed optimization can be reformulated as

min 1m

m∑i=1

φi (wi )

s.t. wi − z = 0

The distributed ADMM iteration thus followsw (t+1)

i = arg minwi

φi (wi ) + ρ2‖wi − z(t) + u(t)

i ‖22,∀i

z(t+1) = 1m

m∑i=1

(w (t)

i + u(t)i

)u(t+1)

i = u(t)i +

(w (t+1)

i − z(t+1)), ∀i

The number of iterations / communication rounds scale strongly withthe condition number, i.e., sample size for learning applications,[Deng12, Hong12]

Juan Xu, Kaiqing Zhang Distributed Optimization 19 / 32

Algorithms with Multi-round Communications

Distributed ADMM

Distributed optimization can be reformulated as

min 1m

m∑i=1

φi (wi )

s.t. wi − z = 0

The distributed ADMM iteration thus followsw (t+1)

i = arg minwi

φi (wi ) + ρ2‖wi − z(t) + u(t)

i ‖22,∀i

z(t+1) = 1m

m∑i=1

(w (t)

i + u(t)i

)u(t+1)

i = u(t)i +

(w (t+1)

i − z(t+1)), ∀i

The number of iterations / communication rounds scale strongly withthe condition number, i.e., sample size for learning applications,[Deng12, Hong12]

Juan Xu, Kaiqing Zhang Distributed Optimization 19 / 32

Algorithms with Multi-round Communications

Distributed Optimization using an ApproximateNewton-Type (DANE) Method

[Shamir14]: Approximate Newton-type step is taken at each iterationwhile no local Hessian ∇2φi (w) is explicitly computed andcommunicated

With parameter η > 0 and regularizer µ > 0

∇φ(w (t)

)= 1

mm∑

i=1∇φi

(w (t)

)w (t+1)

i = arg minw

φi (w)−(∇φi

(w (t)

)− η∇φ

(w (t)

))Tw

+ µ2‖w − w (t)‖22

w (t+1) = 1m

m∑i=1

w (t+1)i

Two rounds of communications per iteration

Juan Xu, Kaiqing Zhang Distributed Optimization 20 / 32

Algorithms with Multi-round Communications

Distributed Optimization using an ApproximateNewton-Type (DANE) Method

[Shamir14]: Approximate Newton-type step is taken at each iterationwhile no local Hessian ∇2φi (w) is explicitly computed andcommunicatedWith parameter η > 0 and regularizer µ > 0

∇φ(w (t)

)= 1

mm∑

i=1∇φi

(w (t)

)w (t+1)

i = arg minw

φi (w)−(∇φi

(w (t)

)− η∇φ

(w (t)

))Tw

+ µ2‖w − w (t)‖22

w (t+1) = 1m

m∑i=1

w (t+1)i

Two rounds of communications per iteration

Juan Xu, Kaiqing Zhang Distributed Optimization 20 / 32

Algorithms with Multi-round Communications

Distributed Optimization using an ApproximateNewton-Type (DANE) Method

[Shamir14]: Approximate Newton-type step is taken at each iterationwhile no local Hessian ∇2φi (w) is explicitly computed andcommunicatedWith parameter η > 0 and regularizer µ > 0

∇φ(w (t)

)= 1

mm∑

i=1∇φi

(w (t)

)w (t+1)

i = arg minw

φi (w)−(∇φi

(w (t)

)− η∇φ

(w (t)

))Tw

+ µ2‖w − w (t)‖22

w (t+1) = 1m

m∑i=1

w (t+1)i

Two rounds of communications per iterationJuan Xu, Kaiqing Zhang Distributed Optimization 20 / 32

Algorithms with Multi-round Communications

DANE Method

The local optimization can be related to a mirror descent update withdistance generating function hi (w) = φi (w) + µ

2‖w‖22 and step size η

w (t+1)i = arg min

wφi (w)−

(∇φi

(w (t)

)− η∇φ

(w (t)

))Tw

+ µ

2 ‖w − w (t)‖22

= arg minw

Vφi

(w ,w (t)

)+(η∇φ

(w (t)

))Tw + µ

2 ‖w − w (t)‖22

= arg minw

Vhi

(w ,w (t)

)+(η∇φ

(w (t)

))Tw

When µ is large, it reduces to gradient descent and no information oflocal φi (w) is used; when µ = 0 and φi (w) ≈ φ(w), it approximatesNewton’s method since

Vhi

(w ,w (t)

)→ 1

2(w − w (t)

)T∇2φ

(w (t)

) (w − w (t)

)

Juan Xu, Kaiqing Zhang Distributed Optimization 21 / 32

Algorithms with Multi-round Communications

DANE Method

The local optimization can be related to a mirror descent update withdistance generating function hi (w) = φi (w) + µ

2‖w‖22 and step size η

w (t+1)i = arg min

wφi (w)−

(∇φi

(w (t)

)− η∇φ

(w (t)

))Tw

+ µ

2 ‖w − w (t)‖22

= arg minw

Vφi

(w ,w (t)

)+(η∇φ

(w (t)

))Tw + µ

2 ‖w − w (t)‖22

= arg minw

Vhi

(w ,w (t)

)+(η∇φ

(w (t)

))Tw

When µ is large, it reduces to gradient descent and no information oflocal φi (w) is used; when µ = 0 and φi (w) ≈ φ(w), it approximatesNewton’s method since

Vhi

(w ,w (t)

)→ 1

2(w − w (t)

)T∇2φ

(w (t)

) (w − w (t)

)

Juan Xu, Kaiqing Zhang Distributed Optimization 21 / 32

Algorithms with Multi-round Communications

DANE Method

The local optimization can be related to a mirror descent update withdistance generating function hi (w) = φi (w) + µ

2‖w‖22 and step size η

w (t+1)i = arg min

wφi (w)−

(∇φi

(w (t)

)− η∇φ

(w (t)

))Tw

+ µ

2 ‖w − w (t)‖22

= arg minw

Vφi

(w ,w (t)

)+(η∇φ

(w (t)

))Tw + µ

2 ‖w − w (t)‖22

= arg minw

Vhi

(w ,w (t)

)+(η∇φ

(w (t)

))Tw

When µ is large, it reduces to gradient descent and no information oflocal φi (w) is used; when µ = 0 and φi (w) ≈ φ(w), it approximatesNewton’s method since

Vhi

(w ,w (t)

)→ 1

2(w − w (t)

)T∇2φ

(w (t)

) (w − w (t)

)

Juan Xu, Kaiqing Zhang Distributed Optimization 21 / 32

Algorithms with Multi-round Communications

DANE Method

The local optimization can be related to a mirror descent update withdistance generating function hi (w) = φi (w) + µ

2‖w‖22 and step size η

w (t+1)i = arg min

wφi (w)−

(∇φi

(w (t)

)− η∇φ

(w (t)

))Tw

+ µ

2 ‖w − w (t)‖22

= arg minw

Vφi

(w ,w (t)

)+(η∇φ

(w (t)

))Tw + µ

2 ‖w − w (t)‖22

= arg minw

Vhi

(w ,w (t)

)+(η∇φ

(w (t)

))Tw

When µ is large, it reduces to gradient descent and no information oflocal φi (w) is used; when µ = 0 and φi (w) ≈ φ(w), it approximatesNewton’s method since

Vhi

(w ,w (t)

)→ 1

2(w − w (t)

)T∇2φ

(w (t)

) (w − w (t)

)

Juan Xu, Kaiqing Zhang Distributed Optimization 21 / 32

Algorithms with Multi-round Communications

DANE Method

The local optimization can be related to a mirror descent update withdistance generating function hi (w) = φi (w) + µ

2‖w‖22 and step size η

w (t+1)i = arg min

wφi (w)−

(∇φi

(w (t)

)− η∇φ

(w (t)

))Tw

+ µ

2 ‖w − w (t)‖22

= arg minw

Vφi

(w ,w (t)

)+(η∇φ

(w (t)

))Tw + µ

2 ‖w − w (t)‖22

= arg minw

Vhi

(w ,w (t)

)+(η∇φ

(w (t)

))Tw

When µ is large, it reduces to gradient descent and no information oflocal φi (w) is used; when µ = 0 and φi (w) ≈ φ(w), it approximatesNewton’s method since

Vhi

(w ,w (t)

)→ 1

2(w − w (t)

)T∇2φ

(w (t)

) (w − w (t)

)Juan Xu, Kaiqing Zhang Distributed Optimization 21 / 32

Algorithms with Multi-round Communications

DANE Method

The iteration / communication complexity for L-smooth λ-stronglyconvex quadratic objective φ(w)

O(

(L/λ)2

n log (dm) log(1ε

))

In practice, when λ scales as 1/√

nm, the complexity only scalesroughly linearly with the number of machines m, not with the samplesize N = nm

O(

L2m log (dm) log(1ε

))

Juan Xu, Kaiqing Zhang Distributed Optimization 22 / 32

Algorithms with Multi-round Communications

DANE Method

The iteration / communication complexity for L-smooth λ-stronglyconvex quadratic objective φ(w)

O(

(L/λ)2

n log (dm) log(1ε

))

In practice, when λ scales as 1/√

nm, the complexity only scalesroughly linearly with the number of machines m, not with the samplesize N = nm

O(

L2m log (dm) log(1ε

))

Juan Xu, Kaiqing Zhang Distributed Optimization 22 / 32

Algorithms with Multi-round Communications

DANE Method

The iteration / communication complexity for L-smooth λ-stronglyconvex quadratic objective φ(w)

O(

(L/λ)2

n log (dm) log(1ε

))

In practice, when λ scales as 1/√

nm, the complexity only scalesroughly linearly with the number of machines m, not with the samplesize N = nm

O(

L2m log (dm) log(1ε

))

Juan Xu, Kaiqing Zhang Distributed Optimization 22 / 32

Algorithms with Multi-round Communications

Simulation setup

(xj , yj) ∈ Rd × R are i.i.d drawn from xj ∼ N (0,Σ) where Σ ∈ Rd×d

is diagonal with Σk,k = k−2.5, yj = xTj w0 + εj where w0 = 1 and

εj ∼ N (0, 1)

Solve

minwφ(w) = 1

m

m∑i=1

φi (w) + λ

2 ‖w‖22,

where φi (w) = 1n

n∑j=1

(xT

j w − yj)2

, λ = 1/√

nm

Juan Xu, Kaiqing Zhang Distributed Optimization 23 / 32

Algorithms with Multi-round Communications

Simulation setup

(xj , yj) ∈ Rd × R are i.i.d drawn from xj ∼ N (0,Σ) where Σ ∈ Rd×d

is diagonal with Σk,k = k−2.5, yj = xTj w0 + εj where w0 = 1 and

εj ∼ N (0, 1)Solve

minwφ(w) = 1

m

m∑i=1

φi (w) + λ

2 ‖w‖22,

where φi (w) = 1n

n∑j=1

(xT

j w − yj)2

, λ = 1/√

nm

Juan Xu, Kaiqing Zhang Distributed Optimization 23 / 32

Algorithms with Multi-round Communications

Simulation results

To compare the convergence rate, set n = 1000,m = 5, d = 100

Figure: Comparison of the convergence rate of different algorithms.

Juan Xu, Kaiqing Zhang Distributed Optimization 24 / 32

Algorithms with Multi-round Communications

Simulation results

To compare the convergence rate, set n = 1000,m = 5, d = 100

Figure: Comparison of the convergence rate of different algorithms.

Juan Xu, Kaiqing Zhang Distributed Optimization 24 / 32

Algorithms with Multi-round Communications

Simulation results

To compare the communication rounds versus sample size, setn ∈ 100, 500, 1000, 5000, 10000 ,m = 5, d = 100, ε = 10−6

Table: Comparison of communication rounds of different algorithms forε-suboptimal when sample N = nm size increases

Algorithm n=100 n=500 n=1000 n=5000 n=10000Accelerated GD 35 51 61 86 101

ADMM 58 95 122 252 345DANE 10 8 8 8 8

Juan Xu, Kaiqing Zhang Distributed Optimization 25 / 32

Algorithms with Multi-round Communications

Simulation results

To compare the communication rounds versus sample size, setn ∈ 100, 500, 1000, 5000, 10000 ,m = 5, d = 100, ε = 10−6

Table: Comparison of communication rounds of different algorithms forε-suboptimal when sample N = nm size increases

Algorithm n=100 n=500 n=1000 n=5000 n=10000Accelerated GD 35 51 61 86 101

ADMM 58 95 122 252 345DANE 10 8 8 8 8

Juan Xu, Kaiqing Zhang Distributed Optimization 25 / 32

Communication Complexity of Multi-round Algorithms

Overview

1 Motivation and Problem Description

2 Algorithms with One-round Communication

3 Algorithms with Multi-round Communications

4 Communication Complexity of Multi-round Algorithms

5 Conclusions

Juan Xu, Kaiqing Zhang Distributed Optimization 25 / 32

Communication Complexity of Multi-round Algorithms

Communication Complexity of Multi-round Algorithms

[Arjevani15]: How many rounds of communications are necessary tosolve the distributed optimization with accuracy ε?

Assume the algorithms only operate over vectors, gradients, andvector products, and only vectors or gradients are communicatedThen there exist m 1-smooth λ-strongly convex local quadraticobjective functions that require at least communication rounds

Ω

√ δ

λlog(λ‖w∗‖2

ε

)where δ ∈ (0, 1) characterizes the similarity of φi (w) among mmachines, typically decaying as O

(1√n

)for random splitting data

[Shamir14]

Juan Xu, Kaiqing Zhang Distributed Optimization 26 / 32

Communication Complexity of Multi-round Algorithms

Communication Complexity of Multi-round Algorithms

[Arjevani15]: How many rounds of communications are necessary tosolve the distributed optimization with accuracy ε?Assume the algorithms only operate over vectors, gradients, andvector products, and only vectors or gradients are communicated

Then there exist m 1-smooth λ-strongly convex local quadraticobjective functions that require at least communication rounds

Ω

√ δ

λlog(λ‖w∗‖2

ε

)where δ ∈ (0, 1) characterizes the similarity of φi (w) among mmachines, typically decaying as O

(1√n

)for random splitting data

[Shamir14]

Juan Xu, Kaiqing Zhang Distributed Optimization 26 / 32

Communication Complexity of Multi-round Algorithms

Communication Complexity of Multi-round Algorithms

[Arjevani15]: How many rounds of communications are necessary tosolve the distributed optimization with accuracy ε?Assume the algorithms only operate over vectors, gradients, andvector products, and only vectors or gradients are communicatedThen there exist m 1-smooth λ-strongly convex local quadraticobjective functions that require at least communication rounds

Ω

√ δ

λlog(λ‖w∗‖2

ε

)where δ ∈ (0, 1) characterizes the similarity of φi (w) among mmachines, typically decaying as O

(1√n

)for random splitting data

[Shamir14]

Juan Xu, Kaiqing Zhang Distributed Optimization 26 / 32

Communication Complexity of Multi-round Algorithms

Communication Complexity of Multi-round Algorithms

Accelerated gradient averaging can achieve the bound for unrelatedlocal functions, i.e., δ = Ω(1)

O

√ 1λ

log(1ε

) v.s. Ω

√ δ

λlog(λ‖w∗‖2

ε

)but can be suboptimal when δ → 0DANE can be suboptimal when λ decays as 1√

nm and δ decays as 1√n

O(

m log(1ε

))v.s. Ω

(4√m log

(1ε

))

Designing optimal algorithms for general objective functions remainsopen

Juan Xu, Kaiqing Zhang Distributed Optimization 27 / 32

Communication Complexity of Multi-round Algorithms

Communication Complexity of Multi-round Algorithms

Accelerated gradient averaging can achieve the bound for unrelatedlocal functions, i.e., δ = Ω(1)

O

√ 1λ

log(1ε

) v.s. Ω

√ δ

λlog(λ‖w∗‖2

ε

)

but can be suboptimal when δ → 0DANE can be suboptimal when λ decays as 1√

nm and δ decays as 1√n

O(

m log(1ε

))v.s. Ω

(4√m log

(1ε

))

Designing optimal algorithms for general objective functions remainsopen

Juan Xu, Kaiqing Zhang Distributed Optimization 27 / 32

Communication Complexity of Multi-round Algorithms

Communication Complexity of Multi-round Algorithms

Accelerated gradient averaging can achieve the bound for unrelatedlocal functions, i.e., δ = Ω(1)

O

√ 1λ

log(1ε

) v.s. Ω

√ δ

λlog(λ‖w∗‖2

ε

)but can be suboptimal when δ → 0

DANE can be suboptimal when λ decays as 1√nm and δ decays as 1√

n

O(

m log(1ε

))v.s. Ω

(4√m log

(1ε

))

Designing optimal algorithms for general objective functions remainsopen

Juan Xu, Kaiqing Zhang Distributed Optimization 27 / 32

Communication Complexity of Multi-round Algorithms

Communication Complexity of Multi-round Algorithms

Accelerated gradient averaging can achieve the bound for unrelatedlocal functions, i.e., δ = Ω(1)

O

√ 1λ

log(1ε

) v.s. Ω

√ δ

λlog(λ‖w∗‖2

ε

)but can be suboptimal when δ → 0DANE can be suboptimal when λ decays as 1√

nm and δ decays as 1√n

O(

m log(1ε

))v.s. Ω

(4√m log

(1ε

))

Designing optimal algorithms for general objective functions remainsopen

Juan Xu, Kaiqing Zhang Distributed Optimization 27 / 32

Communication Complexity of Multi-round Algorithms

Communication Complexity of Multi-round Algorithms

Accelerated gradient averaging can achieve the bound for unrelatedlocal functions, i.e., δ = Ω(1)

O

√ 1λ

log(1ε

) v.s. Ω

√ δ

λlog(λ‖w∗‖2

ε

)but can be suboptimal when δ → 0DANE can be suboptimal when λ decays as 1√

nm and δ decays as 1√n

O(

m log(1ε

))v.s. Ω

(4√m log

(1ε

))

Designing optimal algorithms for general objective functions remainsopen

Juan Xu, Kaiqing Zhang Distributed Optimization 27 / 32

Communication Complexity of Multi-round Algorithms

Communication Complexity of Multi-round Algorithms

Accelerated gradient averaging can achieve the bound for unrelatedlocal functions, i.e., δ = Ω(1)

O

√ 1λ

log(1ε

) v.s. Ω

√ δ

λlog(λ‖w∗‖2

ε

)but can be suboptimal when δ → 0DANE can be suboptimal when λ decays as 1√

nm and δ decays as 1√n

O(

m log(1ε

))v.s. Ω

(4√m log

(1ε

))

Designing optimal algorithms for general objective functions remainsopen

Juan Xu, Kaiqing Zhang Distributed Optimization 27 / 32

Communication Complexity of Multi-round Algorithms

Communication Complexity of Multi-round Algorithms

Summary

Table: Communication efficiency of several distributed algorithms forquadratic objective functions when λ ∼ O

(1√nm

)and δ ∼ O

(1√n

)Algorithm Number of Communication Rounds

Accelerated GD O ( 4√

nm log (1/ε))ADMM O ( 4

√nm log (1/ε))

DANE O (m log (1/ε))Optimal O ( 4

√m log (1/ε))

Juan Xu, Kaiqing Zhang Distributed Optimization 28 / 32

Conclusions

Overview

1 Motivation and Problem Description

2 Algorithms with One-round Communication

3 Algorithms with Multi-round Communications

4 Communication Complexity of Multi-round Algorithms

5 Conclusions

Juan Xu, Kaiqing Zhang Distributed Optimization 28 / 32

Conclusions

Take-aways

Distributed algorithms with only one-round communication sufferfrom constant estimation errors that cannot be mitigated by clevereraveraging

DANE outperforms AGD and ADMM in terms of communicationcomplexity by making use of the similarity of local functionsMany communication rounds can be necessary to achieveε-suboptimal even though the computation capability at localmachines are unlimited

Juan Xu, Kaiqing Zhang Distributed Optimization 29 / 32

Conclusions

Take-aways

Distributed algorithms with only one-round communication sufferfrom constant estimation errors that cannot be mitigated by clevereraveragingDANE outperforms AGD and ADMM in terms of communicationcomplexity by making use of the similarity of local functions

Many communication rounds can be necessary to achieveε-suboptimal even though the computation capability at localmachines are unlimited

Juan Xu, Kaiqing Zhang Distributed Optimization 29 / 32

Conclusions

Take-aways

Distributed algorithms with only one-round communication sufferfrom constant estimation errors that cannot be mitigated by clevereraveragingDANE outperforms AGD and ADMM in terms of communicationcomplexity by making use of the similarity of local functionsMany communication rounds can be necessary to achieveε-suboptimal even though the computation capability at localmachines are unlimited

Juan Xu, Kaiqing Zhang Distributed Optimization 29 / 32

Conclusions

References

He, Niao. ”IE598 Big Data Optimizaiton.” University of Illinois atUrbana-Champaign, 2016.

Zhang, Yuchen and Wainwright, Martin J and Duchi, John C.”Communication-efficient algorithms for statistical optimization.” InAdvances in Neural Information Processing Systems, pp. 1502-1510.2012.Zhang, Yuchen and Wainwright, Martin J and Duchi, John C.”Communication-efficient algorithms for statistical optimization.” InJournal of Machine Learning Research, pp. 3321-3363. 2013.

Sridharan, Karthik, Shai Shalev-Shwartz, and Nathan Srebro. ”Fastrates for regularized objectives.” In Advances in Neural InformationProcessing Systems, pp. 1545-1552. 2009.

Shalev-Shwartz, Shai, Ohad Shamir, Nathan Srebro, and KarthikSridharan. ”Stochastic Convex Optimization.” In COLT. 2009.

Juan Xu, Kaiqing Zhang Distributed Optimization 30 / 32

Conclusions

References

Shamir, Ohad, Nathan Srebro, and Tong Zhang.”Communication-Efficient Distributed Optimization using anApproximate Newton-type Method.” In ICML, vol. 32, no. 1, pp.1000-1008. 2014.Deng, Wei, and Wotao Yin. ”On the global and linear convergence ofthe generalized alternating direction method of multipliers.” Journal ofScientific Computing 66, no. 3 (2016): 889-916.

Hong, Mingyi, and Zhi-Quan Luo. ”On the linear convergence of thealternating direction method of multipliers.” arXiv preprintarXiv:1208.3922 (2012).

Arjevani, Yossi, and Ohad Shamir. ”Communication complexity ofdistributed convex learning and optimization.” In Advances in NeuralInformation Processing Systems, pp. 1756-1764. 2015.

Juan Xu, Kaiqing Zhang Distributed Optimization 31 / 32

Conclusions

Thank you !

Q&A

Juan Xu, Kaiqing Zhang Distributed Optimization 32 / 32

top related