communication-efficient algorithms for distributed...

102
Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing Zhang December 1, 2016 Juan Xu, Kaiqing Zhang Distributed Optimization 1 / 32

Upload: others

Post on 20-May-2020

34 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Communication-Efficient Algorithms for DistributedOptimization

Juan Xu Kaiqing Zhang

December 1, 2016

Juan Xu, Kaiqing Zhang Distributed Optimization 1 / 32

Page 2: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Overview

1 Motivation and Problem Description

2 Algorithms with One-round Communication

3 Algorithms with Multi-round Communications

4 Communication Complexity of Multi-round Algorithms

5 Conclusions

Juan Xu, Kaiqing Zhang Distributed Optimization 2 / 32

Page 3: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Motivation and Problem Description

Overview

1 Motivation and Problem Description

2 Algorithms with One-round Communication

3 Algorithms with Multi-round Communications

4 Communication Complexity of Multi-round Algorithms

5 Conclusions

Juan Xu, Kaiqing Zhang Distributed Optimization 2 / 32

Page 4: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Motivation and Problem Description

Motivation

Consider a Stochastic Optimization formulation

minw

F (w) = EX∼P [f (w ,X )] .

Population risk function F (w) : Rd → RPopulation risk minimizer w∗

w∗ := arg minw

F (w),

w∗ is assume to be unique.Loss function f (w ,X )

f is real-value and convex;X is a random variable following distribution P (P is unknown).

Juan Xu, Kaiqing Zhang Distributed Optimization 3 / 32

Page 5: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Motivation and Problem Description

Motivation

Consider a Stochastic Optimization formulation

minw

F (w) = EX∼P [f (w ,X )] .

Population risk function F (w) : Rd → R

Population risk minimizer w∗

w∗ := arg minw

F (w),

w∗ is assume to be unique.Loss function f (w ,X )

f is real-value and convex;X is a random variable following distribution P (P is unknown).

Juan Xu, Kaiqing Zhang Distributed Optimization 3 / 32

Page 6: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Motivation and Problem Description

Motivation

Consider a Stochastic Optimization formulation

minw

F (w) = EX∼P [f (w ,X )] .

Population risk function F (w) : Rd → RPopulation risk minimizer w∗

w∗ := arg minw

F (w),

w∗ is assume to be unique.

Loss function f (w ,X )

f is real-value and convex;X is a random variable following distribution P (P is unknown).

Juan Xu, Kaiqing Zhang Distributed Optimization 3 / 32

Page 7: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Motivation and Problem Description

Motivation

Consider a Stochastic Optimization formulation

minw

F (w) = EX∼P [f (w ,X )] .

Population risk function F (w) : Rd → RPopulation risk minimizer w∗

w∗ := arg minw

F (w),

w∗ is assume to be unique.Loss function f (w ,X )

f is real-value and convex;X is a random variable following distribution P (P is unknown).

Juan Xu, Kaiqing Zhang Distributed Optimization 3 / 32

Page 8: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Motivation and Problem Description

Motivation

Consider a Stochastic Optimization formulation

minw

F (w) = EX∼P [f (w ,X )] .

Population risk function F (w) : Rd → RPopulation risk minimizer w∗

w∗ := arg minw

F (w),

w∗ is assume to be unique.Loss function f (w ,X )

f is real-value and convex;

X is a random variable following distribution P (P is unknown).

Juan Xu, Kaiqing Zhang Distributed Optimization 3 / 32

Page 9: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Motivation and Problem Description

Motivation

Consider a Stochastic Optimization formulation

minw

F (w) = EX∼P [f (w ,X )] .

Population risk function F (w) : Rd → RPopulation risk minimizer w∗

w∗ := arg minw

F (w),

w∗ is assume to be unique.Loss function f (w ,X )

f is real-value and convex;X is a random variable following distribution P (P is unknown).

Juan Xu, Kaiqing Zhang Distributed Optimization 3 / 32

Page 10: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Motivation and Problem Description

MotivationInstead of the unavailable probability distribution P, we have access toN = nm samples which are i.i.d. according to distribution P.

Empirical risk function φ(w) : Rd → R

φ(w) = 1

N

N∑i=1

f (w , xi ).

Empirical risk minimizer w

w ∈ arg minwφ(w),

→ Finite Sum Problem.

♠ Can we solve it when the sample size N = nm is so large to be storedin a single computer, i.e., in a centralized setting?→ Distributed Optimization.

Juan Xu, Kaiqing Zhang Distributed Optimization 4 / 32

Page 11: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Motivation and Problem Description

MotivationInstead of the unavailable probability distribution P, we have access toN = nm samples which are i.i.d. according to distribution P.

Empirical risk function φ(w) : Rd → R

φ(w) = 1

N

N∑i=1

f (w , xi ).

Empirical risk minimizer w

w ∈ arg minwφ(w),

→ Finite Sum Problem.

♠ Can we solve it when the sample size N = nm is so large to be storedin a single computer, i.e., in a centralized setting?→ Distributed Optimization.

Juan Xu, Kaiqing Zhang Distributed Optimization 4 / 32

Page 12: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Motivation and Problem Description

MotivationInstead of the unavailable probability distribution P, we have access toN = nm samples which are i.i.d. according to distribution P.

Empirical risk function φ(w) : Rd → R

φ(w) = 1

N

N∑i=1

f (w , xi ).

Empirical risk minimizer w

w ∈ arg minwφ(w),

→ Finite Sum Problem.

♠ Can we solve it when the sample size N = nm is so large to be storedin a single computer, i.e., in a centralized setting?→ Distributed Optimization.

Juan Xu, Kaiqing Zhang Distributed Optimization 4 / 32

Page 13: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Motivation and Problem Description

MotivationInstead of the unavailable probability distribution P, we have access toN = nm samples which are i.i.d. according to distribution P.

Empirical risk function φ(w) : Rd → R

φ(w) = 1

N

N∑i=1

f (w , xi ).

Empirical risk minimizer w

w ∈ arg minwφ(w),

→ Finite Sum Problem.

♠ Can we solve it when the sample size N = nm is so large to be storedin a single computer, i.e., in a centralized setting?

→ Distributed Optimization.

Juan Xu, Kaiqing Zhang Distributed Optimization 4 / 32

Page 14: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Motivation and Problem Description

MotivationInstead of the unavailable probability distribution P, we have access toN = nm samples which are i.i.d. according to distribution P.

Empirical risk function φ(w) : Rd → R

φ(w) = 1

N

N∑i=1

f (w , xi ).

Empirical risk minimizer w

w ∈ arg minwφ(w),

→ Finite Sum Problem.

♠ Can we solve it when the sample size N = nm is so large to be storedin a single computer, i.e., in a centralized setting?→ Distributed Optimization.

Juan Xu, Kaiqing Zhang Distributed Optimization 4 / 32

Page 15: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Motivation and Problem Description

Distributed Optimization

Distributed Optimization: divide the large-scale data into m pieces,and use m machines to deal with each of them.

Key Trade-off: pay expensive communication cost to exchange forthe memory feasibility.

Juan Xu, Kaiqing Zhang Distributed Optimization 5 / 32

Page 16: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Motivation and Problem Description

Distributed Optimization

Distributed Optimization: divide the large-scale data into m pieces,and use m machines to deal with each of them.

Key Trade-off: pay expensive communication cost to exchange forthe memory feasibility.

Juan Xu, Kaiqing Zhang Distributed Optimization 5 / 32

Page 17: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Motivation and Problem Description

Two Streams in Distributed Optimization

Figure: Distributed Optimization

One-round CommunicationMulti-round Communication

Juan Xu, Kaiqing Zhang Distributed Optimization 6 / 32

Page 18: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Motivation and Problem Description

Two Streams in Distributed Optimization

Figure: Distributed Optimization

One-round Communication

Multi-round Communication

Juan Xu, Kaiqing Zhang Distributed Optimization 6 / 32

Page 19: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Motivation and Problem Description

Two Streams in Distributed Optimization

Figure: Distributed Optimization

One-round CommunicationMulti-round Communication

Juan Xu, Kaiqing Zhang Distributed Optimization 6 / 32

Page 20: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Motivation and Problem Description

Problem Settings

Given a dataset N = nm samples i.i.d. according to distribution P.

Divide the sample evenly among m machines. Each machine gets asubsampled dataset of size n.Local empirical risk function φi (w) : Rd → R

φi (w) := 1n

n∑i=1

f (w , xi ).

→ Finite Sum Problem.

Assumption. Local empirical risk function φi (w) is L-smooth andλ-strongly convex.

Juan Xu, Kaiqing Zhang Distributed Optimization 7 / 32

Page 21: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Motivation and Problem Description

Problem Settings

Given a dataset N = nm samples i.i.d. according to distribution P.Divide the sample evenly among m machines. Each machine gets asubsampled dataset of size n.

Local empirical risk function φi (w) : Rd → R

φi (w) := 1n

n∑i=1

f (w , xi ).

→ Finite Sum Problem.

Assumption. Local empirical risk function φi (w) is L-smooth andλ-strongly convex.

Juan Xu, Kaiqing Zhang Distributed Optimization 7 / 32

Page 22: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Motivation and Problem Description

Problem Settings

Given a dataset N = nm samples i.i.d. according to distribution P.Divide the sample evenly among m machines. Each machine gets asubsampled dataset of size n.Local empirical risk function φi (w) : Rd → R

φi (w) := 1n

n∑i=1

f (w , xi ).

→ Finite Sum Problem.

Assumption. Local empirical risk function φi (w) is L-smooth andλ-strongly convex.

Juan Xu, Kaiqing Zhang Distributed Optimization 7 / 32

Page 23: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Motivation and Problem Description

Problem Settings

Given a dataset N = nm samples i.i.d. according to distribution P.Divide the sample evenly among m machines. Each machine gets asubsampled dataset of size n.Local empirical risk function φi (w) : Rd → R

φi (w) := 1n

n∑i=1

f (w , xi ).

→ Finite Sum Problem.

Assumption. Local empirical risk function φi (w) is L-smooth andλ-strongly convex.

Juan Xu, Kaiqing Zhang Distributed Optimization 7 / 32

Page 24: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Motivation and Problem Description

Problem Settings

Given a dataset N = nm samples i.i.d. according to distribution P.Divide the sample evenly among m machines. Each machine gets asubsampled dataset of size n.Local empirical risk function φi (w) : Rd → R

φi (w) := 1n

n∑i=1

f (w , xi ).

→ Finite Sum Problem.Assumption. Local empirical risk function φi (w) is L-smooth andλ-strongly convex.

Juan Xu, Kaiqing Zhang Distributed Optimization 7 / 32

Page 25: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Algorithms with One-round Communication

Overview

1 Motivation and Problem Description

2 Algorithms with One-round Communication

3 Algorithms with Multi-round Communications

4 Communication Complexity of Multi-round Algorithms

5 Conclusions

Juan Xu, Kaiqing Zhang Distributed Optimization 7 / 32

Page 26: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Algorithms with One-round Communication

Background

Figure: Algorithms with One-round Communication

Averaging methods?Accuracy, i.e., mean-squared error to the population minimizer w∗?Optimization methods for local problem?

The results of Algorithms with One-round Communication are mainlybased on [He16] and [Zhang13].

Juan Xu, Kaiqing Zhang Distributed Optimization 8 / 32

Page 27: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Algorithms with One-round Communication

Background

Figure: Algorithms with One-round Communication

Averaging methods?

Accuracy, i.e., mean-squared error to the population minimizer w∗?Optimization methods for local problem?

The results of Algorithms with One-round Communication are mainlybased on [He16] and [Zhang13].

Juan Xu, Kaiqing Zhang Distributed Optimization 8 / 32

Page 28: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Algorithms with One-round Communication

Background

Figure: Algorithms with One-round Communication

Averaging methods?Accuracy, i.e., mean-squared error to the population minimizer w∗?

Optimization methods for local problem?The results of Algorithms with One-round Communication are mainlybased on [He16] and [Zhang13].

Juan Xu, Kaiqing Zhang Distributed Optimization 8 / 32

Page 29: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Algorithms with One-round Communication

Background

Figure: Algorithms with One-round Communication

Averaging methods?Accuracy, i.e., mean-squared error to the population minimizer w∗?Optimization methods for local problem?

The results of Algorithms with One-round Communication are mainlybased on [He16] and [Zhang13].

Juan Xu, Kaiqing Zhang Distributed Optimization 8 / 32

Page 30: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Algorithms with One-round Communication

Background

Figure: Algorithms with One-round Communication

Averaging methods?Accuracy, i.e., mean-squared error to the population minimizer w∗?Optimization methods for local problem?

The results of Algorithms with One-round Communication are mainlybased on [He16] and [Zhang13].

Juan Xu, Kaiqing Zhang Distributed Optimization 8 / 32

Page 31: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Algorithms with One-round Communication

Background

Figure: Algorithms with One-round Communication

Averaging methods?Accuracy, i.e., mean-squared error to the population minimizer w∗?Optimization methods for local problem?

The results of Algorithms with One-round Communication are mainlybased on [He16] and [Zhang13].

Juan Xu, Kaiqing Zhang Distributed Optimization 8 / 32

Page 32: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Algorithms with One-round Communication

Averaging Mixture (AVGM) Method

AVGM Methods.Machine i uses its n samples to solve the local empirical riskminimization problem

w1,i ∈ arg minwφi (w).

AVGM combines these m estimates by averaging

w1 := 1m

m∑i=1

w1,i .

Juan Xu, Kaiqing Zhang Distributed Optimization 9 / 32

Page 33: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Algorithms with One-round Communication

Averaging Mixture (AVGM) Method

AVGM Methods.Machine i uses its n samples to solve the local empirical riskminimization problem

w1,i ∈ arg minwφi (w).

AVGM combines these m estimates by averaging

w1 := 1m

m∑i=1

w1,i .

Juan Xu, Kaiqing Zhang Distributed Optimization 9 / 32

Page 34: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Algorithms with One-round Communication

Mean-squared Error for AVGM Method

Theorem 1The mean-squared error of AVGM method is upper bounded as

E[||w1 − w∗||22

]≤ O

( 1λ2nm + 1

λ6n2 + L2 log dλ4n2 + 1

n2m + 1n3

).

Remark on Theorem 1.

The leading term decays proportionally to (nm)−1 = N−1 whenassuming m ≤ n.If we assume F (w) is L-smooth, then

E [F (w1)− F (w∗)] ≤ L2E

[||w1 − w∗||22

].

There may exist a constant gap. Because we often haveλ = O(1/

√nm).

Juan Xu, Kaiqing Zhang Distributed Optimization 10 / 32

Page 35: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Algorithms with One-round Communication

Mean-squared Error for AVGM Method

Theorem 1The mean-squared error of AVGM method is upper bounded as

E[||w1 − w∗||22

]≤ O

( 1λ2nm + 1

λ6n2 + L2 log dλ4n2 + 1

n2m + 1n3

).

Remark on Theorem 1.The leading term decays proportionally to (nm)−1 = N−1 whenassuming m ≤ n.

If we assume F (w) is L-smooth, then

E [F (w1)− F (w∗)] ≤ L2E

[||w1 − w∗||22

].

There may exist a constant gap. Because we often haveλ = O(1/

√nm).

Juan Xu, Kaiqing Zhang Distributed Optimization 10 / 32

Page 36: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Algorithms with One-round Communication

Mean-squared Error for AVGM Method

Theorem 1The mean-squared error of AVGM method is upper bounded as

E[||w1 − w∗||22

]≤ O

( 1λ2nm + 1

λ6n2 + L2 log dλ4n2 + 1

n2m + 1n3

).

Remark on Theorem 1.The leading term decays proportionally to (nm)−1 = N−1 whenassuming m ≤ n.If we assume F (w) is L-smooth, then

E [F (w1)− F (w∗)] ≤ L2E

[||w1 − w∗||22

].

There may exist a constant gap. Because we often haveλ = O(1/

√nm).

Juan Xu, Kaiqing Zhang Distributed Optimization 10 / 32

Page 37: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Algorithms with One-round Communication

Mean-squared Error for AVGM Method

Theorem 1The mean-squared error of AVGM method is upper bounded as

E[||w1 − w∗||22

]≤ O

( 1λ2nm + 1

λ6n2 + L2 log dλ4n2 + 1

n2m + 1n3

).

Remark on Theorem 1.The leading term decays proportionally to (nm)−1 = N−1 whenassuming m ≤ n.If we assume F (w) is L-smooth, then

E [F (w1)− F (w∗)] ≤ L2E

[||w1 − w∗||22

].

There may exist a constant gap. Because we often haveλ = O(1/

√nm).

Juan Xu, Kaiqing Zhang Distributed Optimization 10 / 32

Page 38: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Algorithms with One-round Communication

Subsampled Averaging Mixture (SAVGM) Method

Define a fixed subsampling rate r ∈ [0, 1].

Get a subset of size drne by sampling uniformly at random withoutreplacement from the local dataset of size n for each machine.Machine i computes both the local empirical minimizer w1,i , and thesubsampling empirical minimizer

w2,i ∈ arg minw

φ2,i (w) = 1drne

drne∑i=1

f (w , xi )

.Compute the bootstrap average w2 := 1

m∑m

i=1 w2,i .SAVGM returns the weighted combination

wSAVGM := w1 − r w21− r .

Juan Xu, Kaiqing Zhang Distributed Optimization 11 / 32

Page 39: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Algorithms with One-round Communication

Subsampled Averaging Mixture (SAVGM) Method

Define a fixed subsampling rate r ∈ [0, 1].Get a subset of size drne by sampling uniformly at random withoutreplacement from the local dataset of size n for each machine.

Machine i computes both the local empirical minimizer w1,i , and thesubsampling empirical minimizer

w2,i ∈ arg minw

φ2,i (w) = 1drne

drne∑i=1

f (w , xi )

.Compute the bootstrap average w2 := 1

m∑m

i=1 w2,i .SAVGM returns the weighted combination

wSAVGM := w1 − r w21− r .

Juan Xu, Kaiqing Zhang Distributed Optimization 11 / 32

Page 40: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Algorithms with One-round Communication

Subsampled Averaging Mixture (SAVGM) Method

Define a fixed subsampling rate r ∈ [0, 1].Get a subset of size drne by sampling uniformly at random withoutreplacement from the local dataset of size n for each machine.Machine i computes both the local empirical minimizer w1,i , and thesubsampling empirical minimizer

w2,i ∈ arg minw

φ2,i (w) = 1drne

drne∑i=1

f (w , xi )

.

Compute the bootstrap average w2 := 1m∑m

i=1 w2,i .SAVGM returns the weighted combination

wSAVGM := w1 − r w21− r .

Juan Xu, Kaiqing Zhang Distributed Optimization 11 / 32

Page 41: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Algorithms with One-round Communication

Subsampled Averaging Mixture (SAVGM) Method

Define a fixed subsampling rate r ∈ [0, 1].Get a subset of size drne by sampling uniformly at random withoutreplacement from the local dataset of size n for each machine.Machine i computes both the local empirical minimizer w1,i , and thesubsampling empirical minimizer

w2,i ∈ arg minw

φ2,i (w) = 1drne

drne∑i=1

f (w , xi )

.Compute the bootstrap average w2 := 1

m∑m

i=1 w2,i .

SAVGM returns the weighted combination

wSAVGM := w1 − r w21− r .

Juan Xu, Kaiqing Zhang Distributed Optimization 11 / 32

Page 42: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Algorithms with One-round Communication

Subsampled Averaging Mixture (SAVGM) Method

Define a fixed subsampling rate r ∈ [0, 1].Get a subset of size drne by sampling uniformly at random withoutreplacement from the local dataset of size n for each machine.Machine i computes both the local empirical minimizer w1,i , and thesubsampling empirical minimizer

w2,i ∈ arg minw

φ2,i (w) = 1drne

drne∑i=1

f (w , xi )

.Compute the bootstrap average w2 := 1

m∑m

i=1 w2,i .SAVGM returns the weighted combination

wSAVGM := w1 − r w21− r .

Juan Xu, Kaiqing Zhang Distributed Optimization 11 / 32

Page 43: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Algorithms with One-round Communication

Mean-squared error for SAVGM Method

Theorem 2The mean-squared error of SAVGM method is upper bounded as

E[||wSAVGM − w∗||22

]≤ O

( 1λ2nm + 1

n2m + 1λ6n3 + L2d log d

λ4n3

).

Remark on Theorem 2.

If we assume F (w) is L-smooth, then

E [F (wSAVGM)− F (w∗)] ≤ L2E

[||wSAVGM − w∗||22

].

O(n−2) term is eliminated here. Constant gap problem still exists.

Juan Xu, Kaiqing Zhang Distributed Optimization 12 / 32

Page 44: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Algorithms with One-round Communication

Mean-squared error for SAVGM Method

Theorem 2The mean-squared error of SAVGM method is upper bounded as

E[||wSAVGM − w∗||22

]≤ O

( 1λ2nm + 1

n2m + 1λ6n3 + L2d log d

λ4n3

).

Remark on Theorem 2.If we assume F (w) is L-smooth, then

E [F (wSAVGM)− F (w∗)] ≤ L2E

[||wSAVGM − w∗||22

].

O(n−2) term is eliminated here. Constant gap problem still exists.

Juan Xu, Kaiqing Zhang Distributed Optimization 12 / 32

Page 45: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Algorithms with One-round Communication

Mean-squared error for SAVGM Method

Theorem 2The mean-squared error of SAVGM method is upper bounded as

E[||wSAVGM − w∗||22

]≤ O

( 1λ2nm + 1

n2m + 1λ6n3 + L2d log d

λ4n3

).

Remark on Theorem 2.If we assume F (w) is L-smooth, then

E [F (wSAVGM)− F (w∗)] ≤ L2E

[||wSAVGM − w∗||22

].

O(n−2) term is eliminated here. Constant gap problem still exists.

Juan Xu, Kaiqing Zhang Distributed Optimization 12 / 32

Page 46: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Algorithms with One-round Communication

Optimization Methods for Local Problem

What optimization method we can use to get w1,i and w2,i for machine i ,i.e., how to solve the Finite Sum Problems for each machine?

w1,i ∈ arg minw

φi (w) = 1

n

n∑i=1

f (w , xi )

;

w2,i ∈ arg minw

φ2,i (w) = 1drne

drne∑i=1

f (w , xi )

.

Under the assumption that φi (w) and φ2,i (w) are both L-smooth andλ-strongly convex for each i , then

Gradient Descent (GD): O(Lλ log(1

ε ));

Stochastic Gradient Descent (SGD): O( Lλ2ε

);

Juan Xu, Kaiqing Zhang Distributed Optimization 13 / 32

Page 47: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Algorithms with One-round Communication

Optimization Methods for Local Problem

What optimization method we can use to get w1,i and w2,i for machine i ,i.e., how to solve the Finite Sum Problems for each machine?

w1,i ∈ arg minw

φi (w) = 1

n

n∑i=1

f (w , xi )

;

w2,i ∈ arg minw

φ2,i (w) = 1drne

drne∑i=1

f (w , xi )

.Under the assumption that φi (w) and φ2,i (w) are both L-smooth andλ-strongly convex for each i , then

Gradient Descent (GD): O(Lλ log(1

ε ));

Stochastic Gradient Descent (SGD): O( Lλ2ε

);

Juan Xu, Kaiqing Zhang Distributed Optimization 13 / 32

Page 48: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Algorithms with One-round Communication

Optimization Methods for Local Problem

What optimization method we can use to get w1,i and w2,i for machine i ,i.e., how to solve the Finite Sum Problems for each machine?

w1,i ∈ arg minw

φi (w) = 1

n

n∑i=1

f (w , xi )

;

w2,i ∈ arg minw

φ2,i (w) = 1drne

drne∑i=1

f (w , xi )

.Under the assumption that φi (w) and φ2,i (w) are both L-smooth andλ-strongly convex for each i , then

Gradient Descent (GD): O(Lλ log(1

ε ));

Stochastic Gradient Descent (SGD): O( Lλ2ε

);

Juan Xu, Kaiqing Zhang Distributed Optimization 13 / 32

Page 49: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Algorithms with One-round Communication

Optimization Methods for Local Problem

What optimization method we can use to get w1,i and w2,i for machine i ,i.e., how to solve the Finite Sum Problems for each machine?

w1,i ∈ arg minw

φi (w) = 1

n

n∑i=1

f (w , xi )

;

w2,i ∈ arg minw

φ2,i (w) = 1drne

drne∑i=1

f (w , xi )

.Under the assumption that φi (w) and φ2,i (w) are both L-smooth andλ-strongly convex for each i , then

Gradient Descent (GD): O(Lλ log(1

ε ));

Stochastic Gradient Descent (SGD): O( Lλ2ε

);

Juan Xu, Kaiqing Zhang Distributed Optimization 13 / 32

Page 50: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Algorithms with One-round Communication

Simulation setup

(xj , yj) ∈ Rd × R are i.i.d drawn from xj ∼ N (0,Σ) where Σ ∈ Rd×d

is diagonal with Σk,k = k−2.5, yj = xTj w0 + εj where w0 = 1 and

εj ∼ N (0, 1)

Solve

minwφ(w) = 1

m

m∑i=1

φi (w) + λ

2 ‖w‖22,

where φi (w) = 1n

n∑j=1

(xT

j w − yj)2

, λ = 1/√

nm

set n = 1000,m = 5, d = 100.

Juan Xu, Kaiqing Zhang Distributed Optimization 14 / 32

Page 51: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Algorithms with One-round Communication

Simulation setup

(xj , yj) ∈ Rd × R are i.i.d drawn from xj ∼ N (0,Σ) where Σ ∈ Rd×d

is diagonal with Σk,k = k−2.5, yj = xTj w0 + εj where w0 = 1 and

εj ∼ N (0, 1)Solve

minwφ(w) = 1

m

m∑i=1

φi (w) + λ

2 ‖w‖22,

where φi (w) = 1n

n∑j=1

(xT

j w − yj)2

, λ = 1/√

nm

set n = 1000,m = 5, d = 100.

Juan Xu, Kaiqing Zhang Distributed Optimization 14 / 32

Page 52: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Algorithms with One-round Communication

Simulation setup

(xj , yj) ∈ Rd × R are i.i.d drawn from xj ∼ N (0,Σ) where Σ ∈ Rd×d

is diagonal with Σk,k = k−2.5, yj = xTj w0 + εj where w0 = 1 and

εj ∼ N (0, 1)Solve

minwφ(w) = 1

m

m∑i=1

φi (w) + λ

2 ‖w‖22,

where φi (w) = 1n

n∑j=1

(xT

j w − yj)2

, λ = 1/√

nm

set n = 1000,m = 5, d = 100.

Juan Xu, Kaiqing Zhang Distributed Optimization 14 / 32

Page 53: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Algorithms with One-round Communication

Algorithm Comparison

GD SGD Comparison

AVGM

SAVGM

Comparison

Juan Xu, Kaiqing Zhang Distributed Optimization 15 / 32

Page 54: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Algorithms with Multi-round Communications

Overview

1 Motivation and Problem Description

2 Algorithms with One-round Communication

3 Algorithms with Multi-round Communications

4 Communication Complexity of Multi-round Algorithms

5 Conclusions

Juan Xu, Kaiqing Zhang Distributed Optimization 15 / 32

Page 55: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Algorithms with Multi-round Communications

Suboptimality of AVGM and SAVGM

Recall

E [F (w)]− F (w∗) ≤ O(

Lλ2nm + L

λ6n2 + L3 log(d)λ4n2

),

the mean square error of the one-round communication algorithmscan be much worse if λ→ 0

Actually λ decays as 1√nm in many learning applications, e.g.,

regularized loss minimization and SVM-type problems[Sridaran09, Shalev09]There exists a worst-case [Shamir14] where

E [F (w)]− F (w∗) ≤ C1λnm ,while E [F (w)]− F (w∗) ≥ C2

λn

Multi-round communications are necessary for estimators to convergeto the empirical minimizer w

Juan Xu, Kaiqing Zhang Distributed Optimization 16 / 32

Page 56: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Algorithms with Multi-round Communications

Suboptimality of AVGM and SAVGM

Recall

E [F (w)]− F (w∗) ≤ O(

Lλ2nm + L

λ6n2 + L3 log(d)λ4n2

),

the mean square error of the one-round communication algorithmscan be much worse if λ→ 0Actually λ decays as 1√

nm in many learning applications, e.g.,regularized loss minimization and SVM-type problems[Sridaran09, Shalev09]

There exists a worst-case [Shamir14] where

E [F (w)]− F (w∗) ≤ C1λnm ,while E [F (w)]− F (w∗) ≥ C2

λn

Multi-round communications are necessary for estimators to convergeto the empirical minimizer w

Juan Xu, Kaiqing Zhang Distributed Optimization 16 / 32

Page 57: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Algorithms with Multi-round Communications

Suboptimality of AVGM and SAVGM

Recall

E [F (w)]− F (w∗) ≤ O(

Lλ2nm + L

λ6n2 + L3 log(d)λ4n2

),

the mean square error of the one-round communication algorithmscan be much worse if λ→ 0Actually λ decays as 1√

nm in many learning applications, e.g.,regularized loss minimization and SVM-type problems[Sridaran09, Shalev09]There exists a worst-case [Shamir14] where

E [F (w)]− F (w∗) ≤ C1λnm ,while E [F (w)]− F (w∗) ≥ C2

λn

Multi-round communications are necessary for estimators to convergeto the empirical minimizer w

Juan Xu, Kaiqing Zhang Distributed Optimization 16 / 32

Page 58: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Algorithms with Multi-round Communications

Suboptimality of AVGM and SAVGM

Recall

E [F (w)]− F (w∗) ≤ O(

Lλ2nm + L

λ6n2 + L3 log(d)λ4n2

),

the mean square error of the one-round communication algorithmscan be much worse if λ→ 0Actually λ decays as 1√

nm in many learning applications, e.g.,regularized loss minimization and SVM-type problems[Sridaran09, Shalev09]There exists a worst-case [Shamir14] where

E [F (w)]− F (w∗) ≤ C1λnm ,while E [F (w)]− F (w∗) ≥ C2

λn

Multi-round communications are necessary for estimators to convergeto the empirical minimizer w

Juan Xu, Kaiqing Zhang Distributed Optimization 16 / 32

Page 59: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Algorithms with Multi-round Communications

Accelerated gradient averaging

Each machine calculates the gradient ∇φi (w (t))

Averaged full gradient step ∇φ(w (t)) = 1m∇φi (w (t)) is taken for

update and distribute the w (t+1)

Communication rounds / iterations needed for ε-suboptimal solution is

O

√Lλ

log(1ε

)Polynomial dependence on the condition number is undesirable if λtypically decays at most 1/

√nm, resulting in the dependence of

O ( 4√

nm) on sample size

Juan Xu, Kaiqing Zhang Distributed Optimization 17 / 32

Page 60: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Algorithms with Multi-round Communications

Accelerated gradient averaging

Each machine calculates the gradient ∇φi (w (t))Averaged full gradient step ∇φ(w (t)) = 1

m∇φi (w (t)) is taken forupdate and distribute the w (t+1)

Communication rounds / iterations needed for ε-suboptimal solution is

O

√Lλ

log(1ε

)Polynomial dependence on the condition number is undesirable if λtypically decays at most 1/

√nm, resulting in the dependence of

O ( 4√

nm) on sample size

Juan Xu, Kaiqing Zhang Distributed Optimization 17 / 32

Page 61: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Algorithms with Multi-round Communications

Accelerated gradient averaging

Each machine calculates the gradient ∇φi (w (t))Averaged full gradient step ∇φ(w (t)) = 1

m∇φi (w (t)) is taken forupdate and distribute the w (t+1)

Communication rounds / iterations needed for ε-suboptimal solution is

O

√Lλ

log(1ε

)

Polynomial dependence on the condition number is undesirable if λtypically decays at most 1/

√nm, resulting in the dependence of

O ( 4√

nm) on sample size

Juan Xu, Kaiqing Zhang Distributed Optimization 17 / 32

Page 62: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Algorithms with Multi-round Communications

Accelerated gradient averaging

Each machine calculates the gradient ∇φi (w (t))Averaged full gradient step ∇φ(w (t)) = 1

m∇φi (w (t)) is taken forupdate and distribute the w (t+1)

Communication rounds / iterations needed for ε-suboptimal solution is

O

√Lλ

log(1ε

)Polynomial dependence on the condition number is undesirable if λtypically decays at most 1/

√nm, resulting in the dependence of

O ( 4√

nm) on sample size

Juan Xu, Kaiqing Zhang Distributed Optimization 17 / 32

Page 63: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Algorithms with Multi-round Communications

Alternating Direction Method of Multipliers (ADMM)

The machines alternate between computing shared dual variables andsolving augmented Lagrangian problems with local data

Recall for the problem

min f (x) + g(z)s.t. Ax + Bz = c

ADMM iteration followsx (t+1) = arg min

xf (x) + ρ

2‖Ax + Bz(t) − c + u(t)‖22z(t+1) = arg min

zg(z) + ρ

2‖Ax (t+1) + Bz − c + u(t)‖22u(t+1) = u(t) +

(Ax (t+1) + Bz(t+1) − c

)

Juan Xu, Kaiqing Zhang Distributed Optimization 18 / 32

Page 64: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Algorithms with Multi-round Communications

Alternating Direction Method of Multipliers (ADMM)

The machines alternate between computing shared dual variables andsolving augmented Lagrangian problems with local dataRecall for the problem

min f (x) + g(z)s.t. Ax + Bz = c

ADMM iteration followsx (t+1) = arg min

xf (x) + ρ

2‖Ax + Bz(t) − c + u(t)‖22z(t+1) = arg min

zg(z) + ρ

2‖Ax (t+1) + Bz − c + u(t)‖22u(t+1) = u(t) +

(Ax (t+1) + Bz(t+1) − c

)

Juan Xu, Kaiqing Zhang Distributed Optimization 18 / 32

Page 65: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Algorithms with Multi-round Communications

Distributed ADMM

Distributed optimization can be reformulated as

min 1m

m∑i=1

φi (wi )

s.t. wi − z = 0

The distributed ADMM iteration thus followsw (t+1)

i = arg minwi

φi (wi ) + ρ2‖wi − z(t) + u(t)

i ‖22,∀i

z(t+1) = 1m

m∑i=1

(w (t)

i + u(t)i

)u(t+1)

i = u(t)i +

(w (t+1)

i − z(t+1)), ∀i

The number of iterations / communication rounds scale strongly withthe condition number, i.e., sample size for learning applications,[Deng12, Hong12]

Juan Xu, Kaiqing Zhang Distributed Optimization 19 / 32

Page 66: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Algorithms with Multi-round Communications

Distributed ADMM

Distributed optimization can be reformulated as

min 1m

m∑i=1

φi (wi )

s.t. wi − z = 0

The distributed ADMM iteration thus followsw (t+1)

i = arg minwi

φi (wi ) + ρ2‖wi − z(t) + u(t)

i ‖22,∀i

z(t+1) = 1m

m∑i=1

(w (t)

i + u(t)i

)u(t+1)

i = u(t)i +

(w (t+1)

i − z(t+1)), ∀i

The number of iterations / communication rounds scale strongly withthe condition number, i.e., sample size for learning applications,[Deng12, Hong12]

Juan Xu, Kaiqing Zhang Distributed Optimization 19 / 32

Page 67: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Algorithms with Multi-round Communications

Distributed ADMM

Distributed optimization can be reformulated as

min 1m

m∑i=1

φi (wi )

s.t. wi − z = 0

The distributed ADMM iteration thus followsw (t+1)

i = arg minwi

φi (wi ) + ρ2‖wi − z(t) + u(t)

i ‖22,∀i

z(t+1) = 1m

m∑i=1

(w (t)

i + u(t)i

)u(t+1)

i = u(t)i +

(w (t+1)

i − z(t+1)), ∀i

The number of iterations / communication rounds scale strongly withthe condition number, i.e., sample size for learning applications,[Deng12, Hong12]

Juan Xu, Kaiqing Zhang Distributed Optimization 19 / 32

Page 68: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Algorithms with Multi-round Communications

Distributed Optimization using an ApproximateNewton-Type (DANE) Method

[Shamir14]: Approximate Newton-type step is taken at each iterationwhile no local Hessian ∇2φi (w) is explicitly computed andcommunicated

With parameter η > 0 and regularizer µ > 0

∇φ(w (t)

)= 1

mm∑

i=1∇φi

(w (t)

)w (t+1)

i = arg minw

φi (w)−(∇φi

(w (t)

)− η∇φ

(w (t)

))Tw

+ µ2‖w − w (t)‖22

w (t+1) = 1m

m∑i=1

w (t+1)i

Two rounds of communications per iteration

Juan Xu, Kaiqing Zhang Distributed Optimization 20 / 32

Page 69: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Algorithms with Multi-round Communications

Distributed Optimization using an ApproximateNewton-Type (DANE) Method

[Shamir14]: Approximate Newton-type step is taken at each iterationwhile no local Hessian ∇2φi (w) is explicitly computed andcommunicatedWith parameter η > 0 and regularizer µ > 0

∇φ(w (t)

)= 1

mm∑

i=1∇φi

(w (t)

)w (t+1)

i = arg minw

φi (w)−(∇φi

(w (t)

)− η∇φ

(w (t)

))Tw

+ µ2‖w − w (t)‖22

w (t+1) = 1m

m∑i=1

w (t+1)i

Two rounds of communications per iteration

Juan Xu, Kaiqing Zhang Distributed Optimization 20 / 32

Page 70: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Algorithms with Multi-round Communications

Distributed Optimization using an ApproximateNewton-Type (DANE) Method

[Shamir14]: Approximate Newton-type step is taken at each iterationwhile no local Hessian ∇2φi (w) is explicitly computed andcommunicatedWith parameter η > 0 and regularizer µ > 0

∇φ(w (t)

)= 1

mm∑

i=1∇φi

(w (t)

)w (t+1)

i = arg minw

φi (w)−(∇φi

(w (t)

)− η∇φ

(w (t)

))Tw

+ µ2‖w − w (t)‖22

w (t+1) = 1m

m∑i=1

w (t+1)i

Two rounds of communications per iterationJuan Xu, Kaiqing Zhang Distributed Optimization 20 / 32

Page 71: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Algorithms with Multi-round Communications

DANE Method

The local optimization can be related to a mirror descent update withdistance generating function hi (w) = φi (w) + µ

2‖w‖22 and step size η

w (t+1)i = arg min

wφi (w)−

(∇φi

(w (t)

)− η∇φ

(w (t)

))Tw

+ µ

2 ‖w − w (t)‖22

= arg minw

Vφi

(w ,w (t)

)+(η∇φ

(w (t)

))Tw + µ

2 ‖w − w (t)‖22

= arg minw

Vhi

(w ,w (t)

)+(η∇φ

(w (t)

))Tw

When µ is large, it reduces to gradient descent and no information oflocal φi (w) is used; when µ = 0 and φi (w) ≈ φ(w), it approximatesNewton’s method since

Vhi

(w ,w (t)

)→ 1

2(w − w (t)

)T∇2φ

(w (t)

) (w − w (t)

)

Juan Xu, Kaiqing Zhang Distributed Optimization 21 / 32

Page 72: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Algorithms with Multi-round Communications

DANE Method

The local optimization can be related to a mirror descent update withdistance generating function hi (w) = φi (w) + µ

2‖w‖22 and step size η

w (t+1)i = arg min

wφi (w)−

(∇φi

(w (t)

)− η∇φ

(w (t)

))Tw

+ µ

2 ‖w − w (t)‖22

= arg minw

Vφi

(w ,w (t)

)+(η∇φ

(w (t)

))Tw + µ

2 ‖w − w (t)‖22

= arg minw

Vhi

(w ,w (t)

)+(η∇φ

(w (t)

))Tw

When µ is large, it reduces to gradient descent and no information oflocal φi (w) is used; when µ = 0 and φi (w) ≈ φ(w), it approximatesNewton’s method since

Vhi

(w ,w (t)

)→ 1

2(w − w (t)

)T∇2φ

(w (t)

) (w − w (t)

)

Juan Xu, Kaiqing Zhang Distributed Optimization 21 / 32

Page 73: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Algorithms with Multi-round Communications

DANE Method

The local optimization can be related to a mirror descent update withdistance generating function hi (w) = φi (w) + µ

2‖w‖22 and step size η

w (t+1)i = arg min

wφi (w)−

(∇φi

(w (t)

)− η∇φ

(w (t)

))Tw

+ µ

2 ‖w − w (t)‖22

= arg minw

Vφi

(w ,w (t)

)+(η∇φ

(w (t)

))Tw + µ

2 ‖w − w (t)‖22

= arg minw

Vhi

(w ,w (t)

)+(η∇φ

(w (t)

))Tw

When µ is large, it reduces to gradient descent and no information oflocal φi (w) is used; when µ = 0 and φi (w) ≈ φ(w), it approximatesNewton’s method since

Vhi

(w ,w (t)

)→ 1

2(w − w (t)

)T∇2φ

(w (t)

) (w − w (t)

)

Juan Xu, Kaiqing Zhang Distributed Optimization 21 / 32

Page 74: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Algorithms with Multi-round Communications

DANE Method

The local optimization can be related to a mirror descent update withdistance generating function hi (w) = φi (w) + µ

2‖w‖22 and step size η

w (t+1)i = arg min

wφi (w)−

(∇φi

(w (t)

)− η∇φ

(w (t)

))Tw

+ µ

2 ‖w − w (t)‖22

= arg minw

Vφi

(w ,w (t)

)+(η∇φ

(w (t)

))Tw + µ

2 ‖w − w (t)‖22

= arg minw

Vhi

(w ,w (t)

)+(η∇φ

(w (t)

))Tw

When µ is large, it reduces to gradient descent and no information oflocal φi (w) is used; when µ = 0 and φi (w) ≈ φ(w), it approximatesNewton’s method since

Vhi

(w ,w (t)

)→ 1

2(w − w (t)

)T∇2φ

(w (t)

) (w − w (t)

)

Juan Xu, Kaiqing Zhang Distributed Optimization 21 / 32

Page 75: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Algorithms with Multi-round Communications

DANE Method

The local optimization can be related to a mirror descent update withdistance generating function hi (w) = φi (w) + µ

2‖w‖22 and step size η

w (t+1)i = arg min

wφi (w)−

(∇φi

(w (t)

)− η∇φ

(w (t)

))Tw

+ µ

2 ‖w − w (t)‖22

= arg minw

Vφi

(w ,w (t)

)+(η∇φ

(w (t)

))Tw + µ

2 ‖w − w (t)‖22

= arg minw

Vhi

(w ,w (t)

)+(η∇φ

(w (t)

))Tw

When µ is large, it reduces to gradient descent and no information oflocal φi (w) is used; when µ = 0 and φi (w) ≈ φ(w), it approximatesNewton’s method since

Vhi

(w ,w (t)

)→ 1

2(w − w (t)

)T∇2φ

(w (t)

) (w − w (t)

)Juan Xu, Kaiqing Zhang Distributed Optimization 21 / 32

Page 76: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Algorithms with Multi-round Communications

DANE Method

The iteration / communication complexity for L-smooth λ-stronglyconvex quadratic objective φ(w)

O(

(L/λ)2

n log (dm) log(1ε

))

In practice, when λ scales as 1/√

nm, the complexity only scalesroughly linearly with the number of machines m, not with the samplesize N = nm

O(

L2m log (dm) log(1ε

))

Juan Xu, Kaiqing Zhang Distributed Optimization 22 / 32

Page 77: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Algorithms with Multi-round Communications

DANE Method

The iteration / communication complexity for L-smooth λ-stronglyconvex quadratic objective φ(w)

O(

(L/λ)2

n log (dm) log(1ε

))

In practice, when λ scales as 1/√

nm, the complexity only scalesroughly linearly with the number of machines m, not with the samplesize N = nm

O(

L2m log (dm) log(1ε

))

Juan Xu, Kaiqing Zhang Distributed Optimization 22 / 32

Page 78: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Algorithms with Multi-round Communications

DANE Method

The iteration / communication complexity for L-smooth λ-stronglyconvex quadratic objective φ(w)

O(

(L/λ)2

n log (dm) log(1ε

))

In practice, when λ scales as 1/√

nm, the complexity only scalesroughly linearly with the number of machines m, not with the samplesize N = nm

O(

L2m log (dm) log(1ε

))

Juan Xu, Kaiqing Zhang Distributed Optimization 22 / 32

Page 79: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Algorithms with Multi-round Communications

Simulation setup

(xj , yj) ∈ Rd × R are i.i.d drawn from xj ∼ N (0,Σ) where Σ ∈ Rd×d

is diagonal with Σk,k = k−2.5, yj = xTj w0 + εj where w0 = 1 and

εj ∼ N (0, 1)

Solve

minwφ(w) = 1

m

m∑i=1

φi (w) + λ

2 ‖w‖22,

where φi (w) = 1n

n∑j=1

(xT

j w − yj)2

, λ = 1/√

nm

Juan Xu, Kaiqing Zhang Distributed Optimization 23 / 32

Page 80: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Algorithms with Multi-round Communications

Simulation setup

(xj , yj) ∈ Rd × R are i.i.d drawn from xj ∼ N (0,Σ) where Σ ∈ Rd×d

is diagonal with Σk,k = k−2.5, yj = xTj w0 + εj where w0 = 1 and

εj ∼ N (0, 1)Solve

minwφ(w) = 1

m

m∑i=1

φi (w) + λ

2 ‖w‖22,

where φi (w) = 1n

n∑j=1

(xT

j w − yj)2

, λ = 1/√

nm

Juan Xu, Kaiqing Zhang Distributed Optimization 23 / 32

Page 81: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Algorithms with Multi-round Communications

Simulation results

To compare the convergence rate, set n = 1000,m = 5, d = 100

Figure: Comparison of the convergence rate of different algorithms.

Juan Xu, Kaiqing Zhang Distributed Optimization 24 / 32

Page 82: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Algorithms with Multi-round Communications

Simulation results

To compare the convergence rate, set n = 1000,m = 5, d = 100

Figure: Comparison of the convergence rate of different algorithms.

Juan Xu, Kaiqing Zhang Distributed Optimization 24 / 32

Page 83: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Algorithms with Multi-round Communications

Simulation results

To compare the communication rounds versus sample size, setn ∈ 100, 500, 1000, 5000, 10000 ,m = 5, d = 100, ε = 10−6

Table: Comparison of communication rounds of different algorithms forε-suboptimal when sample N = nm size increases

Algorithm n=100 n=500 n=1000 n=5000 n=10000Accelerated GD 35 51 61 86 101

ADMM 58 95 122 252 345DANE 10 8 8 8 8

Juan Xu, Kaiqing Zhang Distributed Optimization 25 / 32

Page 84: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Algorithms with Multi-round Communications

Simulation results

To compare the communication rounds versus sample size, setn ∈ 100, 500, 1000, 5000, 10000 ,m = 5, d = 100, ε = 10−6

Table: Comparison of communication rounds of different algorithms forε-suboptimal when sample N = nm size increases

Algorithm n=100 n=500 n=1000 n=5000 n=10000Accelerated GD 35 51 61 86 101

ADMM 58 95 122 252 345DANE 10 8 8 8 8

Juan Xu, Kaiqing Zhang Distributed Optimization 25 / 32

Page 85: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Communication Complexity of Multi-round Algorithms

Overview

1 Motivation and Problem Description

2 Algorithms with One-round Communication

3 Algorithms with Multi-round Communications

4 Communication Complexity of Multi-round Algorithms

5 Conclusions

Juan Xu, Kaiqing Zhang Distributed Optimization 25 / 32

Page 86: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Communication Complexity of Multi-round Algorithms

Communication Complexity of Multi-round Algorithms

[Arjevani15]: How many rounds of communications are necessary tosolve the distributed optimization with accuracy ε?

Assume the algorithms only operate over vectors, gradients, andvector products, and only vectors or gradients are communicatedThen there exist m 1-smooth λ-strongly convex local quadraticobjective functions that require at least communication rounds

Ω

√ δ

λlog(λ‖w∗‖2

ε

)where δ ∈ (0, 1) characterizes the similarity of φi (w) among mmachines, typically decaying as O

(1√n

)for random splitting data

[Shamir14]

Juan Xu, Kaiqing Zhang Distributed Optimization 26 / 32

Page 87: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Communication Complexity of Multi-round Algorithms

Communication Complexity of Multi-round Algorithms

[Arjevani15]: How many rounds of communications are necessary tosolve the distributed optimization with accuracy ε?Assume the algorithms only operate over vectors, gradients, andvector products, and only vectors or gradients are communicated

Then there exist m 1-smooth λ-strongly convex local quadraticobjective functions that require at least communication rounds

Ω

√ δ

λlog(λ‖w∗‖2

ε

)where δ ∈ (0, 1) characterizes the similarity of φi (w) among mmachines, typically decaying as O

(1√n

)for random splitting data

[Shamir14]

Juan Xu, Kaiqing Zhang Distributed Optimization 26 / 32

Page 88: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Communication Complexity of Multi-round Algorithms

Communication Complexity of Multi-round Algorithms

[Arjevani15]: How many rounds of communications are necessary tosolve the distributed optimization with accuracy ε?Assume the algorithms only operate over vectors, gradients, andvector products, and only vectors or gradients are communicatedThen there exist m 1-smooth λ-strongly convex local quadraticobjective functions that require at least communication rounds

Ω

√ δ

λlog(λ‖w∗‖2

ε

)where δ ∈ (0, 1) characterizes the similarity of φi (w) among mmachines, typically decaying as O

(1√n

)for random splitting data

[Shamir14]

Juan Xu, Kaiqing Zhang Distributed Optimization 26 / 32

Page 89: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Communication Complexity of Multi-round Algorithms

Communication Complexity of Multi-round Algorithms

Accelerated gradient averaging can achieve the bound for unrelatedlocal functions, i.e., δ = Ω(1)

O

√ 1λ

log(1ε

) v.s. Ω

√ δ

λlog(λ‖w∗‖2

ε

)but can be suboptimal when δ → 0DANE can be suboptimal when λ decays as 1√

nm and δ decays as 1√n

O(

m log(1ε

))v.s. Ω

(4√m log

(1ε

))

Designing optimal algorithms for general objective functions remainsopen

Juan Xu, Kaiqing Zhang Distributed Optimization 27 / 32

Page 90: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Communication Complexity of Multi-round Algorithms

Communication Complexity of Multi-round Algorithms

Accelerated gradient averaging can achieve the bound for unrelatedlocal functions, i.e., δ = Ω(1)

O

√ 1λ

log(1ε

) v.s. Ω

√ δ

λlog(λ‖w∗‖2

ε

)

but can be suboptimal when δ → 0DANE can be suboptimal when λ decays as 1√

nm and δ decays as 1√n

O(

m log(1ε

))v.s. Ω

(4√m log

(1ε

))

Designing optimal algorithms for general objective functions remainsopen

Juan Xu, Kaiqing Zhang Distributed Optimization 27 / 32

Page 91: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Communication Complexity of Multi-round Algorithms

Communication Complexity of Multi-round Algorithms

Accelerated gradient averaging can achieve the bound for unrelatedlocal functions, i.e., δ = Ω(1)

O

√ 1λ

log(1ε

) v.s. Ω

√ δ

λlog(λ‖w∗‖2

ε

)but can be suboptimal when δ → 0

DANE can be suboptimal when λ decays as 1√nm and δ decays as 1√

n

O(

m log(1ε

))v.s. Ω

(4√m log

(1ε

))

Designing optimal algorithms for general objective functions remainsopen

Juan Xu, Kaiqing Zhang Distributed Optimization 27 / 32

Page 92: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Communication Complexity of Multi-round Algorithms

Communication Complexity of Multi-round Algorithms

Accelerated gradient averaging can achieve the bound for unrelatedlocal functions, i.e., δ = Ω(1)

O

√ 1λ

log(1ε

) v.s. Ω

√ δ

λlog(λ‖w∗‖2

ε

)but can be suboptimal when δ → 0DANE can be suboptimal when λ decays as 1√

nm and δ decays as 1√n

O(

m log(1ε

))v.s. Ω

(4√m log

(1ε

))

Designing optimal algorithms for general objective functions remainsopen

Juan Xu, Kaiqing Zhang Distributed Optimization 27 / 32

Page 93: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Communication Complexity of Multi-round Algorithms

Communication Complexity of Multi-round Algorithms

Accelerated gradient averaging can achieve the bound for unrelatedlocal functions, i.e., δ = Ω(1)

O

√ 1λ

log(1ε

) v.s. Ω

√ δ

λlog(λ‖w∗‖2

ε

)but can be suboptimal when δ → 0DANE can be suboptimal when λ decays as 1√

nm and δ decays as 1√n

O(

m log(1ε

))v.s. Ω

(4√m log

(1ε

))

Designing optimal algorithms for general objective functions remainsopen

Juan Xu, Kaiqing Zhang Distributed Optimization 27 / 32

Page 94: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Communication Complexity of Multi-round Algorithms

Communication Complexity of Multi-round Algorithms

Accelerated gradient averaging can achieve the bound for unrelatedlocal functions, i.e., δ = Ω(1)

O

√ 1λ

log(1ε

) v.s. Ω

√ δ

λlog(λ‖w∗‖2

ε

)but can be suboptimal when δ → 0DANE can be suboptimal when λ decays as 1√

nm and δ decays as 1√n

O(

m log(1ε

))v.s. Ω

(4√m log

(1ε

))

Designing optimal algorithms for general objective functions remainsopen

Juan Xu, Kaiqing Zhang Distributed Optimization 27 / 32

Page 95: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Communication Complexity of Multi-round Algorithms

Communication Complexity of Multi-round Algorithms

Summary

Table: Communication efficiency of several distributed algorithms forquadratic objective functions when λ ∼ O

(1√nm

)and δ ∼ O

(1√n

)Algorithm Number of Communication Rounds

Accelerated GD O ( 4√

nm log (1/ε))ADMM O ( 4

√nm log (1/ε))

DANE O (m log (1/ε))Optimal O ( 4

√m log (1/ε))

Juan Xu, Kaiqing Zhang Distributed Optimization 28 / 32

Page 96: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Conclusions

Overview

1 Motivation and Problem Description

2 Algorithms with One-round Communication

3 Algorithms with Multi-round Communications

4 Communication Complexity of Multi-round Algorithms

5 Conclusions

Juan Xu, Kaiqing Zhang Distributed Optimization 28 / 32

Page 97: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Conclusions

Take-aways

Distributed algorithms with only one-round communication sufferfrom constant estimation errors that cannot be mitigated by clevereraveraging

DANE outperforms AGD and ADMM in terms of communicationcomplexity by making use of the similarity of local functionsMany communication rounds can be necessary to achieveε-suboptimal even though the computation capability at localmachines are unlimited

Juan Xu, Kaiqing Zhang Distributed Optimization 29 / 32

Page 98: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Conclusions

Take-aways

Distributed algorithms with only one-round communication sufferfrom constant estimation errors that cannot be mitigated by clevereraveragingDANE outperforms AGD and ADMM in terms of communicationcomplexity by making use of the similarity of local functions

Many communication rounds can be necessary to achieveε-suboptimal even though the computation capability at localmachines are unlimited

Juan Xu, Kaiqing Zhang Distributed Optimization 29 / 32

Page 99: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Conclusions

Take-aways

Distributed algorithms with only one-round communication sufferfrom constant estimation errors that cannot be mitigated by clevereraveragingDANE outperforms AGD and ADMM in terms of communicationcomplexity by making use of the similarity of local functionsMany communication rounds can be necessary to achieveε-suboptimal even though the computation capability at localmachines are unlimited

Juan Xu, Kaiqing Zhang Distributed Optimization 29 / 32

Page 100: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Conclusions

References

He, Niao. ”IE598 Big Data Optimizaiton.” University of Illinois atUrbana-Champaign, 2016.

Zhang, Yuchen and Wainwright, Martin J and Duchi, John C.”Communication-efficient algorithms for statistical optimization.” InAdvances in Neural Information Processing Systems, pp. 1502-1510.2012.Zhang, Yuchen and Wainwright, Martin J and Duchi, John C.”Communication-efficient algorithms for statistical optimization.” InJournal of Machine Learning Research, pp. 3321-3363. 2013.

Sridharan, Karthik, Shai Shalev-Shwartz, and Nathan Srebro. ”Fastrates for regularized objectives.” In Advances in Neural InformationProcessing Systems, pp. 1545-1552. 2009.

Shalev-Shwartz, Shai, Ohad Shamir, Nathan Srebro, and KarthikSridharan. ”Stochastic Convex Optimization.” In COLT. 2009.

Juan Xu, Kaiqing Zhang Distributed Optimization 30 / 32

Page 101: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Conclusions

References

Shamir, Ohad, Nathan Srebro, and Tong Zhang.”Communication-Efficient Distributed Optimization using anApproximate Newton-type Method.” In ICML, vol. 32, no. 1, pp.1000-1008. 2014.Deng, Wei, and Wotao Yin. ”On the global and linear convergence ofthe generalized alternating direction method of multipliers.” Journal ofScientific Computing 66, no. 3 (2016): 889-916.

Hong, Mingyi, and Zhi-Quan Luo. ”On the linear convergence of thealternating direction method of multipliers.” arXiv preprintarXiv:1208.3922 (2012).

Arjevani, Yossi, and Ohad Shamir. ”Communication complexity ofdistributed convex learning and optimization.” In Advances in NeuralInformation Processing Systems, pp. 1756-1764. 2015.

Juan Xu, Kaiqing Zhang Distributed Optimization 31 / 32

Page 102: Communication-Efficient Algorithms for Distributed ...niaohe.ise.illinois.edu/IE598_2016/pdf/IE598... · Communication-Efficient Algorithms for Distributed Optimization Juan Xu Kaiqing

Conclusions

Thank you !

Q&A

Juan Xu, Kaiqing Zhang Distributed Optimization 32 / 32