communication-efficient algorithms for distributed...
TRANSCRIPT
Communication-Efficient Algorithms for DistributedOptimization
Juan Xu Kaiqing Zhang
December 1, 2016
Juan Xu, Kaiqing Zhang Distributed Optimization 1 / 32
Overview
1 Motivation and Problem Description
2 Algorithms with One-round Communication
3 Algorithms with Multi-round Communications
4 Communication Complexity of Multi-round Algorithms
5 Conclusions
Juan Xu, Kaiqing Zhang Distributed Optimization 2 / 32
Motivation and Problem Description
Overview
1 Motivation and Problem Description
2 Algorithms with One-round Communication
3 Algorithms with Multi-round Communications
4 Communication Complexity of Multi-round Algorithms
5 Conclusions
Juan Xu, Kaiqing Zhang Distributed Optimization 2 / 32
Motivation and Problem Description
Motivation
Consider a Stochastic Optimization formulation
minw
F (w) = EX∼P [f (w ,X )] .
Population risk function F (w) : Rd → RPopulation risk minimizer w∗
w∗ := arg minw
F (w),
w∗ is assume to be unique.Loss function f (w ,X )
f is real-value and convex;X is a random variable following distribution P (P is unknown).
Juan Xu, Kaiqing Zhang Distributed Optimization 3 / 32
Motivation and Problem Description
Motivation
Consider a Stochastic Optimization formulation
minw
F (w) = EX∼P [f (w ,X )] .
Population risk function F (w) : Rd → R
Population risk minimizer w∗
w∗ := arg minw
F (w),
w∗ is assume to be unique.Loss function f (w ,X )
f is real-value and convex;X is a random variable following distribution P (P is unknown).
Juan Xu, Kaiqing Zhang Distributed Optimization 3 / 32
Motivation and Problem Description
Motivation
Consider a Stochastic Optimization formulation
minw
F (w) = EX∼P [f (w ,X )] .
Population risk function F (w) : Rd → RPopulation risk minimizer w∗
w∗ := arg minw
F (w),
w∗ is assume to be unique.
Loss function f (w ,X )
f is real-value and convex;X is a random variable following distribution P (P is unknown).
Juan Xu, Kaiqing Zhang Distributed Optimization 3 / 32
Motivation and Problem Description
Motivation
Consider a Stochastic Optimization formulation
minw
F (w) = EX∼P [f (w ,X )] .
Population risk function F (w) : Rd → RPopulation risk minimizer w∗
w∗ := arg minw
F (w),
w∗ is assume to be unique.Loss function f (w ,X )
f is real-value and convex;X is a random variable following distribution P (P is unknown).
Juan Xu, Kaiqing Zhang Distributed Optimization 3 / 32
Motivation and Problem Description
Motivation
Consider a Stochastic Optimization formulation
minw
F (w) = EX∼P [f (w ,X )] .
Population risk function F (w) : Rd → RPopulation risk minimizer w∗
w∗ := arg minw
F (w),
w∗ is assume to be unique.Loss function f (w ,X )
f is real-value and convex;
X is a random variable following distribution P (P is unknown).
Juan Xu, Kaiqing Zhang Distributed Optimization 3 / 32
Motivation and Problem Description
Motivation
Consider a Stochastic Optimization formulation
minw
F (w) = EX∼P [f (w ,X )] .
Population risk function F (w) : Rd → RPopulation risk minimizer w∗
w∗ := arg minw
F (w),
w∗ is assume to be unique.Loss function f (w ,X )
f is real-value and convex;X is a random variable following distribution P (P is unknown).
Juan Xu, Kaiqing Zhang Distributed Optimization 3 / 32
Motivation and Problem Description
MotivationInstead of the unavailable probability distribution P, we have access toN = nm samples which are i.i.d. according to distribution P.
Empirical risk function φ(w) : Rd → R
φ(w) = 1
N
N∑i=1
f (w , xi ).
Empirical risk minimizer w
w ∈ arg minwφ(w),
→ Finite Sum Problem.
♠ Can we solve it when the sample size N = nm is so large to be storedin a single computer, i.e., in a centralized setting?→ Distributed Optimization.
Juan Xu, Kaiqing Zhang Distributed Optimization 4 / 32
Motivation and Problem Description
MotivationInstead of the unavailable probability distribution P, we have access toN = nm samples which are i.i.d. according to distribution P.
Empirical risk function φ(w) : Rd → R
φ(w) = 1
N
N∑i=1
f (w , xi ).
Empirical risk minimizer w
w ∈ arg minwφ(w),
→ Finite Sum Problem.
♠ Can we solve it when the sample size N = nm is so large to be storedin a single computer, i.e., in a centralized setting?→ Distributed Optimization.
Juan Xu, Kaiqing Zhang Distributed Optimization 4 / 32
Motivation and Problem Description
MotivationInstead of the unavailable probability distribution P, we have access toN = nm samples which are i.i.d. according to distribution P.
Empirical risk function φ(w) : Rd → R
φ(w) = 1
N
N∑i=1
f (w , xi ).
Empirical risk minimizer w
w ∈ arg minwφ(w),
→ Finite Sum Problem.
♠ Can we solve it when the sample size N = nm is so large to be storedin a single computer, i.e., in a centralized setting?→ Distributed Optimization.
Juan Xu, Kaiqing Zhang Distributed Optimization 4 / 32
Motivation and Problem Description
MotivationInstead of the unavailable probability distribution P, we have access toN = nm samples which are i.i.d. according to distribution P.
Empirical risk function φ(w) : Rd → R
φ(w) = 1
N
N∑i=1
f (w , xi ).
Empirical risk minimizer w
w ∈ arg minwφ(w),
→ Finite Sum Problem.
♠ Can we solve it when the sample size N = nm is so large to be storedin a single computer, i.e., in a centralized setting?
→ Distributed Optimization.
Juan Xu, Kaiqing Zhang Distributed Optimization 4 / 32
Motivation and Problem Description
MotivationInstead of the unavailable probability distribution P, we have access toN = nm samples which are i.i.d. according to distribution P.
Empirical risk function φ(w) : Rd → R
φ(w) = 1
N
N∑i=1
f (w , xi ).
Empirical risk minimizer w
w ∈ arg minwφ(w),
→ Finite Sum Problem.
♠ Can we solve it when the sample size N = nm is so large to be storedin a single computer, i.e., in a centralized setting?→ Distributed Optimization.
Juan Xu, Kaiqing Zhang Distributed Optimization 4 / 32
Motivation and Problem Description
Distributed Optimization
Distributed Optimization: divide the large-scale data into m pieces,and use m machines to deal with each of them.
Key Trade-off: pay expensive communication cost to exchange forthe memory feasibility.
Juan Xu, Kaiqing Zhang Distributed Optimization 5 / 32
Motivation and Problem Description
Distributed Optimization
Distributed Optimization: divide the large-scale data into m pieces,and use m machines to deal with each of them.
Key Trade-off: pay expensive communication cost to exchange forthe memory feasibility.
Juan Xu, Kaiqing Zhang Distributed Optimization 5 / 32
Motivation and Problem Description
Two Streams in Distributed Optimization
Figure: Distributed Optimization
One-round CommunicationMulti-round Communication
Juan Xu, Kaiqing Zhang Distributed Optimization 6 / 32
Motivation and Problem Description
Two Streams in Distributed Optimization
Figure: Distributed Optimization
One-round Communication
Multi-round Communication
Juan Xu, Kaiqing Zhang Distributed Optimization 6 / 32
Motivation and Problem Description
Two Streams in Distributed Optimization
Figure: Distributed Optimization
One-round CommunicationMulti-round Communication
Juan Xu, Kaiqing Zhang Distributed Optimization 6 / 32
Motivation and Problem Description
Problem Settings
Given a dataset N = nm samples i.i.d. according to distribution P.
Divide the sample evenly among m machines. Each machine gets asubsampled dataset of size n.Local empirical risk function φi (w) : Rd → R
φi (w) := 1n
n∑i=1
f (w , xi ).
→ Finite Sum Problem.
Assumption. Local empirical risk function φi (w) is L-smooth andλ-strongly convex.
Juan Xu, Kaiqing Zhang Distributed Optimization 7 / 32
Motivation and Problem Description
Problem Settings
Given a dataset N = nm samples i.i.d. according to distribution P.Divide the sample evenly among m machines. Each machine gets asubsampled dataset of size n.
Local empirical risk function φi (w) : Rd → R
φi (w) := 1n
n∑i=1
f (w , xi ).
→ Finite Sum Problem.
Assumption. Local empirical risk function φi (w) is L-smooth andλ-strongly convex.
Juan Xu, Kaiqing Zhang Distributed Optimization 7 / 32
Motivation and Problem Description
Problem Settings
Given a dataset N = nm samples i.i.d. according to distribution P.Divide the sample evenly among m machines. Each machine gets asubsampled dataset of size n.Local empirical risk function φi (w) : Rd → R
φi (w) := 1n
n∑i=1
f (w , xi ).
→ Finite Sum Problem.
Assumption. Local empirical risk function φi (w) is L-smooth andλ-strongly convex.
Juan Xu, Kaiqing Zhang Distributed Optimization 7 / 32
Motivation and Problem Description
Problem Settings
Given a dataset N = nm samples i.i.d. according to distribution P.Divide the sample evenly among m machines. Each machine gets asubsampled dataset of size n.Local empirical risk function φi (w) : Rd → R
φi (w) := 1n
n∑i=1
f (w , xi ).
→ Finite Sum Problem.
Assumption. Local empirical risk function φi (w) is L-smooth andλ-strongly convex.
Juan Xu, Kaiqing Zhang Distributed Optimization 7 / 32
Motivation and Problem Description
Problem Settings
Given a dataset N = nm samples i.i.d. according to distribution P.Divide the sample evenly among m machines. Each machine gets asubsampled dataset of size n.Local empirical risk function φi (w) : Rd → R
φi (w) := 1n
n∑i=1
f (w , xi ).
→ Finite Sum Problem.Assumption. Local empirical risk function φi (w) is L-smooth andλ-strongly convex.
Juan Xu, Kaiqing Zhang Distributed Optimization 7 / 32
Algorithms with One-round Communication
Overview
1 Motivation and Problem Description
2 Algorithms with One-round Communication
3 Algorithms with Multi-round Communications
4 Communication Complexity of Multi-round Algorithms
5 Conclusions
Juan Xu, Kaiqing Zhang Distributed Optimization 7 / 32
Algorithms with One-round Communication
Background
Figure: Algorithms with One-round Communication
Averaging methods?Accuracy, i.e., mean-squared error to the population minimizer w∗?Optimization methods for local problem?
The results of Algorithms with One-round Communication are mainlybased on [He16] and [Zhang13].
Juan Xu, Kaiqing Zhang Distributed Optimization 8 / 32
Algorithms with One-round Communication
Background
Figure: Algorithms with One-round Communication
Averaging methods?
Accuracy, i.e., mean-squared error to the population minimizer w∗?Optimization methods for local problem?
The results of Algorithms with One-round Communication are mainlybased on [He16] and [Zhang13].
Juan Xu, Kaiqing Zhang Distributed Optimization 8 / 32
Algorithms with One-round Communication
Background
Figure: Algorithms with One-round Communication
Averaging methods?Accuracy, i.e., mean-squared error to the population minimizer w∗?
Optimization methods for local problem?The results of Algorithms with One-round Communication are mainlybased on [He16] and [Zhang13].
Juan Xu, Kaiqing Zhang Distributed Optimization 8 / 32
Algorithms with One-round Communication
Background
Figure: Algorithms with One-round Communication
Averaging methods?Accuracy, i.e., mean-squared error to the population minimizer w∗?Optimization methods for local problem?
The results of Algorithms with One-round Communication are mainlybased on [He16] and [Zhang13].
Juan Xu, Kaiqing Zhang Distributed Optimization 8 / 32
Algorithms with One-round Communication
Background
Figure: Algorithms with One-round Communication
Averaging methods?Accuracy, i.e., mean-squared error to the population minimizer w∗?Optimization methods for local problem?
The results of Algorithms with One-round Communication are mainlybased on [He16] and [Zhang13].
Juan Xu, Kaiqing Zhang Distributed Optimization 8 / 32
Algorithms with One-round Communication
Background
Figure: Algorithms with One-round Communication
Averaging methods?Accuracy, i.e., mean-squared error to the population minimizer w∗?Optimization methods for local problem?
The results of Algorithms with One-round Communication are mainlybased on [He16] and [Zhang13].
Juan Xu, Kaiqing Zhang Distributed Optimization 8 / 32
Algorithms with One-round Communication
Averaging Mixture (AVGM) Method
AVGM Methods.Machine i uses its n samples to solve the local empirical riskminimization problem
w1,i ∈ arg minwφi (w).
AVGM combines these m estimates by averaging
w1 := 1m
m∑i=1
w1,i .
Juan Xu, Kaiqing Zhang Distributed Optimization 9 / 32
Algorithms with One-round Communication
Averaging Mixture (AVGM) Method
AVGM Methods.Machine i uses its n samples to solve the local empirical riskminimization problem
w1,i ∈ arg minwφi (w).
AVGM combines these m estimates by averaging
w1 := 1m
m∑i=1
w1,i .
Juan Xu, Kaiqing Zhang Distributed Optimization 9 / 32
Algorithms with One-round Communication
Mean-squared Error for AVGM Method
Theorem 1The mean-squared error of AVGM method is upper bounded as
E[||w1 − w∗||22
]≤ O
( 1λ2nm + 1
λ6n2 + L2 log dλ4n2 + 1
n2m + 1n3
).
Remark on Theorem 1.
The leading term decays proportionally to (nm)−1 = N−1 whenassuming m ≤ n.If we assume F (w) is L-smooth, then
E [F (w1)− F (w∗)] ≤ L2E
[||w1 − w∗||22
].
There may exist a constant gap. Because we often haveλ = O(1/
√nm).
Juan Xu, Kaiqing Zhang Distributed Optimization 10 / 32
Algorithms with One-round Communication
Mean-squared Error for AVGM Method
Theorem 1The mean-squared error of AVGM method is upper bounded as
E[||w1 − w∗||22
]≤ O
( 1λ2nm + 1
λ6n2 + L2 log dλ4n2 + 1
n2m + 1n3
).
Remark on Theorem 1.The leading term decays proportionally to (nm)−1 = N−1 whenassuming m ≤ n.
If we assume F (w) is L-smooth, then
E [F (w1)− F (w∗)] ≤ L2E
[||w1 − w∗||22
].
There may exist a constant gap. Because we often haveλ = O(1/
√nm).
Juan Xu, Kaiqing Zhang Distributed Optimization 10 / 32
Algorithms with One-round Communication
Mean-squared Error for AVGM Method
Theorem 1The mean-squared error of AVGM method is upper bounded as
E[||w1 − w∗||22
]≤ O
( 1λ2nm + 1
λ6n2 + L2 log dλ4n2 + 1
n2m + 1n3
).
Remark on Theorem 1.The leading term decays proportionally to (nm)−1 = N−1 whenassuming m ≤ n.If we assume F (w) is L-smooth, then
E [F (w1)− F (w∗)] ≤ L2E
[||w1 − w∗||22
].
There may exist a constant gap. Because we often haveλ = O(1/
√nm).
Juan Xu, Kaiqing Zhang Distributed Optimization 10 / 32
Algorithms with One-round Communication
Mean-squared Error for AVGM Method
Theorem 1The mean-squared error of AVGM method is upper bounded as
E[||w1 − w∗||22
]≤ O
( 1λ2nm + 1
λ6n2 + L2 log dλ4n2 + 1
n2m + 1n3
).
Remark on Theorem 1.The leading term decays proportionally to (nm)−1 = N−1 whenassuming m ≤ n.If we assume F (w) is L-smooth, then
E [F (w1)− F (w∗)] ≤ L2E
[||w1 − w∗||22
].
There may exist a constant gap. Because we often haveλ = O(1/
√nm).
Juan Xu, Kaiqing Zhang Distributed Optimization 10 / 32
Algorithms with One-round Communication
Subsampled Averaging Mixture (SAVGM) Method
Define a fixed subsampling rate r ∈ [0, 1].
Get a subset of size drne by sampling uniformly at random withoutreplacement from the local dataset of size n for each machine.Machine i computes both the local empirical minimizer w1,i , and thesubsampling empirical minimizer
w2,i ∈ arg minw
φ2,i (w) = 1drne
drne∑i=1
f (w , xi )
.Compute the bootstrap average w2 := 1
m∑m
i=1 w2,i .SAVGM returns the weighted combination
wSAVGM := w1 − r w21− r .
Juan Xu, Kaiqing Zhang Distributed Optimization 11 / 32
Algorithms with One-round Communication
Subsampled Averaging Mixture (SAVGM) Method
Define a fixed subsampling rate r ∈ [0, 1].Get a subset of size drne by sampling uniformly at random withoutreplacement from the local dataset of size n for each machine.
Machine i computes both the local empirical minimizer w1,i , and thesubsampling empirical minimizer
w2,i ∈ arg minw
φ2,i (w) = 1drne
drne∑i=1
f (w , xi )
.Compute the bootstrap average w2 := 1
m∑m
i=1 w2,i .SAVGM returns the weighted combination
wSAVGM := w1 − r w21− r .
Juan Xu, Kaiqing Zhang Distributed Optimization 11 / 32
Algorithms with One-round Communication
Subsampled Averaging Mixture (SAVGM) Method
Define a fixed subsampling rate r ∈ [0, 1].Get a subset of size drne by sampling uniformly at random withoutreplacement from the local dataset of size n for each machine.Machine i computes both the local empirical minimizer w1,i , and thesubsampling empirical minimizer
w2,i ∈ arg minw
φ2,i (w) = 1drne
drne∑i=1
f (w , xi )
.
Compute the bootstrap average w2 := 1m∑m
i=1 w2,i .SAVGM returns the weighted combination
wSAVGM := w1 − r w21− r .
Juan Xu, Kaiqing Zhang Distributed Optimization 11 / 32
Algorithms with One-round Communication
Subsampled Averaging Mixture (SAVGM) Method
Define a fixed subsampling rate r ∈ [0, 1].Get a subset of size drne by sampling uniformly at random withoutreplacement from the local dataset of size n for each machine.Machine i computes both the local empirical minimizer w1,i , and thesubsampling empirical minimizer
w2,i ∈ arg minw
φ2,i (w) = 1drne
drne∑i=1
f (w , xi )
.Compute the bootstrap average w2 := 1
m∑m
i=1 w2,i .
SAVGM returns the weighted combination
wSAVGM := w1 − r w21− r .
Juan Xu, Kaiqing Zhang Distributed Optimization 11 / 32
Algorithms with One-round Communication
Subsampled Averaging Mixture (SAVGM) Method
Define a fixed subsampling rate r ∈ [0, 1].Get a subset of size drne by sampling uniformly at random withoutreplacement from the local dataset of size n for each machine.Machine i computes both the local empirical minimizer w1,i , and thesubsampling empirical minimizer
w2,i ∈ arg minw
φ2,i (w) = 1drne
drne∑i=1
f (w , xi )
.Compute the bootstrap average w2 := 1
m∑m
i=1 w2,i .SAVGM returns the weighted combination
wSAVGM := w1 − r w21− r .
Juan Xu, Kaiqing Zhang Distributed Optimization 11 / 32
Algorithms with One-round Communication
Mean-squared error for SAVGM Method
Theorem 2The mean-squared error of SAVGM method is upper bounded as
E[||wSAVGM − w∗||22
]≤ O
( 1λ2nm + 1
n2m + 1λ6n3 + L2d log d
λ4n3
).
Remark on Theorem 2.
If we assume F (w) is L-smooth, then
E [F (wSAVGM)− F (w∗)] ≤ L2E
[||wSAVGM − w∗||22
].
O(n−2) term is eliminated here. Constant gap problem still exists.
Juan Xu, Kaiqing Zhang Distributed Optimization 12 / 32
Algorithms with One-round Communication
Mean-squared error for SAVGM Method
Theorem 2The mean-squared error of SAVGM method is upper bounded as
E[||wSAVGM − w∗||22
]≤ O
( 1λ2nm + 1
n2m + 1λ6n3 + L2d log d
λ4n3
).
Remark on Theorem 2.If we assume F (w) is L-smooth, then
E [F (wSAVGM)− F (w∗)] ≤ L2E
[||wSAVGM − w∗||22
].
O(n−2) term is eliminated here. Constant gap problem still exists.
Juan Xu, Kaiqing Zhang Distributed Optimization 12 / 32
Algorithms with One-round Communication
Mean-squared error for SAVGM Method
Theorem 2The mean-squared error of SAVGM method is upper bounded as
E[||wSAVGM − w∗||22
]≤ O
( 1λ2nm + 1
n2m + 1λ6n3 + L2d log d
λ4n3
).
Remark on Theorem 2.If we assume F (w) is L-smooth, then
E [F (wSAVGM)− F (w∗)] ≤ L2E
[||wSAVGM − w∗||22
].
O(n−2) term is eliminated here. Constant gap problem still exists.
Juan Xu, Kaiqing Zhang Distributed Optimization 12 / 32
Algorithms with One-round Communication
Optimization Methods for Local Problem
What optimization method we can use to get w1,i and w2,i for machine i ,i.e., how to solve the Finite Sum Problems for each machine?
w1,i ∈ arg minw
φi (w) = 1
n
n∑i=1
f (w , xi )
;
w2,i ∈ arg minw
φ2,i (w) = 1drne
drne∑i=1
f (w , xi )
.
Under the assumption that φi (w) and φ2,i (w) are both L-smooth andλ-strongly convex for each i , then
Gradient Descent (GD): O(Lλ log(1
ε ));
Stochastic Gradient Descent (SGD): O( Lλ2ε
);
Juan Xu, Kaiqing Zhang Distributed Optimization 13 / 32
Algorithms with One-round Communication
Optimization Methods for Local Problem
What optimization method we can use to get w1,i and w2,i for machine i ,i.e., how to solve the Finite Sum Problems for each machine?
w1,i ∈ arg minw
φi (w) = 1
n
n∑i=1
f (w , xi )
;
w2,i ∈ arg minw
φ2,i (w) = 1drne
drne∑i=1
f (w , xi )
.Under the assumption that φi (w) and φ2,i (w) are both L-smooth andλ-strongly convex for each i , then
Gradient Descent (GD): O(Lλ log(1
ε ));
Stochastic Gradient Descent (SGD): O( Lλ2ε
);
Juan Xu, Kaiqing Zhang Distributed Optimization 13 / 32
Algorithms with One-round Communication
Optimization Methods for Local Problem
What optimization method we can use to get w1,i and w2,i for machine i ,i.e., how to solve the Finite Sum Problems for each machine?
w1,i ∈ arg minw
φi (w) = 1
n
n∑i=1
f (w , xi )
;
w2,i ∈ arg minw
φ2,i (w) = 1drne
drne∑i=1
f (w , xi )
.Under the assumption that φi (w) and φ2,i (w) are both L-smooth andλ-strongly convex for each i , then
Gradient Descent (GD): O(Lλ log(1
ε ));
Stochastic Gradient Descent (SGD): O( Lλ2ε
);
Juan Xu, Kaiqing Zhang Distributed Optimization 13 / 32
Algorithms with One-round Communication
Optimization Methods for Local Problem
What optimization method we can use to get w1,i and w2,i for machine i ,i.e., how to solve the Finite Sum Problems for each machine?
w1,i ∈ arg minw
φi (w) = 1
n
n∑i=1
f (w , xi )
;
w2,i ∈ arg minw
φ2,i (w) = 1drne
drne∑i=1
f (w , xi )
.Under the assumption that φi (w) and φ2,i (w) are both L-smooth andλ-strongly convex for each i , then
Gradient Descent (GD): O(Lλ log(1
ε ));
Stochastic Gradient Descent (SGD): O( Lλ2ε
);
Juan Xu, Kaiqing Zhang Distributed Optimization 13 / 32
Algorithms with One-round Communication
Simulation setup
(xj , yj) ∈ Rd × R are i.i.d drawn from xj ∼ N (0,Σ) where Σ ∈ Rd×d
is diagonal with Σk,k = k−2.5, yj = xTj w0 + εj where w0 = 1 and
εj ∼ N (0, 1)
Solve
minwφ(w) = 1
m
m∑i=1
φi (w) + λ
2 ‖w‖22,
where φi (w) = 1n
n∑j=1
(xT
j w − yj)2
, λ = 1/√
nm
set n = 1000,m = 5, d = 100.
Juan Xu, Kaiqing Zhang Distributed Optimization 14 / 32
Algorithms with One-round Communication
Simulation setup
(xj , yj) ∈ Rd × R are i.i.d drawn from xj ∼ N (0,Σ) where Σ ∈ Rd×d
is diagonal with Σk,k = k−2.5, yj = xTj w0 + εj where w0 = 1 and
εj ∼ N (0, 1)Solve
minwφ(w) = 1
m
m∑i=1
φi (w) + λ
2 ‖w‖22,
where φi (w) = 1n
n∑j=1
(xT
j w − yj)2
, λ = 1/√
nm
set n = 1000,m = 5, d = 100.
Juan Xu, Kaiqing Zhang Distributed Optimization 14 / 32
Algorithms with One-round Communication
Simulation setup
(xj , yj) ∈ Rd × R are i.i.d drawn from xj ∼ N (0,Σ) where Σ ∈ Rd×d
is diagonal with Σk,k = k−2.5, yj = xTj w0 + εj where w0 = 1 and
εj ∼ N (0, 1)Solve
minwφ(w) = 1
m
m∑i=1
φi (w) + λ
2 ‖w‖22,
where φi (w) = 1n
n∑j=1
(xT
j w − yj)2
, λ = 1/√
nm
set n = 1000,m = 5, d = 100.
Juan Xu, Kaiqing Zhang Distributed Optimization 14 / 32
Algorithms with One-round Communication
Algorithm Comparison
GD SGD Comparison
AVGM
SAVGM
Comparison
Juan Xu, Kaiqing Zhang Distributed Optimization 15 / 32
Algorithms with Multi-round Communications
Overview
1 Motivation and Problem Description
2 Algorithms with One-round Communication
3 Algorithms with Multi-round Communications
4 Communication Complexity of Multi-round Algorithms
5 Conclusions
Juan Xu, Kaiqing Zhang Distributed Optimization 15 / 32
Algorithms with Multi-round Communications
Suboptimality of AVGM and SAVGM
Recall
E [F (w)]− F (w∗) ≤ O(
Lλ2nm + L
λ6n2 + L3 log(d)λ4n2
),
the mean square error of the one-round communication algorithmscan be much worse if λ→ 0
Actually λ decays as 1√nm in many learning applications, e.g.,
regularized loss minimization and SVM-type problems[Sridaran09, Shalev09]There exists a worst-case [Shamir14] where
E [F (w)]− F (w∗) ≤ C1λnm ,while E [F (w)]− F (w∗) ≥ C2
λn
Multi-round communications are necessary for estimators to convergeto the empirical minimizer w
Juan Xu, Kaiqing Zhang Distributed Optimization 16 / 32
Algorithms with Multi-round Communications
Suboptimality of AVGM and SAVGM
Recall
E [F (w)]− F (w∗) ≤ O(
Lλ2nm + L
λ6n2 + L3 log(d)λ4n2
),
the mean square error of the one-round communication algorithmscan be much worse if λ→ 0Actually λ decays as 1√
nm in many learning applications, e.g.,regularized loss minimization and SVM-type problems[Sridaran09, Shalev09]
There exists a worst-case [Shamir14] where
E [F (w)]− F (w∗) ≤ C1λnm ,while E [F (w)]− F (w∗) ≥ C2
λn
Multi-round communications are necessary for estimators to convergeto the empirical minimizer w
Juan Xu, Kaiqing Zhang Distributed Optimization 16 / 32
Algorithms with Multi-round Communications
Suboptimality of AVGM and SAVGM
Recall
E [F (w)]− F (w∗) ≤ O(
Lλ2nm + L
λ6n2 + L3 log(d)λ4n2
),
the mean square error of the one-round communication algorithmscan be much worse if λ→ 0Actually λ decays as 1√
nm in many learning applications, e.g.,regularized loss minimization and SVM-type problems[Sridaran09, Shalev09]There exists a worst-case [Shamir14] where
E [F (w)]− F (w∗) ≤ C1λnm ,while E [F (w)]− F (w∗) ≥ C2
λn
Multi-round communications are necessary for estimators to convergeto the empirical minimizer w
Juan Xu, Kaiqing Zhang Distributed Optimization 16 / 32
Algorithms with Multi-round Communications
Suboptimality of AVGM and SAVGM
Recall
E [F (w)]− F (w∗) ≤ O(
Lλ2nm + L
λ6n2 + L3 log(d)λ4n2
),
the mean square error of the one-round communication algorithmscan be much worse if λ→ 0Actually λ decays as 1√
nm in many learning applications, e.g.,regularized loss minimization and SVM-type problems[Sridaran09, Shalev09]There exists a worst-case [Shamir14] where
E [F (w)]− F (w∗) ≤ C1λnm ,while E [F (w)]− F (w∗) ≥ C2
λn
Multi-round communications are necessary for estimators to convergeto the empirical minimizer w
Juan Xu, Kaiqing Zhang Distributed Optimization 16 / 32
Algorithms with Multi-round Communications
Accelerated gradient averaging
Each machine calculates the gradient ∇φi (w (t))
Averaged full gradient step ∇φ(w (t)) = 1m∇φi (w (t)) is taken for
update and distribute the w (t+1)
Communication rounds / iterations needed for ε-suboptimal solution is
O
√Lλ
log(1ε
)Polynomial dependence on the condition number is undesirable if λtypically decays at most 1/
√nm, resulting in the dependence of
O ( 4√
nm) on sample size
Juan Xu, Kaiqing Zhang Distributed Optimization 17 / 32
Algorithms with Multi-round Communications
Accelerated gradient averaging
Each machine calculates the gradient ∇φi (w (t))Averaged full gradient step ∇φ(w (t)) = 1
m∇φi (w (t)) is taken forupdate and distribute the w (t+1)
Communication rounds / iterations needed for ε-suboptimal solution is
O
√Lλ
log(1ε
)Polynomial dependence on the condition number is undesirable if λtypically decays at most 1/
√nm, resulting in the dependence of
O ( 4√
nm) on sample size
Juan Xu, Kaiqing Zhang Distributed Optimization 17 / 32
Algorithms with Multi-round Communications
Accelerated gradient averaging
Each machine calculates the gradient ∇φi (w (t))Averaged full gradient step ∇φ(w (t)) = 1
m∇φi (w (t)) is taken forupdate and distribute the w (t+1)
Communication rounds / iterations needed for ε-suboptimal solution is
O
√Lλ
log(1ε
)
Polynomial dependence on the condition number is undesirable if λtypically decays at most 1/
√nm, resulting in the dependence of
O ( 4√
nm) on sample size
Juan Xu, Kaiqing Zhang Distributed Optimization 17 / 32
Algorithms with Multi-round Communications
Accelerated gradient averaging
Each machine calculates the gradient ∇φi (w (t))Averaged full gradient step ∇φ(w (t)) = 1
m∇φi (w (t)) is taken forupdate and distribute the w (t+1)
Communication rounds / iterations needed for ε-suboptimal solution is
O
√Lλ
log(1ε
)Polynomial dependence on the condition number is undesirable if λtypically decays at most 1/
√nm, resulting in the dependence of
O ( 4√
nm) on sample size
Juan Xu, Kaiqing Zhang Distributed Optimization 17 / 32
Algorithms with Multi-round Communications
Alternating Direction Method of Multipliers (ADMM)
The machines alternate between computing shared dual variables andsolving augmented Lagrangian problems with local data
Recall for the problem
min f (x) + g(z)s.t. Ax + Bz = c
ADMM iteration followsx (t+1) = arg min
xf (x) + ρ
2‖Ax + Bz(t) − c + u(t)‖22z(t+1) = arg min
zg(z) + ρ
2‖Ax (t+1) + Bz − c + u(t)‖22u(t+1) = u(t) +
(Ax (t+1) + Bz(t+1) − c
)
Juan Xu, Kaiqing Zhang Distributed Optimization 18 / 32
Algorithms with Multi-round Communications
Alternating Direction Method of Multipliers (ADMM)
The machines alternate between computing shared dual variables andsolving augmented Lagrangian problems with local dataRecall for the problem
min f (x) + g(z)s.t. Ax + Bz = c
ADMM iteration followsx (t+1) = arg min
xf (x) + ρ
2‖Ax + Bz(t) − c + u(t)‖22z(t+1) = arg min
zg(z) + ρ
2‖Ax (t+1) + Bz − c + u(t)‖22u(t+1) = u(t) +
(Ax (t+1) + Bz(t+1) − c
)
Juan Xu, Kaiqing Zhang Distributed Optimization 18 / 32
Algorithms with Multi-round Communications
Distributed ADMM
Distributed optimization can be reformulated as
min 1m
m∑i=1
φi (wi )
s.t. wi − z = 0
The distributed ADMM iteration thus followsw (t+1)
i = arg minwi
φi (wi ) + ρ2‖wi − z(t) + u(t)
i ‖22,∀i
z(t+1) = 1m
m∑i=1
(w (t)
i + u(t)i
)u(t+1)
i = u(t)i +
(w (t+1)
i − z(t+1)), ∀i
The number of iterations / communication rounds scale strongly withthe condition number, i.e., sample size for learning applications,[Deng12, Hong12]
Juan Xu, Kaiqing Zhang Distributed Optimization 19 / 32
Algorithms with Multi-round Communications
Distributed ADMM
Distributed optimization can be reformulated as
min 1m
m∑i=1
φi (wi )
s.t. wi − z = 0
The distributed ADMM iteration thus followsw (t+1)
i = arg minwi
φi (wi ) + ρ2‖wi − z(t) + u(t)
i ‖22,∀i
z(t+1) = 1m
m∑i=1
(w (t)
i + u(t)i
)u(t+1)
i = u(t)i +
(w (t+1)
i − z(t+1)), ∀i
The number of iterations / communication rounds scale strongly withthe condition number, i.e., sample size for learning applications,[Deng12, Hong12]
Juan Xu, Kaiqing Zhang Distributed Optimization 19 / 32
Algorithms with Multi-round Communications
Distributed ADMM
Distributed optimization can be reformulated as
min 1m
m∑i=1
φi (wi )
s.t. wi − z = 0
The distributed ADMM iteration thus followsw (t+1)
i = arg minwi
φi (wi ) + ρ2‖wi − z(t) + u(t)
i ‖22,∀i
z(t+1) = 1m
m∑i=1
(w (t)
i + u(t)i
)u(t+1)
i = u(t)i +
(w (t+1)
i − z(t+1)), ∀i
The number of iterations / communication rounds scale strongly withthe condition number, i.e., sample size for learning applications,[Deng12, Hong12]
Juan Xu, Kaiqing Zhang Distributed Optimization 19 / 32
Algorithms with Multi-round Communications
Distributed Optimization using an ApproximateNewton-Type (DANE) Method
[Shamir14]: Approximate Newton-type step is taken at each iterationwhile no local Hessian ∇2φi (w) is explicitly computed andcommunicated
With parameter η > 0 and regularizer µ > 0
∇φ(w (t)
)= 1
mm∑
i=1∇φi
(w (t)
)w (t+1)
i = arg minw
φi (w)−(∇φi
(w (t)
)− η∇φ
(w (t)
))Tw
+ µ2‖w − w (t)‖22
w (t+1) = 1m
m∑i=1
w (t+1)i
Two rounds of communications per iteration
Juan Xu, Kaiqing Zhang Distributed Optimization 20 / 32
Algorithms with Multi-round Communications
Distributed Optimization using an ApproximateNewton-Type (DANE) Method
[Shamir14]: Approximate Newton-type step is taken at each iterationwhile no local Hessian ∇2φi (w) is explicitly computed andcommunicatedWith parameter η > 0 and regularizer µ > 0
∇φ(w (t)
)= 1
mm∑
i=1∇φi
(w (t)
)w (t+1)
i = arg minw
φi (w)−(∇φi
(w (t)
)− η∇φ
(w (t)
))Tw
+ µ2‖w − w (t)‖22
w (t+1) = 1m
m∑i=1
w (t+1)i
Two rounds of communications per iteration
Juan Xu, Kaiqing Zhang Distributed Optimization 20 / 32
Algorithms with Multi-round Communications
Distributed Optimization using an ApproximateNewton-Type (DANE) Method
[Shamir14]: Approximate Newton-type step is taken at each iterationwhile no local Hessian ∇2φi (w) is explicitly computed andcommunicatedWith parameter η > 0 and regularizer µ > 0
∇φ(w (t)
)= 1
mm∑
i=1∇φi
(w (t)
)w (t+1)
i = arg minw
φi (w)−(∇φi
(w (t)
)− η∇φ
(w (t)
))Tw
+ µ2‖w − w (t)‖22
w (t+1) = 1m
m∑i=1
w (t+1)i
Two rounds of communications per iterationJuan Xu, Kaiqing Zhang Distributed Optimization 20 / 32
Algorithms with Multi-round Communications
DANE Method
The local optimization can be related to a mirror descent update withdistance generating function hi (w) = φi (w) + µ
2‖w‖22 and step size η
w (t+1)i = arg min
wφi (w)−
(∇φi
(w (t)
)− η∇φ
(w (t)
))Tw
+ µ
2 ‖w − w (t)‖22
= arg minw
Vφi
(w ,w (t)
)+(η∇φ
(w (t)
))Tw + µ
2 ‖w − w (t)‖22
= arg minw
Vhi
(w ,w (t)
)+(η∇φ
(w (t)
))Tw
When µ is large, it reduces to gradient descent and no information oflocal φi (w) is used; when µ = 0 and φi (w) ≈ φ(w), it approximatesNewton’s method since
Vhi
(w ,w (t)
)→ 1
2(w − w (t)
)T∇2φ
(w (t)
) (w − w (t)
)
Juan Xu, Kaiqing Zhang Distributed Optimization 21 / 32
Algorithms with Multi-round Communications
DANE Method
The local optimization can be related to a mirror descent update withdistance generating function hi (w) = φi (w) + µ
2‖w‖22 and step size η
w (t+1)i = arg min
wφi (w)−
(∇φi
(w (t)
)− η∇φ
(w (t)
))Tw
+ µ
2 ‖w − w (t)‖22
= arg minw
Vφi
(w ,w (t)
)+(η∇φ
(w (t)
))Tw + µ
2 ‖w − w (t)‖22
= arg minw
Vhi
(w ,w (t)
)+(η∇φ
(w (t)
))Tw
When µ is large, it reduces to gradient descent and no information oflocal φi (w) is used; when µ = 0 and φi (w) ≈ φ(w), it approximatesNewton’s method since
Vhi
(w ,w (t)
)→ 1
2(w − w (t)
)T∇2φ
(w (t)
) (w − w (t)
)
Juan Xu, Kaiqing Zhang Distributed Optimization 21 / 32
Algorithms with Multi-round Communications
DANE Method
The local optimization can be related to a mirror descent update withdistance generating function hi (w) = φi (w) + µ
2‖w‖22 and step size η
w (t+1)i = arg min
wφi (w)−
(∇φi
(w (t)
)− η∇φ
(w (t)
))Tw
+ µ
2 ‖w − w (t)‖22
= arg minw
Vφi
(w ,w (t)
)+(η∇φ
(w (t)
))Tw + µ
2 ‖w − w (t)‖22
= arg minw
Vhi
(w ,w (t)
)+(η∇φ
(w (t)
))Tw
When µ is large, it reduces to gradient descent and no information oflocal φi (w) is used; when µ = 0 and φi (w) ≈ φ(w), it approximatesNewton’s method since
Vhi
(w ,w (t)
)→ 1
2(w − w (t)
)T∇2φ
(w (t)
) (w − w (t)
)
Juan Xu, Kaiqing Zhang Distributed Optimization 21 / 32
Algorithms with Multi-round Communications
DANE Method
The local optimization can be related to a mirror descent update withdistance generating function hi (w) = φi (w) + µ
2‖w‖22 and step size η
w (t+1)i = arg min
wφi (w)−
(∇φi
(w (t)
)− η∇φ
(w (t)
))Tw
+ µ
2 ‖w − w (t)‖22
= arg minw
Vφi
(w ,w (t)
)+(η∇φ
(w (t)
))Tw + µ
2 ‖w − w (t)‖22
= arg minw
Vhi
(w ,w (t)
)+(η∇φ
(w (t)
))Tw
When µ is large, it reduces to gradient descent and no information oflocal φi (w) is used; when µ = 0 and φi (w) ≈ φ(w), it approximatesNewton’s method since
Vhi
(w ,w (t)
)→ 1
2(w − w (t)
)T∇2φ
(w (t)
) (w − w (t)
)
Juan Xu, Kaiqing Zhang Distributed Optimization 21 / 32
Algorithms with Multi-round Communications
DANE Method
The local optimization can be related to a mirror descent update withdistance generating function hi (w) = φi (w) + µ
2‖w‖22 and step size η
w (t+1)i = arg min
wφi (w)−
(∇φi
(w (t)
)− η∇φ
(w (t)
))Tw
+ µ
2 ‖w − w (t)‖22
= arg minw
Vφi
(w ,w (t)
)+(η∇φ
(w (t)
))Tw + µ
2 ‖w − w (t)‖22
= arg minw
Vhi
(w ,w (t)
)+(η∇φ
(w (t)
))Tw
When µ is large, it reduces to gradient descent and no information oflocal φi (w) is used; when µ = 0 and φi (w) ≈ φ(w), it approximatesNewton’s method since
Vhi
(w ,w (t)
)→ 1
2(w − w (t)
)T∇2φ
(w (t)
) (w − w (t)
)Juan Xu, Kaiqing Zhang Distributed Optimization 21 / 32
Algorithms with Multi-round Communications
DANE Method
The iteration / communication complexity for L-smooth λ-stronglyconvex quadratic objective φ(w)
O(
(L/λ)2
n log (dm) log(1ε
))
In practice, when λ scales as 1/√
nm, the complexity only scalesroughly linearly with the number of machines m, not with the samplesize N = nm
O(
L2m log (dm) log(1ε
))
Juan Xu, Kaiqing Zhang Distributed Optimization 22 / 32
Algorithms with Multi-round Communications
DANE Method
The iteration / communication complexity for L-smooth λ-stronglyconvex quadratic objective φ(w)
O(
(L/λ)2
n log (dm) log(1ε
))
In practice, when λ scales as 1/√
nm, the complexity only scalesroughly linearly with the number of machines m, not with the samplesize N = nm
O(
L2m log (dm) log(1ε
))
Juan Xu, Kaiqing Zhang Distributed Optimization 22 / 32
Algorithms with Multi-round Communications
DANE Method
The iteration / communication complexity for L-smooth λ-stronglyconvex quadratic objective φ(w)
O(
(L/λ)2
n log (dm) log(1ε
))
In practice, when λ scales as 1/√
nm, the complexity only scalesroughly linearly with the number of machines m, not with the samplesize N = nm
O(
L2m log (dm) log(1ε
))
Juan Xu, Kaiqing Zhang Distributed Optimization 22 / 32
Algorithms with Multi-round Communications
Simulation setup
(xj , yj) ∈ Rd × R are i.i.d drawn from xj ∼ N (0,Σ) where Σ ∈ Rd×d
is diagonal with Σk,k = k−2.5, yj = xTj w0 + εj where w0 = 1 and
εj ∼ N (0, 1)
Solve
minwφ(w) = 1
m
m∑i=1
φi (w) + λ
2 ‖w‖22,
where φi (w) = 1n
n∑j=1
(xT
j w − yj)2
, λ = 1/√
nm
Juan Xu, Kaiqing Zhang Distributed Optimization 23 / 32
Algorithms with Multi-round Communications
Simulation setup
(xj , yj) ∈ Rd × R are i.i.d drawn from xj ∼ N (0,Σ) where Σ ∈ Rd×d
is diagonal with Σk,k = k−2.5, yj = xTj w0 + εj where w0 = 1 and
εj ∼ N (0, 1)Solve
minwφ(w) = 1
m
m∑i=1
φi (w) + λ
2 ‖w‖22,
where φi (w) = 1n
n∑j=1
(xT
j w − yj)2
, λ = 1/√
nm
Juan Xu, Kaiqing Zhang Distributed Optimization 23 / 32
Algorithms with Multi-round Communications
Simulation results
To compare the convergence rate, set n = 1000,m = 5, d = 100
Figure: Comparison of the convergence rate of different algorithms.
Juan Xu, Kaiqing Zhang Distributed Optimization 24 / 32
Algorithms with Multi-round Communications
Simulation results
To compare the convergence rate, set n = 1000,m = 5, d = 100
Figure: Comparison of the convergence rate of different algorithms.
Juan Xu, Kaiqing Zhang Distributed Optimization 24 / 32
Algorithms with Multi-round Communications
Simulation results
To compare the communication rounds versus sample size, setn ∈ 100, 500, 1000, 5000, 10000 ,m = 5, d = 100, ε = 10−6
Table: Comparison of communication rounds of different algorithms forε-suboptimal when sample N = nm size increases
Algorithm n=100 n=500 n=1000 n=5000 n=10000Accelerated GD 35 51 61 86 101
ADMM 58 95 122 252 345DANE 10 8 8 8 8
Juan Xu, Kaiqing Zhang Distributed Optimization 25 / 32
Algorithms with Multi-round Communications
Simulation results
To compare the communication rounds versus sample size, setn ∈ 100, 500, 1000, 5000, 10000 ,m = 5, d = 100, ε = 10−6
Table: Comparison of communication rounds of different algorithms forε-suboptimal when sample N = nm size increases
Algorithm n=100 n=500 n=1000 n=5000 n=10000Accelerated GD 35 51 61 86 101
ADMM 58 95 122 252 345DANE 10 8 8 8 8
Juan Xu, Kaiqing Zhang Distributed Optimization 25 / 32
Communication Complexity of Multi-round Algorithms
Overview
1 Motivation and Problem Description
2 Algorithms with One-round Communication
3 Algorithms with Multi-round Communications
4 Communication Complexity of Multi-round Algorithms
5 Conclusions
Juan Xu, Kaiqing Zhang Distributed Optimization 25 / 32
Communication Complexity of Multi-round Algorithms
Communication Complexity of Multi-round Algorithms
[Arjevani15]: How many rounds of communications are necessary tosolve the distributed optimization with accuracy ε?
Assume the algorithms only operate over vectors, gradients, andvector products, and only vectors or gradients are communicatedThen there exist m 1-smooth λ-strongly convex local quadraticobjective functions that require at least communication rounds
Ω
√ δ
λlog(λ‖w∗‖2
ε
)where δ ∈ (0, 1) characterizes the similarity of φi (w) among mmachines, typically decaying as O
(1√n
)for random splitting data
[Shamir14]
Juan Xu, Kaiqing Zhang Distributed Optimization 26 / 32
Communication Complexity of Multi-round Algorithms
Communication Complexity of Multi-round Algorithms
[Arjevani15]: How many rounds of communications are necessary tosolve the distributed optimization with accuracy ε?Assume the algorithms only operate over vectors, gradients, andvector products, and only vectors or gradients are communicated
Then there exist m 1-smooth λ-strongly convex local quadraticobjective functions that require at least communication rounds
Ω
√ δ
λlog(λ‖w∗‖2
ε
)where δ ∈ (0, 1) characterizes the similarity of φi (w) among mmachines, typically decaying as O
(1√n
)for random splitting data
[Shamir14]
Juan Xu, Kaiqing Zhang Distributed Optimization 26 / 32
Communication Complexity of Multi-round Algorithms
Communication Complexity of Multi-round Algorithms
[Arjevani15]: How many rounds of communications are necessary tosolve the distributed optimization with accuracy ε?Assume the algorithms only operate over vectors, gradients, andvector products, and only vectors or gradients are communicatedThen there exist m 1-smooth λ-strongly convex local quadraticobjective functions that require at least communication rounds
Ω
√ δ
λlog(λ‖w∗‖2
ε
)where δ ∈ (0, 1) characterizes the similarity of φi (w) among mmachines, typically decaying as O
(1√n
)for random splitting data
[Shamir14]
Juan Xu, Kaiqing Zhang Distributed Optimization 26 / 32
Communication Complexity of Multi-round Algorithms
Communication Complexity of Multi-round Algorithms
Accelerated gradient averaging can achieve the bound for unrelatedlocal functions, i.e., δ = Ω(1)
O
√ 1λ
log(1ε
) v.s. Ω
√ δ
λlog(λ‖w∗‖2
ε
)but can be suboptimal when δ → 0DANE can be suboptimal when λ decays as 1√
nm and δ decays as 1√n
O(
m log(1ε
))v.s. Ω
(4√m log
(1ε
))
Designing optimal algorithms for general objective functions remainsopen
Juan Xu, Kaiqing Zhang Distributed Optimization 27 / 32
Communication Complexity of Multi-round Algorithms
Communication Complexity of Multi-round Algorithms
Accelerated gradient averaging can achieve the bound for unrelatedlocal functions, i.e., δ = Ω(1)
O
√ 1λ
log(1ε
) v.s. Ω
√ δ
λlog(λ‖w∗‖2
ε
)
but can be suboptimal when δ → 0DANE can be suboptimal when λ decays as 1√
nm and δ decays as 1√n
O(
m log(1ε
))v.s. Ω
(4√m log
(1ε
))
Designing optimal algorithms for general objective functions remainsopen
Juan Xu, Kaiqing Zhang Distributed Optimization 27 / 32
Communication Complexity of Multi-round Algorithms
Communication Complexity of Multi-round Algorithms
Accelerated gradient averaging can achieve the bound for unrelatedlocal functions, i.e., δ = Ω(1)
O
√ 1λ
log(1ε
) v.s. Ω
√ δ
λlog(λ‖w∗‖2
ε
)but can be suboptimal when δ → 0
DANE can be suboptimal when λ decays as 1√nm and δ decays as 1√
n
O(
m log(1ε
))v.s. Ω
(4√m log
(1ε
))
Designing optimal algorithms for general objective functions remainsopen
Juan Xu, Kaiqing Zhang Distributed Optimization 27 / 32
Communication Complexity of Multi-round Algorithms
Communication Complexity of Multi-round Algorithms
Accelerated gradient averaging can achieve the bound for unrelatedlocal functions, i.e., δ = Ω(1)
O
√ 1λ
log(1ε
) v.s. Ω
√ δ
λlog(λ‖w∗‖2
ε
)but can be suboptimal when δ → 0DANE can be suboptimal when λ decays as 1√
nm and δ decays as 1√n
O(
m log(1ε
))v.s. Ω
(4√m log
(1ε
))
Designing optimal algorithms for general objective functions remainsopen
Juan Xu, Kaiqing Zhang Distributed Optimization 27 / 32
Communication Complexity of Multi-round Algorithms
Communication Complexity of Multi-round Algorithms
Accelerated gradient averaging can achieve the bound for unrelatedlocal functions, i.e., δ = Ω(1)
O
√ 1λ
log(1ε
) v.s. Ω
√ δ
λlog(λ‖w∗‖2
ε
)but can be suboptimal when δ → 0DANE can be suboptimal when λ decays as 1√
nm and δ decays as 1√n
O(
m log(1ε
))v.s. Ω
(4√m log
(1ε
))
Designing optimal algorithms for general objective functions remainsopen
Juan Xu, Kaiqing Zhang Distributed Optimization 27 / 32
Communication Complexity of Multi-round Algorithms
Communication Complexity of Multi-round Algorithms
Accelerated gradient averaging can achieve the bound for unrelatedlocal functions, i.e., δ = Ω(1)
O
√ 1λ
log(1ε
) v.s. Ω
√ δ
λlog(λ‖w∗‖2
ε
)but can be suboptimal when δ → 0DANE can be suboptimal when λ decays as 1√
nm and δ decays as 1√n
O(
m log(1ε
))v.s. Ω
(4√m log
(1ε
))
Designing optimal algorithms for general objective functions remainsopen
Juan Xu, Kaiqing Zhang Distributed Optimization 27 / 32
Communication Complexity of Multi-round Algorithms
Communication Complexity of Multi-round Algorithms
Summary
Table: Communication efficiency of several distributed algorithms forquadratic objective functions when λ ∼ O
(1√nm
)and δ ∼ O
(1√n
)Algorithm Number of Communication Rounds
Accelerated GD O ( 4√
nm log (1/ε))ADMM O ( 4
√nm log (1/ε))
DANE O (m log (1/ε))Optimal O ( 4
√m log (1/ε))
Juan Xu, Kaiqing Zhang Distributed Optimization 28 / 32
Conclusions
Overview
1 Motivation and Problem Description
2 Algorithms with One-round Communication
3 Algorithms with Multi-round Communications
4 Communication Complexity of Multi-round Algorithms
5 Conclusions
Juan Xu, Kaiqing Zhang Distributed Optimization 28 / 32
Conclusions
Take-aways
Distributed algorithms with only one-round communication sufferfrom constant estimation errors that cannot be mitigated by clevereraveraging
DANE outperforms AGD and ADMM in terms of communicationcomplexity by making use of the similarity of local functionsMany communication rounds can be necessary to achieveε-suboptimal even though the computation capability at localmachines are unlimited
Juan Xu, Kaiqing Zhang Distributed Optimization 29 / 32
Conclusions
Take-aways
Distributed algorithms with only one-round communication sufferfrom constant estimation errors that cannot be mitigated by clevereraveragingDANE outperforms AGD and ADMM in terms of communicationcomplexity by making use of the similarity of local functions
Many communication rounds can be necessary to achieveε-suboptimal even though the computation capability at localmachines are unlimited
Juan Xu, Kaiqing Zhang Distributed Optimization 29 / 32
Conclusions
Take-aways
Distributed algorithms with only one-round communication sufferfrom constant estimation errors that cannot be mitigated by clevereraveragingDANE outperforms AGD and ADMM in terms of communicationcomplexity by making use of the similarity of local functionsMany communication rounds can be necessary to achieveε-suboptimal even though the computation capability at localmachines are unlimited
Juan Xu, Kaiqing Zhang Distributed Optimization 29 / 32
Conclusions
References
He, Niao. ”IE598 Big Data Optimizaiton.” University of Illinois atUrbana-Champaign, 2016.
Zhang, Yuchen and Wainwright, Martin J and Duchi, John C.”Communication-efficient algorithms for statistical optimization.” InAdvances in Neural Information Processing Systems, pp. 1502-1510.2012.Zhang, Yuchen and Wainwright, Martin J and Duchi, John C.”Communication-efficient algorithms for statistical optimization.” InJournal of Machine Learning Research, pp. 3321-3363. 2013.
Sridharan, Karthik, Shai Shalev-Shwartz, and Nathan Srebro. ”Fastrates for regularized objectives.” In Advances in Neural InformationProcessing Systems, pp. 1545-1552. 2009.
Shalev-Shwartz, Shai, Ohad Shamir, Nathan Srebro, and KarthikSridharan. ”Stochastic Convex Optimization.” In COLT. 2009.
Juan Xu, Kaiqing Zhang Distributed Optimization 30 / 32
Conclusions
References
Shamir, Ohad, Nathan Srebro, and Tong Zhang.”Communication-Efficient Distributed Optimization using anApproximate Newton-type Method.” In ICML, vol. 32, no. 1, pp.1000-1008. 2014.Deng, Wei, and Wotao Yin. ”On the global and linear convergence ofthe generalized alternating direction method of multipliers.” Journal ofScientific Computing 66, no. 3 (2016): 889-916.
Hong, Mingyi, and Zhi-Quan Luo. ”On the linear convergence of thealternating direction method of multipliers.” arXiv preprintarXiv:1208.3922 (2012).
Arjevani, Yossi, and Ohad Shamir. ”Communication complexity ofdistributed convex learning and optimization.” In Advances in NeuralInformation Processing Systems, pp. 1756-1764. 2015.
Juan Xu, Kaiqing Zhang Distributed Optimization 31 / 32
Conclusions
Thank you !
Q&A
Juan Xu, Kaiqing Zhang Distributed Optimization 32 / 32