journal of tsp latex class files, vol. x, no. x, sep 2018 1 … · 2019-06-19 · journal of tsp...

18
JOURNAL OF TSP L A T E X CLASS FILES, VOL. X, NO. X, SEP 2018 1 A Decentralized Proximal-Gradient Method with Network Independent Step-sizes and Separated Convergence Rates Zhi Li, Wei Shi, and Ming Yan Abstract—This paper proposes a novel proximal- gradient algorithm for a decentralized optimization prob- lem with a composite objective containing smooth and non-smooth terms. Specifically, the smooth and nonsmooth terms are dealt with by gradient and proximal updates, respectively. The proposed algorithm is closely related to a previous algorithm, PG-EXTRA [1], but has a few advantages. First of all, agents use uncoordinated step-sizes, and the stable upper bounds on step-sizes are independent of network topologies. The step-sizes depend on local objective functions, and they can be as large as those of the gradient descent. Secondly, for the special case without non-smooth terms, linear convergence can be achieved under the strong convexity assumption. The dependence of the convergence rate on the objective functions and the network are separated, and the convergence rate of the new algorithm is as good as one of the two convergence rates that match the typical rates for the general gradient descent and the consensus averaging. We provide numerical experiments to demonstrate the efficacy of the introduced algorithm and validate our theoretical discoveries. Index Terms—decentralized optimization, proximal- gradient, convergence rates, network independent I. I NTRODUCTION T HIS paper focuses on the following decentralized optimization problem: minimize xR p ¯ f (x) := 1 n n X i=1 (s i (x)+ r i (x)), (1) where s i : R p R and r i : R p R ∪{+∞} are two lower semi-continuous proper convex functions held privately by agent i to encode the agent’s objective. We Zhi Li is with Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, MI, 48824, USA. ([email protected]) Wei Shi was with Department of Electrical Engineering, Princeton University, Princeton, NJ 08544, USA. Ming Yan is with Department of Computational Mathematics, Sci- ence and Engineering and Department of Mathematics, Michigan State University, East Lansing, MI, 48824, USA. ([email protected]) This work is supported in part by the NSF grant DMS-1621798. assume that one function (e.g., without loss of generality, s i ) is differentiable and has a Lipschitz continuous gradient with parameter L> 0, and the other function r i is proximable, i.e., its proximal mapping prox λri (y) = arg min xR p λr i (x)+ 1 2 kx - yk 2 , has a closed-form solution or can be computed easily. Examples of s i include linear functions, quadratic func- tions, and logistic functions, while r i could be the 1 norm, 1D total variation, or indicator functions of simple convex sets. In addition, we assume that the agents are connected through a fixed bi-directional communication network. Every agent in the network wants to obtain an optimal solution of (1) while it can only receive/send nonsensitive messages 1 from/to its immediate neighbors. Specific problems of form (1) that require a decen- tralized computing architecture have appeared in various areas including networked multi-vehicle coordination, distributed information processing and decision making in sensor networks, as well as distributed estimation and learning. Some examples include distributed average consensus [2]–[4], distributed spectrum sensing [5], in- formation control [6], [7], power systems control [8], [9], statistical inference and learning [10]–[12]. In general, decentralized optimization fits the scenarios where the data is collected and/or stored in a distributed network, a fusion center is either inapplicable or unaffordable, and/or computing is required to be performed in a distributed but collaborative manner by multiple agents or by network designers. A. Literature Review The study on distributed algorithms dates back to the early 1980s [13], [14]. Since then, due to the emergence 1 We believe that agent i’s instantaneous estimation on the optimal solution is not a piece of sensitive information but s i and r i are. arXiv:1704.07807v2 [math.OC] 18 Jun 2019

Upload: others

Post on 06-Apr-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: JOURNAL OF TSP LATEX CLASS FILES, VOL. X, NO. X, SEP 2018 1 … · 2019-06-19 · JOURNAL OF TSP LATEX CLASS FILES, VOL.X, NO. X, SEP 2018 1 A Decentralized Proximal-Gradient Method

JOURNAL OF TSP LATEX CLASS FILES, VOL. X, NO. X, SEP 2018 1

A Decentralized Proximal-Gradient Methodwith Network Independent Step-sizes and

Separated Convergence RatesZhi Li, Wei Shi, and Ming Yan

Abstract—This paper proposes a novel proximal-gradient algorithm for a decentralized optimization prob-lem with a composite objective containing smooth andnon-smooth terms. Specifically, the smooth and nonsmoothterms are dealt with by gradient and proximal updates,respectively. The proposed algorithm is closely relatedto a previous algorithm, PG-EXTRA [1], but has a fewadvantages. First of all, agents use uncoordinated step-sizes,and the stable upper bounds on step-sizes are independentof network topologies. The step-sizes depend on localobjective functions, and they can be as large as those ofthe gradient descent. Secondly, for the special case withoutnon-smooth terms, linear convergence can be achievedunder the strong convexity assumption. The dependenceof the convergence rate on the objective functions and thenetwork are separated, and the convergence rate of thenew algorithm is as good as one of the two convergencerates that match the typical rates for the general gradientdescent and the consensus averaging. We provide numericalexperiments to demonstrate the efficacy of the introducedalgorithm and validate our theoretical discoveries.

Index Terms—decentralized optimization, proximal-gradient, convergence rates, network independent

I. INTRODUCTION

THIS paper focuses on the following decentralizedoptimization problem:

minimizex∈Rp

f(x) :=1

n

n∑i=1

(si(x) + ri(x)), (1)

where si : Rp → R and ri : Rp → R ∪ {+∞} aretwo lower semi-continuous proper convex functions heldprivately by agent i to encode the agent’s objective. We

Zhi Li is with Department of Computational Mathematics, Scienceand Engineering, Michigan State University, East Lansing, MI, 48824,USA. ([email protected])

Wei Shi was with Department of Electrical Engineering, PrincetonUniversity, Princeton, NJ 08544, USA.

Ming Yan is with Department of Computational Mathematics, Sci-ence and Engineering and Department of Mathematics, Michigan StateUniversity, East Lansing, MI, 48824, USA. ([email protected])

This work is supported in part by the NSF grant DMS-1621798.

assume that one function (e.g., without loss of generality,si) is differentiable and has a Lipschitz continuousgradient with parameter L > 0, and the other functionri is proximable, i.e., its proximal mapping

proxλri(y) = arg minx∈Rp

λri(x) +1

2‖x− y‖2,

has a closed-form solution or can be computed easily.Examples of si include linear functions, quadratic func-tions, and logistic functions, while ri could be the `1norm, 1D total variation, or indicator functions of simpleconvex sets. In addition, we assume that the agents areconnected through a fixed bi-directional communicationnetwork. Every agent in the network wants to obtainan optimal solution of (1) while it can only receive/sendnonsensitive messages1 from/to its immediate neighbors.

Specific problems of form (1) that require a decen-tralized computing architecture have appeared in variousareas including networked multi-vehicle coordination,distributed information processing and decision makingin sensor networks, as well as distributed estimationand learning. Some examples include distributed averageconsensus [2]–[4], distributed spectrum sensing [5], in-formation control [6], [7], power systems control [8], [9],statistical inference and learning [10]–[12]. In general,decentralized optimization fits the scenarios where thedata is collected and/or stored in a distributed network,a fusion center is either inapplicable or unaffordable,and/or computing is required to be performed in adistributed but collaborative manner by multiple agentsor by network designers.

A. Literature Review

The study on distributed algorithms dates back to theearly 1980s [13], [14]. Since then, due to the emergence

1We believe that agent i’s instantaneous estimation on the optimalsolution is not a piece of sensitive information but si and ri are.

arX

iv:1

704.

0780

7v2

[m

ath.

OC

] 1

8 Ju

n 20

19

Page 2: JOURNAL OF TSP LATEX CLASS FILES, VOL. X, NO. X, SEP 2018 1 … · 2019-06-19 · JOURNAL OF TSP LATEX CLASS FILES, VOL.X, NO. X, SEP 2018 1 A Decentralized Proximal-Gradient Method

JOURNAL OF TSP LATEX CLASS FILES, VOL. X, NO. X, SEP 2018 2

of large-scale networks, decentralized (optimization) al-gorithms, as a special type of distributed algorithmsfor solving problem (1), have received attention. Manyefforts have been made on star networks with one masteragent and multiple slave agents [15], [16]. This schemeis “centralized” due to the use of a “master” agent. Itmay suffer a single point of failure and may violate theprivacy requirement in certain applications. In this paper,we focus on solving (1) in a decentralized fashion, whereno “master” agent is used.

Incremental algorithms [17]–[22] can solve (1) with-out the need of a “master” agent and it is based ona directed ring network. To handle general (possiblytime-varying) networks, the distributed sub-gradient al-gorithm was proposed in [23]. This algorithm and itsvariants [24], [25] are intuitive and simple but usuallyslow due to the diminishing step-size that is neededto obtain a consensual and optimal solution, even ifthe objective functions are differentiable and stronglyconvex. With a fixed step-size, these distributed methodscan be fast, but they only converge to a neighborhoodof the solution set which depends on the step-size. Thisphenomenon creates an exactness-speed dilemma [26].

A class of distributed approaches that bypass thisdilemma is based on introducing the Lagrangian dual.The resulting algorithms include distributed dual de-composition [27] and decentralized alternating directionmethod of multipliers (ADMM) [28]. The decentralizedADMM and its proximal-gradient variant can employa fixed step-size to achieve O(1/k) rate under generalconvexity assumptions [29]–[31]. Under the strong con-vexity assumption, the decentralized ADMM has linearconvergence for time-invariant undirected graphs [32].There exist some other distributed methods that do not(explicitly) use dual variables but can still converge to anoptimal consensual solution with fixed step-sizes. In par-ticular, works in [33], [34] employ multi-consensus innerloops, Nesterov’s acceleration, and/or the adapt-then-combine (ATC) strategy. Under the assumption that theobjectives have bounded and Lipschitz continuous gradi-ents2, the algorithm proposed in [34] has O

(ln(k)/k2

)rate. References [1], [36] use a difference structure tocancel the steady state error in decentralized gradientdescent [23], [26], thereby developing the algorithmEXTRA and its proximal-gradient variant PG-EXTRA. Itconverges at an O(1/k) rate when the objective functionin (1) is convex and has a linear convergence rate when

2This means that the nonsmooth terms ri’s are absent. Such assump-tion is much stronger than the one used for achieving the O(1/k2)rate in Nesterov’s optimal gradient method [35].

the objective function is strongly convex and ri(x) = 0for all i.

A number of recent works employed the so-calledgradient tracking [37] to conquer different issues [38]–[42]. To be specific, works [38], [42] relax the step-size rule to allow uncoordinated step-sizes across agents.Paper [39] solves non-convex optimization problems.Paper [41] aims at achieving geometric convergence overtime-varying graphs. Work [40] improves the conver-gence rate over EXTRA, and its formulation is the sameas that in [42].

Another topic of interest is decentralized optimizationover directed graphs [41], [43]–[46], which is beyondthe scope of this paper.

B. Proposed Algorithm and its Advantages

To proceed, let us introduce some basic notation first.Agent i holds a local variable xi ∈ Rp, and we denotexki as its value at the k-th iteration. Then, we introduce anew function that is the average of all the local functionswith local variables as

f(x) :=1

n

n∑i=1

(si(xi) + ri(xi)), (2)

where

x :=

− x>1 −− x>2 −

...− x>n −

∈ Rn×p. (3)

If all local variables are identical, i.e., x1 = · · · = xn,we say that x is consensual. In addition, we define

s(x) :=1

n

n∑i=1

si(xi), r(x) :=1

n

n∑i=1

ri(xi). (4)

We have f(x) = s(x) + r(x). The gradient of s at x isgiven in the same way as x in (3) by

∇s(x) :=

− (∇s1(x1))

> −− (∇s2(x2))

> −...

− (∇sn(xn))> −

∈ Rn×p. (5)

By making a simple modification over PG-EXTRA [1], our proposed algorithm brings a bigimprovement in the speed and the dependency ofconvergence over networks. To better expose this simple

2In the original EXTRA, two mixing matrices W and W are used.For simplicity, we let W = I+W

2here.

Page 3: JOURNAL OF TSP LATEX CLASS FILES, VOL. X, NO. X, SEP 2018 1 … · 2019-06-19 · JOURNAL OF TSP LATEX CLASS FILES, VOL.X, NO. X, SEP 2018 1 A Decentralized Proximal-Gradient Method

JOURNAL OF TSP LATEX CLASS FILES, VOL. X, NO. X, SEP 2018 3

modification, let us compare a special case of ourproposed algorithm with EXTRA for the smooth case,i.e., r(x) = 0.

(EXTRA3) xk+2 =I + W

2(2xk+1 − xk)

−α∇s(xk+1) + α∇s(xk),(6a)

(Proposed NIDS) xk+2 =I + W

2(2xk+1 − xk

−α∇s(xk+1) + α∇s(xk)).(6b)

Here, W ∈ Rn×n is a matrix that represents informa-tion exchange between neighboring agents (more detailsabout this matrix are in Assumption 1) and α is thestep-size. The only difference between EXTRA andthe proposed algorithm is the information exchangedbetween the agents. EXTRA exchanges only the es-timations 2xk+1 − xk, while the proposed algorithmexchanges the gradient adapted estimations, i.e., 2xk+1−xk − α∇s(xk+1) + α∇s(xk). Because of this smallmodification, the proposed algorithm has a NetworkInDependent Step-size, which will be explained later.Therefore we name the proposed algorithm NIDS andwill use this abbreviation throughout the paper. For thenonsmooth case, more detailed comparison between PG-EXTRA and NIDS will be given in Section III.

A large and network independent step-size forNIDS: All works mentioned above either employ pes-simistic step-sizes or have network dependent upperbounds on step-sizes. Furthermore, the step-sizes forthe strongly convex case are more conservative. Forexample, the step-size used to achieve linear conver-gence rates for EXTRA in [36], [48] is in the orderof O(µ/L2), where µ and L are the strong convexityconstant of s(x) and Lipschitz constant of ∇s(x), re-spectively. As a contrast, the centralized gradient descentcan choose a step-size in the order of O(1/L). The upperbound of step-size for EXTRA was recently improved to(5 + 3λn(W))/(4L) in [47]. We can choose 1/(2L) forany W satisfying Assumption 1. Another example ofemploying a constant step-size in distributed optimiza-tion is DIGing [41]. Although its ATC variant in [42]was shown to converge faster than DIGing, the step-sizeis still very conservative compared to O(1/L). We willshow that the step-size of NIDS can have the same upperbound 2/L as that of the centralized gradient descent.The achievable step-sizes of NIDS for o(1/k) rate inthe general convex case and the linear convergence ratein the strongly convex case are both in the order ofO(1/L). Furthermore, NIDS allows each agent to havean individual step-size. Each agent i can choose a step-

size αi that is as large as 2/Li on any connected network,where Li is the Lipschitz constant of∇si(x) and Li ≤ Lfor all i (L = maxi Li). Apart from the step-sizes,to run NIDS, a common/public parameter c is neededfor the construction of W (see (8) for the algorithmand W). This parameter c can be chosen without anyknowledge of the network (or the mixing matrix W).For example, c = 1/(2 maxi αi). Table I provides anoverview of algorithmic configurations for EXTRA andNIDS. NIDS works as long as each agent can estimateits local functional parameter: No agent needs any otherglobal information including the number of agents in thewhole network except the largest step-size, if it is not thesame for all agents.

In the line of research of optimization over hetero-geneous networks, after the initial work [38] regardinguncoordinated step sizes, references [49] and [50] in-troduce and analyze a diffusion strategy with correc-tions that achieves an exact linear convergence withuncoordinated step sizes. This exact diffusion algorithmis still related to the Lagrangian method but can beconsidered as having incorporated a CTA structure. TheCTA strategy can, similar to the ATC strategy, improvethe convergence speed of certain consensus optimizationalgorithms (see [41, Remark 3] and [46, Section II.C]).However, the analysis in [50], though allowing stepsize mismatch across the network, does not take intoconsideration the heterogeneity of agents’ functionalconditions. Furthermore, their upper bound for the step-size is in the order of O(µ/L2).

Sublinear convergence rate for the general case:Under the general convexity assumption, we show thatNIDS has a convergence rate of o(1/k), which is slightlybetter than the O(1/k) rate of PG-EXTRA. Becausethe step-size of NIDS does not depend on the networktopology and is much larger than that of PG-EXTRA,NIDS can be much faster than PG-EXTRA, as shown inthe numerical experiments.

Linear convergence rate for the strongly convexcase: Let us first define “scalability”. When the iteratexk of an algorithm converges to the optimal solution x∗

linearly, i.e., ‖xk − x∗‖2 = O((1 − 1/S)k) with somepositive constant S, we say that the algorithm needs torun O(S) log(1/ε) iterations to reach ε-accuracy. So wecall O(S) the scalability of the algorithm.

For the case where the non-smooth terms are absentand the functions {si}ni=1 are strongly convex, we showthat NIDS achieves a linear convergence rate whosedependencies on the functions {si}ni=1 and the networktopology are decoupled. To be specific, to reach ε-

Page 4: JOURNAL OF TSP LATEX CLASS FILES, VOL. X, NO. X, SEP 2018 1 … · 2019-06-19 · JOURNAL OF TSP LATEX CLASS FILES, VOL.X, NO. X, SEP 2018 1 A Decentralized Proximal-Gradient Method

JOURNAL OF TSP LATEX CLASS FILES, VOL. X, NO. X, SEP 2018 4

TABLE ISUMMARY OF ALGORITHMIC PARAMETERS USED IN EXTRA AND NIDS. THE STEP SIZE BOUND ON α FOR EXTRA COMES FROM

REFERENCE [47] WHICH IMPROVES THAT GIVEN IN REFERENCE [36].

Method α or αi c

EXTRA (λn(W) given) α < (5 + 3λn(W))/(4 maxi Li) –EXTRA (λn(W) not given) α < 1/(2 maxi Li) –

NIDS (λn(W) given) αi < 2/Li c ≤ 1/((1− λn(W)) maxi αi)NIDS (λn(W) not given) αi < 2/Li c ≤ 1/(2 maxi αi)

accuracy, the number of iterations needed for NIDS is

O

(max

(L

µ,

1− λn(W)

1− λ2(W)

))log

1

ε,

where λi(W) is the ith largest eigenvalue of W. Both Lµ

and 1−λn(W)1−λ2(W) are typical in the literature of optimization

and average consensus, respectively. The value Lµ , also

called the condition number of the objective function,is aligned with the scalability of the standard gradientdescent [35]. The value 1−λn(W)

1−λ2(W) is understood as thecondition number4 of the network and aligned with thescalability of the simplest linear iterations for distributedaveraging [51].

Separating the condition numbers of the objectivefunction and the network provides a way to determine thebottleneck of NIDS for a specific problem and a givennetwork. Therefore, the system designer might be ableto smartly apply preconditioning on {si}ni=1 or improvethe connectivity of the network to cost-effectively obtaina better convergence.

Summary and comparison of state-of-the-art algo-rithms: We list the properties of a few relevant algo-rithms in Table II. We let σ := 1−λn(W)

1−λ2(W) . This quantityis directly affected by the network topology and howthe matrix W is defined, thus is also related to the con-sensus ability of a network. When the network is fullyconnected (a complete graph), we can choose W so thatλ2(W) = λn(W) = 0 and thus σ = 1 (the best case); ingeneral σ ≥ 1 since 0 < 1−λ2(W) ≤ 1−λn(W) < 2;in the worst case, we have σ ≤ 1

1−λ2(W) = O(n2) [4,Section 2.3]. We keep σ in the bounds/rates of involvedalgorithms for a fair comparison instead of focusing onthe worst case that often gives pessimistic/conservativeresults. We omit “O(·)” in “Bounds of step-sizes” and“Scalabilities” for brevity and only compare the effect offunctional properties (µ and L) and network properties(σ and/or n). Before talking into details, let us clarify a

4When we choose W = I − τŁ where Ł is the Laplacian ofthe underlying graph and τ is a positive tunable constant, we have1−λn(W)1−λ2(W)

=λ1(Ł)λn−1(Ł)

which is the finite condition number of Ł.Note that λn(Ł) = 0.

few points. In EXTRA, µg is a quantity that is associatedwith the strong convexity of the original function f(x),so it covers a larger class of problems; In DIGing, µis the mean value of the strong convexity constants oflocal objectives; In Acc-DNGD, the step-size for theconvex case contains k, the current number of iterations.Thus it represents a diminishing step-size sequence; InOptimal [53], [54], the total number of iterations K isused to determine the step-size for the convex case. Inaddition, they apply to problems in which the objectivesare dual friendly (see [54] for its definition). Note sometypes of objectives are suitable for gradient update, someare suitable for dual gradient update (dual friendly), andsome are suitable for proximal update.

Finding the algorithm with the lowest per-iterationcost depends on the problem (functions). Apparently,our bounds on step-sizes and the corresponding scala-bility/rate are better than those given in EXTRA andHarness (see Table II). When σ is close to 1 (the graphis well connected), the step-size bound and scalabilitygiven in DIGing are the same as NIDS. However, whenσ is large, their result becomes rather conservative. Acc-DNGD and Optimal have improved the scalability/rateof gradient-based distributed optimization by employingNesterov’s acceleration technique on primal and dualproblems, respectively. For the convex case, our rate isworse than theirs because our algorithm does not em-ploy Nesterov’s acceleration. For the primal distributedgradient method after acceleration [52], the scalabilityin σ is still worse than our result. Algorithm Optimalachieves the optimal scalability/rate for distributed opti-mization. However, as we have mentioned above, theiralgorithms are dual based thus apply to a different classof problems. In addition, NIDS supports proximable non-smooth functions and uncoordinated step-sizes whilethese have not been considered in Acc-DNGD andOptimal. To sum up, we have reached the best possi-ble performance of first-order algorithms for distributedoptimization without acceleration. Further improving theperformance by incorporating Nesterov’s techniques toour algorithm will be a future direction.

Page 5: JOURNAL OF TSP LATEX CLASS FILES, VOL. X, NO. X, SEP 2018 1 … · 2019-06-19 · JOURNAL OF TSP LATEX CLASS FILES, VOL.X, NO. X, SEP 2018 1 A Decentralized Proximal-Gradient Method

JOURNAL OF TSP LATEX CLASS FILES, VOL. X, NO. X, SEP 2018 5

TABLE IISUMMARY OF A FEW RELEVANT ALGORITHMS. HERE, µ (OR µg OR µ) IS THE STRONG CONVEXITY CONSTANT OF THE OBJECTIVE

FUNCTION (OR THAT OF A MODIFIED OBJECTIVE FUNCTION); L (OR L) IS THE LIPSCHITZ CONSTANT OF THE OBJECTIVE GRADIENT (OR

ITS MODIFIED VERSION). ALSO σ =1−λn(W)1−λ2(W)

IS CONSIDERED AS THE CONDITION NUMBER OF THE NETWORK, WHICH SCALES AT THE

ORDER OF O(n2) IN THE WORST CASE SCENARIO. WE OMIT “O(·)” IN “ORDERS OF STEP-SIZE BOUNDS” AND “SCALABILITY” FORBREVITY. NOTE QUANTITIES INVOLVING K ONLY HOLD FOR A FINITE K .

Algorithm Support Orders of step-size bounds Uncoordinated Scalability Rateprox. operators (strongly convex; convex) step-size (strongly convex) (convex)

EXTRA

[1], [36] Yes µg

L2 ; 1L

No(Lµg

)2o(

1k

)Aug-DGM

[38] No small enough Yes – convergesHarness

[40] No µL2σ2 ; 1

Lσ2 No(Lµσ)2

O(

1k

)Acc-DNGD (σ−1)3

σ6L

( µL

)3/7;

[52] No min{(1−σ−1)2,(σ)−3}Lk0.6

No σ3

(σ−1)1.5

(Lµ

)5/7O(

1k1.4

)DIGing

[41], [42] No min{ µ0.5

σ(σ−1)L1.5n0.5 ,1L}; – Yes max{L

µ,

(σ−1)2n0.5L1.5+σ2µ1.5

µ1.5 } –Optimal

[53], [54] No µ; LσK2 No

(Lµσ)0.5

O(

1K2

)NIDS Yes 1

L; 1L

Yes max{Lµ, σ} o( 1

k)

Finally, we note that, references [49], [50], appearingsimultaneously with this work, also proposed (6b) toenlarge the step-size and use column stochastic matricesrather than symmetric doubly stochastic matrices. How-ever, their algorithm only works for smooth problems,and their analysis seems to be restrictive and requirestwice differentiability and strong convexity of {si}ni=1.The stepsize is also in the order of µ/L2 [48].

C. Future Works

The capability of our algorithm using purely lo-cally determined parameters increases its potential tobe extended to dynamic networks with a time-varyingnumber of nodes. Given such flexibility, we may usesimilar schemes to solve the decentralized empirical riskminimization problems. Furthermore, it also enhancesthe privacy of the agents through allowing each agentto perform their own optimization procedure withoutnegotiation on any parameter.

By using Nesterov’s acceleration technique, refer-ence [4] shows that the scalability of a new averageconsensus protocol can be improved to O(n); whenthe nonsmooth terms ri’s are absent, reference [53]shows that the scalability of a new dual based accel-erated distributed gradient method can be improved toO(√σL/µ). One of our future work is exploring the

convergence rates/scalability of the Nesterov’s acceler-ated version of our algorithm.

D. Paper Organization

The rest of this paper is organized as follows. To facili-tate the description of the technical ideas, the algorithms,and the analysis, we introduce additional notation in Sub-section I-E. The intuition for the network-independentstep-size is provided in Section II. In Section III, weintroduce our algorithm NIDS and discuss its relationto some other existing algorithms. In Section IV, wefirst show that NIDS can be understood as an iterativealgorithm for seeking a fixed point. Following this, weestablish that NIDS converges at an o(1/k) rate for thegeneral convex case and a linear rate for the stronglyconvex case. Then, numerical simulations are given inSection V to corroborate our theoretical claims. Finalremarks are given in Section VI.

E. Notation

We use bold upper-case letters such as W to definematrices in Rn×n and bold lower-case letters such as xand z to define matrices in Rn×p (when p = 1, theyare vectors). Let 1 and 0 be matrices with all ones andzeros, respectively, and their dimensions are providedwhen necessary. For matrices x, y ∈ Rn×p, we define

Page 6: JOURNAL OF TSP LATEX CLASS FILES, VOL. X, NO. X, SEP 2018 1 … · 2019-06-19 · JOURNAL OF TSP LATEX CLASS FILES, VOL.X, NO. X, SEP 2018 1 A Decentralized Proximal-Gradient Method

JOURNAL OF TSP LATEX CLASS FILES, VOL. X, NO. X, SEP 2018 6

their inner product as 〈x,y〉 = tr(x>y) and the norm as‖x‖ =

√〈x,x〉. Additionally, by an abuse of notation,

we define 〈x,y〉Q = tr(x>Qy) and ‖x‖2Q = 〈x,x〉Qfor any given symmetric matrix Q ∈ Rn×n. Note that〈·, ·〉Q is an inner product defined in Rn×p if and only ifQ is positive definite. However, when Q is not positivedefinite, 〈x,y〉Q can still be an inner product defined ina subspace of Rn×p, see Lemma 3 for more details. Wedefine the range of A ∈ Rn×n by range(A) := {x ∈Rn×p : x = Ay, y ∈ Rn×p}. The largest eigenvalueof a symmetric matrix A is also denoted as λmax(A).For two symmetric matrices A,B ∈ Rn×n, A � B (orA < B) means that A−B is positive definite (or positivesemidefinite). Moreover, we use Ni to represent the setof agents that can directly send messages to agent i.

II. INTUITION FOR NETWORK-INDEPENDENTSTEP-SIZE

In this section, we provide an intuition for thenetwork-independent step-size for NIDS with only thedifferentiable function s. The decentralized optimizationproblem is equivalent to

minimizex

s(x), s.t. (I−W)1/2x = 0,

where (I−W)1/2 is the square root of I−W, and theconstraint is the same as the consensual condition withthe mixing matrix W given in Assumption 1. DenoteŁ = (I−W)1/2. The corresponding optimality conditionwith the introduction of the dual variable p is[

0 Ł−Ł 0

] [x∗

p∗

]= −

[∇s(x∗)

0

].

EXTRA is equivalent to the Condat-Vu primal-dualalgorithm [47], [55], and it can be further explained as aforward-backward splitting applied to the equation, i.e.,[[

1αI −Ł−Ł 2αI

]+

[0 Ł−Ł 0

]] [xk+1

pk+1

]=

[1αI −Ł−Ł 2αI

] [xk

pk

]−[∇s(xk)

0

].

The update is

xk+1 =xk − αŁpk − α∇s(xk),

αpk+1 =αpk − 1

2Łxk + Łxk+1.

It is equivalent to EXTRA after p is eliminated. In thiscase, the new metric is a full matrix, and therefore, theupper bound of the step-size α depends on the matrixŁ. To be more specific,[

1αI −Ł−Ł 2αI

]<

[L2 I 00 0

],

which gives α ≤ 2(1 + λmin(W))/L. A larger andoptimal upper bound for the step-size of EXTRA isshown in [47] (See Table I), and it still depends on W.However, we choose a block diagonal metric and have[[

1αI 00 α(I + W)

]+

[0 Ł−Ł 0

]] [xk+1

pk+1

]=

[1αI 00 α(I + W)

] [xk

pk

]−[∇s(xk)

0

].

The update becomes

2αpk+1 =α(I + W)pk + Łxk − αŁ∇s(xk),

xk+1 =xk − α∇s(xk)− αŁpk+1,

which is equivalent to NIDS after p is eliminated.Because the new metric is block diagonal, and thenonexpansiveness of the forward step depends on thefunction only, i.e., α ≤ 2/L.

III. PROPOSED ALGORITHM NIDSIn this section, we describe our proposed NIDS in

Algorithm 1 for solving (1) in more details and explainthe connections to other related methods.

Algorithm 1 NIDSEach agent i obtains its mixing values wij ,∀j ∈ Ni;Each agent i chooses its own step-size αi > 0 and thesame parameter c (e.g., c = 0.5/maxi αi);Each agent i sets the mixing values wij :=cαiwij ,∀j ∈ Ni and wii := 1− cαi + cαiwii;Each agent i picks arbitrary initial x0

i ∈ Rp andperforms

z1i = x0

i − αi∇si(x0i

),

x1i = arg min

x∈Rp

αiri(x) +1

2‖x− z1

i ‖2.

for k = 1, 2, 3 . . . doEach agents i performs

zk+1i = zki − xki +

∑j=Ni∪{i}

wij(2xkj − xk−1

j

−αj∇sj(xkj)

+ αj∇sj(xk−1j

)),

xk+1i = arg min

x∈Rp

αiri(x) +1

2‖x− zk+1

i ‖2.

end for

The mixing matrix satisfies the following assumption,which comes from [1], [36].

Assumption 1 (Mixing matrix): The connected networkG = {V, E} consists of a set of agents V = {1, 2, · · · , n}

Page 7: JOURNAL OF TSP LATEX CLASS FILES, VOL. X, NO. X, SEP 2018 1 … · 2019-06-19 · JOURNAL OF TSP LATEX CLASS FILES, VOL.X, NO. X, SEP 2018 1 A Decentralized Proximal-Gradient Method

JOURNAL OF TSP LATEX CLASS FILES, VOL. X, NO. X, SEP 2018 7

and a set of undirected edges E . An undirected edge(i, j) ∈ E means that there is a connection betweenagents i and j and both agents can exchange data. Themixing matrix W = [wij ] ∈ Rn×n satisfies:

1) (Decentralized property). If i 6= j and (i, j) /∈ E ,then wij = 0;

2) (Symmetry). W = WT ;3) (Null space property). Null(I − W) =

span(1n×1);4) (Spectral property). 2I < W + I � 0n×n.Remark 1: Assumption 1 implies that the eigenvalues

of W lie in (−1, 1] and the multiplicity of eigenvalue 1 isone, i.e., 1 = λ1(W) > λ2(W) ≥ · · · ≥ λn(W) > −1.Item 3 of Assumption 1 shows that (I −W)1n×1 = 0and the orthogonal complement of span(1n×1) is therow space of I−W, which is also the column space ofI−W because of the symmetry of W.

The functions {si}ni=1 and {ri}ni=1 satisfy the follow-ing assumption.

Assumption 2: Functions {si(x)}ni=1 and {ri(x)}ni=1

are lower semi-continuous proper convex, and{si(x)}ni=1 have Lipschitz continuous gradientswith constants {Li}ni=1, respectively. Thus, we have

〈x−y,∇s(x)−∇s(y)〉 ≥ ‖∇s (x)−∇s (y)‖2L−1 , (7)

where L = Diag(L1, · · · , Ln) is the diagonal matrixwith the Lipschitz constants [35].

Instead of using the same step-size for all the agents,we allow agent i to choose its own step-size αi and letΛ = Diag(α1, · · · , αn) ∈ Rn×n. Then NIDS can beexpressed as

zk+1 =zk − xk + W(2xk − xk−1

− Λ∇s(xk) + Λ∇s(xk−1)),(8a)

xk+1 = arg minx∈Rn×p

r(x) +1

2‖x− zk+1‖2Λ−1 , (8b)

where W = I − cΛ(I −W) and c is chosen such thatΛ−1/2WΛ1/2 = I − cΛ1/2(I −W)Λ1/2 < 0. Thiscondition shows that the upper bound of the parameterc depends on W and Λ. When the information aboutW is not given, we can just let c = 1/(2 maxi αi)because λn(W) > −1. To set such a parameter, apreprocessing step is needed to obtain the maximum.However, since the maximum can be easily computedin a connected network in no more than n − 1 roundsof communication wherein each node repeatedly takesmaximum of the values from neighbors, the cost of thispreprocessing is essentially negligible compared to theworst-case running time of our optimization protocol.

If all agents choose the same step-size, i.e., Λ = αI,and we let c = 1/(2α), (8) becomes

zk+1 =zk − xk +I + W

2(2xk − xk−1

− α∇s(xk) + α∇s(xk−1)),(9a)

xk+1 = arg minx∈Rn×p

r(x) +1

2α‖x− zk+1‖2. (9b)

Remark 2: The update of PG-EXTRA is

zk+1 =zk − xk +I + W

2(2xk − xk−1)

− α∇s(xk) + α∇s(xk−1),(10a)

xk+1 = arg minx∈Rn×p

r(x) +1

2α‖x− zk+1‖2. (10b)

The only difference between NIDS and PG-EXTRA isthat the mixing operation is further applied to the succes-sive difference of the gradients −α∇s(xk)+α∇s(xk−1)in NIDS.

When there is no function r(x), (8) becomes

xk+1 =W(2xk − xk−1 − Λ∇s(xk) + Λ∇s(xk−1)),

and it further reduces to (6b) when Λ = αI andc = 1/(2α). Note that, though (6b) appears in [49],[50], its convergence still needs a small step-size thatalso depends on the network topology and the strongconvexity constant. In Theorem 1 of [50], the upperbound for the step-size is also O(µ/L2), which is thesame as that of PG-EXTRA.

IV. CONVERGENCE ANALYSIS OF NIDSIn order to show the convergence of NIDS, we also

need the following assumption.Assumption 3 (Solution existence): Problem (1) has at

least one solution.To simplify the analysis, we introduce a new sequence{dk}k≥0 which is defined as

dk := Λ−1(xk−1 − zk)−∇s(xk−1

). (11)

Using the sequence {xk}k≥0, we obtain a recursive(update) relation for {dk}k≥0:

dk+1

=Λ−1(xk − zk+1)−∇s(xk)

=Λ−1(xk − zk + xk)−∇s(xk)

− Λ−1W(2xk − xk−1 − Λ∇s(xk) + Λ∇s(xk−1))

=Λ−1(xk − zk + xk − 2xk + xk−1)

−∇s(xk) +∇s(xk)−∇s(xk−1)

+ c(I−W)(2xk − xk−1 − Λ∇s(xk) + Λ∇s(xk−1))

=dk + c(I−W)(2xk − zk − Λ∇s(xk)− Λdk),

Page 8: JOURNAL OF TSP LATEX CLASS FILES, VOL. X, NO. X, SEP 2018 1 … · 2019-06-19 · JOURNAL OF TSP LATEX CLASS FILES, VOL.X, NO. X, SEP 2018 1 A Decentralized Proximal-Gradient Method

JOURNAL OF TSP LATEX CLASS FILES, VOL. X, NO. X, SEP 2018 8

where the second equality comes from the update ofzk+1 in (8a) and the last one holds because of thedefinition of dk in (11). Therefore, the iteration (8) isequivalent to, with the update order (x,d, z),

xk = arg minx∈Rn×p

r(x) +1

2‖x− zk‖2Λ−1 , (12a)

dk+1 =dk + c(I−W)(2xk − zk

−Λ∇s(xk)− Λdk

),

(12b)

zk+1 =xk − Λ∇s(xk)− Λdk+1, (12c)

in the sense that both (8) and (12) generate the same{xk, zk}k>0 sequence.

Because xk is determined by zk only and can beeliminated from the iteration, iteration (12) is essentiallyan operator for (d, z). Note that we have d1 = Λ−1(x0−z1) −∇s

(x0)

= 0 from Algorithm 1. Therefore, fromthe update of dk+1 in (12b), dk ∈ range(I−W) for allk. In fact, any z1 such that d1 ∈ range(I−W) worksfor NIDS. The following two lemmas show the relationbetween fixed points of (12) and optimal solutions of (1).The proofs for all lemmas and propositions are includedin the supplemental material.

Lemma 1 (Fixed point of (12)): (d∗, z∗) is a fixedpoint of (12) if and only if there exists a subgradientq∗ ∈ ∂r(x∗) such that z∗ = x∗ + Λq∗ and

d∗ +∇s(x∗) + q∗ = 0, (13a)(I−W)x∗ = 0. (13b)

Lemma 2 (Optimality condition): x∗ is consensualwith x∗1 = x∗2 = · · · = x∗n = x∗ being an optimalsolution of problem (1) if and only if there exists p∗

and a subgradient q∗ ∈ ∂r(x∗) such that:

(I−W)p∗ +∇s(x∗) + q∗ = 0, (14a)(I−W)x∗ = 0. (14b)

In addition, (d∗ = (I −W)p∗, z∗ = x∗ + Λq∗) is afixed point of iteration (12).

Lemma 2 shows that we can find a fixed point of iter-ation (12) to obtain an optimal solution of problem (1).It also tells us that we need d∗ ∈ range(I−W) to getan optimal solution of problem (1). Therefore, we needd1 ∈ range(I−W).

Lemma 3 (Norm over range space): For any symmet-ric positive semidefinite matrix A ∈ Rn×n with rankr ≤ n, let λ1 ≥ λ2 ≥ · · · ≥ λr > 0 be its r eigen-values. Then range(A) defined in Section I-E is a rp-dimensional subspace in Rn×p and has a norm definedby ‖x‖2A† := 〈x,A†x〉, where A† is the pseudo inverse

of A. In addition, λ−11 ‖x‖2 ≤ ‖x‖2A† ≤ λ−1

r ‖x‖2 forall x ∈ range(A).

Proposition 1: Let M = c−1(I−W)† − Λ with I <cΛ1/2(I−W)Λ1/2 < 0. Then ‖ · ‖M is a norm definedfor range(I−W).

The following lemma compares the distance to a fixedpoint of (12) for two consecutive iterates.

Lemma 4 (Fundamental inequality): Let (d∗, z∗) be afixed point of iteration (12) with d∗ ∈ range(I−W).The update (dk, zk)⇒ (dk+1, zk+1) in (12) satisfies

‖zk+1 − z∗‖2Λ−1 + ‖dk+1 − d∗‖2M≤‖zk − z∗‖2Λ−1 + ‖dk − d∗‖2M− ‖zk − zk+1‖2Λ−1 − ‖dk − dk+1‖2M+ 2〈∇s

(xk)−∇s (x∗) , zk − zk+1〉

− 2〈xk − x∗,∇s(xk)−∇s(x∗)〉.

(15)

Proof: From the update of zk+1 in (12c), we have

〈dk+1 − d∗, zk+1 − zk + xk − x∗〉=〈dk+1 − d∗, 2xk − zk − Λ∇s(xk)− Λdk+1 − x∗〉=〈dk+1 − d∗, c−1(I−W)†(dk+1 − dk)

+ Λdk − Λdk+1〉=〈dk+1 − d∗,dk+1 − dk〉M, (16)

where the second equality comes from (12b), (14b), anddk+1 −d∗ ∈ range(I−W). From (12a), we have that

〈xk − x∗, zk − xk − z∗ + x∗〉Λ−1 ≥ 0. (17)

Therefore, we have

〈xk − x∗,∇s(xk)−∇s(x∗)〉≤〈xk − x∗,Λ−1(zk − xk − z∗ + x∗)

+∇s(xk)−∇s(x∗)〉=〈xk − x∗,Λ−1(zk − zk+1)− dk+1 + d∗〉=〈xk − x∗, zk − zk+1〉Λ−1 + 〈dk+1

− d∗, zk+1 − zk〉 − 〈dk+1 − d∗,dk+1 − dk〉M=〈Λ−1(xk − x∗)− dk+1 + d∗, zk − zk+1〉− 〈dk+1 − d∗,dk+1 − dk〉M

=〈Λ−1(zk+1 − z∗) +∇s(xk)−∇s(x∗), zk − zk+1〉− 〈dk+1 − d∗,dk+1 − dk〉M

=〈zk+1 − z∗, zk − zk+1〉Λ−1

+ 〈∇s(xk)−∇s(x∗), zk − zk+1〉+ 〈dk+1 − d∗,dk − dk+1〉M.

The inequality and the second equality comes from (17)and (16), respectively. The first and fourth equalities hold

Page 9: JOURNAL OF TSP LATEX CLASS FILES, VOL. X, NO. X, SEP 2018 1 … · 2019-06-19 · JOURNAL OF TSP LATEX CLASS FILES, VOL.X, NO. X, SEP 2018 1 A Decentralized Proximal-Gradient Method

JOURNAL OF TSP LATEX CLASS FILES, VOL. X, NO. X, SEP 2018 9

because of the update of zk+1 in (12c). Using 2〈a,b〉 =‖a + b‖2 − ‖a‖2 − ‖b‖2 and rearranging the previousinequality give us that

2〈xk − x∗,∇s(xk)−∇s(x∗)〉− 2〈∇s(xk)−∇s(x∗), zk − zk+1〉

≤2〈zk+1 − z∗, zk − zk+1〉Λ−1

+ 2〈dk+1 − d∗,dk − dk+1〉M=‖zk − z∗‖2Λ−1 − ‖zk+1 − z∗‖2Λ−1 − ‖zk − zk+1‖2Λ−1

+ ‖dk − d∗‖2M − ‖dk+1 − d∗‖2M − ‖dk − dk+1‖2M.

Therefore, (15) is obtained.

A. Sublinear convergence of NIDS

As explained in Section II, NIDS is equivalent to theprimal-dual algorithm [56] applied to problem

minimizex

s(x) + r(x) + ι((I−W)1/2x), (18)

where ι(·) is the indicator function, which return 0 for0 and +∞ otherwise, with the metric matrix being[

Λ−1 00 c−1I− (I−W)1/2Λ(I−W)1/2

].

We apply [56, Theorem 1] and obtain the followingsublinear convergence result.

Theorem 1 (Sublinear rate): Let (dk, zk) be the se-quence generated from NIDS in (12) with αi < 2/Lifor all i and I < cΛ1/2(I−W)Λ1/2. We have

‖zk − zk+1‖2Λ−1 + ‖dk − dk+1‖2M

≤‖z1 − z∗‖2Λ−1 + ‖d1 − d∗‖2M

k(1−maxiαiLi

2 ),

(19)

‖zk − zk+1‖2Λ−1 + ‖dk − dk+1‖2M = o

(1

k + 1

).

Furthermore, (dk, zk) converges to a fixed point (d, z)of iteration (12) and d ∈ range(I − W), if I �cΛ1/2(I−W)Λ1/2.

Remark 3: Note the convergence in Theorem 1 isshown in z and d. We will show the convergence interms of (14). Recall that

zk+1 − zk =xk − Λ∇s(xk)− Λdk+1 − zk

=− Λ(dk+1 +∇s(xk) + qk),

where qk ∈ ∂r(xk). Therefore, ‖zk+1 − zk‖2Λ−1 → 0implies the convergence in terms of (14a).

Combining (12b) and (12c), we have

dk+1 =dk + c(I−W)(xk − zk + zk+1)

+ c(I−W)Λ(dk+1 − dk).

Rearranging it gives

(I− c(I−W)Λ)(dk+1 − dk

)=c(I−W)

(xk − zk + zk+1

).

Then we have

‖c(I−W)(xk − zk + zk+1

)‖2

=‖c (I−W)M1/2M1/2(dk+1 − dk

)‖2

≤‖c(I−W)M1/2‖2∥∥dk+1 − dk

∥∥2

M,

where the second equality comes from dk+1 − dk ∈range(I − W). Thus ‖zk+1 − zk‖2Λ−1 + ‖dk+1 −dk‖2M → 0 implies the convergence in terms of (14b).

B. Linear convergence for special cases

In this subsection, we provide the linear convergencerate for the case when r(x) = 0, i.e., zk = xk in NIDS.

Theorem 2: If {si(x)}ni=1 are strongly convex withparameters {µi}ni=1, then

〈x− y,∇s(x)−∇s(y)〉 ≥ ‖x− y‖2S , (20)

where S = Diag(µ1, · · · , µn) ∈ Rn×n. Let (dk,xk) bethe sequence generated from NIDS with αi < 2/Li forall i and I < cΛ1/2(I−W)Λ1/2. We define

ρ = max(

1− (2−maxi

(αiLi)) mini

(µiαi),

1− c

λmax(Λ−1/2(I−W)†Λ−1/2)

),

(21)

and have

‖xk+1 − x∗‖2Λ−1 + ‖dk+1 − d∗‖2M+Λ

≤ρ(‖xk − x∗‖2Λ−1 + ‖dk − d∗‖2M+Λ

).

(22)

Proof: From (15), we have

‖xk+1 − x∗‖2Λ−1 + ‖dk+1 − d∗‖2M≤‖xk − x∗‖2Λ−1 + ‖dk − d∗‖2M− ‖xk − xk+1‖2Λ−1 − ‖dk − dk+1‖2M (23)

+ 2〈∇s(xk)−∇s (x∗) ,xk − xk+1〉

− 2〈xk − x∗,∇s(xk)−∇s(x∗)〉.

Page 10: JOURNAL OF TSP LATEX CLASS FILES, VOL. X, NO. X, SEP 2018 1 … · 2019-06-19 · JOURNAL OF TSP LATEX CLASS FILES, VOL.X, NO. X, SEP 2018 1 A Decentralized Proximal-Gradient Method

JOURNAL OF TSP LATEX CLASS FILES, VOL. X, NO. X, SEP 2018 10

For the two inner product terms, we have

2〈∇s(xk)−∇s (x∗) ,xk − xk+1〉

− 2〈xk − x∗,∇s(xk)−∇s(x∗)〉=− ‖xk − xk+1 − Λ∇s

(xk)

+ Λ∇s (x∗) ‖2Λ−1

+ ‖xk − xk+1‖2Λ−1 + ‖∇s(xk)−∇s (x∗) ‖2Λ

− 2〈xk − x∗,∇s(xk)−∇s(x∗)〉≤ − ‖dk+1 − d∗‖2Λ + ‖xk − xk+1‖2Λ−1

+ ‖∇s(xk)−∇s (x∗) ‖2Λ

−maxi

(αiLi)‖∇s(xk)−∇s (x∗) ‖2L−1

− (2−maxi

(αiLi))‖xk − x∗‖2S≤− ‖dk+1 − d∗‖2Λ + ‖xk − xk+1‖2Λ−1

− (2−maxi

(αiLi)) mini

(µiαi)‖xk − x∗‖2Λ−1 . (24)

The first inequality comes from xk+1 = xk −Λ∇s

(xk)− dk+1, Λ∇s (x∗) + d∗ = 0, (7), and (20).

Combing (23) and (24), we have

‖xk+1 − x∗‖2Λ−1 + ‖dk+1 − d∗‖2M≤‖xk − x∗‖2Λ−1 + ‖dk − d∗‖2M − ‖dk+1 − d∗‖2Λ− (2−max

i(αiLi)) min

i(µiαi)‖xk − x∗‖2Λ−1 .

Therefore,

‖xk+1 − x∗‖2Λ−1 + ‖dk+1 − d∗‖2M+Λ

≤(1− (2−maxi

(αiLi)) mini

(µiαi))‖xk − x∗‖2Λ−1

+λmax(Λ−1/2MΛ−1/2)

λmax(Λ−1/2MΛ−1/2) + 1‖dk − d∗‖2M+Λ.

Since

λmax(Λ−1/2MΛ−1/2)

λmax(Λ−1/2MΛ−1/2) + 1

=1− c

λmax(Λ−1/2(I−W)†Λ−1/2).

Let ρ defined as (21), then we have (22).Remark 4: The condition I < cΛ1/2(I −W)Λ1/2

implies that c ≤ λn−1(Λ−1/2(I−W)†Λ−1/2).• If the agents across the whole network use an

identical stepsize α, that is, Λ = αI, then

ρ = max(

1− (2− αmaxiLi)αmin

iµi,

1− cα

λmax((I−W)†)

).

A concise but informative expression of the rateρ = max

(1− mini µi

maxi Li, λ2(W)−λn(W)

1−λn(W)

)can be ob-

tained when we specifically choose α = 1maxi Li

and c = 1(1−λn(W))α . When λn(W) is not given,

we choose c = 1/(2α) and obtain the scalabilitymax

(maxi Li

mini µi, 2

1−λ2(W)

). In this case, the network

impact and the functional impact are decoupled.• If we let Λ = L−1 and c = λn−1(L−1/2(I −

W)†L−1/2), then the rate becomes

ρ = max

(1−min

i

µiLi,

1− λn−1(L1/2(I−W)†L1/2)

λmax(L1/2(I−W)†L1/2)

).

When λn(W) is not given, we choose c =1/(2 maxi αi) = mini Li/2 and obtain the scal-ability max

(maxi

Li

µi, maxi Li

mini Li· 2

1−λ2(W)

). In this

case, the networking impact is coupled with thefunction factors, i.e., the smoothness heterogeneitymaxi Li

mini Liis multiplied on the networking impact.

While the other number depends on the functionalcondition numbers Li

µi’s only.

Remark 5: Theorem 2 separates the dependence ofthe linear convergence rate on the functions and thenetwork structure. In our current scheme, all the agentsperform information exchange and the proximal-gradientstep once in each iteration. If the proximal-gradient stepis expensive, this explicit rate formula can help us todecide whether the so-called multi-step consensus canhelp reducing the computational time.

For the sake of simplicity, let us assume for this mo-ment that all the agents have the same strong convexityconstant µ and gradient Lipschitz continuity constantL. Suppose that the “t-step consensus” technique isemployed, i.e., the mixing matrix W in our algorithm isreplaced by Wt, where t is a positive integer. Then toreach ε-accuracy, the number of iterations needed is

O

(max

(L

µ,

1− λn(Wt)

1− λ2(Wt)

))log

1

ε.

When L/µ = 1 and step sizes are chosen as Λ = L−1,it says that we should let t → +∞ if the graph isnot a complete graph. Such theoretical result is correctin intuition since in this case, the centralized gradientdescent only needs one step to reach optimal and thebottleneck in decentralized optimization is the network.

Suppose tmax is a reasonable upper bound on t,which is set by the system designer. It is difficultto explicitly find an optimal t. But with the aboveanalysis as an evidence, we suggest that one chooset = min

([logλ2(W)(1−

µL )], tmax

)if 1− µ

L > λ2(W);otherwise t = 1. Here [·] gives the nearest integer.

Page 11: JOURNAL OF TSP LATEX CLASS FILES, VOL. X, NO. X, SEP 2018 1 … · 2019-06-19 · JOURNAL OF TSP LATEX CLASS FILES, VOL.X, NO. X, SEP 2018 1 A Decentralized Proximal-Gradient Method

JOURNAL OF TSP LATEX CLASS FILES, VOL. X, NO. X, SEP 2018 11

If the bottleneck is on the functions, we can introducea mapping x = By and change the unknown variablefrom x to y. E.g., if the function si(x) is a compositionof a convex function with a linear mapping, replacingx using y changes the linear mapping and the conditionnumber L/µ of the function. When B is diagonal, it issimilar to the column normalization in machine learningapplications. There are other possible ways for reducingthe condition number of the functions. It is out of thescope of this work, and we leave this as future work.

V. NUMERICAL EXPERIMENTS

In this section, we compare the performance of NIDSwith several state-of-the-art algorithms for decentralizedoptimization. These methods are• The EXTRA/PG-EXTRA (see (10));• The DIGing-ATC [42]. For reference, the DIGing-

ATC updates are provided as follows:

xk+1 =W(xk − αyk),

yk+1 =W(yk +∇f(xk+1)−∇f(xk)).

• The accelerated distributed Nesterov gradient de-scent (Acc-DNGD-SC in [52]);

• The (dual friendly) optimal algorithm (OA) fordistributed optimization (equation (7) in [54]).

Note there are two rounds of communication in eachiteration of DIGing-ATC and Acc-DNGD-SC while thereis only one round in that of EXTRA/NIDS/OA. For allthe experiments, we first compute the exact solution x∗

for (1) using the centralized (proximal) gradient descent.All networks are randomly generated with connectivityratio τ , where τ is defined as the number of actual edgesdivided by the total number of possible edges n(n−1)

2 .We will report the specific τ used in each test. Themixing matrix W is always chosen with the Metropolisrule (see [57] and [36, Section 2.4]).

The experiments are carried in Matlab R2016b run-ning on a laptop with Intel i7 CPU @ 2.60HZ, 16.0GB of RAM, and Windows 10 operating system. Thesource codes for reproducing the numerical results canbe accessed at https://github.com/mingyan08/NIDS.

A. The strongly convex case with r(x) = 0

Consider the decentralized problem that solves for anunknown signal x ∈ Rp. Each agent i ∈ {1, · · · , n}takes its own measurement via yi = Mix + ei, whereyi ∈ Rmi is the measurement vector, Mi ∈ Rmi×p isthe sensing matrix, and ei ∈ Rmi is the independent

and identically distributed noise. To estimate x collabo-ratively, we apply the decentralized algorithms to solve

minimizex

1

n

n∑i=1

1

2‖Mix− yi‖2

In order to ensure that each function 12‖Mix − yi‖2

is strongly convex, we choose mi = 60 and p = 50and set the number of nodes n = 40. For the firstexperiment, we choose Mi such that the Lipshchitzconstant of ∇si satisfies Li = 1 and the strongly convexconstant µi = 0.5 for all i. Based on Remark 4, wechoose α = 1/(maxi Li) = 1 and c = 1/(1− λn(W))for NIDS. In addition, we choose c = 1/2 such thatW = I+W

2 , which gives the same as that for EXTRA.The comparison of these methods (NIDS with c =

1/((1− λn(W))α), NIDS with c = 1/2, EXTRA,DIGing-ATC, Acc-DNGD-SC, and OA) is shown inFig. 1 for two different networks with connectivity ratiosτ = 0.35 (top) and τ = 0.45 (bottom), respectively. Itshows better performance of NIDS in both choices of c(corresponding to known W and unknown W) than thatof other algorithms. NIDS with c = 1/((1− λn(W))α)always takes less than half the number of iterationsused by EXTRA to reach the same accuracy. In ourexperiment, DIGing-ATC appears to be sensitive to net-works. Under a better connected network (see Fig. 1bottom), DIGing-ATC can catch up with NIDS withc = 1/(2α). The theoretical step-size of Acc-DNGD-SC is too small due to a very small constant in thebound in [52], and the convergence of Acc-DNGD-DCunder such theoretical step-size in our test is slow anduncompetitive. Thus we have carefully tuned its step-size. With the hand-optimized step-size, Acc-DNGD-SCcan achieve a comparable performance as NIDS withc = 1/(2α). In the plots, we observe that OA is fastin terms of the number of iterations. However, in thiscase, the per-iteration cost of OA is relatively high sinceit requires solving a system of linear equations at eachiteration (though factorization tricks may be used to savesome computational time).

Next, we demonstrate the effort of uncoordi-nated/adaptive step-size. We construct the function withµi = 0.02 and Li = 1 for each node i. Then we changethe Li values by multiplying the function by a constant.We use the same mixing matrix for the following twoexperiments.• We change half nodes. We randomly pick an even

number node and multiply its function by 4. Forremaining even number nodes, we multiply theirfunctions by a random integer 2 or 3.

Page 12: JOURNAL OF TSP LATEX CLASS FILES, VOL. X, NO. X, SEP 2018 1 … · 2019-06-19 · JOURNAL OF TSP LATEX CLASS FILES, VOL.X, NO. X, SEP 2018 1 A Decentralized Proximal-Gradient Method

JOURNAL OF TSP LATEX CLASS FILES, VOL. X, NO. X, SEP 2018 12

0 10 20 30 40 50 60 70 80 90

number of iterations

10-14

10-12

10-10

10-8

10-6

10-4

10-2

0 10 20 30 40 50 60 70 80 90

number of iterations

10-14

10-12

10-10

10-8

10-6

10-4

10-2

Fig. 1. Relative error ‖x−x∗‖‖x∗‖ against the number of iterations for

two different networks (top: τ = 0.35; bottom: τ = 0.45). NIDS,EXTRA, and DIGing-ATC use the same step-size α = 1/L, whereL = maxi Li. The step-size for Acc-DNGD-SC is hand-optimized.We use the default step-sizes for OA as suggested by the authors.

0 200 400 600 800 1000 1200

number of iterations

10-14

10-12

10-10

10-8

10-6

10-4

10-2

100

Fig. 2. The relative error ‖x−x∗‖‖x∗‖ against the number of iterations.

NIDS-1/L uses the same step-size 1/L, where L = maxi Li, andNIDS-adaptive uses the step-size 1/Li for each node. We assumethat no graph information is available, thus c = 1/(2 maxi αi). Theconnectivity ratio of the network is set as τ = 0.1.

• We change a quarter nodes. We randomly pick an

node not in U = {4, 8, 16, . . . , 40} and multiplyits function by 10. Then for other nodes in U ,we multiply their functions by a random integerbetween 2 and 9.

We compare NIDS with adaptive step-size (1/Li fornode i) and NIDS with same step-size 1/maxi Li inFig. 2. We let c = 1/(2 maxi αi), so no networkinformation is needed. As shown in Fig. 2, NIDS withadaptive step-size converges faster than same step-size.

B. The case with nonsmooth function r(x)

In this subsection, we compare the performance ofNIDS with PG-EXTRA [1] only since other methodsin Section V-A, such as DIGing, can not be appliedto this nonsmooth case. We consider a decentralizedcompressed sensing problem. Again, each agent i ∈{1, · · · , n} takes its own measurement via yi = Mix+ei, where yi ∈ Rmi is the measurement vector, Mi ∈Rmi×p is the sensing matrix, and ei ∈ Rmi is theindependent and identically distributed noise. Here, xis a sparse signal. The optimization problem is

minimizex

1

n

n∑i=1

1

2‖Mix− yi‖2 +

1

n

n∑i=1

λi‖x‖1,

where the connectivity ratio of the network τ = 0.1. Wenormalize the problem to make sure that the Lipschitzconstant satisfies Li = 1 for each node, we choose mi =3 and p = 200 and set the number of nodes n = 40.

Fig. 3 shows that a larger step-size in NIDS leadsto faster convergence. With step-size 1, NIDS and PG-EXTRA converge at the same speed. But if we keepincreasing the step-size, PG-EXTRA will diverge withstep-size 1.4 while the step-size of NIDS can be in-creased to 1.9 maintaining convergence at a faster speed.

C. An application in classification for healthcare data

We consider a decentralized sparse logistic regres-sion problem to classify the colon-cancer data [58].There are 62 samples, and each sample features 2,000pieces of gene expressing information (numericalizedand normalized [58]) and a binary outcome. The out-come can be normal/negative (+1) or tumor/positive(−1) and the data set contains 22 normal and 40 tumorcolon tissue samples. We store the gene information inMi ∈ R1×2001 (one more dimension is augmented totake care of the linear offset/constant in the logit functionmodel), and the outcome information is yi ∈ {−1, 1},i ∈ S1

⋃S2, where S1 serves for training while S2

serves for testing. Suppose we have a 50-node connected

Page 13: JOURNAL OF TSP LATEX CLASS FILES, VOL. X, NO. X, SEP 2018 1 … · 2019-06-19 · JOURNAL OF TSP LATEX CLASS FILES, VOL.X, NO. X, SEP 2018 1 A Decentralized Proximal-Gradient Method

JOURNAL OF TSP LATEX CLASS FILES, VOL. X, NO. X, SEP 2018 13

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2number of iterations 104

10-8

10-6

10-4

10-2

100

102

NIDS-1/LNIDS-1.5/LNIDS-1.9/LPGEXTRA-1/LPGEXTRA-1.2/LPGEXTRA-1.3/LPGEXTRA-1.4/L

Fig. 3. The relative error ‖x−x∗‖‖x∗‖ against the number of iterations.

Different step-sizes for PG-EXTRA and NIDS are considered. Forinstance, “NIDS-1/L” is NIDS using the same step-size 1/L acrossthe network of agents, where L = maxi Li. The connectivity ratio ofthe network is τ = 0.4.

network where each node i holds 1 sample (Mi, yi)(the connected network is randomly generated and itsconnectivity ratio is set to 0.08; the 50 in-networksamples indexed by S1 are randomly drawn from the62 samples). The decentralized logistic regression

minimizex

1

|S1|∑i∈S1

ln(1 + exp(−Mixiyi))

+1

|S1|∑i∈S1

λi‖xi‖22 +1

|S1|∑i∈S1

λi‖xi‖1,

is then solved over the network to train a sparse linearclassifier x∗ for the outcome prediction of the remain-ing/future samples/data. In the optimization formulation,the `2 norm term imposes strong convexity to s, while`1 term promotes sparsity of the solution. Aside fromthe 50 samples used for training purpose, we randomlyselect 12 nodes from the 50 nodes to show the predictionperformance of the remaining 12 samples in Fig. 4left. The middle and right figures in Fig. 4 show howthe consensus error

∥∥xk∥∥I−W and the sparsity of the

average solution vector 150

∑50i=1 x

ki drops, respectively.

VI. CONCLUSION

We proposed a novel decentralized consensus algo-rithm NIDS, whose step-size does not depend on thenetwork structure. In NIDS, the step-size depends onlyon the objective function, and it can be as large as2/L, where L is the Lipschitz constant of the gra-dient of the smooth function. We showed that NIDSconverges at the o(1/k) rate for the general convexcase and at a linear rate for the strongly convex case.

For the strongly convex case, we separated the linearconvergence rate’s dependence on the objective functionand the network. The separated convergence rates matchthe typical rates for the general gradient descent andthe consensus averaging. Furthermore, every agent inthe network can choose its own step-size independentlyby its own objective function. Numerical experimentsvalidated the theoretical results and demonstrated betterperformance of NIDS over state-of-the-art algorithms.Because the step-size of NIDS does not depend onthe network structure, there are many possible futureextensions. One extension is to apply NIDS on dynamicnetworks where nodes can join and drop off.

ACKNOWLEDGEMENT

We thank the anonymous reviewers for helpful com-ments and suggestions to improve the clarity of thispaper.

REFERENCES

[1] W. Shi, Q. Ling, G. Wu, and W. Yin, “A proximal gradi-ent algorithm for decentralized composite optimization,” IEEETransactions on Signal Processing, vol. 63, no. 22, pp. 6013–6023, 2015.

[2] L. Xiao, S. Boyd, and S. Kim, “Distributed average consensuswith least-mean-square deviation,” Journal of Parallel and Dis-tributed Computing, vol. 67, no. 1, pp. 33–46, 2007.

[3] K. Cai and H. Ishii, “Average consensus on arbitrary stronglyconnected digraphs with time-varying topologies,” IEEE Trans-actions on Automatic Control, vol. 59, no. 4, pp. 1066–1071,2014.

[4] A. Olshevsky, “Linear time average consensus and distributedoptimization on fixed graphs,” SIAM Journal on Control andOptimization, vol. 55, no. 6, pp. 3990–4014, 2017.

[5] J. Bazerque and G. Giannakis, “Distributed spectrum sensing forcognitive radio networks by exploiting sparsity,” IEEE Transac-tions on Signal Processing, vol. 58, pp. 1847–1862, 2010.

[6] W. Ren, “Consensus based formation control strategies for multi-vehicle systems,” in Proceedings of the American Control Con-ference, 2006, pp. 4237–4242.

[7] A. Olshevsky, “Efficient Information Aggregation Strategies forDistributed Control and Signal Processing,” Ph.D. dissertation,Massachusetts Institute of Technology, 2010.

[8] S. Ram, V. Veeravalli, and A. Nedic, “Distributed non-autonomous power control through distributed convex optimiza-tion,” in INFOCOM. IEEE, 2009, pp. 3001–3005.

[9] L. Gan, U. Topcu, and S. Low, “Optimal decentralized protocolfor electric vehicle charging,” IEEE Transactions on PowerSystems, vol. 28, no. 2, pp. 940–951, 2013.

[10] M. Rabbat and R. Nowak, “Distributed optimization in sensornetworks,” in Proceedings of the 3rd international symposiumon Information processing in sensor networks. ACM, 2004, pp.20–27.

[11] P. Forero, A. Cano, and G. Giannakis, “Consensus-based dis-tributed support vector machines,” Journal of Machine LearningResearch, vol. 59, pp. 1663–1707, 2010.

[12] A. Nedic, A. Olshevsky, and C. A. Uribe, “Fast convergencerates for distributed non-bayesian learning,” IEEE Transactionson Automatic Control, vol. 62, no. 11, pp. 5538–5553, 2017.

Page 14: JOURNAL OF TSP LATEX CLASS FILES, VOL. X, NO. X, SEP 2018 1 … · 2019-06-19 · JOURNAL OF TSP LATEX CLASS FILES, VOL.X, NO. X, SEP 2018 1 A Decentralized Proximal-Gradient Method

JOURNAL OF TSP LATEX CLASS FILES, VOL. X, NO. X, SEP 2018 14

0 1000 2000 3000 4000 5000 6000 7000 8000

number of iterations

0

2

4

6

8

10

12

num

ber

of c

orre

ct p

redi

ctio

ns

0 1000 2000 3000 4000 5000 6000 7000 8000

number of iterations

10-8

10-7

10-6

10-5

10-4

10-3

cons

ensu

s er

ror

0 1000 2000 3000 4000 5000 6000 7000 8000

number of iterations

100

101

102

103

Fig. 4. Performance of NIDS for sparse logistic regression. Left: number of correct predictions vs. iteration. Middle: consensus error∥∥xk∥∥

I−W

vs. iteration; Right: number of non-zero elements in xk = 150

∑50i=1 x

ki vs. iteration.

[13] D. Bertsekas, “Distributed asynchronous computation of fixedpoints,” Mathematical Programming, vol. 27, no. 1, pp. 107–120,1983.

[14] J. Tsitsiklis, D. Bertsekas, and M. Athans, “Distributed asyn-chronous deterministic and stochastic gradient optimization algo-rithms,” IEEE Transactions on Automatic Control, vol. 31, no. 9,pp. 803–812, 1986.

[15] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Dis-tributed optimization and statistical learning via the alternatingdirection method of multipliers,” Foundations and Trends R© inMachine Learning, vol. 3, no. 1, pp. 1–122, 2011.

[16] V. Cevher, S. Becker, and M. Schmidt, “Convex optimization forbig data: Scalable, randomized, and parallel algorithms for bigdata analytics,” IEEE Signal Processing Magazine, vol. 31, no. 5,pp. 32–43, 2014.

[17] A. Nedic and D. Bertsekas, “Convergence rate of incrementalsubgradient algorithms,” in Stochastic optimization: algorithmsand applications. Springer, 2001, pp. 223–264.

[18] A. Nedic and D. Bertsekas, “Incremental subgradient methods fornondifferentiable optimization,” SIAM Journal on Optimization,vol. 12, no. 1, pp. 109–138, 2001.

[19] A. Nedic, D. P. Bertsekas, and V. Borkar, “Distributed asyn-chronous incremental subgradient methods,” in Proceedings ofthe March 2000 Haifa Workshop “Inherently Parallel Algorithmsin Feasibility and Optimization and Their Applications”. Else-vier, Amsterdam, 2001.

[20] S. Ram, A. Nedic, and V. Veeravalli, “Incremental stochasticsubgradient algorithms for convex optimization,” SIAM Journalon Optimization, vol. 20, no. 2, pp. 691–717, 2009.

[21] D. P. Bertsekas, “Incremental proximal methods for large scaleconvex optimization,” Mathematical Programming, vol. 129, pp.163–195, 2011.

[22] M. Wang and D. P. Bertsekas, “Incremental constraint projection-proximal methods for nonsmooth convex optimization,” 2013,lab. for Information and Decision Systems Report LIDS-P-2907,MIT, July 2013.

[23] A. Nedic and A. Ozdaglar, “Distributed subgradient methodsfor multi-agent optimization,” IEEE Transactions on AutomaticControl, vol. 54, pp. 48–61, 2009.

[24] S. S. Ram, A. Nedic, and V. Veeravalli, “Distributed stochasticsubgradient projection algorithms for convex optimization,” Jour-nal of Optimization Theory and Applications, vol. 147, no. 3, pp.516–545, 2010.

[25] A. Nedic, “Asynchronous broadcast-based convex optimizationover a network,” IEEE Transactions on Automatic Control,vol. 56, no. 6, pp. 1337–1351, 2011.

[26] K. Yuan, Q. Ling, and W. Yin, “On the convergence of decentral-

ized gradient descent,” SIAM Journal on Optimization, vol. 26,no. 3, pp. 1835–1854, 2016.

[27] H. Terelius, U. Topcu, and R. Murray, “Decentralized multi-agent optimization via dual decomposition,” IFAC ProceedingsVolumes, vol. 44, no. 1, pp. 11 245–11 251, 2011.

[28] D. P. Bertsekas and J. Tsitsiklis, Parallel and Distributed Compu-tation: Numerical Methods, 2nd ed. Nashua: Athena Scientific,1997.

[29] E. Wei and A. Ozdaglar, “On the O(1/k) convergence of asyn-chronous distributed alternating direction method of multipliers,”in Global Conference on Signal and Information Processing(GlobalSIP), 2013 IEEE. IEEE, 2013, pp. 551–554.

[30] T.-H. Chang, M. Hong, and X. Wang, “Multi-agent distributedoptimization via inexact consensus ADMM,” IEEE Transactionson Signal Processing, vol. 63, no. 2, pp. 482–497, 2015.

[31] M. Hong and T.-H. Chang, “Stochastic proximal gradient con-sensus over random networks,” IEEE Transactions on SignalProcessing, vol. 65, no. 11, pp. 2933–2948, 2017.

[32] W. Shi, Q. Ling, K. Yuan, G. Wu, and W. Yin, “On thelinear convergence of the ADMM in decentralized consensusoptimization,” IEEE Transactions on Signal Processing, vol. 62,no. 7, pp. 1750–1761, 2014.

[33] A. Chen and A. Ozdaglar, “A fast distributed proximal-gradientmethod,” in the 50th Annual Allerton Conference on Communi-cation, Control, and Computing (Allerton), 2012, pp. 601–608.

[34] D. Jakovetic, J. Xavier, and J. Moura, “Fast distributed gradientmethods,” IEEE Transactions on Automatic Control, vol. 59, pp.1131–1146, 2014.

[35] Y. Nesterov, Introductory Lectures on Convex Optimization: ABasic Course. Springer Science & Business Media, 2013,vol. 87.

[36] W. Shi, Q. Ling, G. Wu, and W. Yin, “EXTRA: An exact first-order algorithm for decentralized consensus optimization,” SIAMJournal on Optimization, vol. 25, no. 2, pp. 944–966, 2015.

[37] M. Zhu and S. Martinez, “Discrete-time dynamic average con-sensus,” Automatica, vol. 46, no. 2, pp. 322–329, 2010.

[38] J. Xu, S. Zhu, Y. Soh, and L. Xie, “Augmented distributed gra-dient methods for multi-agent optimization under uncoordinatedconstant stepsizes,” in Proceedings of the 54th IEEE Conferenceon Decision and Control (CDC), 2015, pp. 2055–2060.

[39] P. Di Lorenzo and G. Scutari, “NEXT: In-network nonconvexoptimization,” IEEE Transactions on Signal and InformationProcessing over Networks, vol. 2, no. 2, pp. 120–136, 2016.

[40] G. Qu and N. Li, “Harnessing smoothness to accelerate dis-tributed optimization,” in Decision and Control (CDC), 2016IEEE 55th Conference on. IEEE, 2016, pp. 159–166.

[41] A. Nedic, A. Olshevsky, and W. Shi, “Achieving geometric con-vergence for distributed optimization over time-varying graphs,”

Page 15: JOURNAL OF TSP LATEX CLASS FILES, VOL. X, NO. X, SEP 2018 1 … · 2019-06-19 · JOURNAL OF TSP LATEX CLASS FILES, VOL.X, NO. X, SEP 2018 1 A Decentralized Proximal-Gradient Method

JOURNAL OF TSP LATEX CLASS FILES, VOL. X, NO. X, SEP 2018 15

SIAM Journal on Optimization, vol. 27, no. 4, pp. 2597–2633,2017.

[42] A. Nedic, A. Olshevsky, W. Shi, and C. A. Uribe, “Geometricallyconvergent distributed optimization with uncoordinated step-sizes,” in American Control Conference (ACC), 2017. IEEE,2017, pp. 3950–3955.

[43] A. Nedic and A. Olshevsky, “Distributed optimization over time-varying directed graphs,” in The 52nd IEEE Annual Conferenceon Decision and Control, 2013, pp. 6855–6860.

[44] C. Xi and U. Khan, “On the linear convergence of dis-tributed optimization over directed graphs,” arXiv preprintarXiv:1510.02149, 2015.

[45] J. Zeng and W. Yin, “ExtraPush for convex smooth decentralizedoptimization over directed networks,” Journal of ComputationalMathematics, Special Issue on Compressed Sensing, Optimiza-tion, and Structured Solutions, vol. 35, no. 4, pp. 381–394, 2017.

[46] Y. Sun, G. Scutari, and D. Palomar, “Distributed nonconvexmultiagent optimization over time-varying networks,” in 201650th Asilomar Conference on Signals, Systems and Computers.IEEE, 2016, pp. 788–794.

[47] Z. Li and M. Yan, “A primal-dual algorithm with optimal step-sizes and its application in decentralized consensus optimization,”arXiv preprint arXiv:1711.06785, 2017.

[48] S. A. Alghunaim and A. H. Sayed, “Linear Convergence ofPrimal-Dual Gradient Methods and their Performance in Dis-tributed Optimization,” arXiv e-prints, p. arXiv:1904.01196, Apr2019.

[49] K. Yuan, B. Ying, X. Zhao, and A. H. Sayed, “Exact diffusionfor distributed optimization and learningpart i: Algorithm devel-opment,” IEEE Transactions on Signal Processing, vol. 67, no. 3,pp. 708–723, 2017.

[50] ——, “Exact diffusion for distributed optimization and learn-ingpart ii: Convergence analysis,” IEEE Transactions on SignalProcessing, vol. 67, no. 3, pp. 724–739, 2017.

[51] A. Nedic, A. Olshevsky, A. Ozdaglar, and J. N. Tsitsiklis, “Ondistributed averaging algorithms and quantization effects,” IEEETransactions on Automatic Control, vol. 54, no. 11, pp. 2506–2517, 2009.

[52] G. Qu and N. Li, “Accelerated distributed nesterov gradientdescent for convex and smooth functions,” in Decision andControl (CDC), 2017 IEEE 56th Annual Conference on. IEEE,2017, pp. 2260–2267.

[53] K. Scaman, F. Bach, S. Bubeck, Y. T. Lee, and L. Massoulie,“Optimal algorithms for smooth and strongly convex distributedoptimization in networks,” in International Conference on Ma-chine Learning, 2017, pp. 3027–3036.

[54] C. A. Uribe, S. Lee, A. Gasnikov, and A. Nedic, “Opti-mal algorithms for distributed optimization,” arXiv preprintarXiv:1712.00232, 2017.

[55] T. Wu, K. Yuan, Q. Ling, W. Yin, and A. H. Sayed, “Decen-tralized consensus optimization with asynchrony and delays,”IEEE Transactions on Signal and Information Processing overNetworks, vol. 4, no. 2, pp. 293–307, 2018.

[56] M. Yan, “A new primal-dual algorithm for minimizing the sumof three functions with a linear operator,” Journal of ScientificComputing, p. to appear, 2018.

[57] S. Boyd, P. Diaconis, and L. Xiao, “Fastest mixing markov chainon a graph,” SIAM review, vol. 46, no. 4, pp. 667–689, 2004.

[58] J. Liu, J. Chen, and J. Ye, “Large-scale sparse logistic regres-sion,” in Proceedings of the 15th ACM SIGKDD internationalconference on Knowledge discovery and data mining. ACM,2009, pp. 547–556.

[59] D. Davis and W. Yin, “Convergence rate analysis of several split-ting schemes,” in Splitting Methods in Communication, Imaging,Science, and Engineering. Springer, 2016, pp. 115–163.

Zhi Li received the B.S. and the M.S. degreein Applied Mathematics from China Univer-sity of Petroleum, Shandong, China, in 2007and 2010, respectively. Then, he participatedin the cooperative education program andalso received the M.S. degree in Applied Sci-ence from Saint Marys University, Halifax,NS, Canada, in 2012. After being awardedthe Hong Kong Ph.D. Fellowship, he went toHong Kong Baptist University, HK, China,where he received his Ph.D. in Applied Math-

ematics in 2016, and he was supported by the fellowship from 2013to 2016. Since 2016 he has been a postdoctoral researcher with theDepartment of Computational Mathematics, Science and Engineering,Michigan State University, East Lansing, MI.

LIU et al.: DVC IN DISTRIBUTION NETWORKS: ONLINE AND ROBUST IMPLEMENTATIONS 6117

Wei Shi (M’15) received the B.E. degree inautomation and the Ph.D. degree in control sci-ence and engineering from the University ofScience and Technology of China, Hefei, in 2010and 2015, respectively. From 2015 to 2016, hewas a Post-Doctoral Research Associate withCoordinated Science Laboratory, University ofIllinois at Urbana-Champaign, Urbana. He is cur-rently a Post-Doctoral Research Associate withBoston University. His research interest is optimiza-tion and its applications in signal processing andcontrol.

Hao Zhu (M’12) received the B.E. degree fromTsinghua University in 2006, and the M.Sc. andPh.D. degrees from the University of Minnesotain 2009 and 2012, all in electrical engineer-ing. She is an Assistant Professor of Electricaland Computer Engineering with the University ofIllinois, Urbana-Champaign (UIUC). She was aPost-Doctoral Research Associate on power gridmodeling and validation with the UIUC InformationTrust Institute before joining the ECE Faculty in2014. Her current research interests include power

grid monitoring, power system operations and control, and energy data ana-lytics. She was a recipient of the NSF CAREER Award, the Siebel EnergyInstitute Seed Grant Award, and the 2nd Best Paper Award at the 2016 NorthAmerican Power Symposium. She is currently a member of the steering com-mittee of the IEEE SMART GRID representing the IEEE Signal ProcessingSociety.

Wei Shi received the B.E. degree in au-tomation and the Ph.D. degree in controlscience and engineering from the Univer-sity of Science and Technology of China,Hefei, in 2010 and 2015, respectively. Hewas a Postdoctoral Researcher with the Co-ordinated Science Laboratory, the Universityof Illinois at Urbana-Champaign, Urbana, IL,USA, from 2015 to 2016, with Arizona StateUniversity, Tempe, AZ, USA from 2016 to2018, and with Princeton University from

2018-2019. His research interests spanned in optimization, cyberphys-ical systems, and big data analytics. He was awarded the 2017 YoungAuthor Best Paper Award from the IEEE Signal Processing Society.

Ming Yan is currently an Assistant Pro-fessor in the Department of ComputationalMathematics, Science and Engineering andthe Department of Mathematics, MichiganState University. He received the B.S. Degreeand M.S. Degree from University of Scienceand Technology of China, and Ph.D. degreefrom University of California, Los Angeles(UCLA) in 2012. His research interests in-clude optimization methods and their appli-cations in sparse recovery and regularized

inverse problems, variational methods for image processing, paralleland distributed algorithms for solving big data problems.

Page 16: JOURNAL OF TSP LATEX CLASS FILES, VOL. X, NO. X, SEP 2018 1 … · 2019-06-19 · JOURNAL OF TSP LATEX CLASS FILES, VOL.X, NO. X, SEP 2018 1 A Decentralized Proximal-Gradient Method

L. ZHI, W. SHI, AND M. YAN 1

SUPPLEMENTARY MATERIAL FOR “ADECENTRALIZED PROXIMAL-GRADIENT

METHODWITH NETWORK INDEPENDENT STEP-SIZESANDSEPARATED CONVERGENCE RATES”

A. Proof of Lemma 1

Lemma 1 (Fixed point of (12)): (d∗, z∗) is a fixedpoint of (12) if and only if there exists a subgradientq∗ ∈ ∂r(x∗) such that z∗ = x∗ + Λq∗ and

d∗ +∇s(x∗) + q∗ = 0,

(I−W)x∗ = 0.

Proof: “⇒” If (d∗, z∗) is a fixed point of (12), wehave

0 = c(I−W) (2x∗ − z∗ − Λ∇s (x∗)− Λd∗)

= c(I−W)x∗,

where the two equalities come from (12b) and (12c),respectively. Combining (12c) and (12a) gives

0 = z∗ − x∗ + Λ∇s(x∗) + Λd∗

= Λ(q∗ +∇s(x∗) + d∗),

where q∗ ∈ ∂r(x∗).“⇐” In order to show that (d∗, z∗) is a fixed point of

iteration (12), we just need to verify that (dk+1, zk+1) =(d∗, z∗) if (dk, zk) = (d∗, z∗). From (12a), we havexk = x∗, then

dk+1 = d∗ + c(I−W) (2x∗ − z∗ − Λ∇s (x∗)− Λd∗)

= d∗ + c(I−W)x∗ = d∗,

zk+1 = x∗ − Λ∇s (x∗)− Λd∗ = x∗ + Λq∗ = z∗.

Therefore, (d∗, z∗) is a fixed point of iteration (12).

B. Proof of Lemma 2

Lemma 2 (Optimality condition): x∗ is consensualwith x∗1 = x∗2 = · · · = x∗n = x∗ being an optimalsolution of problem (1) if and only if there exists p∗

and a subgradient q∗ ∈ ∂r(x∗) such that:

(I−W)p∗ +∇s(x∗) + q∗ = 0, (26a)(I−W)x∗ = 0. (26b)

In addition, (d∗ = (I −W)p∗, z∗ = x∗ + Λq∗) is afixed point of iteration (12).

Proof: “⇒” Because x∗ = 1n×1(x∗)>, we have(I −W)x∗ = (I −W)1n×1(x∗)> = 0n×1(x∗)> = 0.The fact that x∗ is an optimal solution of problem (1)means there exists q∗ ∈ ∂r(x∗) such that (∇s(x∗) +q∗)>1n×1 = 0. That is to say all columns of ∇s(x∗) +

q∗ are orthogonal to 1n×1. Therefore, Remark 1 showsthe existence of p∗ such that (I−W)p∗+∇s(x∗)+q∗ =0.

“⇐” Equation (26b) shows that x∗ is consensual be-cause of item 3 of Assumption 1, i.e., x∗ = 1n×1(x∗)>

for some x∗. From (26a), we have 0 = ((I −W)p∗+∇s(x∗) +q∗)>1n×1 = (p∗)>(I−W)1n×1 +(∇s(x∗) + q∗)>1n×1 = (∇s(x∗) + q∗)>1n×1. Thus,0 ∈

∑ni=1(∇si(x∗)+∂ri(x∗)) because x∗ is consensual.

This completes the proof for the equivalence.Lemma 1 shows that (d∗ = (I −W)p∗, z∗ = x∗ +

Λq∗) is a fixed point of iteration (12).

C. Proof of Lemma 3

Lemma 3 (Norm over range space): For any symmet-ric positive semidefinite matrix A ∈ Rn×n with rankr ≤ n, let λ1 ≥ λ2 ≥ · · · ≥ λr > 0 be its r eigen-values. Then range(A) defined in Section I-E is a rp-dimensional subspace in Rn×p and has a norm definedby ‖x‖2A† := 〈x,A†x〉, where A† is the pseudo inverseof A. In addition, λ−1

1 ‖x‖2 ≤ ‖x‖2A† ≤ λ−1r ‖x‖2 for

all x ∈ range(A).Proof: Let A = UΣU>, where Σ =

Diag(λ1, λ2, · · · , λr) and the columns of U are or-thonormal eigenvectors for corresponding eigenvalues,i.e., U ∈ Rn×r and U>U = Ir×r. Then A† =UΣ−1U>, where Σ−1 = Diag(λ−1

1 , λ−12 , · · · , λ−1

r ).Letting x = Ay, we have ‖x‖2 =〈UΣU>y,UΣU>y〉 = 〈ΣU>y,ΣU>y〉 =‖ΣU>y‖2. In addition,

〈x,A†x〉 =〈Ay,A†Ay〉=〈UΣU>y,UΣ−1U>UΣU>y〉=〈ΣU>y,Σ−1ΣU>y〉.

Therefore,

λ−11 ‖x‖2 = λ−1

1 ‖ΣU>y‖2 (27)

≤ 〈x,A†x〉 ≤ λ−1r ‖ΣU>y‖2 = λ−1

r ‖x‖2,

which means that ‖ · ‖2A† = 〈·,A†·〉 is a norm forrange(A).

D. Proof of Proposition 1

Proposition 1: Let M = c−1(I −W)† − Λ with Λbeing symmetric positive definite and I < cΛ1/2(I −W)Λ1/2 < 0. Then ‖ · ‖M is a norm defined forrange(I−W).

Page 17: JOURNAL OF TSP LATEX CLASS FILES, VOL. X, NO. X, SEP 2018 1 … · 2019-06-19 · JOURNAL OF TSP LATEX CLASS FILES, VOL.X, NO. X, SEP 2018 1 A Decentralized Proximal-Gradient Method

L. ZHI, W. SHI, AND M. YAN 2

Proof: Rewrite the matrix M as

M =c−1(I−W)† − Λ

=Λ1/2(c−1Λ−1/2(I−W)†Λ−1/2 − I)Λ1/2

=Λ1/2((cΛ1/2(I−W)Λ1/2)† − I)Λ1/2.

For any x ∈ range(I −W), we can find y ∈ Rn×psuch that x = (I−W)Λ1/2y. Then

〈x,Mx〉=〈(I−W)Λ1/2y,Λ1/2((cΛ1/2(I−W)Λ1/2)†

− I)Λ1/2(I−W)Λ1/2y〉=〈Λ1/2(I−W)Λ1/2y, ((cΛ1/2(I−W)Λ1/2)†

− I)Λ1/2(I−W)Λ1/2y〉.

We apply Lemma 3 on Λ1/2(I−W)Λ1/2 and obtain theresult.

E. Proof of Theorem 1

Before proving Theorem 1, we present two lemmas.The first lemma shows that the distance to a fixed pointof (12) is decreasing, and the second one show thedistance between two iterations is decreasing.

Lemma 4 (A key inequality of descent): Let (d∗, z∗)be a fixed point of (12) and d∗ ∈ range(I −W). Forthe sequence (dk, zk) generated from NIDS in (12) withI < cΛ1/2(I−W)Λ1/2, we have

‖zk+1 − z∗‖2Λ−1 + ‖dk+1 − d∗‖2M≤‖zk − z∗‖2Λ−1 + ‖dk − d∗‖2M (28)

− (1−maxi

αiLi2

)(‖zk − zk+1‖2Λ−1

+‖dk − dk+1‖2M).

Proof: Young’s inequality and (7) give us

2〈∇s(xk)−∇s (x∗) , zk − zk+1〉

− 2〈xk − x∗,∇s(xk)−∇s(x∗)〉

≤1

2‖zk − zk+1‖2L + 2‖∇s(xk)−∇s(x∗)‖2L−1

− 2‖∇s(xk)−∇s(x∗)‖2L−1

=1

2‖zk − zk+1‖2L.

Therefore, from (15), we have

‖zk+1 − z∗‖2Λ−1 + ‖dk+1 − d∗‖2M≤‖zk − z∗‖2Λ−1 + ‖dk − d∗‖2M − ‖zk − zk+1‖2Λ−1

− ‖dk − dk+1‖2M +1

2‖zk − zk+1‖2L

≤‖zk − z∗‖2Λ−1 + ‖dk − d∗‖2M

−(1−maxi

αiLi2

)‖zk − zk+1‖2Λ−1 − ‖dk − dk+1‖2M≤‖zk − z∗‖2Λ−1 + ‖dk − d∗‖2M

−(1−maxi

αiLi2

)(‖zk − zk+1‖2Λ−1 + ‖dk − dk+1‖2M

).

This completes the proof.Lemma 5 (Monotonicity of successive difference in

a special norm): Let (dk, zk) be the sequence gen-erated from NIDS in (12) with αi < 2/Li for alli and I < cΛ1/2(I − W)Λ1/2. Then the sequence{‖zk+1 − zk‖2Λ−1 + ‖dk+1 − dk‖2M

}k≥0

is monotoni-cally nonincreasing.

Proof: Similar to the proof for Lemma 4, we canshow that

〈dk+1 − dk, zk+1 − zk + xk〉= 〈dk+1 − dk,dk+1 − dk〉M,

(29)

〈dk+1 − dk, zk − zk−1 + xk−1〉= 〈dk+1 − dk,dk − dk−1〉M,

(30)

〈xk − xk−1, zk − xk − zk−1 + xk−1〉Λ−1 ≥ 0. (31)

Subtracting (30) from (29) on both sides, we have

〈dk+1 − dk,xk − xk−1〉

=∥∥dk+1 − dk

∥∥2

M− 〈dk+1 − dk,dk − dk−1〉M

+ 〈dk+1 − dk, 2zk − zk−1 − zk+1〉

≥∥∥dk+1 − dk

∥∥2

M− 1

2‖dk+1 − dk‖2M −

1

2‖dk − dk−1‖2M

+ 〈dk+1 − dk, 2zk − zk−1 − zk+1〉

=1

2‖dk+1 − dk‖2M −

1

2‖dk − dk−1‖2M

+ 〈dk+1 − dk, 2zk − zk−1 − zk+1〉, (32)

where the inequality comes from the Cauchy-Schwarzinequality. Then, the previous inequality, togetherwith (31) and the Cauchy-Schwarz inequality, gives

〈xk − xk−1,∇s(xk)−∇s(xk−1)〉≤⟨xk − xk−1,Λ−1(zk − xk − zk−1 + xk−1)

+∇s(xk)−∇s(xk−1)⟩

=〈xk − xk−1,Λ−1(zk − zk+1 − zk−1 + zk)

− dk+1 + dk〉

Page 18: JOURNAL OF TSP LATEX CLASS FILES, VOL. X, NO. X, SEP 2018 1 … · 2019-06-19 · JOURNAL OF TSP LATEX CLASS FILES, VOL.X, NO. X, SEP 2018 1 A Decentralized Proximal-Gradient Method

L. ZHI, W. SHI, AND M. YAN 3

≤〈xk − xk−1, zk − zk+1 − zk−1 + zk〉Λ−1

− 〈dk+1 − dk, 2zk − zk−1 − zk+1〉

− 1

2

∥∥dk+1 − dk∥∥2

M+

1

2‖dk − dk−1‖2M

=⟨Λ−1(xk − xk−1)− dk+1 + dk,

zk − zk+1 − zk−1 + zk⟩

− 1

2

∥∥dk+1 − dk∥∥2

M+

1

2‖dk − dk−1‖2M

and consequently

〈xk − xk−1,∇s(xk)−∇s(xk−1)〉≤〈Λ−1(zk+1 − zk) +∇s

(xk)−∇s

(xk−1

),

zk − zk+1 − zk−1 + zk〉

− 1

2

∥∥dk+1 − dk∥∥2

M+

1

2

∥∥dk − dk−1∥∥2

M

≤〈zk+1 − zk, zk − zk+1 − zk−1 + zk〉Λ−1

+1

2

∥∥zk − zk+1 − zk−1 + zk∥∥2

Λ−1

+1

2

∥∥∇s (xk)−∇s (xk−1)∥∥2

Λ

− 1

2

∥∥dk+1 − dk∥∥2

M+

1

2

∥∥dk − dk−1∥∥2

M

=1

2

∥∥zk − zk−1∥∥2

Λ−1 −1

2

∥∥zk+1 − zk∥∥2

Λ−1

+1

2

∥∥∇s (xk)−∇s (xk−1)∥∥2

Λ

− 1

2

∥∥dk+1 − dk∥∥2

M+

1

2

∥∥dk − dk−1∥∥2

M.

The three inequalities hold because of (31), (32), andthe Cauchy-Schwarz inequality, respectively. The firstand third equalities come from (12c). Rearranging theprevious inequality, we obtain∥∥zk+1 − zk

∥∥2

Λ−1 +∥∥dk+1 − dk

∥∥2

M

≤∥∥zk − zk−1

∥∥2

Λ−1 +∥∥dk − dk−1

∥∥2

M

+1

2

∥∥∇s (xk)−∇s (xk−1)∥∥2

Λ

− 〈xk − xk−1,∇s(xk)−∇s(xk−1)〉

≤∥∥zk − zk−1

∥∥2

Λ−1 +∥∥dk − dk−1

∥∥2

M

+1

2

∥∥∇s (xk)−∇s (xk−1)∥∥2

Λ−2L−1

≤∥∥zk − zk−1

∥∥2

Λ−1 +∥∥dk − dk−1

∥∥2

M.

where the second and last inequalities come from (7)and Λ < 2L−1, respectively. It completes the proof.

Theorem 1 (Sublinear rate): Let (dk, zk) be the se-quence generated from NIDS in (12) with αi < 2/Lifor all i and I < cΛ1/2(I−W)Λ1/2. We have

‖zk − zk+1‖2Λ−1 + ‖dk − dk+1‖2M

≤‖z1 − z∗‖2Λ−1 + ‖d1 − d∗‖2M

k(1−maxiαiLi

2 ),

(33)

‖zk − zk+1‖2Λ−1 + ‖dk − dk+1‖2M = o

(1

k + 1

).

Furthermore, (dk, zk) converges to a fixed point (d, z)of iteration (12) and d ∈ range(I − W), if I �cΛ1/2(I−W)Λ1/2.

Proof: Lemma 5 shows that{‖zk+1 − zk‖2Λ−1 + ‖dk+1 − dk‖2M

}k≥0

ismonotonically nonincreasing. Summing up (28)from 1 to k, we have

k∑j=1

(‖zj − zj+1‖2Λ−1 + ‖dj − dj+1‖2M

)≤ 1

(1−maxiαiLi

2 )(‖z1 − z∗‖2Λ−1 + ‖d1 − d∗‖2M

− ‖zk+1 − z∗‖2Λ−1 − ‖dk+1 − d∗‖2M).

Therefore, we have

‖zk − zk+1‖2Λ−1 + ‖dk − dk+1‖2M

≤1

k

k∑j=1

(‖zj − zj+1‖2Λ−1 + ‖dj − dj+1‖2M

)≤ 1

k(1−maxiαiLi

2 )(‖z1 − z∗‖2Λ−1 + ‖d1 − d∗‖2M),

and [59, Lemma 1] gives us (33).When I � cΛ1/2(I − W)Λ1/2, inequality (28)

shows that the sequence (dk, zk) is bounded, and thereexists a convergent subsequence (dki , zki) such that(dki , zki) → (d, z). Then (33) gives the convergenceof (dki+1, zki+1). More specifically, (dki+1, zki+1) →(d, z). Therefore (d, z) is a fixed point of iteration (12).In addition, because dk ∈ range(I−W) for all k, wehave d ∈ range(I−W). Finally Lemma 4 implies theconvergence of (dk, zk) to (d, z).