lighter-communication distributed machine learning via ...y328yu/mypapers/uai16.pdf · most,...

Lighter-Communication Distributed Machine Learning via Sufficient FactorBroadcasting

Pengtao Xie, Jin Kyu Kim, Yi Zhou?, Qirong Ho†, Abhimanu Kumar§, Yaoliang Yu, Eric XingSchool of Computer Science, Carnegie Mellon University;

?Department of EECS, Syracuse University;†Institute for Infocomm Research, A*STAR, Singapore; §Groupon

Abstract

Matrix-parametrized models (MPMs) are widelyused in machine learning (ML) applications. Inlarge-scale ML problems, the parameter matrixof a MPM can grow at an unexpected rate, result-ing in high communication and parameter syn-chronization costs. To address this issue, we of-fer two contributions: first, we develop a com-putation model for a large family of MPMs,which share the following property: the param-eter update computed on each data sample isa rank-1 matrix, i.e. the outer product of two“sufficient factors” (SFs). Second, we imple-ment a decentralized, peer-to-peer system, Suf-ficient Factor Broadcasting (SFB), which broad-casts the SFs among worker machines, and re-constructs the update matrices locally at eachworker. SFB takes advantage of small rank-1matrix updates and efficient partial broadcastingstrategies to dramatically improve communica-tion efficiency. We propose a graph optimiza-tion based partial broadcasting scheme, whichminimizes the delay of information dissemina-tion under the constraint that each machine onlycommunicates with a subset rather than all ofmachines. Furthermore, we provide theoreti-cal analysis to show that SFB guarantees con-vergence of algorithms (under full broadcasting)without requiring a centralized synchronizationmechanism. Experiments corroborate SFB’s ef-ficiency on four MPMs.

1 INTRODUCTIONMachine Learning (ML) provides a principled and effec-tive mechanism for extracting latent structure and patternsfrom raw data and making automatic predictions and deci-sions. The growing prevalence of big data, such as billionsof text pages in the web, hundreds of hours of video up-

loaded to video-sharing sites every minute1, accompaniedby an increasing need of big model, such as neural net-works (Dean et al., 2012) and topic models (Yuan et al.,2015) with billions of parameters, has inspired the designand development of distributed machine learning systems(Dean and Ghemawat, 2008; Gonzalez et al., 2012; Zahariaet al., 2012; Li et al., 2014; Xing et al., 2015) running onresearch clusters, data center and cloud platforms with 10s-1000s machines.

For many machine learning (ML) models, such as mul-ticlass logistic regression (MLR), neural networks (NN)(Chilimbi et al., 2014), distance metric learning (DML)(Xing et al., 2002) and sparse coding (SC) (Olshausen andField, 1997), their parameters can be represented by a ma-trix W. For example, in MLR, rows of W represent theclassification coefficient vectors corresponding to differentclasses; whereas in SC rows of W correspond to the basisvectors used for reconstructing the observed data. A learn-ing algorithm, such as stochastic gradient descent (SGD),would iteratively compute an update ∆W from data, to beaggregated with the current version of W. We call suchmodels matrix-parameterized models (MPMs).

Learning MPMs in large scale ML problems is challeng-ing: ML application scales have risen dramatically, a goodexample being the ImageNet (Deng et al., 2009) com-pendium with millions of images grouped into tens of thou-sands of classes. To ensure fast running times when scal-ing up MPMs to such large problems, it is desirable toturn to distributed computation; however, a unique chal-lenge to MPMs is that the parameter matrix grows rapidlywith problem size, causing straightforward parallelizationstrategies to perform less ideally. Consider a data-parallelalgorithm, in which every worker uses a subset of the datato update the parameters — a common paradigm is tosynchronize the full parameter matrix and update matri-ces amongst all workers (Dean and Ghemawat, 2008; Deanet al., 2012; Li et al., 2015; Chilimbi et al., 2014; Sind-hwani and Ghoting, 2012; Gopal and Yang, 2013). How-ever, this synchronization can quickly become a bottle-

1https://www.youtube.com/yt/press/statistics.html

neck: take MLR for example, in which the parameter ma-trix W is of size J ×D, where J is the number of classesand D is the feature dimensionality. In one applicationof MLR to Wikipedia (Partalas et al., 2015), J = 325kand D > 10, 000, thus W contains several billion entries(tens of GBs of memory). Because typical computer clus-ter networks can only transfer a few GBs per second at themost, inter-machine synchronization of W can dominateand bottleneck the actual algorithmic computation. In re-cent years, many distributed frameworks have been devel-oped for large scale machine learning, including Bulk Syn-chronous Parallel (BSP) systems such as Hadoop (Deanand Ghemawat, 2008) and Spark (Zaharia et al., 2012),graph computation frameworks such as GraphLab (Gon-zalez et al., 2012), and bounded-asynchronous key-valuestores such as DistBelief(Dean et al., 2012), Petuum-PS(Ho et al., 2013), Project Adam (Chilimbi et al., 2014)and (Li et al., 2014). When using these systems to learnMPMs, it is common to transmit the full parameter ma-trices W and/or matrix updates ∆W between machines,usually in a server-client style (Dean and Ghemawat, 2008;Dean et al., 2012; Sindhwani and Ghoting, 2012; Gopaland Yang, 2013; Chilimbi et al., 2014; Li et al., 2015).As the matrices become larger due to increasing problemsizes, so do communication costs and synchronization de-lays — hence, reducing such costs is a key priority whenusing these frameworks.

We begin by investigating the structure of matrix param-eterized models, in order to design efficient communica-tion strategies. We focus on models with a common prop-erty: when the parameter matrix W of these models is opti-mized with stochastic gradient descent (SGD) (Dean et al.,2012; Ho et al., 2013; Chilimbi et al., 2014) or stochasticdual coordinate ascent (SDCA) (Hsieh et al., 2008; Shalev-Shwartz and Zhang, 2013), the update4W computed overone (or a few) data sample(s) is of low-rank, e.g. it canbe written as the outer product of two vectors u and v:4W = uv>. The vectors u and v are sufficient fac-tors (SF, meaning that they are sufficient to reconstructthe update matrix 4W). A rich set of models (Olshausenand Field, 1997; Lee and Seung, 1999; Xing et al., 2002;Chilimbi et al., 2014) fall into this family: for instance,when solving an MLR problem using SGD, the stochasticgradient is 4W = uv>, where u is the prediction prob-ability vector and v is the feature vector. Similarly, whensolving an `2 regularized MLR problem using SDCA, theupdate matrix 4W also admits such as a structure, whereu is the update vector of a dual variable and v is the featurevector. Other models include neural networks (Chilimbiet al., 2014), distance metric learning (Xing et al., 2002),sparse coding (Olshausen and Field, 1997), non-negativematrix factorization (Lee and Seung, 1999) and principalcomponent analysis, to name a few.

Leveraging this property, we propose a system called Suf-

ficient Factor Broadcasting (SFB), whose basic idea is tosend sufficient factors (SFs) between workers, which thenreconstruct matrix updates4W locally, thus greatly reduc-ing inter-machine parameter communication. This standsin contrast to the well-established parameter server id-iom (Chilimbi et al., 2014; Li et al., 2014), a centralizeddesign where workers maintain a “local” image of the pa-rameters W, which are synchronized with a central param-eter image W (stored on the “parameter servers”). In exist-ing parameter server designs, the (small, low-rank) updates4W are accumulated into the central parameter server’sW, and the low-rank structure of each update 4W is lostin the process. Thus, the parameter server can only trans-mit the (large, full-rank) matrix W to the workers, in-ducing extra communication that could be avoided. Weaddress this issue by designing SFB as a decentralized,peer-to-peer system, where each worker keeps its own im-age of the parameters W (either in memory or on localdisk), and sends sufficient factors to only a subset of otherworkers, via “partial broadcasting” strategies that avoid theusual O(P 2) peer-to-peer broadcast communication overP machines. SFB also exploits ML algorithm toleranceto bounded-asynchronous execution (Ho et al., 2013), assupported by both our experiments and a theoretical proof.SFB is highly communication-efficient; transmission costsare linear in the dimensions of the parameter matrix, andthe resulting faster communication greatly reduces waitingtime in synchronous systems (e.g. Hadoop and Spark), orimproves parameter freshness in (bounded) asynchronoussystems (e.g. GraphLab, Petuum-PS and (Li et al., 2014)).SFs have been used to speed up some (but not all) networkcommunication in deep learning (Chilimbi et al., 2014); ourwork differs primarily in that we always transmit SFs, neverfull matrices.

The major contributions of this paper are as follows:

• We identify the sufficient factor property of a largefamily of matrix-parametrized models when solvedwith two popular algorithms: stochastic gradient de-scent and stochastic dual coordinate ascent.

• In light of the sufficient factor property, we proposea sufficient factor broadcasting (SFB) model of com-putation. Through a decentralized, peer-to-peer archi-tecture with bounded-asynchronous partial broadcast-ing, SFB greatly reduces communication complexitywhile maintaining excellent empirical performance.

• To further reduce communication cost, we investigatea partial broadcasting scheme and propose a graph-optimization based approach to determine the topol-ogy of the communication network.

• We analyze the communication and computation costsof SFB and provide a convergence guarantee of SFBbased minibatch SGD algorithm, under bulk syn-chronous and bounded asynchronous executions.

• We empirically evaluate SFB on four popular mod-els, and confirm the efficiency and low communica-tion complexity of SFB.

The rest of the paper is organized as follows. In Sec-tion 2 and 3, we introduce the sufficient factor property ofmatrix-parametrized models and propose the sufficient fac-tor broadcasting computation model, respectively. Section4 analyzes the costs and convergence behavior of SFB. Sec-tion 5 gives experimental results. Section 6 reviews relatedworks and Section 7 concludes the paper.

2 SUFFICIENT FACTOR PROPERTY OFMATRIX-PARAMETRIZED MODELS

The core goal of Sufficient Factor Broadcasting (SFB)is to reduce network communication costs for matrix-parametrized models; specifically, those that follow an op-timization formulation

(P) minW

1N

N∑i=1

fi(Wai) + h(W) (1)

where the model is parametrized by a matrix W ∈ RJ×D.The loss function fi(·) is typically defined over a set oftraining samples {(ai,bi)}Ni=1, with the dependence on bibeing suppressed. We allow fi(·) to be either convex ornonconvex, smooth or nonsmooth (with subgradient every-where); examples include `2 loss and multiclass logisticloss, amongst others. The regularizer h(W) is assumedto admit an efficient proximal operator proxh(·). For ex-ample, h(·) could be an indicator function of convex con-straints, `1-, `2-, trace-norm, to name a few. The vectorsai and bi can represent observed features, supervised in-formation (e.g., class labels in classification, response val-ues in regression), or even unobserved auxiliary informa-tion (such as sparse codes in sparse coding (Olshausen andField, 1997)) associated with data sample i. The key prop-erty we exploit below ranges from the matrix-vector multi-plication Wai. This optimization problem (P) can be usedto represent a rich set of ML models (Olshausen and Field,1997; Lee and Seung, 1999; Xing et al., 2002; Chilimbiet al., 2014), such as the following:

Distance Metric Learning (DML) (Xing et al., 2002) im-proves the performance of other ML algorithms, by learn-ing a new distance function that correctly represents similarand dissimilar pairs of data samples; this distance functionis a matrix W that can have billions of parameters or more,depending on the data sample dimensionality. The vectorai is the difference of the feature vectors in the ith data pairand fi(·) can be either a quadratic function or a hinge lossfunction, depending on the similarity/dissimilarity label biof the data pair. In both cases, h(·) can be an `1-, `2-, trace-norm regularizer or simply h(·) = 0 (no regularization).

Sparse Coding (SC) (Olshausen and Field, 1997) learns adictionary of basis from data, so that the data can be re-

represented sparsely (and thus efficiently) in terms of thedictionary. In SC, W is the dictionary matrix, ai are thesparse codes, bi is the input feature vector and fi(·) is aquadratic function (Olshausen and Field, 1997). To pre-vent the entries in W from becoming too large, each col-umn Wk must satisfy ‖Wk‖2 ≤ 1. In this case, h(W)is an indicator function which equals 0 if W satisfies theconstraints and equals∞ otherwise.

2.1 OPTIMIZATION VIA PROXIMAL SGD, SDCA

To solve the optimization problem (P), it is common to em-ploy either (proximal) stochastic gradient descent (SGD)(Dean et al., 2012; Ho et al., 2013; Chilimbi et al., 2014; Liet al., 2015) or stochastic dual coordinate ascent (SDCA)(Hsieh et al., 2008; Shalev-Shwartz and Zhang, 2013), bothof which are popular and well-established parallel opti-mization techniques.

Proximal SGD: In proximal SGD, a stochastic estimate ofthe gradient, 4W, is first computed over one data sam-ple (or a mini-batch of samples), in order to update W viaW ← W − η 4W (where η is the learning rate). Fol-lowing this, the proximal operator proxηh(·) is applied toW. Notably, the stochastic gradient 4W in (P) can bewritten as the outer product of two vectors 4W = uv>,where u = ∂f(Wai,bi)

∂(Wai), v = ai, according to the chain

rule. Later, we will show that this low rank structure of4W can greatly reduce inter-worker communication.

Stochastic DCA: SDCA applies to problems (P) wherefi(·) is convex and h(·) is strongly convex (e.g. when h(·)contains the squared `2 norm); it solves the dual problem of(P), via stochastic coordinate ascent on the dual variables.Introducing the dual matrix U = [u1, . . . ,uN ] ∈ RJ×Nand the data matrix A = [a1, . . . ,aN ] ∈ RD×N , the dualproblem of (P) can be written as

(D) minU

1N

N∑i=1

f∗i (−ui) + h∗( 1NUA>) (2)

where f∗i (·) and h∗(·) are the Fenchel conjugate functionsof fi(·) and h(·), respectively. The primal-dual matricesW and U are connected by2 W = ∇h∗(Z), where theauxiliary matrix Z := 1

NUA>. Algorithmically, we needto update the dual matrix U, the primal matrix W, and theauxiliary matrix Z: every iteration, we pick a random datasample i, and compute the stochastic update 4ui by min-imizing (D) while holding {uj}j 6=i fixed. The dual vari-able is updated via ui ← ui − 4ui, the auxiliary vari-able via Z ← Z − 4uia

>i , and the primal variable via

W ← ∇h∗(Z). Similar to SGD, the update of Z is alsothe outer product of two vectors: 4ui and ai, which canbe exploited to reduce communication cost.

2The strong convexity of h is equivalent to the smoothness ofthe conjugate function h∗.

Sufficient Factor Property in SGD and SDCA: In bothSGD and SDCA, the parameter matrix update can be com-puted as the outer product of two vectors — we call thesesufficient factors (SFs). This property can be leveraged toimprove the communication efficiency of distributed MLsystems: instead of communicating parameter/update ma-trices among machines, we can communicate the SFs andreconstruct the update matrices locally at each machine.Because the SFs are much smaller in size, synchronizationcosts can be dramatically reduced. See Section 4 below fora detailed analysis.

Low-rank Extensions: More generally, the update matrix4W may not be exactly rank-1, but still of very low rank.For example, when each machine uses a mini-batch of sizeK,4W is of rank at mostK; in Restricted Boltzmann Ma-chines, the update of the weight matrix is computed fromfour vectors u1,v1,u2,v2 as u1v

>1 − u2v

>2 , i.e. rank-2;

for the BFGS algorithm (Bertsekas, 1999), the update ofthe inverse Hessian is computed from two vectors u,v asαuu> − β(uv> + vu>), i.e. rank-3. Even when the up-date matrix4W is not genuinely low-rank, to reduce com-munication cost, it might still make sense to send only acertain low-rank approximation. We intend to investigatethese possibilities in future work.

3 SUFFICIENT FACTORBROADCASTING

Leveraging the SF property of the update matrix inproblems (P) and (D), we propose a Sufficient FactorBroadcasting (SFB) system that supports efficient (low-communication) distributed learning of the parameter ma-trix W. We assume a setting with P workers, each ofwhich holds a data shard and a copy of the parameter ma-trix3 W. Stochastic updates to W are generated via proxi-mal SGD or SDCA, and communicated between machinesto ensure parameter consistency. In proximal SGD, on ev-ery iteration, each worker p computes SFs (up,vp), basedon one data sample xi = (ai,bi) in the worker’s datashard. The worker then broadcasts (up,vp) to all otherworkers; once all P workers have performed their broad-cast (and have thus received all SFs), they re-construct theP update matrices (one per data sample) from the P SFs,and apply them to update their local copy of W. Finally,each worker applies the proximal operator proxh(·). Whenusing SDCA, the above procedure is instead used to broad-cast SFs for the auxiliary matrix Z, which is then usedto obtain the primal matrix W = ∇h∗(Z). Figure 1 il-lustrates SFB operation: 4 workers compute their respec-tive SFs (u1,v1), . . . , (u4,v4), which are then broad-

3For simplicity, we assume each worker has enough memoryto hold a full copy of the parameter matrix W. If W is too large,one can either partition it across multiple machines (Dean et al.,2012; Li et al., 2014), or use local disk storage (i.e. out of coreoperation). We plan to investigate these strategies as future work.

Worker 1

Server

Worker 2

Worker 3 Worker 4

4W

W

2W

W

3W

W

1W

W

Worker 1 Worker 2

Worker 3 Worker 4

11,vu

)(a )(b

11,vu

11,vu22,vu

22,vu22,vu

33,vu

33,vu

33,vu

44,vu

44,vu

44,vu

1W

x

11,vu22,vu

33,vu

44,vu

Figure 1: Sufficient Factor Broadcasting (SFB).

cast to the other 3 workers. Each worker p uses all 4SFs (u1,v1), . . . , (u4,v4) to exactly reconstruct the up-date matrices 4Wp = upv

>p , and update their local copy

of the parameter matrix: Wp ←Wp−∑4q=1 uqv

>q . While

the above description reflects synchronous execution, it iseasy to extend to (bounded) asynchronous execution.

SFB vs Client-Server Architectures: The SFB peer-to-peer topology can be contrasted with a “full-matrix” client-server architecture for parameter synchronization, e.g. asused by Project Adam (Chilimbi et al., 2014) to learnneural networks: there, a centralized server maintains theglobal parameter matrix, and each client keeps a local copy.Clients compute sufficient factors and send them to theserver, which uses the SFs to update the global parame-ter matrix; the server then sends the full, updated parame-ter matrix back to clients. Although client-to-server costsare reduced (by sending SFs), server-to-client costs are stillexpensive because full parameter matrices need to be sent.In contrast, the peer-to-peer SFB topology never sends fullmatrices; only SFs are sent over the network. We also notethat under SFB, the update matrices are reconstructed ateach of the P machines, rather than once at a central server(for full-matrix architectures). Our experiments show thatthe time taken for update reconstruction is empirically neg-ligible compared to communication and SF computation.

Partial Broadcasting A naive peer-to-peer topology in-curs a communication cost of O(P 2) (where P is the num-ber of worker machines), which inhibits scalability to datacenter scale clusters where P can reach several thousand(Dean et al., 2012; Li et al., 2014). Hence, SFB adoptsan (optional) partial broadcasting scheme where each ma-chine connects with and sends messages to a subset of Qmachines (rather than all other machines), thus reducingcommunication costs from O(P 2) to O(PQ). Figure 2presents an example. In partial broadcasting, an update U tpgenerated by machine p at iteration t is sent only to ma-chines that are directly connected with p (and the updateU tp takes effect at iteration t + 1). The effect of U tp is in-directly and eventually transmitted to every other machineq, via the updates generated by machines sitting betweenp and q in the topology. This happens at iteration t + τ ,for some delay τ > 1 that depends on Q and the loca-tion of p and q in the network topology. Consequently, the

Figure 2: An Example of Partial Broadcasting.

P machines will not have the exact same parameter im-age W, even under bulk synchronous parallel execution —yet surprisingly, this does not empirically (Section 5) com-promise algorithm accuracy as long as Q is not too small.We hypothesize that this property is related to the toleranceof ML algorithms to bounded-asynchronous execution andrandom error, and we defer a formal proof to future work.

Determining the “best” topology of partial broadcasting,i.e., which subset of machines each machine should sendmessage to, is a challenging issue. Previous studies (Liet al., 2015) are mostly based on heuristics. While empir-ically effective, these heuristics lack a mathematically for-malizable objective. To address this issue, we propose anoptimization-oriented solution, with the goal to achieve fastdissemination of information: the effect of updates gener-ated at each machine should be “seen” by all other ma-chines as quickly as possible. Formally, we aim to reducethe delay τ . Consider a directed network with P nodes,where epq = 1 denotes that node p sends message to nodeq and epq = 0 otherwise. Let cpq denote the shortest di-rected path from node p to q. It is easy to see that τ = |cpq|,where |cpq| is the length of cpq . Letting E = {epq}Pp=1,q 6=pwe can find a network topology that minimizes τ by solvingthis optimization problem:

minE∑Pp=1

∑Pq 6=p |cpq|

s.t.∑Pq 6=p epq = Q,∀p

(3)

which can be efficiently solved by using a Branch andBound (Land and Doig, 1960) searching of the graph struc-ture, in conjunction with an algorithm (Williams, 2014)finding all-pairs shortest path in directed graphs.

Mini-batch Proximal SGD/SDCA: SFB can also be usedin mini-batch proximal SGD/SDCA; every iteration, eachworker samples a mini-batch of K data points, and com-putes K pairs of sufficient factors {(ui,vi)}Ki=1. TheseK pairs are broadcast to all other workers, which recon-struct the originating worker’s update matrix as 4W =1K

∑Ki=1 uiv

Ti .

Consistency Models SFB supports two consistencymodels: Bulk Synchronous Parallel (BSP-SFB) and StaleSynchronous Parallel (SSP-SFB), and we provide theoreti-cal convergence guarantees in the next section.

• BSP-SFB: Under BSP (Dean and Ghemawat, 2008; Za-

sfb app mlr ( int J, int D, int staleness )//SF computation functionfunction compute sv ( sfb app mlr ):

while ( ! converged ):X = sample minibatch ()foreach xi in X:

//sufficient factor uipred = predict ( mlr.para mat, xi )mlr.sv list[i].write u ( pred )//sufficient factor vimlr.sv list[i].write v ( xi )

commit()

Figure 3: Multiclass LR Pseudocode.

haria et al., 2012), an end-of-iteration global barrier en-sures all workers have completed their work, and syn-chronized their parameter copies, before proceeding tothe next iteration. BSP is a strong consistency model,that guarantees the same computational outcome (andthus algorithm convergence) each time.

• SSP-SFB: BSP can be sensitive to stragglers (slowworkers) (Ho et al., 2013), limiting the distributed sys-tem to the speed of the slowest worker. Stale Syn-chronous Parallel (SSP) (Bertsekas and Tsitsiklis, 1989;Ho et al., 2013) communication model addresses this is-sue, by allowing workers to advance at different rates,provided that the difference in iteration number betweenthe slowest and fastest workers is no more than a user-provided staleness s. SSP alleviates the straggler issuewhile guaranteeing algorithm convergence (Ho et al.,2013). Under SSP-SFB, each worker p tracks the num-ber of SF pairs computed by itself, tp, versus the numberτ qp (tp) of SF pairs received from each worker q. If thereexists a worker q such that tp − τ qp (tp) > s (i.e. someworker q is likely more than s iterations behind workerp), then worker p pauses until q is no longer s iterationsor more behind.

Programming Interface The SFB programming inter-face is simple; users need to provide a SF computa-tion function to specify how to compute the sufficientfactors. To send out SF pairs (u,v), the user addsthem to a buffer object sv list, via: write u(vec u),write v(vec v), which set i-th SF u or v to vec u or vec v.All SF pairs are sent out at the end of an iteration, which issignaled by commit(). Finally, in order to choose betweenBSP and SSP consistency, users simply set staleness to anappropriate value (0 for BSP, > 0 for SSP). SFB automati-cally updates workers’ local parameter matrix using all SFpairs — including both locally computed SF pairs addedto sv list, as well as SF pairs received from other work-ers. Figure 3 shows SFB pseudocode for multiclass logis-

tic regression. For proximal SGD/SDCA algorithms, SFBrequires users to write an additional function, prox(mat),which applies the proximal operator proxh(·) (or the SDCAdual operator h∗(·)) to the parameter matrix mat.

4 COST ANALYSIS AND THEORYWe now examine the costs and convergence behav-ior of SFB under synchronous and bounded-async (e.g.SSP (Bertsekas and Tsitsiklis, 1989; Ho et al., 2013)) con-sistency, and show that SFB can be preferable to full-matrixsynchronization/communication schemes.

4.1 COST ANALYSIS

Figure 4 compares the communications, space and time (toapply updates to W) costs of peer-to-peer SFB, against fullmatrix synchronization (FMS) under a client-server archi-tecture (Chilimbi et al., 2014). For SFB with a full broad-casting scheme, in each minibatch, every worker broad-casts K SF pairs (u,v) to P − 1 other workers, i.e.O(P 2K(J + D)) values are sent per iteration — linear inmatrix dimensions J,D, and quadratic in P . For SFB witha partial broadcasting scheme, every worker communicatesSF pairs with Q < P peers, hence the communication costis reduced to O(PQK(J + D)). Because SF pairs can-not be aggregated before transmission, the cost has a de-pendency on K. In contrast, the communication cost inFMS is O(PJD), linear in P , quadratic in matrix dimen-sions, and independent of K. For both SFB and FMS, thecost of storing W is O(JD) on every machine. As for thetime taken to update W per iteration, FMS costs O(PJD)at the server (to aggregate P client update matrices) andO(PKJD) at the P clients (to aggregate K updates intoone update matrix). By comparison, SFB bears a cost ofO(P 2KJD) under full broadcasting andO(PQKJD) un-der partial broadcasting due to the additional overhead ofreconstructing each update matrix P or Q times.

Compared with FMS, SFB achieves communication sav-ings by paying an extra computation cost. In a number ofpractical scenarios, such a tradeoff is worthwhile. Considerlarge problem scales where min(J,D) ≥ 10000, and mod-erate minibatch sizes 1 ≤ K ≤ 100 (as studied in this pa-per); when using a moderate number of machines (around10-100), theO(P 2K(J+D)) communications cost of SFBis lower than the O(PJD) cost for FMS, and the rela-tive benefit of SFB improves as the dimensions J,D of Wgrow. In data center scale computing environments withthousands of machines, we can adopt the partial broadcast-ing scheme. As for the time needed to apply updates to W,it turns out that the additional cost of reconstructing eachupdate matrix P or Q times in SFB is negligible in practice— we have observed in our experiments that the time spentcomputing SFs, as well as communicating SFs over the net-work, greatly dominates the cost of reconstructing updatematrices using SFs. Overall, the communication savings

dominate the added computational overhead, which we val-idated in experiments.

4.2 CONVERGENCE ANALYSIS

We study the convergence of minibatch SGD under fullbroadcasting SFB (with extensions to proximal-SGD,SDCA being a topic for future study). Since SFB is apeer-to-peer decentralized computation model, we need toshow that parameter copies on different workers convergeto the same limiting point without a centralized coordina-tion, even under delays in communication due to boundedasynchronous execution. In this respect, we differ fromanalyses of centralized parameter server systems (Ho et al.,2013), which instead show convergence of global parame-ters on the central server.

We wish to solve the optimization problem minW

∑Mm=1

fm(W), where M is the number of training data mini-batches, and fm corresponds to the loss function on them-th minibatch. Assume the training data minibatches{1, ...,M} are divided into P disjoint subsets {S1, ..., SP }with |Sp| denoting the number of minibatches in Sp. De-note F =

∑Mm=1 fm as the total loss, and for p = 1, . . . , P ,

Fp :=∑j∈Sp

fj is the loss on Sp (on the p-th machine).

Consider a distributed system with P machines. Each ma-chine p keeps a local variable Wp and the training data inSp. At each iteration, machine p draws one minibatch Ipuniformly at random from partition Sp, and computes thepartial gradient

∑j∈Ip ∇fj(Wp). Each machine updates

its local variable by accumulating partial updates from allmachines. Denote ηc as the learning rate at c-th iterationon every machine. The partial update generated by ma-chine p at its c-th iteration is denoted as Up(Wc

p, Icp) =

−ηc|Sp|∑j∈Icp∇fj(Wc

p). Note that Icp is random and thefactor |Sp| is to restore unbiasedness in expectation. Thenthe local update rule of machine p is

Wcp = W0 +

∑Pq=1

∑τqp (c)

t=0 Uq(Wtq, I

tq)

0 ≤ (c− 1)− τ qp (c) ≤ s(4)

where W0 is the common initializer for all P machines,and τ qp (c) is the number of iterations machine q has trans-mitted to machine p when machine p conducts its c-th it-eration. Clearly, τpp (c) = c. Note that we also requireτ qp (c) ≤ c − 1, i.e., machine p will not use any partialupdates of machine q that are too fast forward. This is toavoid correlation in the theoretical analysis. Hence, ma-chine p (at its c-th iteration) accumulates updates generatedby machine q up to iteration τ qp (c), which is restricted to beat most s iterations behind. This formulation, in which sis the maximum “staleness” allowed between any updateand any worker, covers bulk synchronous parallel (BSP)full broadcasting (s = 0) and bounded-asynchronous fullbroadcasting (s > 0). The following standard assumptionsare needed for our analysis:

Computational Model Total comms, per iter W storage per machine W update time, per iterSFB (peer-to-peer, full broad-casting)

O(P 2K(J +D)) O(JD) O(P 2KJD)

SFB (peer-to-peer, partialbroadcasting)

O(PQK(J +D)) O(JD) O(PQKJD)

FMS (client-server (Chilimbiet al., 2014))

O(PJD) O(JD) O(PJD) at server,O(PKJD) at clients

Figure 4: Cost of using SFB versus FMS.K is minibatch size, J,D are dimensions of W, and P is the number of workers.

Assumption 1. (1) For all j, fj is continuously differen-tiable and F is bounded from below; (2)∇F ,∇Fp are Lip-schitz continuous with constants LF and Lp, respectively,and let L =

∑Pp=1 Lp; (3) There exists B, σ2 such that

for all p and c, we have (almost surely) ‖Wcp‖ ≤ B and

E‖ |Sp|∑j∈Ip ∇fj(W)−∇Fp(W) ‖22 ≤ σ2.

Our analysis is based on the following auxiliary update

Wc = W0 +∑Pq=1

∑c−1t=0 Uq(W

tq, I

tq), (5)

Compare to the local update (4) on machine p, essentiallythis auxiliary update accumulates all c − 1 updates gener-ated by all machines, instead of the τ qp (c) updates that ma-chine p has access to. We show that all local machine pa-rameter sequences are asymptotically consistent with thisauxiliary sequence:

Theorem 1. Let {Wcp}, p = 1, . . . , P , and {Wc} be the

local sequences and the auxiliary sequence generated bySFB for problem (P) (with h ≡ 0), respectively. Under

Assumption 1 and set the learning rate ηc = O(√

1Lσ2Psc ),

then we have

• lim infc→∞

E‖∇F (Wc)‖ = 0, hence there exists a subse-

quence of∇F (Wc) that almost surely vanishes;• limc→∞

maxp ‖Wc−Wcp‖ = 0, i.e. the maximal disagree-

ment between all local sequences and the auxiliary se-quence converges to 0 (almost surely);

• There exists a common subsequence of {Wcp} and

{Wc} that converges almost surely to a stationarypoint of F , with the rate min

c≤CE‖∑Pp=1∇Fp(Wc

p)‖22 ≤

O

(√Lσ2PsC

)Intuitively, Theorem 1 says that, given a properly-chosenlearning rate, all local worker parameters {Wc

p} eventu-ally converge to stationary points (i.e. local minima) of theobjective function F , despite the fact that SF transmissioncan be delayed by up to s iterations. Thus, SFB learningis robust even under bounded-asynchronous communica-tion (such as SSP). Our analysis differs from (Bertsekasand Tsitsiklis, 1989) in two ways: (1) Bertsekas and Tsit-siklis (1989) explicitly maintains a consensus model whichwould require transmitting the parameter matrix amongworker machines — a communication bottleneck that wewere able to avoid; (2) we allow subsampling in eachworker machine. Accordingly, our theoretical guarantee

is probabilistic, instead of the deterministic one in (Bert-sekas and Tsitsiklis, 1989). In future work, we intendto extend the analysis to partial broadcasting under BSPand bounded-asynchronous execution. Partial broadcastingpresents additional challenges, because updates are onlysent to a subset of machines (rather than every machine).

5 EXPERIMENTSWe demonstrate how four popular models can be efficientlylearnt using SFB: (1) multiclass logistic regression (MLR)and distance metric learning (DML)4 based on SGD; (2)sparse coding (SC) based on proximal SGD; (3) `2 reg-ularized multiclass logistic regression (L2-MLR) based onSDCA. For baselines, we compared with (a) Spark (Zahariaet al., 2012) for MLR and L2-MLR, and (b) full matrix syn-chronization (FMS) implemented on open-source parame-ter servers (Ho et al., 2013; Li et al., 2014) for all four mod-els. In certain experiments, we made a comparison withthe distributed (L2-)MLR system proposed in (Gopal andYang, 2013). In FMS, workers send update matrices to thecentral server, which then sends up-to-date parameter ma-trices to workers5. Due to data sparsity, both the updatematrices and sufficient factors are sparse; we use this factto reduce communication and computation costs. The ex-periments were performed on a cluster where each machinehas 64 2.1GHz AMD cores, 128G memory, and a 10Gbpsnetwork interface. Unless otherwise noted, 12 machineswere used. Some experiments were conducted on 28 ma-chines.

Datasets and Experimental Setup We used two datasetsfor our experiments: (1) ImageNet (Deng et al., 2009) ILS-FRC2012 dataset, which contains 1.2 million images from1000 categories; the images are represented with LLC fea-tures (Wang et al., 2010), whose dimensionality is 172k.(2) Wikipedia (Partalas et al., 2015) dataset, which con-tains 2.4 million documents from 325k categories; docu-ments are represented with term frequency, inverse docu-ment frequency (tf-idf), with a dimensionality of 20k. Weran MLR, DML, SC, L2-MLR on the Wikipedia, Ima-

4For DML, we use the parametrization proposed in (Wein-berger et al., 2005), which is a linear projection matrix L ∈ Rd×k,where d is the feature dimension and k is the latent dimension.

5This has the same communication complexity as (Chilimbiet al., 2014), which sends SFs from workers to servers, but sendsmatrices from servers to workers; the latter matrix transmissiondominates the total cost.

0

5

10

15

20

25

30

35

40

30k 100k 325k

Ru

nti

me

(H

ou

rs)

Number of classes in MLR

Spark FMS SFB

0

10

20

30

40

10k 30k 50k

Ru

nti

me

(H

ou

rs)

Latent dimension in DML

0

5

10

15

20

25

10k 30k 50k

Ru

nti

me

(H

ou

rs)

Dictionary size in SC

0

5

10

15

20

25

30

35

40

45

30k 100k 325k

Ru

nti

me

(H

ou

rs)

Number of classes in L2-MLR

Figure 5: Convergence time versus model size for MLR, DML, SC, L2-MLR (left to right), under BSP.

1.75

1.85

1.95

2.05

2.15

2.25

0 10000 20000 30000 40000 50000 60000

Ob

ject

ive

Runtime (seconds)

FMS mb=1

FMS mb=10

FMS mb=100

FMS mb=1000

SFB mb=1

SFB mb=10

SFB mb=100

SFB mb=10000

200000

400000

600000

800000

1000000

1200000

1400000

0 10000 20000 30000 40000 50000 60000

# Sa

mp

les

Runtime (seconds)

FMS mb=1

FMS mb=10

FMS mb=100

FMS mb=1000

SFB mb=1

SFB mb=10

SFB mb=100

SFB mb=1000

1.75

1.85

1.95

2.05

2.15

2.25

0 500000 1000000 1500000

Ob

ject

ive

# Samples

FMS mb=1

FMS mb=10

FMS mb=100

FMS mb=1000

SFB mb=1

SFB mb=10

SFB mb=100

SFB mb=1000

Figure 6: MLR objective vs runtime (left), #samples vs runtime (middle), objective vs #samples (right).

geNet, ImageNet, Wikipedia datasets respectively, and theparameter matrices contained up to 6.5b, 8.6b, 8.6b, 6.5bentries (the largest latent dimension for DML and largestdictionary size for SC were both 50k). The tradeoff param-eters in SC and L2-MLR were set to 0.001 and 0.1. Wetuned the minibatch size, and found that K = 100 wasnear-ideal for all experiments. All experiments used thesame constant learning rate (tuned in the range [10−5, 1]).Unless otherwise stated, a full broadcasting scheme is usedfor most experiments.

Convergence Speed and Quality Figure 5 shows thetime taken to reach a fixed objective value, for differentmodel sizes, using BSP consistency (SSP results are inthe supplement), on 12 machines. SFB converges fasterthan FMS, as well as Spark v1.3.16. This is because SFBhas lower communication costs, hence a greater propor-tion of running time gets spent on computation rather thannetwork waiting. This is shown in Figure 6, which plotsdata samples processed per second7 (throughput) and algo-rithm progress per sample for MLR, under BSP consistency(SSP results are in the supplement) and varying minibatchsizes. The middle graph shows that SFB processes far moresamples per second than FMS, while the rightmost graphshows that SFB and FMS produce exactly the same algo-rithm progress per sample under BSP. For this experiment,minibatch sizes between K = 10 and 100 performed thebest as indicated by the leftmost graph. We point out thatlarger model sizes should further improve SFB’s advantageover FMS, because SFB has linear communications cost inthe matrix dimensions, whereas FMS has quadratic costs.Under a large model size (e.g., 325k classes in MLR), thecommunication cost becomes the bottleneck in FMS andcauses prolonged network waiting time and parameter syn-chronization delays, while the cost is moderate in SFB.

6Spark is about 2x slower than PS (Ho et al., 2013) based C++implementation of FMS, due to JVM and RDD overheads.

7We use samples per second instead of iterations, so differentminibatch sizes can be compared.

We also evaluated SFB on 28 machines, under BSP andfull broadcasting (Q=27). On MLR (325k classes), SFBtook 2.46 hours to converge while FMS took 10.77 hours.On L2-MLR (325k classes), the convergence time of SFBis 2.14 hours while that of FMS is 9.31 hours.

We made a comparison with the distributed (L2-)MLR sys-tem proposed by (Gopal and Yang, 2013) on 12 machines.On MLR (325k classes), Gopal and Yang (2013) took 31.7hours to converge while SFB took 4.5 hours. To convergeon L2-MLR (325k classes), Gopal and Yang (2013) took28.3 hours while SFB took 3.7 hours.

Scalability In all experiments that follow, we set thenumber of (L2-)MLR classes, DML latent dimension, SCdictionary size to 325k, 50k, 50k. Figure 7 shows SFB scal-ability with varying machines under BSP (SSP results arein the supplement), for MLR, DML, SC, L2-MLR, on 12machines. In general, we observed close to linear (ideal)speedup, with a slight drop at 12 machines. On 28 ma-chines, for MLR and L2-MLR, SFB achieved 17.4x and15.7x speedup over one machine respectively.

Computation Time vs Network Waiting Time Figure8 shows the total computation and network time requiredfor SFB and FMS to converge, across a range of SSP stale-ness values8 — in general, higher communication cost andlower staleness induce more network waiting. For all stal-eness values, SFB requires far less network waiting (be-cause SFs are much smaller than full matrices in FMS).Computation time for SFB is slightly longer than FMS be-cause (1) update matrices must be reconstructed on eachSFB worker, and (2) SFB requires a few more iterations forconvergence, because peer-to-peer communication causesa slightly more parameter inconsistency under staleness.Overall, the SFB reduction in network waiting time re-mains far greater than the added computation time, andoutperforms FMS in total time. For both FMS and SFB,

8The Spark implementation does not easily permit this timebreakdown, so we omit it.

0

2

4

6

8

10

12

14

3 6 9 12

Spe

ed

up

# Machines in MLR

Linear

SFB

0

2

4

6

8

10

12

14

3 6 9 12

Spe

ed

up

# Machines in DML

0

2

4

6

8

10

12

14

3 6 9 12

Spe

ed

up

# Machines in SC

0

2

4

6

8

10

12

14

3 6 9 12

Spe

ed

up

# Machines in L2-MLR

Figure 7: SFB scalability with varying machines under BSP, for MLR, DML, SC, L2-MLR (left to right).

0

2

4

6

8

10

12

14

16

18

20

Ru

nti

me

(h

ou

rs)

0 5 10 20 50

Staleness (MLR)

Computation time of FMS Computation time of SFB

Network waiting time of FMS Network waiting time of SFB

0

5

10

15

20

25

30

35

Ru

nti

me

(h

ou

rs)

0 5 10 20 50

Staleness (DML)

0

5

10

15

20

25

Ru

nti

me

(h

ou

rs)

0 5 10 20 50

Staleness (SC)

0

2

4

6

8

10

12

14

16

Ru

nti

me

(H

ou

rs)

0 5 10 20 50

Staleness (L2-MLR)

Figure 8: Computation vs network waiting time for MLR, DML, SC, L2-MLR (left to right).

4.2

4.4

4.6

4.8

5

5.2

1 2 4 8 12

Ru

nti

me

(h

ou

rs)

Q

3.553.63.653.73.753.83.853.93.95

1 2 4 8 12

Ru

nti

me

(h

ou

rs)

Q

Figure 9: Convergence time versus Q in partial broadcast-ing for MLR (left) and L2-MLR (right), under BSP.

the shortest convergence times are achieved at moderatestaleness values, confirming the importance of bounded-asynchronous communication.

Partial Broadcasting On 12 machines, we studied inpartial broadcasting how the parameter Q, which is thenumber of peers to which each machine sends messages,affects the convergence speed of SFB. Figure 9 shows theconvergence time of SFB on MLR and L2-MLR versusvarying Q, under BSP (SSP results are given in supple-ments). First, we observed that SFB under partial broad-casting (PB) with Q < 11 converges to the same objectivevalue as under full broadcasting (FB) where Q = 11. Thisprovides empirical justification that PB preserves correct-ness and convergence of algorithms. Second, we noted thatthe convergence time of PB is affected by Q. As observedin this figure, a smaller Q incurs longer convergence time.This is because a smaller Q is more likely to cause the pa-rameter copies on different workers to be out of synchro-nization and degrade iteration quality. However, as long asQ is not too small, the convergence speed of PB is compa-rable with FB. As shown in the figure, for Q ≥ 4, the con-vergence time of PB is very close to FB. This demonstratesthat using PB, we can reduce the communication cost fromO(P 2) to O(PQ) with slight sacrifice of the convergencespeed.

6 RELATED WORKS

A number of system and algorithmic solutions have beenproposed to reduce communication cost in distributed ML.On the system side, Dean et al. (2012) proposed to re-

duce communication overhead by reducing the frequencyof parameter/gradient exchanges between workers and theserver. Li et al. (2014) used filters to select “important”parameters/updates for transmission to reduce the numberof data entries to be communicated. On the algorithm side,Tsianos et al. (2012) studied the tradeoffs between commu-nication and computation in distributed dual averaging anddistributed stochastic dual coordinate ascent respectively.Shamir et al. (2014) proposed an approximate Newton-type method to achieve communication efficiency in dis-tributed optimization. SFB is orthogonal to these existingapproaches and be potentially combined with them to fur-ther reduce communication cost.

Peer-to-peer, decentralized architectures have been investi-gated in other distributed ML frameworks (Li et al., 2015).Our SFB system also adopts such an architecture, but withthe specific purpose of supporting the SFB computationmodel, which is not explored by existing peer-to-peer MLframeworks.

7 CONCLUSIONS

In this paper, we identify the sufficient factor property ofa large set of matrix-parametrized models: when thesemodels are optimized with stochastic gradient descent orstochastic dual coordinate ascent, the update matrices areof low-rank. Leveraging this property, we propose a suffi-cient factor broadcasting strategy to efficiently handle thelearning of these models with low communication cost. Apartial broadcasting scheme is investigated to alleviate theoverhead of full broadcasting. We analyze the cost andconvergence property of SFB, whose communication ef-ficiency is demonstrated in empirical evaluations.

Acknowledgements

P.X and E.X are supported by ONR N00014140684,DARPA XDATA FA8701220324, NSF IIS1447676.

ReferencesDimitri P Bertsekas. Nonlinear programming. Athena sci-

entific Belmont, 1999.

Dimitri P. Bertsekas and John N. Tsitsiklis. Paral-lel and Distributed Computation: Numerical Methods.Prentice-Hall, 1989.

Trishul Chilimbi, Yutaka Suzue, Johnson Apacible, andKarthik Kalyanaraman. Project adam: building an ef-ficient and scalable deep learning training system. InUSENIX Symposium on Operating Systems Design andImplementation, 2014.

Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simpli-fied data processing on large clusters. CACM, 2008.

Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen,Matthieu Devin, Mark Mao, Andrew Senior, PaulTucker, Ke Yang, Quoc V Le, et al. Large scale dis-tributed deep networks. In NIPS, 2012.

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, andLi Fei-Fei. Imagenet: A large-scale hierarchical imagedatabase. In CVPR, 2009.

Joseph E Gonzalez, Yucheng Low, Haijie Gu, DannyBickson, and Carlos Guestrin. Powergraph: dis-tributed graph-parallel computation on natural graphs. InUSENIX Symposium on Operating Systems Design andImplementation, 2012.

Siddharth Gopal and Yiming Yang. Distributed training oflarge-scale logistic models. In ICML, 2013.

Qirong Ho, James Cipar, Henggang Cui, Seunghak Lee,Jin Kyu Kim, Phillip B Gibbons, Garth A Gibson, GregGanger, and Eric Xing. More effective distributed ml viaa stale synchronous parallel parameter server. In NIPS,2013.

Cho-Jui Hsieh, Kai-Wei Chang, Chih-Jen Lin, S SathiyaKeerthi, and Sellamanickam Sundararajan. A dual co-ordinate descent method for large-scale linear svm. InICML, 2008.

Ailsa H Land and Alison G Doig. An automatic method ofsolving discrete programming problems. Econometrica:Journal of the Econometric Society, 1960.

Daniel D Lee and H Sebastian Seung. Learning the partsof objects by non-negative matrix factorization. Nature,1999.

Hao Li, Asim Kadav, Erik Kruus, and Cristian Ungureanu.Malt: distributed data-parallelism for existing ml appli-cations. In Proceedings of the Tenth European Confer-ence on Computer Systems, 2015.

Mu Li, David G Andersen, Jun Woo Park, Alexander JSmola, Amr Ahmed, Vanja Josifovski, James Long, Eu-gene J Shekita, and Bor-Yiing Su. Scaling distributedmachine learning with the parameter server. In USENIX

Symposium on Operating Systems Design and Imple-mentation, 2014.

Bruno A Olshausen and David J Field. Sparse coding withan overcomplete basis set: A strategy employed by v1?Vision research, 1997.

Ioannis Partalas, Aris Kosmopoulos, Nicolas Baskiotis,Thierry Artieres, George Paliouras, Eric Gaussier, IonAndroutsopoulos, Massih-Reza Amini, and Patrick Gali-nari. Lshtc: A benchmark for large-scale text classifica-tion. arXiv:1503.08581 [cs.IR], 2015.

Shai Shalev-Shwartz and Tong Zhang. Stochastic dual co-ordinate ascent methods for regularized loss. JMLR,2013.

Ohad Shamir, Nathan Srebro, and Tong Zhang. Communi-cation efficient distributed optimization using an approx-imate newton-type method. In ICML, 2014.

Vikas Sindhwani and Amol Ghoting. Large-scale dis-tributed non-negative sparse coding and sparse dictio-nary learning. In KDD, 2012.

Konstantinos Tsianos, Sean Lawlor, and Michael GRabbat. Communication/computation tradeoffs inconsensus-based distributed optimization. In NIPS,2012.

Jinjun Wang, Jianchao Yang, Kai Yu, Fengjun Lv, ThomasHuang, and Yihong Gong. Locality-constrained linearcoding for image classification. In CVPR, 2010.

Kilian Q Weinberger, John Blitzer, and Lawrence K Saul.Distance metric learning for large margin nearest neigh-bor classification. In NIPS, 2005.

Ryan Williams. Faster all-pairs shortest paths via circuitcomplexity. In Proceedings of the 46th Annual ACMSymposium on Theory of Computing, 2014.

Eric P Xing, Michael I Jordan, Stuart Russell, and An-drew Y Ng. Distance metric learning with applicationto clustering with side-information. In NIPS, 2002.

Eric P Xing, Qirong Ho, Wei Dai, Jin-Kyu Kim, JinliangWei, Seunghak Lee, Xun Zheng, Pengtao Xie, Abhi-manu Kumar, and Yaoliang Yu. Petuum: A new plat-form for distributed machine learning on big data. InKDD, 2015.

Jinhui Yuan, Fei Gao, Qirong Ho, Wei Dai, Jinliang Wei,Xun Zheng, Eric Po Xing, Tie-Yan Liu, and Wei-YingMa. Lightlda: Big topic models on modest computerclusters. In Proceedings of the 24th International Con-ference on World Wide Web, 2015.

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das,Ankur Dave, Justin Ma, Murphy McCauley, Michael JFranklin, Scott Shenker, and Ion Stoica. Resilient dis-tributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In USENIX Symposium onNetworked Systems Design and Implementation, 2012.

Lighter-Communication Distributed Machine Learning viaSufficient Factor Broadcasting

— Supplementary Material

Pengtao Xie, Jin Kyu Kim, Yi Zhou?, Qirong Ho†, Abhimanu Kumar§, Yaoliang Yu, Eric XingSchool of Computer Science, Carnegie Mellon University;

?Department of EECS, Syracuse University;†Institute for Infocomm Research, A*STAR, Singapore; §Groupon

1 Proof of Convergence

Proof of Theorem 1Proof. Let Fc := σ{Iτp : p = 1, . . . , P, τ = 1, . . . , c} be the filtration generated by the random samplings Iτp up toiteration counter c, i.elet@tokeneonedot, the information up to iteration c. Note that for all p and c, Wc

p and Wc areFc−1 measurable (since τ qp (c) ≤ c− 1 by assumption), and Icp is independent of Fc−1. Recall that the partial updategenerated by machine p at its c-th iteration is

Up(Wcp, I

cp) = −ηc|Sp|

∑j∈Icp

∇fj(Wcp)

Then it holds thatUp(W

cp) = E[Up(Wc

p, Icp)|Fc−1] = −ηc∇Fp(Wc

p)

(Note that we have suppressed the dependence of Up on the iteration counter c.)Then, we have

E[∑P

p=1 Up(Wcp, I

cp) | Fc−1

]=∑Pp=1 E[Up(Wc

p, Icp) | Fc−1] =

∑Pp=1 Up(W

cp) (1)

Similarly we have

E[∥∥∑P

p=1 Up(Wcp, I

cp)∥∥22| Fc−1

]=∑Pp,q=1 E[〈Up(Wc

p, Icp), Uq(W

cq, I

cq )〉 | Fc−1]

=∑Pp,q=1〈Up(Wc

q), Uq(Wcq)〉+

∑Pp=1 E

[‖Up(Wc

p, Icp)− Up(Wc

p)‖22 | Fc−1] (2)

The variance term in the above equality can be bounded as∑Pp=1 E

[‖Up(Wc

p, Icp)− Up(Wc

p)‖22 | Fc−1]

= η2c

P∑p=1

E

‖|Sp|∑j∈Icp

∇fj(Wcp)−∇Fp(Wc

p)‖22 | Fc−1

︸︷︷︸σ̂2P

≤ η2c σ̂2P

(3)

Now use the update rule Wc+1p = Wc

p +∑Pp=1 Up(W

cp, I

cp) and the descent lemma [1], we have

F (Wc+1)− F (Wc)≤ 〈Wc+1 −Wc,∇F (Wc)〉+ LF

2 ‖Wc+1 −Wc‖22

= 〈∑Pp=1 Up(W

cp, I

cp),∇F (Wc)〉+ LF

2 ‖∑Pp=1 Up(W

cp, I

cp)‖22

(4)

1

Then take expectation on both sides, we obtain

E [ F (Wc+1)− F (Wc) | Fc−1]

≤ 〈∑Pp=1 Up(W

cp),∇F (Wc)〉+ LF η

2c σ̂

2P2 + LF

2 ‖∑Pp=1 Up(W

cp)‖22

= (LF

2 − η−1c )‖

∑Pp=1 Up(W

cp)‖22 +

LF η2c σ̂

2P2 − η−1c 〈

∑Pp=1 Up(W

cp),∑Pp=1[Up(W

c)− Up(Wcp)]〉

≤ (LF

2 − η−1c )‖

∑Pp=1 Up(W

cp)‖22 +

LF η2c σ̂

2P2 + ‖

∑Pp=1 Up(W

cp)‖∑Pp=1 Lp‖Wc −Wc

p‖

(5)

Now take expectation w.r.t all random variables, we obtain

E[F (Wc+1)− F (Wc)

]≤ (LF

2 − η−1c )E

[‖∑Pp=1 Up(W

cp)‖22

]+∑Pp=1 LpE

[‖∑Pp=1 Up(W

cp)‖‖Wc −Wc

p‖]+

LF η2cσ

2P2

(6)

Next we proceed to bound the term E‖∑Pp=1 Up(W

cp)‖‖Wc −Wc

p‖. We list the auxiliary update rule and thelocal update rule here for convenience.

Wc = W0 +∑Pq=1

∑c−1t=0 Uq(W

tq, I

tq),

Wcp = W0 +

∑Pq=1

∑τqp (c)

t=0 Uq(Wtq, I

tq).

(7)

Now subtract the above two and use the bounded delay assumption 0 ≤ (c− 1)− τ qp (c) ≤ s, we obtain

‖Wc −Wcp‖

= ‖∑Pq=1

∑c−1t=τq

p (c)+1 Uq(Wtq, I

tq)‖

≤ ‖∑Pq=1

∑c−1t=c−s Uq(W

tq, I

tq)‖+ ‖

∑Pq=1

∑τqp (c)

t=c−s Uq(Wtq, I

tq)‖

≤∑c−1t=c−s ‖

∑Pq=1 Uq(W

tq, I

tq)‖+ ηc−sG

(8)

where the last inequality follows from the facts that ηc is strictly decreasing, and ‖∑Pq=1

∑τqp (c)

t=c−s∇Fq(Wtq, I

tq)‖

is bounded by some constant G since ∇Fq is continuous and all the sequences Wcp are bounded. Thus by taking

expectation, we obtain

E[‖∑Pp=1 Up(W

cp)‖ ‖Wc −Wc

p‖]

≤ E[‖∑Pp=1 Up(W

cp)‖(∑c−1

t=c−s ‖∑Pq=1 Uq(W

tq, I

tq)‖+ ηc−sG

)]=∑c−1t=c−s E

[‖∑Pp=1 Up(W

cp)‖‖

∑Pq=1 Uq(W

tq, I

tq)‖]+ ηc−sG · E

[‖∑Pp=1 Up(W

cp)‖]

≤∑c−1t=c−s E

[‖∑Pp=1 Up(W

cp)‖22 + ‖

∑Pq=1 Uq(W

tq, I

tq)‖22

]+ E‖

∑Pp=1 Up(W

cp)‖22 + η2c−sG

2

≤ (s+ 1)E‖∑Pp=1 Up(W


2 +∑c−1t=c−s

[E‖∑Pq=1 Uq(W

tq)‖22 + η2t σ

2P]

(9)

Now plug this into the previous result in (6):

EF (Wc+1)− EF (Wc)

≤ (LF

2 − η−1c )E‖

∑Pp=1 Up(W

cp)‖22 + (s+ 1)LE‖

∑Pp=1 Up(W


2L

+∑c−1t=c−s

[LE‖

∑Pp=1 Up(W

cp)‖22 + η2tLσ

2P]+

LF η2cσ

2P2

= (LF

2 + (s+ 1)L− η−1c )E‖∑Pp=1 Up(W


2L

+∑c−1t=c−s

[LE‖

∑Pp=1 Up(W

cp)‖22 + η2tLσ

2P]+

LF η2cσ

2P2

(10)

Sum both sides over c = 0, ..., C:

EF (WC+1)− EF (W0)

≤∑Cc=0

[(LF

2 + (2s+ 1)L− η−1c )E‖∑Pp=1 Up(W

cp)‖22

]+ (Lσ2Ps+ LFσ

2P2 )

∑Cc=0 η

2c +G2L

∑Cc=0 η

2c−s

(11)

2

After rearranging terms we finally obtain∑Cc=0

[η2c (η

−1c − LF

2 − 2(s+ 1)L)E‖∑Pp=1 ∇Fp(Wc

p)‖22]

≤ EF (W0)− EF (WC+1) + (Lσ2Ps+ LFσ2P

2 )∑Cc=0 η

2c +G2L

∑Cc=0 η

2c−s

≤ F (W0)− infF + (Lσ2Ps+ LG2 + LFσ2P

2 )∑Cc=0 η

2c .

(12)

It then follows that

minc=1,...,C

E[‖∑Pp=1∇Fp(Wc

p)‖22]

≤F (W0)− infF + (Lσ2Ps+ LG2 + LFσ

2P2 )

∑Cc=0 η

2c∑C

c=0 ηc − (LF /2 + 2L(s+ 1))η2c

≈F (W0)− infF + (Lσ2Ps+ LG2 + LFσ

2P2 )

∑Cc=0 η

2c∑C

c=0 ηc,

where we ignore the higher order term (LF /2 + 2L(s + 1))η2c in the last equation for simplicity, and this does notaffect the order of the final estimate since we will use a diminishing stepsize ηc = O(1/

√c). Now we can apply [2,

Theorem 4.2] to the last equation to conclude that

minc=1,...,C

E[‖∑Pp=1∇Fp(Wc

p)‖22]

≤√

(F (W0)− infF )[2L(σ2Ps+G2) + LFσ2P ]

2C

with the choice of stepsize

ηc =

√8(F (W0)− infF )

2L(σ2Ps+G2) + LFσ2P

1√c.

Hence, we must havelim infc→∞

E‖∑Pp=1∇Fp(Wc

p)‖ = 0 (13)

proving the first claim.On the other hand, the bound of ‖Wc −Wc

p‖ in (8) gives

‖Wc −Wcp‖ ≤

∑c−1t=c−s ηt‖

∑Pq=1 |Sq|

∑j∈Itq∇fj(Wt

q)‖+ ηc−sG (14)

By assumption the sequences {Wcp}p,c and {Wc}c are bounded and the gradient of fj is continuous, thus ∇fj(Wt

q)is bounded. Now take c → ∞ in the above inequality and notice that lim

c→∞ηc = 0, we have lim

c→∞‖Wc −Wc

p‖ = 0

almost surely, proving the second claim.Lastly, the Lipschitz continuity of∇Fp further implies

0 = lim infc→∞

E‖∑Pp=1∇Fp(Wc

p)‖ ≥ lim infc→∞

E‖∑Pp=1∇Fp(Wc)‖ = lim inf

c→∞E‖∇F (Wc)‖ = 0 (15)

Thus there exists a common limit point of Wc,Wcp that is a stationary point almost surely. The proof is now complete.

2 Sample Code for Sparse CodingFigure 2 shows the sample code of implementing sparse coding in SFB. D is the feature dimensionality of data and Jis the dictionary size. Users need to write a SF computation function to specify how to compute the sufficient factors:for each data sample xi, we first compute its sparse code a based on the dictionary B stored in the parameter matrixsc.para mat. Given a, the sufficient factor u can be computed as Ba − xi and the sufficient factor v is simply a. Inaddition, users need to provide a proximal operator function to specify how to project B to the `2 ball constraint set.

3 Implementation Details

3

Input SF Queue Output SF Queue

SF Computing Thread

Parameter Update Thread

Parameter Matrix

SFUpdate

Matrices

Training Data

SFSF

CommunicationThread

Data

Parameter

SFSF

SFSF

Figure 2: Implementation details on each worker in SFB.

sfb app sc ( int D, int J , int staleness)//SF computation functionfunction compute sf ( sfb app sc ):

while ( ! converge ):X=sample minibatch ()foreach xi in X:

//compute sparse codea = compute sparse code ( sc.para mat, xi )//sufficient factor uisc.sf list[i].write u ( sc.para mat * a-xi )//sufficient factor visc.sf list[i].write v ( a )

commit()//Proximal operator functionfunction prox ( sfb app sc ):

foreach column bi in sc.para mat:if ‖bi‖2 > 1:bi =

bi

‖bi‖2

Figure 1: Sample code of sparse coding in SFB

Figure 2 shows the implementation details oneach worker in SFB. Each worker maintains threethreads: SF computing thread, parameter updatethread and communication thread. Each workerholds a local copy of the parameter matrix anda partition of the training data. It also maintainsan input SF queue which stores the sufficient fac-tors computed locally and received remotely andan output SF queue which stores SFs to be sentto other workers. In each iteration, the SF com-puting thread checks the consistency policy de-tailed in the main paper. If permitted, this threadrandomly chooses a minibatch of samples fromthe training data, computes the SFs and pushesthem to the input and output SF queue. The pa-rameter update thread fetches SFs from the inputSF queue and uses them to update the parame-ter matrix. In proximal-SGD/SDCA, the proxi-mal/dual operator function (provided by the user)is automatically called by this thread as a functionpointer. The communication thread receives SFsfrom other workers and pushes them into the inputSF queue and broadcasts SFs in the output SF queue to other workers. One worker is in charge of measuring theobjective value. Once the algorithm converges, this worker notifies all other workers to terminate the job. We imple-mented SFB in C++. OpenMPI was used for communication between workers and OpenMP was used for multicoreparallelization within each machine.

The decentralized architecture of SFB makes it robust to machine failures. If one worker fails, the rest of workerscan continue to compute and broadcast the sufficient factors among themselves. In addition, SFB possesses highelasticity [3]: new workers can be added and existing workers can be taken offline, without restarting the runningframework. A thorough study of fault tolerance and elasticity will be left for future work.

4

0

5

10

15

20

30k 100k 325k

Ru

nti

me

(h

ou

rs)

# Classes in MLR

FMS

SFB

0

5

10

15

20

25

30

35

10k 30k 50k

Ru

nti

me

(h

ou

rs)


0

5

10

15

20

25

10k 30k 50k

Ru

nti

me

(h

ou

rs)


0

2

4

6

8

10

12

14

30k 100k 325k

Ru

nti

me

(H

ou

rs)

# Classes in L2-MLR

Figure 3: Convergence time versus model size for MLR, DML, SC, L2-MLR (left to right), under SSP with stale-ness=20.

0

2

4

6

8

10

12

14

3 6 9 12

Spe

ed

up

# Machines (MLR)

Linear SFB

0

2

4

6

8

10

12

14

3 6 9 12

Spe

ed

up

# Machines (DML)

0

2

4

6

8

10

12

14

3 6 9 12

Spe

ed

up

# Machines (SC)

0

2

4

6

8

10

12

14

3 6 9 12

Ru

nti

me

(H

ou

rs)

# Machines (L2-MLR)

Figure 4: SFB scalability with varying machines, for MLR, DML, SC, L2-MLR (left to right), under SSP withstaleness=20.

4 Additional Experiment ResultsFigure 3 shows the convergence time versus model size for MLR, DML, SC, L2-MLR, under SSP with staleness=20.Figure 4 shows SFB scalability with varying machines under SSP with staleness=20, for MLR, DML, SC, L2-MLR.Figure 5 shows the iteration throughput (left) and iteration quality (right) for MLR, under SSP (staleness=20). Theminibatch size was set to 100 for both SFB and FMS. As can be seen from the rightmost graph, SFB has a lightlyworse iteration quality than FMS. The reason we conjecture is that the centralized architecture of FMS is more robustand stable than the decentralized architecture of SFB. On the other hand, the iteration throughput of SFB is muchhigher than FMS as shown in the leftmost graph. Figure 6 shows the convergence time of MLR and L2-MLR versusvarying Q in partial broadcasting, under SSP (staleness=20). Figure 7 shows the communication volume of fourmodels under BSP. As shown in the figure, the communication volume of SFB is significantly lower than FMS andSpark. Under the BSP consistency model, SFB and FMS share the same iteration quality, hence need the same numberof iterations to converge. Within each iteration, SFB communicates vectors while FMS transmits matrices. As a result,the communication volume of SFB is much lower than FMS.

References[1] Dimitri P. Bertsekas and John N. Tsitsiklis. Parallel and Distributed Computation: Numerical Methods. Prentice-Hall, 1989.

[2] Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Opera-tions Research Letters, 31(3):167 – 175, 2003.

[3] Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita,and Bor-Yiing Su. Scaling distributed machine learning with the parameter server. In USENIX Symposium on OperatingSystems Design and Implementation, 2014.

5

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

0 10000 20000 30000 40000 50000

Ite

rati

on

s

Runtime (seconds)

FMS

SVB

1.75

1.85

1.95

2.05

2.15

2.25

2.35

0 5000 10000 15000 20000

Ob

ject

ive

Iterations

FMS

SVB

Figure 5: MLR iteration throughput (left) and iteration quality (right) for MLR under SSP (staleness=20).

3.4

3.6

3.8

4

4.2

4.4

1 2 4 8 12

Ru

nti

me

(h

ou

rs)

Q

100k 325k

3.69

3.7

3.71

3.72

Ru

nti

me

(h

ou

rs)

3.053.13.153.23.253.33.353.43.45

1 2 4 8 12

Ru

nti

me

(h

ou

rs)

Q

Figure 6: Convergence time versus Q in partial broadcasting for MLR (left) and L2-MLR (right), under SSP (stale-ness=20).

0

500

1000

1500

2000

30k 100k 325k

Co

mm

. vo

lum

e (

TB)

# Classes in MLR

Spark

FMS

SFB0

500

1000

1500

10k 30k 50k

Co

mm

. V

olu

me

(TB

)


FMS

SFB

0

200

400

600

800

1000

1200

10k 30k 50k

Co

mm

. V

olu

me

(TB

)


FMS

SFB

0

500

1000

1500

2000

30k 100k 325k

Co

mm

. Vo

lum

e (

TB)

# Classes in L2-MLR

Spark

FMS

SFB

Figure 7: Communication volume for MLR, DML, SC, L2-MLR (left to right) under BSP.

6

lighter-communication distributed machine learning via ...y328yu/mypapers/uai16.pdf · most,...

Documents