a communication-efﬁcient collaborative learning …1 introduction one critical challenge for...

A Communication-Efficient Collaborative LearningFramework for Distributed Features

Yang LiuWeBank

[email protected]

Yan KangWeBank

[email protected]

Xinwei ZhangUniversity of [email protected]

Liping LiUniversity of Science and Technology of China

[email protected]

Yong ChengWeBank

[email protected]

Tianjian ChenWeBank

[email protected]

Mingyi HongUniversity of Minnesota

[email protected]

Qiang YangWeBank

[email protected]

Abstract

We introduce a collaborative learning framework allowing multiple parties havingdifferent sets of attributes about the same user jointly build models without exposingtheir raw data or model parameters. In particular, we propose a Federated StochasticBlock Coordinate Descent (FedBCD) algorithm, in which each party conductsmultiple local updates before each communication to effectively reduce the numberof communication rounds among parties, a principal bottleneck for collaborativelearning problems. We analyze theoretically the impact of the number of localupdates, and show that when the batch size, sample size and the local iterationsare selected appropriately, within T iterations, the algorithm performs O(

√T )

communication rounds and achieves some O(1/√T ) accuracy (measured by the

average of the gradient norm squared). The approach is supported by our empiricalevaluations on a variety of tasks and datasets, demonstrating advantages overstochastic gradient descent (SGD) approaches.

1 Introduction

One critical challenge for applying today’s Artificial Intelligence (AI) technologies to real-worldapplications is the common existence of data silos across different organizations. Due to legal,privacy and other practical constraints, data from different organizations cannot be easily integrated.The implementation of user privacy laws such as GDPR [1] has made sharing data among differentorganizations more challenging. Collaborative learning has emerged to be an attractive solution tothe data silo and privacy problem. While distributed learning (DL) frameworks [2] originally aimsat parallelizing computing power and distributes data identically across multiple servers, federatedlearning (FL) [3] focuses on data locality, non-IID distribution and privacy. In most of the existingcollaborative learning frameworks, data are distributed by samples thus share the same set of attributes.However, a different scenario is cross-organizational collaborative learning problems where partiesshare the same users but have different set of features. For example, a local bank and a local retailcompany in the same city may have large overlap in user base and it is beneficial for these partiesto build collaborative learning models with their respective features. FL is further categorized into

2nd International Workshop on Federated Learning for Data Privacy and Confidentiality, in Conjunction withNeurIPS 2019 (FL-NeurIPS 19)

arX

iv:1

912.

1118

7v6

[cs

.LG

] 3

1 Ju

l 202

0

horizontal (sample-partitioned) FL, vertical (feature-partitioned) FL and federated transfer learning(FTL) in [4].

Feature-partitioned collaborative learning problems have been studied in the setting of both DL [5–7]and FL [4, 8–10]. However, existing architectures have not sufficiently addressed the communicationproblem especially in communication-sensitive scenarios where data are geographically distributedand data locality and privacy are of paramount significance (i.e., in a FL setting). In these approaches,per-iteration communication and computations are often required, since the update of algorithmparameters needs contributions from all parties. In addition, to prevent data leakage, privacy-preserving techniques, such as Homomorphic Encryption (HE) [11], Secure Multi-party Computation(SMPC) [12] are typically applied to transmitted data [4, 10, 9], adding expensive communicationoverhead to the architectures. Differential Privacy (DP) is also a commonly-adopted approach, butsuch approaches suffer from precision loss [6, 7]. In sample-partitioned FL[3], it is demonstratedexperimentally that multiple local updates can be performed with federated averaging (FedAvg),reducing the number of communication round effectively. Whether it is feasible to perform suchmultiple local update strategy over distributed features is not clear.

In this paper, we propose a collaborative learning framework for distributed features named Federatedstochastic block coordinate descent (FedBCD), where parties only share a single value per sampleinstead of model parameters or raw data for each communication, and can continuously performlocal model updates (in either a parallel or sequential manner) without per-iteration communication.In the proposed framework, all raw data and model parameters stay local, and each party doesnot learn other parties’ data or model parameters either before or after the training. There is noloss in performance of the collaborative model as compared to the model trained in a centralizedmanner. We demonstrate that the communication cost can be significantly reduced by adoptingFedBCD. Compared with the existing distributed (stochastic) coordinate descent methods [13–16],we show for the first time that when the number of local updates, mini-batch size and learning ratesare selected appropriately, the FedBCD converges to a O(1/

√T ) accuracy with O(

√T ) rounds of

communications despite performing multiple local updates using staled information. We then performcomprehensive evaluation of FedBCD against several alternative protocols, including a sequentiallocal update strategy, FedSeq with both the original Stochastic Gradient Descent (SGD) and Proximalgradient descent. For evaluation, we apply the proposed algorithms to multiple complex modelsinclude logistic regression and convolutional neural networks, and on multiple datasets, including theprivacy-sensitive medical dataset MIMIC-III [17], MNIST [18] and NUS-WIDE [19] dataset. Finally,we implemented the algorithm for federated transfer learning (FTL) [10] to tackle problems with fewlabeled data and insufficient user overlaps.

2 Related Work

Traditional distributed learning adopts a parameter server architecture [2] to enable a large amountof computation nodes to train a shared model by aggregating locally-computed updates. The issueof privacy in DL framework is considered in [20]. FL [3] adopted a FedAvg algorithm whichruns Stochastic Gradient Descent (SGD) for multiple local updates in parallel to achieve bettercommunication efficiency. The authors of [21] studied the FedAvg algorithm under the parallelrestarted SGD framework and analyzed the convergence rate and communication savings under IIDsettings. In [22], the convergence of the FedAvg algorithm under non-IID settings was investigated.All the work above consider the sample-partitioned scenario.

Feature-partitioned learning architectures have been developed for models including trees [9], linearand logistic regression [4, 8, 5, 7], and neural networks [10, 6]. Distributed Coordinate Descent[13] used balanced partitions and decoupled computation at the individual coordinate in each parti-tion; Distributed Block Coordinate Descent [14] assumes feature partitioning is given and performssynchronous block updates of variables which is suited for MapReduce systems with high communi-cation cost settings. These approaches require synchronization at every iteration. Asynchronous BCD[15] and Asynchronous ADMM algorithms [16] tries to tame various kinds of asynchronicity usingstrategies such as small stepsize, and careful update scheduling, and the design objective is to ensurethat the algorithm can still behave reasonably under non-ideal computing environment. Our approachtries to address the expensive communication overhead problem in FL scenario by systematicallyadopting BCD with sufficient number of local updates guided by theoretical convergence guarantees.

2

3 Problem Definition

Suppose K data parties collaboratively train a machine learning model based on N data samples{xi, yi}Ni=1 and the feature vector xi ∈ R1×d are distributed among K parties {xki ∈ R1×dk}Kk=1,where dk is the feature dimension of party k. Without loss of generality, we assume one partyholds the labels and it is party K. Let us denote the data set as Dki , {xki }, for k ∈ [K − 1],DKi , {xKi , yKi }, and Di , {Dki }Kk=1 (where [K − 1] denotes the set {1, · · · ,K − 1}). Then thecollaborative training problem can be formulated as

minΘL(Θ;D) ,

1

N

N∑i=1

f(θ1, . . . , θK ;Di) + λ

K∑k=1

γ(θk) (1)

where θk ∈ Rdk denotes the training parameters of the kth party; Θ = [θ1; . . . ; θK ]; f(·) and γ(·)denotes the loss function and regularizer and λ is the hyperparatemer; For a wide range of modelssuch as linear and logistic regression, and support vector machines, the loss function has the followingform:

f(θ1, . . . , θK ;Di) = f(

K∑k=1

xki θk, yKi ) (2)

The objective is for each party k to find its θk without sharing its data Dki or parameter θk to otherparties.

4 The FedSGD Approach

If a mini-batch S ⊂ [N ] of data is sampled, the stochastic partial gradient w.r.t. θk is given by

gk(Θ;S) , ∇kf(Θ;S) + λ∇γ(θk). (3)

Let Hki = xki θk and Hi =

∑Kk=1H

ki , then for the loss function in equation (2), we have

∇kf(Θ;S) =1

NS

∑i∈S

∂f(Hi, yKi )

∂Hi(xki )T (4)

,1

NS

∑i∈S

g(Hi, yKi )(xki )T (5)

Where NS is the number of samples in S. To compute∇kf(Θ;S) locally, each party k ∈ [K − 1]sends Hk = {Hk

i }i∈S to party K, who then calculates HK = {g(Hi, yKi )}i∈S and sends to other

parties, and finally all parties can compute gradient updates with equation (4). See Algorithm 1.

For an arbitrary loss function, let us define the collection of information needed to compute∇kf(Θ;S) as

HS−k := {H k

q (θq ,Sq)}q 6=k . (6)

where H kq (·) is a function summarizing the information required from party q to k, the stochastic

gradients (3) can be computed as the following:

gk(Θ;S) = ∇kf(HS−k, θk;S) + λ∇γ(θk)

, gk(H−k, θk;S). (7)

Therefore, the overall stochastic gradient is given as

g(Θ;S) , [g1(H−1, θ1;S); · · · ; gK(H−K , θK ;S)]. (8)

A direct approach to optimize (1) is to use the vanilla stochastic gradient descent (SGD) algorithmgiven below

θk ← θk − ηgk(H−k, θk;S), ∀ k. (9)

3

Algorithm 1 FedSGD implemented on K partiesInput: learning rate ηOutput: Model parameters θ1, θ2...θKParty 1,2..K initialize θ1, θ2, ...θK .for each iteration j=1,2... do

Randomly sample S ⊂ [N ]Exchange({1, 2...K});for each party k=1,2..K in parallel do

k computes gk with equation (7) and update θj+1k = θjk − ηgk;

endendExchange(U ): # U is the set of party IDsif equation (2) hold then

for each party k ∈ U and k 6= K in parallel dok computes Hk,and send Hk to party K;

endparty K computes HK and sends to all other parties in U ;

endelse

for each party k ∈ U in parallel dok computes H1

k ...HKk ,and send Hq

k to party q;

endend

The federated implementation of the above SGD iteration is given in Algorithm 1. Algorithm 1requires communication of intermediate results at every iteration. This could be very inefficient,especially when K is large or the task is communication heavy. For a task of form in equation(2), the number of communications per round is 2(K − 1), but for an arbitrary task, the number ofcommunications per round can be K2−K if it requires pair-wise communication. We note that sinceAlgorithm 1 has the same iteration as the vanilla SGD algorithm, it converges with a rate of O( 1√

T),

regardless of the choice of K [23]. Since each iteration requires one round of communication amongall the parties, T rounds of communication is required to achieve an error of O( 1√

T).

5 The Proposed FedBCD Algorithms

In the parallel version of our proposed algorithm, called FedBCD-p, at each iteration, each partyk ∈ [K] performs Q > 1 consecutive local gradient updates in parallel, before communicating theintermediate results among each other; see Algorithm 2. Such “multi-local-step" strategy is stronglymotivated by our practical implementation (to be shown in our Experiments Section), where we foundthat performing multiple local steps can significantly reduce overall communication cost. Further,such a strategy also resembles the FedAvg algorithm [3], where each party performs multiple localsteps before aggregation.

At each iteration the kth feature is updated using the direction (where S is a mini-batch of data points)

dk = −ηgk(H−k, θk;S). (10)

Because H−k is the intermediate information obtained from the most recent synchronization, it maycontain staled information so it may no longer be an unbiased estimate of the true partial gradient∇kL(Θ). On the other hand, during the Q local updates no inter-party communication is required.Therefore, one could expect that there will be some interesting tradeoff between communicationefficiency and computational efficiency. In the same spirit, a sequential version of the algorithmallows the parties to update their local θk’s sequentially, while each update consists of Q local updateswithout inter-party communication, termed FedBCD-s. (Algorithm 3).

4

Algorithm 2 FedBCD-p: Parallel Federated Stochastic Block Coordinate DescentInput: learning rate ηOutput: Model parameters θ1, θ2...θKParty 1,2..K initialize θ1, θ2, ...θK .for each outer iteration t=1,2... do

Randomly sample a mini-batch S ⊂ D;Exchange ({1, 2...K});for each party k ∈ [N ], in parallel do

for each local iteration j = 1, 2..., Q dok computes gk(H−k, θk;S) using (7);Update θk ← θk − ηgk(H−k, θk;S);

endend

end

Algorithm 3 Sequential FedBCD: Sequential Federated Stochastic Block Coordinate DescentInput: learning rate ηOutput: Model parameters θ1, θ2...θKParty 1,2..K initialize θ1, θ2, ...θK .Exchange({1, 2...K});for each iteration i=1,2... do

Randomly sample a mini-batch S ⊂ D;for each party k=1,2..K sequentially do

for each local iteration j=1,2...,Q dok computes gk using (7) and update θk ← θk − ηgk(H−k, θk;S);

endExchange({k,K});

endend

6 Security Analysis

Here we aim to find out whether one party can learn other party’s data from collections of messagesexchanged (Hk

q ) during training. Whereas previous research studied data leakage from exposingcomplete set of model parameters or gradients [24–26], in our protocol model parameters are keptprivate, and only the intermediate results (such as inner product of model parameters and feature) areexposed.

Security Definition Let Sj be the set of data point sampled at the jth iteration and ij denotes theith sample of the jth iteration. Hk

ijis the contribution of the ith sample to other parties. At the

(j + 1)th iteration, we update weight variables according to equation (4)

θj+1k = θjk − ηj(

1

NSj

∑ij∈Sj

g(Hij , yKij )(xkij )T + λθjk) (11)

The security definition is that for any party k with undisclosed dataset Dk and training parametersθk following FedBCD, there exists infinite solutions for {xkij}i∈Sj,j=0,1··· ,M that yield the same setof contributions {Hk

ij}j=0,1··· ,M . That is, one can not determine party k’s data uniquely from its

exchanged messages of {Hkj }j=0,1··· ,M regardless the value of M , the total number of iterations.

Such a security definition is inline with prior security definitions proposed in privacy-preservingmachine learning and secure multiple computation (SMC), such as [27–29, 4, 30]. Under thissecurity definition, when some prior knowledge about the data is known, an adversary may be ableto eliminate some alternative solutions or certain derived statistical information may be revealed

5

(a) (b)

Figure 1: Illustration of a 2-party collaborative learning framework (a) with neural network(NN)-based local model. (b) FedBCD-s and FedBCD-p algorithms

[28, 29] but it is still impossible to infer the exact raw data ("deep leakage"). However, this practicaland heuristic security model provides flexible tradeoff between privacy and efficiency and allowsmuch more efficient solutions. Our problem is particularly challenging in that the observations byother parties are serial outputs from FedBCD algorithm and are all correlated based on equation (11).Although it is easy to show security to send one round of Hk

ijdue to protection from the reduced

dimensionality, it is unclear whether raw data will be leaked after thousands or millions of rounds ofiterative communications.

Theorem 1 For K-party collaborative learning framework following (2) with K ≥ 2, the FedBCDAlgorithm is secured for party k if k’s feature dimension is larger than 2, i.e., dk ≥ 2.

The security proof can be readily extended to collaborative systems where parties have arbitrary localsub-models (such as neural networks) but connect at the final prediction layer with loss function (2)(see Figure 1(a)). Let Gk be the local transformation on xki and is unknown to parties other than k.We choose Gki to be the identity map, i.e. Gki (xki ) = (xki ), then the problem reduces to Theorem 1.

7 Convergence Analysis

In this section, we perform convergence analysis of the FedBCD algorithm. Our analysis will befocused on the parallel version Algorithm 2 and the sequential version can be analyzed use similartechniques.

To facilitate the proof, we define the following notations: Let r denote the iteration index, in whicheach iteration one round of local update is performed; Let r0 denote the latest iteration before r inwhich synchronization has been performed, and the intermediate information Hk’s are exchanged.Let yrk denote the “local vector" that node k uses to compute its local gradient at iteration r, that is

gk(yrk;S) = gk([Θr0−k, θ

rk];S]) (12)

where [v−k, w] denotes a vector v with its kth element replaced by w. Note that by Algorithm 2,each node k always updates the kth element of yrk, while the information about Θr0

−k is obtained bythe most recent synchronization step. Further, we use the “global" variable Θr to collect the mostupdated parameters at each iteration of each node, where yk,j denotes the jth element of yk:

Θr = [θr1; . . . ; θrK ] , [yr1,1; . . . ; yrK,K ].

Note that {Θr} is only a sequence of “virtual" variables, it is never explicitly formed in the algorithm.

Assumptions

A1: Lipschitz Gradient. Assume that the loss function satisfies the following:

‖∇L(Θ1)−∇L(Θ2)‖ ≤ L‖Θ1 −Θ2‖, ∀Θ1,Θ2

‖∇kL(Θ1)−∇kL(Θ2)‖ ≤ Lk‖Θ1 −Θ2‖, ∀Θ1,Θ2.

A2: Uniform Sampling. For simplicity, assume that the data sample is partitioned into B mini-batches S1, · · · ,SB , each with size S; at a given iteration, S is sampled uniformly from thesemini-batches.

Our main result is shown below. The detailed analysis is relegated to the Supplemental Material.

6

Theorem 2 Under assumption A1, A2, when the step size in Algorithm 2 satisfies 0 < η ≤min{

√2

2Q√∑K

j=1 L2j+3L2

k

, 1L}, then for all T ≥ 1, we have the following bound:

1

T

T−1∑r=0

E[‖∇L(Θr)‖2] ≤ 2

ηT(L(Θ(0))− L(Θ?)) (13)

+ 2η2(K + 3)Q2K∑k=1

L2k

σ2

S+ 2K

σ2

S. (14)

where L(Θ?) denotes the global minimum of problem (1).

Remark 1. It is non-trivial to find an unbiased estimator for the local stochastic gradient gk(yrk;S).This is because after each synchronization step, each agent k performs Q deterministic steps basedon the same data set S, while fixing all the rest of the variable blocks at Θr0

−k. This is significantlydifferent from FedAvg-type algorithms, where at each iteration a new mini-batch is sampled at eachnode.

Remark 2. If we pick η = 1√T

, and S = Q =√T , then with any fixed K the convergence speed of

the algorithm is O( 1√T

) (in terms of the speed of shrinking the averaged gradient norm squared).

To the best of our knowledge, it is the first time that such an O(1/√T ) rate has been proven for

any algorithms with multiple local steps designed for the feature-partitioned collaboratively learningproblem.

Remark 3. Compared with the existing distributed stochastic coordinate descent methods [13–16],our results are different. It shows that, despite using stochastic gradients and performing multiplelocal updates using staled information, only O(

√T ) communication rounds are requires (out of

total T iterations) to achieves O(1/√T ) rate. We are not aware of any of such guarantees for other

distributed coordinate descent methods.

Remark 4. If we consider the impact of the number of nodes K and pick η = 1√KT

, and S = Q =√TK, then the convergence speed of the algorithm is O(

√K√T

). This indicates that the proposedalgorithm has a slow down w.r.t the number of parties involved. However, in practice, this factoris mild assuming that the total number of parties involved in a feature-partitioned collaborativelylearning problem is usually not large.

8 Experiments

Datasets and Models

MIMIC-III. MIMIC-III (Medical Information Mart for Intensive Care) [17] is a large databasecomprising information related to patients admitted to critical care units at a large tertiary carehospital. Data includes vital signs, medications, survival data, and more. Following the dataprocessing procedures of [31], we compile a subset of the MIMIC-III database containing more than31 million clinical events that correspond to 17 clinical variables and get the final training and testsets of 17,903 and 3,236 ICU stays, respectively. For each variable we compute six different samplestatistic features on seven different subsequences of a given time series, obtaining 17× 7× 6 = 714features. We focus on the in-hospital mortality prediction task based on the first 48 hours of an ICUstay with area under the receiver operating characteristic (AUC-ROC) being the main metric. Wepartition each sample vertically by its clinical features. In a practical situation, clinical variables maycome from different hospitals or different departments in the same hospital and can not share theirdata due to the patients personal privacy. This task is refered to as MIMIC-LR.

MNIST. We partition each MNIST [18] image with shape 28 × 28 × 1 vertically into two parts(each part has shape 28× 14× 1). Each party uses a local CNN model to learn feature representationfrom raw image input. The local CNN model for each party consists of two 3x3 convolution layersthat each has 64 channels and are followed by a fully connected layer with 256 units. Then, thetwo feature representations produced by the two local CNN models respectively are fed into thelogistic regression model with 512 parameters for a binary classification task. We refer this task asMNIST-CNN.

7

0 500 1000 1500 2000 2500 3000 3500 4000communication rounds

0.80

0.81

0.82

0.83

0.84

0.85

Auc

MIMIC-LR

FedSGD Q=1FedBCD-p Q=5FedBCD-p Q=50FedBCD-s Q=1FedBCD-s Q=5FedBCD-s Q=50

(a)

0 1000 2000 3000 4000 5000communication rounds

0.63

0.64

0.65

0.66

0.67

0.68

0.69

train

ing

loss

MIMIC-LRFedSGD Q=1FedBCD-p Q=5FedBCD-p Q=50FedBCD-s Q=1FedBCD-s Q=5FedBCD-s Q=50

(b)

0 20 40 60 80 100 120communication rounds

0.975

0.980

0.985

0.990

0.995

1.000

AUC

MNIST-CNN

FedSGD Q=1FedBCD-p Q=3FedBCD-p Q=5FedBCD-s Q=1FedBCD-s Q=3FedBCD-s Q=5

(c)


0.68

0.70

0.72

0.74

0.76

0.78

0.80

Loss

MNIST-CNNFedSGD Q=1FedBCD-p Q=3FedBCD-p Q=5FedBCD-s Q=1FedBCD-s Q=3FedBCD-s Q=5

(d)

Figure 2: Comparison of AUC (Left) and training loss (Right) in MIMIC-III dataset with varyinglocal iterations, denoted by Q.

MIMIC-LR MNIST-CNNAUC 84% AUC 99.7%

Algo. Q rounds Q roundsFedSGD 1 334 1 46

FedBCD-p 5 71 3 1650 52 5 8

FedBCD-s 1 407 1 485 74 3 15

50 52 5 9Table 1: Number of communication rounds to reach a target AUC-ROC for FedBCD-p, FedBCD-sand FedSGD on MIMIC-LR and MNIST-CNN respectively.

NUS-WIDE. The NUS-WIDE dataset [19] consists of 634 low-level images features extracted fromFlickr images as well as their associated tags and ground truth labels. We put 634 low-level imagefeatures on party B and 1000 textual tag features with ground truth labels on party A. The objectiveis to perform a federated transfer learning (FTL) task studied in [10]. FTL aims to predict labelsto unlabeled images of party B through transfer learning form A to B. Each party utilizes a neuralnetwork having one hidden layer with 64 units to learn feature representation from their raw inputs.Then, the feature representations of both sides are fed into the final federated layer to performfederated transfer learning. This task is refered to as NUS-FTL.

Default-Credit. The Default-Credit consists of credit card records including user demographics,history of payments, and bill statements, etc., with a user’s default payments as labels. We separatethe features in a way that the demographic features are on one side, separated from the paymentand balance features. This segregation is common in industry applications such as retail and carrental leveraging banking data for user credibility prediction and customer segmentation. In ourexperiments, party A has labels and 18 features including six months of payment and bill balancedata, whereas party B has 15 features including user profile data such as education, marriage . Weperform a FTL task as described above but with homomorphic encryption applied. We refer to thistask as Credit-FTL.

For all experiments, we adopt a decay learning rate strategy with ηt = η0√t+1

, where η0 is opti-mized for each experiment. We fix the batch size to 64 and 256 for MIMIC-LR and MNIST-CNNrespectively.

Results and Discussion

FedBCD-p vs FedBCD-s. We first study the impact of varying local iterations on the communicationefficiency of both FedBCD-p and FedBCD-s algorithms based on MIMIC-LR and MNIST-CNN(Figure 2). We observe similar convergence for FedBCD-s and FedBCD-p for various values ofQ. However, for the same communication round, the running time of FedBCD-s doubles that ofFedBCD-p due to sequential execution. As the number of local iteration increases, we observe that therequired number of communication rounds reduce dramatically (Table 1). Therefore, by reasonablyincreasing the number of local iteration, we can take advantage of the parallelism on participantsand save the overall communication costs by reducing the number of total communication roundsrequired.

8

Impact of Q. Theorem 2 suggests that as Q grows the required number of communication roundsmay first decrease and then increase again, and eventually the algorithm may not converge to optimalsolution. To further investigate the relationship between the convergence rate and the local iterationQ, we evaluate FedBCD-p algorithm on NUS-FTL with a large range of Q. The results are shownin Figure 3 and Figure 4(a), which illustrate that FedBCD-p achieves the best AUC with the leastnumber of communication rounds when Q = 15. For each target AUC, there exists an optimal Q.This manifests that one needs to carefully select Q to achieve the best communication efficiency, assuggested by Theorem 2.

0 20 40 60 80 100 120 140communication rounds

0.74

0.76

0.78

0.80

0.82

0.84AU

C

NUS-FTL

FedBCD-p Q=1FedBCD-p Q=5FedBCD-p Q=15FedBCD-p Q=20


3000

4000

5000

6000

7000

Loss

NUS-FTL

FedBCD-p Q=1FedBCD-p Q=5FedBCD-p Q=15FedBCD-p Q=20

Figure 3: Comparison of AUC (Left) and training loss (Right) in NUS-WIDE dataset with varyinglocal iterations, denoted by L.

3 5 10 15 20 25Q

5

10

15

20

25

Com

mun

icat

ion

Roun

ds

NUS-FTLAUC=0.830AUC=0.835AUC=0.837

(a)

0 10 20 30 40communication rounds

0.810

0.815

0.820

0.825

0.830

0.835

0.840

AUC

NUS-FTL

FedBCD-p Q=25FedBCD-p Q=50FedBCD-p Q=100FedPBCD-p Q=25FedPBCD-p Q=50FedPBCD-p Q=100

(b)

Figure 4: The relationship between communication rounds and varying local iterations denoted by Qfor three target AUC (a), and the comparison between FedBCD-p and FedPBCD-p for large localiterations (b).

Figure 4(b) shows that for very large local iteration Q = 25, 50 and 100, the FedBCD-p cannotconverge to the AUC of 83.7%. This phenomenon is also supported by Theorem 2, where if Q is toolarge the right hand side of (13) may not go to zero. Next we further address this issue by making thealgorithm less sensitive in choosing Q.

Proximal Gradient Descent. [32] proposed adding a proximal term to the local objective functionto alleviate potential divergence when local iteration is large. Here, we explore this idea to ourscenario. We rewrite (12) as follows:

gk(yrk;Di) = gk([Θr0−k, θ

rk];Di]) + µ(θrk − θ

r0k ) (15)

where µ(θrk−θr0k ) is the gradient of the proximal term µ

2 ||θrk−θ

r0k ||2, which exploits the initial model

θr0k of party k to limit the impact of local updates by restricting the locally updated model to be closeto θr0k . We denote the proximal version of FedBCD-p as FedPBCD-p. We then apply FedPBCD-pwith µ = 0.1 to NUS-FTL for Q = 25, 50 and 100 respectively. Figure 4(b) illustrates that if Q istoo large, FedBCD-p fails to converge to optimal solutions whereas the FedPBCD-p converges fasterand is able to reach at a higher test AUC than corresponding FedBCD-p does.

Increasing number of Parties. In this section, we increase the number of parties to five andseventeen and conduct experiments for MIMIC-LR task. We partition data by clinical variables witheach party having all the related features of the same variable. We adopt a decay learning rate strategywith η0√

Tkaccording to Theorem 2. The results are shown in Figure 5. We can see that the proposed

method still performs well when we increase the local iterations for multiple parties. As we increase

9

the number of parties to five and seventeen, FedBCD-p is slightly slower than the two-party case, butthe impact of node K is very mild, which verifies the theoretical analysis in Remark 3.


0.80

0.81

0.82

0.83

0.84

0.85

Auc

MIMIC-LR

FedSGD Q=1 K=2FedBCD-p Q=5 K=2FedBCD-p Q=50 K=2FedBCD-p Q=5 K=5FedBCD-p Q=50 K=5FedBCD-p Q=5 K=17


0.80

0.81

0.82

0.83

0.84

0.85

Auc

MIMIC-LR

FedSGD Q=1 K=2FedBCD-s Q=5 K=2FedBCD-s Q=50 K=2FedBCD-s Q=5 K=5FedBCD-s Q=50 K=5FedBCD-s Q=5 K=17

Figure 5: Comparison of AUC in MIMIC-III dataset with varying local iterations (denoted by Q) andnumber of parties (denoted by K). (Left) FedBCD-p; (Right) FedBCD-s

Implementation with HE. In this section, we investigate the efficiency of FedBCD-p algorithmwith homomorphic encryption (HE) applied. Using HE to protect transmitted information ensureshigher security but it is extremely computationally expensive to perform computations on encrypteddata. In such a scenario, carefully selecting Q may reduce communication rounds but may alsointroduce computational overhead because the total number of local iterations may increase (Q×number of communication rounds). We integrated the FedBCD-p algorithm into the current FTLimplementation on FATE1 and simulate two-party learning on two machines with Intel Xeon Goldmodel with 20 cores, 80G memory and 1T hard disk. The experimental results are summarized inTable 2. It shows that FedBCD-p with larger Q costs less communication rounds and total trainingtime to reach a specific AUC with a mild increase in computation time but more than 70 percentsreduction in communication round from FedSGD to Q = 10.

Credit-FTLAUC Algo. Q R comp. comm. total

FedSGD 1 17 11.33 11.34 22.6770% FedBCD-p 5 4 13.40 2.94 16.34

10 2 10.87 2.74 13.61FedSGD 1 30 20.50 20.10 40.60

75% FedBCD-p 5 8 26.78 5.57 32.3510 4 23.73 2.93 26.66

FedSGD 1 46 32.20 30.69 62.8980% FedBCD-p 5 13 43.52 9.05 52.57

10 7 41.53 5.12 46.65Table 2: Number of communication rounds and training time to reach target AUC 70%, 75% and 80%respectively for FedSGD versus FedBCD-p. R, comp. and comm. denote communication rounds,computation time (mins), and communication time (mins) respectively.

9 Conclusions and Future Work

In this paper, we propose a new collaboratively learning framework for distributed features based onblock coordinate gradient descent in which parties perform more than one local update of gradientsbefore communication. Our approach significantly reduces the number of communication roundsand the total communication overhead. We theoretically prove that the algorithm achieves globalconvergence with a decay learning rate and proper choice of Q. The approach is supported by ourextensive experimental evaluations. We also show that adding proximal term can further enhanceconvergence at large value of Q. Future work may include investigating and further improving thecommunication efficiency of such approaches for more complex and asynchronized collaborativesystems.

1https://github.com/FederatedAI/FATE

10

References[1] EU. REGULATION (EU) 2016/679 OF THE EUROPEAN PARLIAMENT AND OF THE

COUNCIL on the protection of natural persons with regard to the processing of personaldata and on the free movement of such data, and repealing Directive 95/46/EC (General DataProtection Regulation). Available at: https://eur-lex. europa. eu/legal-content/EN/TXT, 2016.

[2] Dean, J., G. Corrado, R. Monga, et al. Large scale distributed deep networks. In Advances inneural information processing systems, pages 1223–1231. 2012.

[3] McMahan, H. B., E. Moore, D. Ramage, et al. Federated learning of deep networks using modelaveraging. CoRR, abs/1602.05629, 2016.

[4] Yang, Q., Y. Liu, T. Chen, et al. Federated machine learning: Concept and applications. CoRR,abs/1902.04885, 2019.

[5] Gratton, C., V. D., R. Arablouei, et al. Distributed ridge regression with feature partitioning.pages 1423–1427. 2018.

[6] Hu, Y., D. Niu, J. Yang, et al. Fdml: A collaborative machine learning framework for distributedfeatures. In KDD. 2019.

[7] Hu, Y., P. Liu, L. Kong, et al. Learning privately over distributed features: An ADMM sharingapproach. CoRR, abs/1907.07735, 2019.

[8] Hardy, S., W. Henecka, H. Ivey-Law, et al. Private federated learning on vertically partitioneddata via entity resolution and additively homomorphic encryption. CoRR, abs/1711.10677,2017.

[9] Cheng, K., T. Fan, Y. Jin, et al. Secureboost: A lossless federated learning framework. CoRR,abs/1901.08755, 2019.

[10] Liu, Y., T. Chen, Q. Yang. Secure federated transfer learning. CoRR, abs/1812.03337, 2018.

[11] Rivest, R. L., L. Adleman, M. L. Dertouzos. On data banks and privacy homomorphisms.Foundations of Secure Computation, Academia Press, pages 169–179, 1978.

[12] Yao, A. C. Protocols for secure computations. In SFCS, pages 160–164. 1982.

[13] Richtárik, P., M. Takác. Distributed coordinate descent method for learning with big data. J.Mach. Learn. Res., 17(1):2657–2681, 2016.

[14] Mahajan, D., S. S. Keerthi, S. Sundararajan. A distributed block coordinate descent method fortraining l1regularized linear classifiers. J. Mach. Learn. Res., 18(1):3167–3201, 2017.

[15] Peng, Z., Y. Xu, M. Yan, et al. Arock: an algorithmic framework for asynchronous parallelcoordinate updates. 2015. Preprint, arXiv:1506.02396.

[16] Niu, F., B. Recht, C. Re, et al. Hogwild!: A lock-free approach to parallelizing stochasticgradient dea distributed, asynchronous and incremental algorithm for nonconvex optimization:An admm based approachvia a scent. 2011. Online at arXiv:1106.57320v2.

[17] Johnson, A. E., T. J. Pollard, L. Shen, et al. Mimic-iii, a freely accessible critical care database.Scientific data, 3:160035, 2016.

[18] LeCun, Y., C. Cortes. MNIST handwritten digit database. 2010.

[19] Chua, T.-S., J. Tang, R. Hong, et al. NUS-WIDE: A real-world web image database fromNational University of Singapore. In CIVR. 2009.

[20] Shokri, R., V. Shmatikov. Privacy-preserving deep learning. In Proceedings of the 22Nd ACMSIGSAC Conference on Computer and Communications Security, CCS ’15, pages 1310–1321.ACM, New York, NY, USA, 2015.

11

[21] Yu, H., S. Yang, S. Zhu. Parallel restarted sgd with faster convergence and less communication:Demystifying why model averaging works for deep learning. In Proceedings of the AAAIConference on Artificial Intelligence, vol. 33, pages 5693–5700. 2019.

[22] Li, X., K. Huang, W. Yang, et al. On the convergence of FedAvg on non-iid data. arXiv preprintarXiv:1907.02189, 2019.

[23] Ghadimi, S., G. Lan. Stochastic first-and zeroth-order methods for nonconvex stochasticprogramming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.

[24] Zhu, L., Z. Liu, S. Han. Deep leakage from gradients. In Advances in Neural InformationProcessing Systems 32, pages 14774–14784. Curran Associates, Inc., 2019.

[25] Hitaj, B., G. Ateniese, F. Perez-Cruz. Deep models under the gan: Information leakage fromcollaborative deep learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computerand Communications Security, CCS ’17, pages 603–618. ACM, New York, NY, USA, 2017.

[26] Melis, L., C. Song, E. D. Cristofaro, et al. Exploiting unintended feature leakage in collaborativelearning. 2019 IEEE Symposium on Security and Privacy (SP), pages 691–706, 2018.

[27] Li, Q., Z. Wen, B. He. Practical federated gradient boosting decision trees. In 34th AAAIConference on Artificial Intelligence (AAAI-20). 2020.

[28] Du, W., Y. Han, S. Chen. Privacy-preserving multivariate statistical analysis: Linear regressionand classification. In M. Berry, U. Dayal, C. Kamath, D. Skillicorn, eds., SIAM ProceedingsSeries, pages 222–233. 2004.

[29] Vaidya, J., C. Clifton. Privacy preserving association rule mining in vertically partitioned data.In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discoveryand Data Mining, KDD ’02, pages 639–644. ACM, New York, NY, USA, 2002.

[30] Mangasarian, O. L., E. W. Wild, G. M. Fung. Privacy-preserving classification of verticallypartitioned data via random kernels. ACM Trans. Knowl. Discov. Data, 2(3):12:1–12:16, 2008.

[31] Harutyunyan, H., H. Khachatrian, D. C. Kale, et al. Multitask learning and benchmarking withclinical time series data. arXiv preprint arXiv:1703.07771, 2017.

[32] Li, T., A. K. Sahu, M. Zaheer, et al. Federated optimization for heterogeneous networks. arXivpreprint arXiv:1812.06127, 2019.

12

Supplementary Materials

A Proof of Theorem 1

Lemma 1 For any vector θ0 ∈ Rdk . There exists infinite many of non-identity orthogonal matrixU ∈ Rdk×dk such that

UUT = I (16)Uθ0 = θ0 (17)

Proof. First we construct a orthogonal U1 satisfying (16) and (17) for

θ0 := e1 = (1, 0, · · · , 0)T ∈ Rdk (18)

Then we complete the proof by generalizing the construction for an arbitrary θ0 ∈ Rdk .

With θ0 = e1, we construct U1 in the following way

U1 :=

[1 00 V

](19)

where V ∈ R(dk−1)×(dk−1) is any non-identity orthogonal matrix with dk > 2, i.e.,

V V T = I. (20)

Condition (16) is satisfied since

U1UT1 =

[1 00 V

] [1 00 V T

](21)

=

[1 00 V V T

](22)

= I, (23)

and condition (17) is satisfied trivially, i.e.,

U1e1 = [ 1 0, · · · , 0 ]T

= e1 (24)

For any arbitrary θ0, we apply the Householder transformation to "rotate" it to the basis vector e1,i.e.,

θ0 = ‖θ0‖2Pe1 (25)where P is the Householder transformation operator such as

P = PT (26)PPT = PP = I (27)

Therefore from U1 defined in (19) we can construct U by

U = PU1P. (28)

Finally, we verifies that U satisfies condition (16)) and (17)):

UUT = PU1P (PUT1 P ) (29)

= PU1(PP )UT1 P (30)

= PU1UT1 P (31)

= PP = I (32)Uθ0 = PU1Pθ

0 (33)= ‖θ0‖2P (e1U1) (34)

= ‖θ0‖2Pe1 (35)

= θ0 (36)

where (34) holds since from (25) we have

Pθ0 = ‖θ0‖2e1 (37)

13

A.1 Proof of Theorem 1

Proof: We first show that the conclusion holds for the case when k < K. With initial weightθ0k ∈ Rdk , Lemma 1 shows that we can find infinite number of non-identity matrix U ∈ Rdk×dk suchthat

θ0k = UT θ0k. (38)Let xkij denotes the ith sample of the data set Sj sampled at jth iteration. We show that for any{xkij}i∈Sj ,j=0,1,··· that yields observations {Hk

ij}j=0,1,···, we can construct another set of data

xkij := xkijU (39)

where U is chosen to satisfy condition (38). Let {Hkij} be observations generated by {xkij}, and {θjk}

be weight variables withθ0k = UT θ0k. (40)

We show in the following that for j = 0, 1, · · ·Hkij = Hk

ij (41)

θjk = UT θjk. (42)

It is easy to verify (41) for j = 0, i.e.,

Hki0 = xki0UU

T θ0k

= (xki0U)(UT θ0k)

= xki0 θ0k

= Hki0 ,

From equation (11), we definegij := g(Hij , y

Ki ) (43)

Now assuming that condition (41) and (42) hold for j ≤ J . Then

gij = gij (44)

θjk = UT θjk (45)

We first show (42) holds for j = J + 1.

θJ+1k (46)

= θJk − η(1

NSJ

∑i∈SJ

giJ (xkiJ )T + λθJk ) (47)

= UT θJk − η(1

NSJ

∑i∈SJ

giJ (xkiJU)T + λUT θJk ) (48)

= UT (θJk − η(1

NSJ

∑i∈SJ

giJ (xkiJ )T + λθJk )) (49)

= UT θJ+1k (50)

where (48) follows from (44) and (45). Note if Q local updates are performed, locally we have

θj,q+1k − θj,qk = (1− ηλ)(θj,qk − θ

j,q−1k ) (51)

where θj,qk denotes the qth local update of jth iteration. it is thus easy to show that

θJ+1,qk = UT θJ+1,q

k (52)

Next we show (41) holds for j = J + 1.

HkiJ+1

= xki,J+1θJ+1k (53)

= xki,J+1UUT θJ+1

k (54)

= xki,J+1θk (55)

= HkiJ+1

(56)

14

Finally we complete the proof by showing the conclusion holds for k = K. Similarly for any{xKij }j=0,1,··· and {yKij }, we construct a different solution as follows:

xKij := xKijU (57)

yKij := yKij . (58)

where U ∈ Rdk×dk is a non-identity orthogonal matrix satisfying (39) for k = K. Therefore we have

HKij = g(Hij , y

Kij ) (59)

= g(Hij , yKij ) (60)

= HKij (61)

B Proof of Theorem 2

To begin with, let us first introduce some notations.

Notations. For notation simplicity, we divide the total data set [N ] into S1, . . . ,SB , each with|Si| = S. Let r denote the iteration index, where each iteration one round of local update isperformed among all the nodes; Let

Θr = [θr1; . . . ; θrK ] , [yr1,1; . . . ; yrK,K ]

denotes the updated parameters at each iteration of each node. Note that {Θr} is only a sequence of“virtual" variable that helps the proof, but it is never explicitly formed in the algorithm.

Let r0 denote the latest iteration before r with global communication, so Θr0 = yr0k , k = 1, . . . ,Kand

yrk = [Θr0−k, θ

rk].

Also we denote the full gradient of the loss function as∇L(Θ). Further denote the stochastic gradientas

G , [g1(y1;S); . . . ; gK(yK ;S)].

Using the above notation, the update rule becomes: Θr+1 = Θr − γGr.

We also need the following important definition. Let us denote the variables updated in the innerloops, using mini-batch Sl, l ∈ [B] as Θr

[l]. That is, we have the following:

Θr0+1[l],k , Θr0

k − ηgrk(Θr0 ;Sl),

Θr0+τ[l],k , Θr0+τ−1

[l],k − ηgrk(yr0+τ−1[l],k ;Sl),

whereyr[l],k , [Θr0

−k, θr[l],k].

Further define

yrk , [yr[1],k; · · · ; yr[B],k]. (62)

Note that if Si is sampled at r0, then the quantities Θr0+τ[l],k , l 6= i are not realized in the algorithm.

They are introduced only for analytical purposes.

By using these notations, Algorithm 2 can be written in the compact form as in Algorithm 4. Further,the uniform sampling assumption A2 implies the following:

A2’: Unbiased Gradient with Bounded Variance. Suppose Assumption A2 holds. Then we havethe following:

E[grk(yrk;Si) | {ytk}r0t=1] =

1

B

B∑`=1

grk(yr[`],k;S`) , ∇kL(yrk)

where the expectation is taken on the choice of i at iteration r0, and conditioning all the past historiesof the algorithm up to iteration r0; Further, the following holds:

E[‖grk(yrk;Si)−∇kL(yrk)‖2 | {ytk}r0t=1] ≤ σ2

S

15

Algorithm 4 Federated Stochastic BCD Algorithm

Input: Θ(0), η,QInitialize: y0

k , Θ(0), k = 1, . . . ,Kfor r = 0, 1, . . . do

if r mod Q = 0 thendo global communicationyrk , [yr1,1

T, . . . ,yrK,KT]

T

Randomly sample a mini-batch Si for nodes 1, . . . ,K

endCalculate grk(yrk;Si) at each node using Sifor k = 1, · · · ,K do

Local updates

yr+1k,j ,

{yrk,j , j = 1, . . . ,K, j 6= k

yrk,j − ηgrk(yrk;Si), j = k

endend

where the expectation is taken over the random selection of the data point at iteration r0; σ2 is thevariance introduced by sampling from a single data point.

To begin our proof, we first present a lemma that bounds the difference of the gradients evaluated atthe global vector Θr and the local vector yrk.

Lemma 2 Bounded difference of gradient. Under Assumption A1, A2, the difference between thepartial gradients evaluated at Θr and yrk is bounded by the following:

EK∑k=1

‖∇kL(Θr)− gk(yrk;S)‖2 ≤ (K + 3)η2K∑k=1

L2kQ

2 σ2

S

+ η2Q

K∑k=1

(

K∑j=1

L2j + 3L2

k)

r−1∑τ=r−Q

E‖∇kL(yτk)‖2 +Kσ2

S,

where σ is some constant related to the variances of the sampling process.

B.1 Proof of Lemma 2

Proof. First we take the expectation and apply assumption A1,A2

E[‖∇kL(Θr)− grk(yrk;Si)‖2]

= E[ESi [‖∇kL(Θr[i])− g

rk(yrk;Si)‖2 | Θr0 ]]

≤ 2E[ESi [‖∇kL(Θr[i])−∇kL(yrk)‖2 | Θr0 ]

+ ESi [∇kL(yrk)− grk(yrk;Si)‖2 | Θr0 ]]

≤ 2E[L2kESi [

1

B

B∑l=1

‖Θr[i] − yr[l],k‖

2 | Θr0 ]

+ ESi [‖∇kL(yrk)− grk(yrk;Si)‖2 | Θr0 ]]

≤ 2E

[L2kESi [

1

B

B∑l=1

‖Θr[i] − yr[l],k‖

2 | Θr0 ]

]+

2σ2

S

(63)

16

Note that

‖Θr[i] − yr[l],k‖2 =

K∑j=1

‖θr[i],j − yr[l],k,j‖2

=∑j 6=k

‖θr[i],j − θr0j ‖2 + ‖θr[i],k − θr[l],k‖2

=∑j 6=k

‖θr0j − ηr−1∑τ=r0

gτj (yτ[i],j ;Si)− θr0j ‖2

+ ‖ηr−1∑τ=r0

gτk(yτ[i],k;Si)− ηr−1∑τ=r0

gτk(yτ[l],k;Sl)‖2

≤ η2[∑j 6=k

‖r−1∑τ=r0

gτj (yτ[i],j ;Si)‖2

+ 2‖r−1∑τ=r0

gτk(yτ[l],k;Sl)‖2 + 2‖r−1∑τ=r0

gτk(yτ[l],k;Si)‖2]

≤ η2(r − r0 − 1)

r−1∑τ=r0

(

K∑j=1

‖gτj (yτ[i],j ;Si)‖2

+ 2‖gτk(yτ[l],k;Sl)‖2 + ‖gτk(yτ[i],k;Si)‖2).

(64)

Notice that r − r0 − 1 ≤ Q and substitute (64) to (63), by taking expectation over Si, we obtain

E[‖∇kL(Θr)− grk(yrk;Si)‖2]

≤ 2L2kη

2(r − r0 − 1)

r−1∑τ=r0

E[ESi[ 1

B

B∑l=1

(

K∑j=1


+ 2‖gτk(yτ[l],k;Sl)‖2 + ‖gτk(yτ[i],k;Si)‖2)]]

+2σ2

S

(i)

≤ 2L2kη

2Q

r−1∑τ=r0

E[ESi[(

K∑j=1


+ 2‖∇kL(yτk)‖2 +2σ2

S+ ‖gτk(yτ[i],k;Si)‖2)

]]+

2σ2

S

(ii)

≤ 2L2kη

2Q

r−1∑τ=r0

E[(

K∑j=1

(‖∇kL(yτk)‖2 +σ2

S)

+ 3‖∇kL(yτk)‖2 +3σ2

S)] +

2σ2

S

≤ 2L2kη

2Q

r−1∑τ=r−Q

E[(

K∑j=1

‖∇kL(yτk)‖2 +(K + 3)σ2

S

+ 3‖∇kL(yτk)‖2)] +2σ2

S

(65)

where in (i) we used the definition of ∇kL(yrk); in (ii) we use the fact that E[x2] − (E[x])2 =E[(x− E(x))2]; in the last inequality is true because summing from r0 to r − 1 is always less thatsumming from r −Q to r − 1. Next we sum over K, we have

EK∑k=1

‖∇kL(Θr)− gk(yrk;Si)‖2 ≤ 2(K + 3)η2K∑k=1

L2kQ

2 σ2

S

+ 2η2Q

K∑k=1

(

K∑j=1

L2j + 3L2

k)

r−1∑τ=r−Q

E‖∇kL(yτk)‖2 + 2Kσ2

S,

(66)

This completed the proof of this result.

17

B.2 Proof of Theorem 2

Proof. First apply the Lipschitz condition of L, we have

L(Θr+1) ≤L(Θr) + 〈∇L(Θr),Θr+1 −Θr〉

+L

2‖Θr+1 −Θr‖2.

(67)

Next we apply the update step in Algorithm 4 and take the expectation

E[L(Θr+1)− L(Θr)] ≤ −ηE〈∇L(Θr),Gr〉+L

2η2E‖Gr‖2

= −η2E‖∇L(Θr)‖2 − η

2E‖Gr‖2

+η

2E‖∇L(Θr)−Gr‖2 +

L

2η2E‖Gr‖2

= −η2E‖∇L(Θr)‖2 − η

2(1− Lη)E‖Gr‖2

+η

2E

K∑k=1

‖∇kL(Θr)− gk(yrk;Si)‖2

(68)

Also we have

E‖Gr‖2 = E‖E[Gr]‖2 + E‖Gr − E[Gr]‖2

= EK∑k=1

‖∇kL(yrk)‖2 + EK∑k=1

‖Gr −∇kL(yrk)‖2

≥ EK∑k=1

‖∇kL(yrk)‖2.

(69)

Substitute (69) into (68), choose 0 < η ≤ 1L and apply Lemma 2, we have

E[L(Θr+1)− L(Θr)]

≤ −η2E‖∇L(Θr)‖2 − η

2(1− Lη)E

K∑k=1

‖∇kL(yrk)‖2

+η

2E

K∑k=1

‖∇kL(Θr)− gk(yrk;Si)‖2

(66)≤ −η

2E‖∇L(Θr)‖2 − η

2(1− Lη)E

K∑k=1

‖∇kL(yrk)‖2

+ (K + 3)η3K∑k=1

L2kQ

2 σ2

S+ ηK

σ2

S

+ η3QK∑k=1

(

K∑j=1

L2j + 3L2

k)

r−1∑τ=r−Q

E‖∇kL(yτk)‖2

(70)

Average over T and reorganize the terms, let η satisfy

1− Lη − 2η2Q2(

K∑j=1

L2j + 3L2

k) ≥ 0,

18

(0 < η ≤√2

2Q√∑K

j=1 L2j+3L2

k

), then we obtain

1

T

T−1∑r=0

E‖∇L(Θr)‖2 ≤ 2

ηTE[L(Θ0)− L(ΘT+1)]

−K∑k=1

(1− Lη − 2η2Q2(

K∑j=1

L2j + 3L2

k))E‖∇kL(yrk)‖2

+ (K + 3)2η2K∑k=1

L2kQ

2 σ2

S+ 2K

σ2

S

≤ 2

ηTE[L(Θ0)− L(ΘT+1)]

+ (K + 3)2η2K∑k=1

L2kQ

2 σ2

S+ 2K

σ2

S.

(71)

The proof is completed.

19

a communication-efﬁcient collaborative learning …1 introduction one critical challenge for...

Documents