differentially private distributed online learning · differentially private distributed online...

14
Differentially Private Distributed Online Learning Chencheng Li , Student Member, IEEE, Pan Zhou , Member, IEEE, Li Xiong , Qian Wang , Member, IEEE, and Ting Wang Abstract—In the big data era, the generation of data presents some new characteristics, including wide distribution, high velocity, high dimensionality, and privacy concern. To address these challenges for big data analytics, we develop a privacy-preserving distributed online learning framework on the data collected from distributed data sources. Specifically, each node (i.e., data source) has the capacity of learning a model from its local dataset, and exchanges intermediate parameters with a random part of their own neighboring (logically connected) nodes. Hence, the topology of the communications in our distributed computing framework is unfixed in practice. As online learning always performs on the sensitive data, we introduce the notion of differential privacy (DP) into our distributed online learning algorithm (DOLA) to protect the data privacy during the learning, which prevents an adversary from inferring any significant sensitive information. Our model is of general value for big data analytics in the distributed setting, because it can provide rigorous and scalable privacy proof and have much less computational complexity when compared to classic schemes, e.g., secure multiparty computation (SMC). To tackle high-dimensional incoming data entries, we study a sparse version of the DOLA with novel DP techniques to save the computing resources and improve the utility. Furthermore, we present two modified private DOLAs to meet the need of practical applications. One is to convert the DOLA to distributed stochastic optimization in an offline setting, the other is to use the mini-batches approach to reduce the amount of the perturbation noise and improve the utility. We conduct experiments on real datasets in a configured distributed platform. Numerical experiment results validate the feasibility of our private DOLAs. Index Terms—Differential privacy, distributed optimization, online learning, sparse, mini-batch, big data Ç 1 INTRODUCTION W ITH the fast development of the Internet, the data from many application domains are in large scale and distributed. The centralized processing is no longer capable of efficiently computing the internet data nowa- days. Moreover, not only the scale of data becomes enor- mous, but also the velocity of data generating increases dramatically. Hence, the data should be processed in real time (online) to satisfy the need of fast responses to the users. Besides the wide distribution and high velocity, a great deal of data involve personal information which requires privacy protection. Any data mining without guaranteeing the secure computing and data privacy is never permitted. Therefore, a privacy-preserving mecha- nism needs to be used in the distributed setting. In a word, we intend to study a private and efficient distributed frame- work for big data related applications. To stimulate our ideas, we provide two reality applications as follows: Disease prevention and treatment. Hospitals have a large number of patient cases. A data analysis of these cases helps the doctors to make accurate diagnosis and propose early treatments. However, the information about the patients is extremely private. A hasty survey may reveal sensitive information about the patients. Hence, each hospital cannot share its patient cases with other research institutions and canonicity hospitals. An important challenge is how to conduct medical studies separately, while the privacy of patients are preserved. Online ads recommendation. A large number of people engage in online activities by Apps and websites. Advertis- ing revenue is the main part of some IT companies, such as Facebook, Google etc. Such Internet companies recommend personalized advertisements to each user. Since the prefer- ences of the users change over time, the sample data of the users are updated frequently. As a result, they need to han- dle petabytes of data every day. For solving the problem, some distributed computing systems, e.g., Hadoop and Spark, have been used to process a large scale of data for years by Internet companies. Hadoop and Spark are of the master-slave distributed system, where the raw data are fre- quently transmitted between the disk and memory. They cost much resources and time. Against this kind of waste, we design an algorithm to optimize the distributed functions by sharing the intermediate parameters. It is also necessary to apply privacy mechanisms to the data transmission in the distributed system. To solve the above private distributed problems, some methods have been studied. For instance, secure multiparty computation is a preferable method to optimize a function over distributed data resources while keeping these data C. Li and P. Zhou are with the School of Electronic Information and Com- munications, Huazhong University of Science and Technology, Wuhan, Hubei 430073, China. E-mail: {lichencheng, panzhou}@hust.edu.cn. L. Xiong is with the Department of Mathematics and Computer Science and Biomedical Informatics, Emory University, Atlanta, GA 30322. E-mail: [email protected]. Q. Wang is with the School of Computer Science, Wuhan University, Wuhan, Hubei 430072, P.R. China. E-mail: [email protected]. T. Wang is with the Department of Computer Science and Engineering, Lehigh University, Bethlehem, PA 18015. E-mail: [email protected]. Manuscript received 18 Sept. 2016; revised 11 Dec. 2017; accepted 6 Jan. 2018. Date of publication 17 Jan. 2018; date of current version 5 July 2018. (Corresponding author: Pan Zhou.) Recommended for acceptance by E. Terzi. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference the Digital Object Identifier below. Digital Object Identifier no. 10.1109/TKDE.2018.2794384 1440 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 30, NO. 8, AUGUST 2018 1041-4347 ß 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See ht_tp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Upload: others

Post on 28-May-2020

18 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Differentially Private Distributed Online Learning · Differentially Private Distributed Online Learning Chencheng Li , Student Member, IEEE, Pan Zhou , Member, IEEE, Li Xiong , Qian

Differentially Private Distributed Online LearningChencheng Li , Student Member, IEEE, Pan Zhou ,Member, IEEE,

Li Xiong , Qian Wang ,Member, IEEE, and Ting Wang

Abstract—In the big data era, the generation of data presents some new characteristics, including wide distribution, high velocity, high

dimensionality, and privacy concern. To address these challenges for big data analytics, we develop a privacy-preserving distributed

online learning framework on the data collected from distributed data sources. Specifically, each node (i.e., data source) has the

capacity of learning a model from its local dataset, and exchanges intermediate parameters with a random part of their own neighboring

(logically connected) nodes. Hence, the topology of the communications in our distributed computing framework is unfixed in practice.

As online learning always performs on the sensitive data, we introduce the notion of differential privacy (DP) into our distributed online

learning algorithm (DOLA) to protect the data privacy during the learning, which prevents an adversary from inferring any significant

sensitive information. Our model is of general value for big data analytics in the distributed setting, because it can provide rigorous and

scalable privacy proof and have much less computational complexity when compared to classic schemes, e.g., secure multiparty

computation (SMC). To tackle high-dimensional incoming data entries, we study a sparse version of the DOLA with novel DP

techniques to save the computing resources and improve the utility. Furthermore, we present two modified private DOLAs to meet the

need of practical applications. One is to convert the DOLA to distributed stochastic optimization in an offline setting, the other is to use

the mini-batches approach to reduce the amount of the perturbation noise and improve the utility. We conduct experiments on real

datasets in a configured distributed platform. Numerical experiment results validate the feasibility of our private DOLAs.

Index Terms—Differential privacy, distributed optimization, online learning, sparse, mini-batch, big data

Ç

1 INTRODUCTION

WITH the fast development of the Internet, the datafrom many application domains are in large scale

and distributed. The centralized processing is no longercapable of efficiently computing the internet data nowa-days. Moreover, not only the scale of data becomes enor-mous, but also the velocity of data generating increasesdramatically. Hence, the data should be processed in realtime (online) to satisfy the need of fast responses to theusers. Besides the wide distribution and high velocity, agreat deal of data involve personal information whichrequires privacy protection. Any data mining withoutguaranteeing the secure computing and data privacy isnever permitted. Therefore, a privacy-preserving mecha-nism needs to be used in the distributed setting. In a word,we intend to study a private and efficient distributed frame-work for big data related applications. To stimulate ourideas, we provide two reality applications as follows:

Disease prevention and treatment. Hospitals have a largenumber of patient cases. A data analysis of these cases helpsthe doctors to make accurate diagnosis and propose earlytreatments. However, the information about the patients isextremely private. A hasty survey may reveal sensitiveinformation about the patients. Hence, each hospital cannotshare its patient cases with other research institutions andcanonicity hospitals. An important challenge is how toconduct medical studies separately, while the privacy ofpatients are preserved.

Online ads recommendation. A large number of peopleengage in online activities by Apps and websites. Advertis-ing revenue is the main part of some IT companies, such asFacebook, Google etc. Such Internet companies recommendpersonalized advertisements to each user. Since the prefer-ences of the users change over time, the sample data of theusers are updated frequently. As a result, they need to han-dle petabytes of data every day. For solving the problem,some distributed computing systems, e.g., Hadoop andSpark, have been used to process a large scale of data foryears by Internet companies. Hadoop and Spark are of themaster-slave distributed system, where the raw data are fre-quently transmitted between the disk and memory. Theycost much resources and time. Against this kind of waste, wedesign an algorithm to optimize the distributed functions bysharing the intermediate parameters. It is also necessary toapply privacy mechanisms to the data transmission in thedistributed system.

To solve the above private distributed problems, somemethods have been studied. For instance, secure multipartycomputation is a preferable method to optimize a functionover distributed data resources while keeping these data

� C. Li and P. Zhou are with the School of Electronic Information and Com-munications, Huazhong University of Science and Technology, Wuhan,Hubei 430073, China. E-mail: {lichencheng, panzhou}@hust.edu.cn.

� L. Xiong is with the Department of Mathematics and Computer Scienceand Biomedical Informatics, Emory University, Atlanta, GA 30322.E-mail: [email protected].

� Q. Wang is with the School of Computer Science, Wuhan University,Wuhan, Hubei 430072, P.R. China. E-mail: [email protected].

� T. Wang is with the Department of Computer Science and Engineering,Lehigh University, Bethlehem, PA 18015. E-mail: [email protected].

Manuscript received 18 Sept. 2016; revised 11 Dec. 2017; accepted 6 Jan.2018. Date of publication 17 Jan. 2018; date of current version 5 July 2018.(Corresponding author: Pan Zhou.)Recommended for acceptance by E. Terzi.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference the Digital Object Identifier below.Digital Object Identifier no. 10.1109/TKDE.2018.2794384

1440 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 30, NO. 8, AUGUST 2018

1041-4347� 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See ht _tp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: Differentially Private Distributed Online Learning · Differentially Private Distributed Online Learning Chencheng Li , Student Member, IEEE, Pan Zhou , Member, IEEE, Li Xiong , Qian

private. SMC has been studied in many applications (e.g.,[1], [2], [3], [4]). Although SMC can solve the distributedcomputing privately, it costs a large amount of computationresources. As a state-of-the-art privacy notion, differentialprivacy [5], [6] has been used in some distributed optimiza-tion models (e.g., [7], [8], [9]). DP can protect a learningalgorithm by using a small amount of noise (generated dueto a Gaussian distribution or some else distributions). How-ever, achieving the trade-off between the privacy and utilityof a DP-algorithm remains a problem. In this paper, we willpropose a much faster and more private distributed learn-ing algorithm, which can be also used in more applicationssuch as high-dimensional data optimizations.

First, we present the distributed online learning modeland privacy concerns. Specifically, for the distributed set-ting, we assume all nodes (i.e., data sources) have the inde-pendent online learning ability, which updates the localparameter by one (or a small batch of) data point from thelocal data source. To achieve the convergence of the DOLA,all nodes must exchange their learnable intermediateparameters to their own neighboring nodes. This approachcan save much communication cost. The design of the com-munication matrices among the nodes poses a great chal-lenge. Our distributed online learning model differs fromtraditional distributed computing systems (e.g., Hadoopand Spark) that contain a master node and many slavenodes. The master node is in charge of the “MapReduce”scheduling and the slave nodes are responsible for comput-ing data as requested. This kind of distributed system needsmuch communication cost and easily leads to privacybreaches as a result of transmitting all the computed data tothe master. For the privacy mechanism, we present a differ-entially private framework for the distributed online learn-ing with the sensitive data. Differential privacy is a popularprivacy mechanism based on noise perturbation and hasbeen used in a few machine learning applications [10], [11],[12]. In a nutshell, a differentially private machine learningalgorithm guarantees its output not to be much differentwhenever one individual is in the training set or not. Thus,an adversary cannot infer any meaningful information fromthe output of the algorithm. DP differs from traditionalencryption approaches and preserves the privacy of thelarge and distributed datasets by adding a small amount ofnoise. How we apply DP to the distributed online learningis an another challenge.

Furthermore, some online data are high-dimensional. Forinstance, a single person has a variety of social activities in asocial network, so the corresponding data vector of his/hersocial information may be long and complex. When a dataminer studies the consumer behavior about one interest,some of the information in the vector may not be relevant.A person’s height and age may not contribute to predictinghis taste. Thus, the high-dimensional data could enhancethe computational complexity of algorithms and weakenthe utility of online learning models. To deal with this prob-lem, we introduce a sparse solution and test it on real data-sets. So far, there have been two classical effective methodsfor sparse online learning [13], [14], [15]. The first method(e.g., [16]) introduced sparsity in the weights of online learn-ing algorithms via truncated gradient. The second methodfollowed the dual averaging algorithm [17]. In this paper,

we will exploit online mirror descent [18] and Lasso-L1 norm[19] to make the learnable parameter sparse for achieving abetter learning performance.

Finally, we propose two extensions of the private DOLAfor some practice applications. The first extension is thatour differentially private DOLA can be converted to the dis-tributed stochastic optimization. Specifically, we show thatthe regret bound of our private DOLA can be used to obtaingood convergence rates for the distributed stochastic opti-mization algorithm. This application contributes to the off-line learning related optimization problem when the datastream is coming in real time. The second one is that weallow a slack in the number of samples computed in eachiteration of online learning. Each iteration can process morethan one data instant, which is called the mini-batchupdates [20], [21]. By using mini-batches in the privateDOLA, we not only process the high-velocity data streammuch faster, but also reduce the amount of perturbationnoise while the algorithm guarantees the same DP level.

Following are the main contributions of this paper:

� We present a distributed framework for online learn-ing. We respectively obtain the classical regret

bounds Oð ffiffiffiffiTp Þ [22] and OðlogT Þ [23] for convex andstrongly convex objective functions for DOLA,which indicates that the utility of our private DOLAachieves almost the same performance with the cen-tralized online learning algorithm.

� We provide � and �; dð Þ-differential privacy guaran-tees for the DOLA. Interestingly, the private regretbounds have the same order of Oð ffiffiffiffiTp Þ and OðlogT Þwith the non-private ones, which indicates thatguaranteeing differential privacy in the DOLA doesnot significantly harm the original performance.

� Facing the high-dimensional data, we propose a dif-ferentially private sparse DOLA and achieve anOð ffiffiffiffiTp Þ regret bound, which shows that our privatesparse DOLA works well.

� We convert our differentially private DOLA to pri-vate distributed stochastic optimization and obtaingood convergence rates, i.e., Oð 1ffiffiffi

Tp Þ and OðlogTT Þ.

� We use mini-batches to improve the performance ofour differentially private DOLA. The proposed algo-rithms using mini-batches guarantees the same levelof privacy with less noise, which naturally leads to abetter performance than one-batch private DOLA.

The rest of the paper is organized as follows. Section 2discusses some relatedworks. Section 3 presents the problemstatement for our work.We propose � and �; dð Þ-differentiallyprivate DOLA in Section 4. Section 5 studies the privatesparse DOLA aimed at processing the high-dimensionaldata. In Section 6, we make extensions for our privateDOLAs. Section 6.1 discusses the application of the privateDOLA to stochastic optimization and Section 6.2 uses mini-batches to improve the privacy and utility performance ofour private DOLA. In Section 7, we present simulationresults of the proposed algorithms. Section 8 concludes thepaper. Some lengthy proofs of this paper are left in the sup-plement, which can be found on the Computer Society Digi-tal Library at http://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2794384.

LI ET AL.: DIFFERENTIALLY PRIVATE DISTRIBUTED ONLINE LEARNING 1441

Page 3: Differentially Private Distributed Online Learning · Differentially Private Distributed Online Learning Chencheng Li , Student Member, IEEE, Pan Zhou , Member, IEEE, Li Xiong , Qian

2 RELATED WORK

Jain et al. [11] studied the differentially private centralizedonline learning. They provided a generic differentially pri-vate framework for online algorithms. Using their genericframework, Implicit Gradient Descent and GeneralizedInfinitesimal Gradient Ascent could be transformed into dif-ferentially private online learning algorithms. Their workmotivates our study on the differentially private onlinelearning in distributed scenarios.

Recently, growing research effort has been devoted to thedistributed online learning. Yan et al. [24] has proposed aDOLA to handle the decentralized data. A fixed networktopology was used to conduct the communications amongthe learners in their system. Further, they studied theprivacy-preserving problem, and showed that the non-fullyconnected communication network has intrinsic privacy-preserving properties. Worse than differential privacy, theirprivacy-preserving method cannot protect the privacy of alllearners absolutely. Besides, Huang et al. [7] is closelyrelated to our work. They presented a differentially privatedistributed optimization algorithm.

The method to solve distributed online learning was pio-neered in distributed optimization (DO). Hazan has studiedonline convex optimization in his book [25]. They proposedthat the framework of convex online learning is closely tiedto the statistical learning theory and convex optimization.Duchi et al. [26] developed an efficient algorithm forDO based on dual averaging of subgradients. Nedic andOzdaglar [27] considered a subgradient method for distrib-uted convex optimization, where the functions are convexbut not necessarily smooth. They demonstrated that a time-variant communication could ensure the convergence of theDO algorithm. Ram et al. [28] analyzed the influence ofstochastic subgradient errors on distributed convex optimi-zation in a time-variant network topology. All these papershave made great contributions to DO, but they did notconsider the privacy-preserving issue.

There has been much research effort being devoted tohow differential privacy can be used in existing learningalgorithms. For example, Chaudhuri et al. [10] presentedthe output perturbation and objective perturbation ideas

about differential privacy in ERM classification. Rajkumarand Agarwal [8] extended the work of [10] to distributedmultiparty classification. More importantly, they analyzedthe sequential and parallel composability problems whilethe algorithm guaranteed �-differential privacy. Bassilyet al. [29] proposed more efficient algorithms and tightererror bounds for ERM classification on the basis of [10].

For a combination of SMC and DP, Goryczka and Xionget al. [30] studied existing SMC solutions with differentialprivacy in distributed setting and compared the complexi-ties and security characteristics of them.

3 PROBLEM STATEMENT

3.1 System Setting

Given a distributed network shown in Fig. 1, where m com-puting nodes are connected in a non-fully connected graph.Each node is only connected with its neighboring nodes,e.g., node i (the green one in Fig. 1) can communicate withnodes j, k and p, which are located in the light green circularregion. The circular regions in Fig. 1 denote the maximumrange that the nodes can reach. Note that the lines denotingthe links have dotted and solid representations. The dottedline indicates that the link between the neighboring nodesdoes not work at this time. When this link is connected atone time, its dotted line becomes solid. In a nutshell, eachnode exchanges information with a random subset of itsneighbors in one iteration. This pattern makes furtherefforts to reduce the communication cost and ensures thateach node keeps enough information exchanging withneighbors. Note that since our distributed setting does notconsider the delays of the communications, the communica-tion cost of our algorithm is not considered.

Our goal is to train all the nodes to be independent onlineclassifiers fi : X ! Y. For each single node, it receives the“question” xi

t, taken from a convex set X and should give aprediction (an “answer”) denoted by pit to this question. Tomeasure how “accurate” the prediction is, we compare itwith the correct answer yit 2 Y, and then compute the lossfunction ‘ wi

t; xit; y

it

� �(e.g., ‘ wi

t; xit; y

it

� � ¼ wi; xit

� �� yit�� ��). We

let fit ðwÞ :¼ ‘ðwi; xit; y

itÞ to be a convex function. Different

from the ERM optimization, one online node minimizesregret versus best single response in hindsight as follows:

Regret ¼XTt¼1

fit ðwi

t; �itÞ �min

w2W

XTt¼1

ftðw; �itÞ; (1)

TABLE 1Summary of Main Notations

wit a learnable parameter of the ith node at time t

At a communication matrix in iteration tn the dimensionality of vector parametersSðtÞ the sensitivity of each iterationfit loss functionm the number of nodesat learning rate or stepsizeaijðtÞ the ði; jÞth element of At

GiðtÞ the set of neighbors of Node iL (sub)gradient bound of fi

tgit (sub)gradient of fit� privacy level of DPRD regret of the DOLAwt

1m

Pmi¼1 w

itbwT

1T

PTt¼1 wt

�; �h i inner product�k k Euclidean norm unless special remark

Fig. 1. Distributed network where nodes only can communicate withneighbors (localized in their respective circular regions).

1442 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 30, NO. 8, AUGUST 2018

Page 4: Differentially Private Distributed Online Learning · Differentially Private Distributed Online Learning Chencheng Li , Student Member, IEEE, Pan Zhou , Member, IEEE, Li Xiong , Qian

where �it denotes the sample xit; y

it

� �received in the tth itera-

tion, f is the loss function. After T iterations, the regretshould be bounded, which indicates that the predictions aremore accurate over time. In our distributed online learning,we need to redefine a new regret function.

Definition 1. In our distributed online learning algorithm,we assume there are m learning nodes. Each node has the inde-pendent learning ability. Then, we define the regret of the algo-rithm as

RD ¼XTt¼1

Xmi¼1

fit ðwjtÞ �min

w2W

XTt¼1

Xmi¼1

fit ðwÞ: (2)

In a centralized online learning algorithm, N data pointsneed T ¼ N iterations to be finished, while the distributedalgorithm can process m�N data points over the same timeperiod. Note that RD is computed with respect to an arbitrary

node’s parameter wjt . This states that any node of the system

can measure the regret of the whole system based on its local

parameter (e.g., wjt), even though the node does not handle the

data of other nodes.In this paper, we bound RD and use this regret bound to

measure the utility of DOLA.

3.2 Distributed Online Learning Model

To solve the problem of minimizing (1), we exploit the dis-tributed convex optimization under the following assump-tion on the datasetW and the cost functions fi

t .

Assumption 1. The setW and the cost functions fit are such that

(1) The set W is closed and convex subset of Rn. LetR ¼D supx;y2W x� yk k denote the diameter ofW .

(2) The cost functions fit are strongly convex with modu-

lus � � 0. For all x; y 2W , we have

rfit ðxÞ; y� x

� � � fit ðyÞ � fi

t ðxÞ ��

2y� xk k2: (3)

(3) The (sub)gradients of fit are uniformly bounded, i.e.,there exists L > 0 , for all x 2W , we have

rfit ðxÞ

�� �� � L: (4)

In Assumption 1: (1) guarantees that there exists an opti-mal solution in our algorithm; (2) and (3) help us analyze theconvergence of our algorithm.

To achieve the convergence of DOLA, we specify that (i.e.,Fig. 2) the ith node (i 2 ½1;m�) updates its local parameter wi

t

based on its local dataset fðxit; yitÞ 2 X � Yg. Since them nodesare distributed, each node exchanges the learnable parameter wt

with neighbors for a global optimal solution. In this paper, weuse a time-variant matrix At to conduct the communicationsamong nodes, which makes each node communicate with differ-ent subsets of their neighbors in different iterations. Each nodefirst gets the exchanged parameters and computes the weightedaverage of them, then updates the parameter wi

t, finally broad-casts the new learnable parameter to its neighbors.

To recall, the nodes communicate with neighbors based onthe matrix At. Each node directly or indirectly influences othernodes. For a clear description, we denote the communicationgraph for a node i in tth iteration by

GiðtÞ ¼ fði; jÞ : aijðtÞ > 0g: (5)

In our algorithm, each node computes a weighted average [28]of the m nodes’ parameters. The weighted average should makeeach node have “equal” influence on other nodes after longiterations.

Besides basic assumptions for datasets and objective func-tions, how to conduct the communications among the distributednodes is critical to solve the distributed convex optimizationproblem in our work. Since the nodes exchange information withneighbors while they update local parameters with subgradients,a time-variantm-by-m doubly stochastic matrix At is proposedto conduct the communications. It has a few properties: 1) allelements of At are non-negative and the sum of each row orcolumn is one; 2) aijðtÞ > 0 means there exists a communica-tion between the ith and the jth nodes in the tth iteration, whileaijðtÞ ¼ 0 means there is no communication between them; 3)there exists a constant h, 0 < h < 1; such that aijðtÞ > 0implies that aijðtÞ > h.

Then, we make the following assumption on the propertiesof At.

Assumption 2. For an arbitrary node i, there exists a minimalscalar h, 0 < h < 1, and a scalar N such that

(1) aijðtÞ > 0 for ði; jÞ 2 CGðtþ 1Þ,(2)

Pmj¼1 aijðtÞ ¼ 1 and

Pmi¼1 aijðtÞ ¼ 1,

(3) aijðtÞ > 0 implies that aijðtÞ � h,(4) The graph [k¼1;...NGiðtþ kÞ is strongly connected for

all k.

In Assumption 2: (1) and (2) state that each nodecomputes a weighted average of the parameters shown inAlgorithm 1; (3) ensures that the influences among thenodes are significant; (2) and (4) ensure that them nodes areequally influential in a long run. Assumption 2 is crucial tominimize the regret bounds in distributed scenarios.

3.3 Adversary Model and Protection Limits

In recent years, many researchers have made great progresson machine learning and distributed computing. However,some malicious algorithms were proposed for competingthe backward-deduction of the machine learning algorithmand inferring the input data by observing the outputs orthe intermediate parameters. In order to prevent such“stealing” actions, we must protect every parameter of thealgorithms which may be exposed.

In this section, we assume two attack means. The first oneis that an adversary can invade any links of the nodes in ournetwork and is free to get the parameters being exchangedbetween two nodes. In this setting, our algorithm can

Fig. 2. Process of iterations: one node first receives the exchangedparameters, then updates its local parameter and finally broadcasts thenew parameter to neighbors.

LI ET AL.: DIFFERENTIALLY PRIVATE DISTRIBUTED ONLINE LEARNING 1443

Page 5: Differentially Private Distributed Online Learning · Differentially Private Distributed Online Learning Chencheng Li , Student Member, IEEE, Pan Zhou , Member, IEEE, Li Xiong , Qian

protect the privacy of all the nodes, since the adversary isnot able to infer some sensitive information from the datawith perturbation noise. The second one is that an adver-sary controls one of the nodes and is able to obtain all datain the local source and also can get the parameters from itsneighboring nodes. In such a worst case, except for theinvaded nodes, the privacy of all the other nodes can be pro-tected. In a word, no matter what kinds of attack means ourDP algorithms face, the privacy can be masked as long asthe data is differentially private. Note that differential pri-vacy stops the adversary from inferring sensitive informa-tion about the data, it cannot resist the invasion of databaselaunched by malicious attackers.

3.4 Differential Privacy Model

Differential Privacy.Dwork [5] proposed the definition of dif-ferential privacy for the first time. Differential privacymakes a data miner be able to release some statistic of itsdatabase without revealing sensitive information about aparticular value itself. In this paper, we use differential pri-vacy to protect the privacy of the nodes and give the follow-ing definition.

Definition 2. Let A denote our differentially private DOLA. LetX ¼ xi

1; xi2; . . . ; x

iT

� �be a sequence of questions taken from an

arbitrary node’s local data source. Let W ¼ wi1; w

i2; . . . ; w

iT

� �be a sequence of T outputs of the node and W ¼ AðXÞ. Then,our algorithm A is �-differentially private if given any twoadjacent question sequences X and X0 that differ in one ques-tion entry, the following holds:

Pr A Xð Þ 2W½ � � e� Pr A X0ð Þ 2W½ �: (6)

This inequality guarantees that whether or not an individ-ual participates in the database, it will not make any significantdifference on the output of our algorithm, so the adversary isnot able to gain useful information about the individual. Fur-thermore, A provides �; dð Þ-differential privacy if it satisfies

Pr A Xð Þ 2W½ � � e� Pr A X0ð Þ 2W½ � þ d;

which is a slightly weaker privacy guarantee.Differential privacy aims at weakening the significantly dif-

ference between A Xð Þ and A X0ð Þ by adding random noise tothe outputs of learning algorithms. Thus, to ensure differentialprivacy, we need to know that how “sensitive” the algorithmA is. Further, according to [5], the magnitude of the noisedepends on the largest change that a single entry in data sourcecould have on the output of Algorithm 1; this quantity isreferred to as the sensitivity of the algorithm as formallydefined below:

Definition 3 (Sensitivity). Recall in Definition 1, for any Xand X0, which differ in exactly one entry, we define the sensitiv-ity of Algorithm 1 in tth iteration as

SðtÞ¼ supX ;X0

A Xð Þ�A X0ð Þk k1: (7)

The above norm is L1-norm. According to the notion of sen-sitivity, we know that higher sensitivity leads to more noise ifthe algorithm guarantees the same level of privacy. By bound-ing the sensitivity SðtÞ, we determine the magnitude of the

random noise to guarantee differential privacy. We will pro-vide the bound of SðtÞ in Lemma 1.

Algorithm 1. �-Differentially Private DOLA

1: Input: fit ðwÞ :¼ ‘ðw; xi

t; yitÞ, i 2 ½1; m� and t 2 ½0; T � ; initial

points w10; . . . ; w

m0 ; doubly stochastic matrix At ¼

ðaijðtÞÞ 2 Rm�m; maximum iterations T .2: for t ¼ 0; . . . ; T do3: for each node i ¼ 1; . . . ;m do4: bit ¼

Pmj¼1 aijðtþ 1Þðwi

t þ sitÞ, where sj

t is a Laplace noisevector in Rn;

5: git rfit ðbit; �itÞ;

6: witþ1 ¼ Pro½bit � atþ1 � git�, where si

t Lap mð Þ;(Projection ontoW ) (Lap mð Þ defined in (11))

7: broadcast the output ðwitþ1 þ si

tþ1Þ to Gðtþ 1Þi;8: end for9: end for

4 DIFFERENTIALLY PRIVATE DOLA

In this section, we first present the �-differentially privateDOLA in Algorithm 1, and study the privacy and utilityrespectively in Sections 4.1 and 4.2. Then, we extend � to�; dð Þ-differential privacy in Section 4.3. Handling the com-position argument (see more details in [31]) problem of dif-ferential privacy, our �-differentially private DOLA behavesdifferent from the related work in [32]. The differences willbe specifically described in this section. More importantly,we bound the regret of Algorithm 1 and analyze the tradeoffproblem between the privacy and utility.

4.1 �-Differential Privacy Concerns

Lemma 1. Under Assumption 1, if the L1-sensitivity of the algo-rithm is computed as (7), we obtain

SðtÞ � 2at

ffiffiffinp

L; (8)

where n denotes the dimensionality of vectors and L is thebound of (sub)gradient (see (4)).

Proof. Recall in Definition 1, X and X0 are any two data vec-tors differing in one entry. wi

t is computed based on thedata set X while wi

t0is computed based on the data set X0.

Certainly, we have kAðXÞ � AðX0Þk1 ¼ kwit � wi

t0k1.

For datasets X and X0 we have

wit ¼ Pro bit�1 � atg

it�1

and wi

t

0 ¼ Pro bit�1 � atgit�10h i:

Then, we have

wit � wi

t

0��� ���1¼ Pro bit�1 � atg

it�1

� Pro bit�1 � atgit�10h i��� ���

1

� ðbit�1 � atgit�1Þ � ðbit�1 � atg

it�10Þ

��� ���1

¼ at git�1 � git�10��� ���

1� at git�1

�� ��1þ git�1

0��� ���1

� �� at

ffiffiffinp

git�1�� ��

2þ git�1

0��� ���2

� �� 2at

ffiffiffinp

L:

(9)

By Definition 3, we know

SðtÞ � wit � wi

t

0��� ���1: (10)

Hence, combining (9) and (10), we obtain (8). tu

1444 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 30, NO. 8, AUGUST 2018

Page 6: Differentially Private Distributed Online Learning · Differentially Private Distributed Online Learning Chencheng Li , Student Member, IEEE, Pan Zhou , Member, IEEE, Li Xiong , Qian

We next determine the magnitude of the added randomnoise due to (8). In step 7 of Algorithm 1, we use s to denotethe random noise. s 2 Rn is a Laplace random noise vectordrawn independently according to the density function

Lap xjmð Þ ¼ 1

2mexp � xj j

m

�; (11)

where m ¼ S tð Þ=� and Lap mð Þ denotes the Laplace distribu-tion. (8) and (11) show that the magnitude of the added ran-dom noise depends on the sensitivity parameters: �, thestepsize at, the dimensionality of vectors n, and thebounded subgradient L.

Lemma 2. Under Assumption 1 and 2, at the tth iteration, theoutput of ith online learner in Algorithm 1, i.e., ðwi

t þ sitÞ, is

�-differentially private.

Proof. Let ewit ¼ wi

t þ sit and ewi

t0 ¼ wi

t0 þ si

t, then by the defi-nition of differential privacy (see Definition 1), ewi

t is�-differentially private if

Pr½ewit 2W � � e� Pr½ewi

t

0 2W �: (12)

For w 2W , we obtain

Pr ewit

� �Pr ewi

t0� � ¼Yn

j¼1

exp � � wit ½j��w½j�j jS tð Þ

�exp � � wi

t0 ½j��w½j�j jS tð Þ

�0BB@

1CCA¼Ynj¼1

exp� wi

t0½j� � w½j��� ��� wi

t½j� � w½j��� ��� �S tð Þ

0@ 1A�Ynj¼1

exp� wi

t0½j� � wi

t½j��� ��

S tð Þ

!¼ exp

� wit0 � wi

t

�� ��1

S tð Þ

!� exp �ð Þ;

(13)

where the first inequality follows from the triangleinequality, and the last inequality follows from (10). tuComposition Theorem. In Lemma 2,we provide �-differential

privacy for each iteration of Algorithm 1. Due to the composi-tion theorem [11], [33], we know that the privacy level ofAlgorithm 1 will degrade after T -fold adaptive composition.Next, we will study how much privacy is still guaranteedafter T iterations. As known, the worst privacy guarantee is atmost the “sum” of the differential privacy guarantee of singleone iteration.

Theorem 1 ([6], [31], [33]). For �; d � 0, the class of �; dð Þ-differentially private algorithms provide T�; Tdð Þ-differentialprivacy under T -fold adaptive composition.

The worst privacy happens when an adversary appliesmany queries to the same database. If our distributed learningalgorithm runs in an offline way (i.e., each update depends onthe gradients of the all data points over the T iterations), theoffline algorithm suffers the “sum” of the differential privacydescribed in Theorem 1. Dwork et al. (2010) [33] demonstratedthat if a slightly larger value of d is allowed for, one can have asignificantly higher privacy level w.r.t �.

Theorem 2 ([33]). For �; d � 0, the class of �; dð Þ-differentiallyprivate algorithms provide �0; Tdþ d0ð Þ-differential privacyunder T -fold adaptive composition, for

�0 ¼ T� e� � 1ð Þ þ �ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi2T log 1=d0ð Þ

q: (14)

Furthermore, if the queries are applied to disjoint subsets ofthe data, the privacy guarantee does not degrade across iterations[32]. This setting is the best solution for the iterative algorithms.However, a learning algorithm never obtain such good privacyguarantees since it is learnt based on the previous sample dataand parameters. Hence, we can have the following theorem.

Theorem 3. The class of our distributed �-differentially privateonline learning algorithms cannot guarantee � privacy levelunder T -fold adaptive composition.

Proof. Due to the composition theorem in [11], [33], weknow that the privacy level of the iterative algorithm willdegrade after T -fold adaptive composition. tuOur DOLA neither suffers the worst privacy level in

Theorem 1, nor achieves the best privacy guarantees. Specif-ically, in line 4 and 7 of Algorithm 1, we add si

tþ1 to maskthe real value of wi

tþ1. In other words, we protect the privacy

of the data point �it in the tth iteration. Using the sameperturbation, we protect the privacy of the data point �itþ1 inthe ðtþ 1Þth iteration. An important question is that willthe privacy level of each data point degrade after severaliterations? We give the following theorem to answer: yes,but a little lower. Although every data point is only used inone iteration, it has an effect on the next updates. An adver-sary may infer some information about the used data pointsfrom the later updates. Hence, the privacy level of each datapoint in the DOLA degrades slightly. Then we obtain thefollowing theorem.

Theorem 4. Under Assumption 1 and 2, Algorithm 1 can intui-tively provide T�-differential privacy, or has a better privacylevel ð�0; d0Þ for

�0 ¼ T��0 þ �ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi2T log 1=d0ð Þ

q;

where �0 ¼ e� � 1 and d0 2 0; 1½ �.Proof. This proof will follow from the above analysis, the

proof of Theorem 1 in [11] and Theorem III.3. in [33] if weset d ¼ 0. tuAbove all, �-DP is guaranteed for single one node at

every iteration of Algorithm 1. And �‘-DP is guaranteed forAlgorithm 1 after T iterations. To recall the analysis inSection 3.3, the real privacy of Algorithm 1 is tightly associ-ated with the adversary action. If an adversary invades oneof the nodes at a certain moment, Algorithm 1 is �-DP at thismoment. Furthermore, Algorithm 1 will become �-DP if theadversary keeps observing the node for T iterations. Theutility of the Algorithm 1 in Section 4.2 comes from such asituation where Algorithm 1 is invaded in the beginning (atthe first iteration). Hence, the utility in Section 4.2 and �-DPare the worst results in our paper.

4.2 Utility of Algorithm 1

In this section, we study the regret bounds of Algorithm 1for general convex and strongly convex objective functions,

LI ET AL.: DIFFERENTIALLY PRIVATE DISTRIBUTED ONLINE LEARNING 1445

Page 7: Differentially Private Distributed Online Learning · Differentially Private Distributed Online Learning Chencheng Li , Student Member, IEEE, Pan Zhou , Member, IEEE, Li Xiong , Qian

respectively. We provide the rigorous formula derivationfor final results: Oð ffiffiffiffiTp Þ for general convex objective func-tions and OðlogT Þ for strongly convex objective functions.All proofs in this section are provided in the supplement,available online.

Based on the communication rules shown in Figs. 1 and2, we have the following lemma.

Lemma 3 ([28]). We suppose that at each iteration t, the matrixAt satisfies the description in Assumption 2. Then, we have

(1) limk!1 fðk; sÞ ¼ 1m eeT for all k; s 2 Z with k � s;

where

fðk; sÞ ¼ AðkÞAðk� 1ÞA � � �Aðsþ 1Þ: (15)

(2) Further, the convergence is geometric and the rate ofconvergence is given by

fðk; sÞij �1

m

� ����� ���� � ubk�s; (16)

where

u ¼ 1� h

4m2

� ��2b ¼ 1� h

4m2

� � 1N:

Lemma 3 is repeatedly used in the proofs of the followinglemmas. Next, we study the convergence of Algorithm 1 indetails. We use subgradient descent method to make wi

t moveforward to the theoretically optimal solution. To estimate thelearning rate of the system, computing the norms move for-ward to the theoretically optimal solution. To estimate thelearning rate of the system, computing the norms wi

t � wjt

�� ��does not help. Alternatively we study the behavior ofwt � wi

t

�� ��, where for all t, wt is defined by

wt ¼ 1

m

Xmi¼1

wit: (17)

Lemma 4. Under Assumption 1 and 2, for all i 2 f1; . . . ;mgand t 2 f1; . . . ; Tg, we have

wt � wit

�� �� � mLuXt�1k¼1

bt�kak þ uXt�1k¼1

bt�kXmi¼1

sik

�� ��þ 2atL: (18)

Proof. See Appendix A in the supplement, availableonline. tu

Lemma 5. Under Assumption 1 and 2, for any w 2W and forall t, we have

wtþ1 � wk k � 1þ 2atþ1Lþ 2Lð þ 2

m

Xmi¼1

sit

�� ���2�Þ wt � wk k � 2

mft wtð Þ � ft wð Þð Þ

þ 4L1

m

Xmi¼1

wt � wit

�� ��þ 1

m

Xmi¼1

sit þ ditþ1

� �����������2

:

(19)

Proof. See Appendix B in the supplement, availableonline. tuLemma 4 and 5 are used for the proof of Lemma 6.

Lemma 6 is the generic regret bound for our private DOLA(see Algorithm 1). More importantly, Lemma 6 indicatesthat the learnbale parameters w1; w2; . . . ; wm approach thew. In other words, the values of the learnable parametersapproach each other over iterations.

Lemma 6. We let w denote the optimal solution computed inhindsight. The regret RD of Algorithm 1 is given by

XTt¼1

ftðwitÞ � ftðwÞ

� mRLþ 3bum2L2

1� bþ 13

2mL2

�XTt¼1

at

þ 3bumL

1� bþ 2Lþ 1

2m

�XTt¼1

Xmi¼1

sit

�� ��þmR

2:

(20)

Proof. See Appendix C in the supplement, availableonline. tuObserving (20), m, R and L directly influence the bound.

A faster learning rate at can reduce the convergence time,but increases the regret bound. Expectedly, the perturbationnoise makes a great contribution to the bound and a higherprivacy level increases the regret bound. In short, we havethe following observations: 1) the larger scale of distributedonline learning (i.e., bigger m, L and R) leads to a higherregret bound; 2) choosing an appropriate at is a tradeoffbetween the convergence rate and the regret bound; 3) thesetting of the value of � is also a tradeoff between the pri-vacy level and the regret bound.

Finally, we give two different regret bounds according tothe setting of objective functions in the following theorem.

Theorem 5. Based on Lemma 6, if � > 0 and we set at ¼ 1�t,

then the expected regret of our DOLA satisfies

EXTt¼1

ftðwitÞ

" #�XTt¼1

ftðwÞ

� mL

�Rþ 3bumL

1� bþ 13

2L

�1þ logTð Þ

þ 3bumL

1� bþ 2Lþ 1

2m

�2ffiffiffi2p

mnL

��1þ logTð Þ þmR

2;

(21)

and if � ¼ 0 and set at ¼ 12ffiffitp then

EXTt¼1

ftðwitÞ

" #�XTt¼1

ftðwÞ

� mL

�Rþ 3bumL

1� bþ 13

2L

� ffiffiffiffiTp� 1

2

�þ 3bumL

1� bþ 2Lþ 1

2m

�2ffiffiffi2p

mnL

ffiffiffiffiTp� 1

2

�þmR

2:

(22)

1446 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 30, NO. 8, AUGUST 2018

Page 8: Differentially Private Distributed Online Learning · Differentially Private Distributed Online Learning Chencheng Li , Student Member, IEEE, Pan Zhou , Member, IEEE, Li Xiong , Qian

Proof. See Appendix D in the supplement, availableonline. tu

Remark 1. Observing (21) and (22), the regret bound isclosely related to u, b, m and �. The higher privacy level (alower �) makes the bound higher. Interestingly, the param-eters u and b denote the “density” of the network. Themore connected the network is, the higher the density isand the lower the regret bound gets. Note that a more con-nected network leads to a better utility of the DOLA, how-ever it does not mean that the privacy level changes. Sinceour private DOLA protect every node at each iteration.The privacy level is only determined by the noise and thealgorithm updating rules. As shown in Algorithm 1, theLaplace noise is added tow after the projection step, bit maynot be in the convex domain so the strongly convexity andboundness of gradient properties may not hold. Indeed,this situation may happen when the privacy level is prettyhigh. The large noise that provides the high privacy levelcan get bit out of the convex domain so that the updatingparameter w is swallowed up. Hence, we assume that thenoise providing differential privacy will not be largeenough to get bit out of the convex domain. This assump-tion is tested in the experiment (see Fig. 3). The experimentshows that the proper privacy level will not destroy theconvergence of the DOLAunless � gets too small.

Algorithm 2. �; dð Þ-Differentially Private DOLA

1: Input: Cost functions fit ðwÞ :¼ ‘ðw; xi

t; yitÞ, i 2 ½1;m� and t 2

½0; T � ; initial points w10; . . . ; w

m0 ; doubly stochastic

matrix At ¼ ðaijðtÞÞ 2 Rm�m; maximum iterations T .

2: Set J2 4a2t nL2T log T=dð Þlog 1=dð Þ

�2;

3: for t ¼ 0; . . . ; T do4: for each node i ¼ 1; . . . ;m do5: bit ¼

Pmj¼1 aijðtþ 1Þ ~wi

t;

6: git rfit ðbit; �itÞ;

7: ~witþ1 ¼ Pro½bit � atþ1ðgit þ si

tþ1Þ�, where sitþ1 N 0; J2ð Þ;

(Projection ontoW )8: broadcast the output ~wi

tþ1 to GðtÞi;9: end for10: end for

4.3 �; dð Þ-Differentially Private DOLA

In this section, we use the Gaussian noise to provide �; dð Þ-differential privacy guarantees for DOLA instead of usingthe Laplace noise. Some researches (e.g., [11], [29]) havegiven very solid approaches to guarantee �; dð Þ-differentialprivacy of their learning algorithms. Using their works forreference, we study to use a Gaussian perturbation in DOLA

and analyze the utility. Due to [8], �; dð Þ-differential privacyprovides a slightlyweaker privacy guarantee than �-differen-tial privacy, but is more robust to the number of nodes.

Theorem 6 (Privacy guarantee). Algorithm 2 guarantees�; dð Þ-differential privacy.

Proof. At any time step t, we have GtðXÞ ¼ ðgit þ sitþ1Þ

N ðgit; J2Þ, which is a random variable computed over thenoise si

tþ1 and conditioned on ~witþ1 (combine Line 5, 6 and

7 of Alg .2). LetcGt Xð Þ vð Þ denote themeasure ofGt Xð Þw.r.ta variable v. To measure the privacy loss in the algorithm,

we use a random variable Kt ¼ jlog cGtðXÞðGtðXÞÞcGtðX0 ÞðGtðXÞÞj (see [29],

[33]). Now, according to [34], [35], Gaussian noise pertur-

bation guarantees that with high probability 1� d2 over the

randomness of s, Kt � �ffiffiffiffiffiffiffiffiffiffiffiffiffilog 1=dð Þp � 1. To conclude this

proof, we have to use the composition of differential pri-

vacy from [33] which demonstrated that the class of

�; dð Þ-differential privacy satisfies ð�0; tdþ d0Þ-differentialprivacy for �0 ¼ t�ðe� � 1Þ þ �

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi2t log ð1=d0Þ

p. Hence, with

probability at least 1� d, the privacy lossK ¼Ptt¼1 Kt is at

most �, which concludes the proof. tuTheorem 7 (Utility guarantee). Let J2 ¼Oð4a2t nL2T log ðT=dÞlog ð1=dÞ

�2Þ,

our �; dð Þ-differentially private DOLA (Algorithm 2) obtains theexpected regret as follows:

EXTt¼1

ft wit

� �" #�XTt¼1

ft wð Þ

� mRBþ 3bum2B2

1� bþ 13

2mB2

�OT þmR

2;

(23)

where B ¼ Lþ 2OT

ffiffiffiffiffinTp

log 1=2 T=dð Þlog 1=dð Þ� . For strongly convex

objective functions, � > 0 and at ¼ 1�t, Ot ¼ 1þ logT ; for

general convex objective functions, � ¼ 0 and at ¼ 12ffiffitp ,

Ot ¼ffiffiffiffiTp � 1

2.

Proof. Comparing Algorithm 1 and Algorithm 2, we onlyuse different noise distributions. Hence, the analysis ofthis bound makes no big differences. Due to space limita-tions, we omit the proof which follows from the proof ofTheorem 6. tu

Remark 2. In Algorithm 1 (see Line 7), the noise sitþ1 is

directly added to the learnable parameter witþ1 and the

Laplace output perturbation ðwitþ1 þ si

tþ1Þ is broadcast toneighbors. In Algorithm 2 (see Line 7), we put the Gauss-ian noise noise si

tþ1 in the update of w, and obtain the dif-ferentially private learnable parameter ~wi

tþ1. Hence, the

Fig. 3. (a) and (b) Regret versus Privacy. Average regret is generated by Algorithm 1 and Algorithm 2 that are tested on Diabetes 130-US hospitalsdataset and normalized by the number of iterations. (c) and (d) Regret versus Topology. Algorithm 1 and Algorithm 2 are experimented based onthree different topologies.

LI ET AL.: DIFFERENTIALLY PRIVATE DISTRIBUTED ONLINE LEARNING 1447

Page 9: Differentially Private Distributed Online Learning · Differentially Private Distributed Online Learning Chencheng Li , Student Member, IEEE, Pan Zhou , Member, IEEE, Li Xiong , Qian

parameter w cannot get out of the convex domain. In theproof of guaranteeing differential privacy, Algorithm 1and Algorithm 2 both exploit the composition theorem.In fact, �; dð Þ-DP is a looser privacy requirement than�-DP. However, in this paper we provide totally differentmethods to guarantee �; dð Þ and � private DOLA. It is notvery rigorous to compare the utility of Algorithm 1 andAlgorithm 2 based on the regret bounds. We only ensurethat �; dð Þ-DP needs a smaller amount of noise than �-DPat the same privacy level.

Discussion. In this section, we provide differentially pri-vate DOLA in a dynamic and random graph. A fixed com-munication graph can be also used in distributed onlinelearning (e.g., [24]). Let A denote the fix communicationmatrix. As with Lemma 3, the limiting behaviors of Ak ask!1 is the key point. Since J.S. Liu [36] give the bound inChapter 12, we directly exploit the following lemma forsimplicity.

Lemma 7. [36]. Let Aij be i; jð Þth element of A, for any j, wehave X

iAk

ij �1

m

���� ���� � gC;

where C ¼ 2 and g relates to the minimum nonzero values ofA.

In the view of the proofs about the utility of Algorithm 1(see Appendix A-D, available online.), we are able to rewritethe lemmas and theorems for giving the regret bound in thefixed communication graph. Hence, our distributed differen-tially private method can be used in a fixed network.

5 SPARSE PRIVATE DOLA

In the big data era, the data is not only large-scale, but alsooften high-dimensional. Without applying sparse means tothe data, the non-sense data contained in some dimensionswould have negative effects on the utility of our privateDOLA. So we propose a sparse version of the privateDOLA in this section.

In this section, we also use the system model and param-eters that are defined earlier in the article. Let RS denote theregret of Algorithm 3, which has the same form with (2) inDefinition 1. The proposed private sparse DOLA is shownin Algorithm 3.

Algorithm 3. Sparse Private Distributed Online Learning

1: Input: fit ðwÞ :¼ ‘ðw; xit; yitÞ, i 2 ½1; m� and t 2 ½1; T �; doublystochastic matrix At ¼ ðaijðtÞÞ 2 Rm�m.

2: Initiaization: #i1 ¼ 0, i 2 1;m½ �;

3: for t ¼ 1; . . . ; T do4: for each node i ¼ 1; . . . ;m do5: receive xi

t 2 Rn;6: pit ¼ r’t #i

t

� �;

7: wit ¼ argminw

12 pit � w�� ��2

2þr wk k1;

8: predict yit;9: receive yit and obtain fi

t wit

� �:¼ ‘ wi

t; xit; y

it

� �;

10: broadcast to neighbors: ~#itþ1 ¼ #i

tþ1 þ dit;11: end for12: end for

According to line 10 and 11 of Algorithm 3, the differen-tial privacy mechanism of Algorithm 3 follows fromAlgorithm 1. Hence, the privacy concerns about sparseprivate DOLA is nearly the same with the analysis inSection 4.1. For Algorithm 3, we mainly study its utility per-formance and do not give a detailed privacy analysis.

To bound RS ,we first present a special lemma below.

Lemma 8. Let ’t be b-strongly convex functions, which have thenorms �k k’t and dual norms �k k’t . When Algorithm 3 keepsrunning, we have the following inequality

XTt¼1

Xmi¼1

wt � wð ÞT gt

� m’T wð Þ=at þ 1

at

XTt¼1

Xmi¼1

’t #tð Þ � #i

tþ1 #tð Þ

þ a2t

2bgtk k22þat dtk k2 þ rt gtk k1

�:

(24)

Proof. See Appendix E in the supplement, available online. tuBased on Lemma 8, we easily have the regret bound of

Algorithm 3.

Theorem 8. Set ’t wð Þ ¼ 12 wk k22, which is 1-strongly convex. Let

rt ¼ atr, then the regret bound is

RS � RffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiLþ rð ÞmTL

pþ 2

ffiffiffi2p

m2nTL

a

ffiffiffiffiTp� 1

2

�: (25)

Proof. See Appendix F in the supplement, available online.tuRemark 3. According to Theorem 8, Algorithm 3 obtains

the classical square root regret Oð ffiffiffiffiTp Þ. It is a pity that wedo not bound the regret to OðlogT Þ. Due to the propertyof the Lasso algorithm, we can set the sparsity of the finalresult by finetuning rt in Algorithm 3. A proper sparsityis able to improve the utility of a learning model. If thesparsity is too high or too low, the learnable model per-forms badly in the accuracy of predictions. So the valueof rt has a big influence on the utility of Algorithm 3. Wewill do experiments on the sparsity-accuracy tradeoff andfind the sparsity which can lead to the highest accuracy.

6 EXTENSION

We have proposed a private DOLA to solve the real-timelearning problem in a distributed setting. Furthermore, ourprivate DOLA can be widely exploited. In this section, weconvert our DOLA into a distributed stochastic optimiza-tion, and make some improvements on the privacy and util-ity of the algorithm by the mini-batch updates [20].

6.1 Application to Stochastic Learning Optimization

As known to us, stochastic optimization for machine learn-ing is an important tool in the big data era. The batch learn-ing needs to use a large set of training samples to minimize

the empirical risk function (e.g., 1n

Pni¼1 fi w; �ið Þ þ r wð Þ),

which costs much more time and resources. Stochastic gra-dient descent (SGD) algorithm updates each iteration by asingle sample drawn randomly from the dataset. The out-standing advantage is that SGD substantially reduces the

1448 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 30, NO. 8, AUGUST 2018

Page 10: Differentially Private Distributed Online Learning · Differentially Private Distributed Online Learning Chencheng Li , Student Member, IEEE, Pan Zhou , Member, IEEE, Li Xiong , Qian

computing resources. Hence, we pay some attention to thestochastic optimization in distributed setting. Recently,Cesa-Bianchi et al. [37] showed that online learning and sto-chastic optimization are interchangeable.

Theorem 9 ([37]). One online algorithm with regret guarantee:

1

T

XTt¼1

fit wi

t; �it

� ��minw2W

XTt¼1

ft w; �it� � � R Tð Þ; (26)

can be converted to a stochastic optimization algorithm withhigh probability, by outputting the average of the iterates:

F wTð Þ � F wð Þ � R Tð Þ;where wT ¼ 1

T

PTt¼1 �wt and F wð Þ ¼ E f w; �ð Þ½ �.

Exploiting Theorem 9, our private DOLA regret boundscan be converted into convergence rate and generalizationbounds. Following (26), we should compute the bound ofR Tð Þ ¼ 1

T

PTt¼1 ft wi

t

� �� 1T

PTt¼1 ft w

ð Þ, where ft is short forPmi¼1 f

it . Recall that our private DOLA achieves OðlogT Þ

regret bound for strongly convex functions and Oð ffiffiffiffiTp Þ forconvex functions. Hence, we have Lemma 9.

Lemma 9. Based on Theorem 5 and 7, our private DOLA havethe following inequalities with high probability of 1� d:

if f is general convex,

1

T

XT

t¼1 ft wit

� �� 1

T

XT

t¼1 ft wð Þ � O

ffiffiffiffiffiffiffiffiffiffiffiffiffiffilog 1=d

T

r !;

if f is strongly convex,

1

T

XT

t¼1 ft wit

� �� 1

T

XT

t¼1 ft wð Þ � O

logT � log 1=dT

�:

Intuitively, the average regret boundRðT Þ has an Oð 1ffiffiffiTp Þ or

OðlogTT Þ bound. Due to Theorem 9, our private DOLA can bealso converted into a distributed stochastic learning algorithm,which also abides by the communication rule (Assumption 2)and the assumption about objective functions (Assumption 1).

Theorem 10. If the problem defined in Fig. 1 is required to besolved in a stochastic method, our private DOLA can be con-verted into a corresponding stochastic algorithm and the rate ofconvergence of this stochastic algorithm satisfies:

E f w; �ð Þ½ � � f wð Þ � R Tð Þ;where R Tð Þ ¼ Oð 1ffiffiffi

Tp Þ general convex function and RðT Þ ¼

OðlogTT Þ for strongly convex function.In the view of some related stochastic optimization

researches [38], [39], [40], the stochastic convergence rateobtained by the conversion of our DOLA makes sense. How-ever, the following experiment (see Fig. 6a) shows that thisapplication suffers from some random noise so that does nothave a good convergence accuracy.

6.2 Private DOLA Using Mini-Batches

We have provided � and �; dð Þ-differential privacy for theDOLA. Although the DOLA handles the data in real time, itmay be not fast enough to compute an extremely fastcomingdata stream. Motivated by [41], we use the mini-batchmethod to process multiple samples at the same iteration.

Recall that Algorithm 1 and Algorithm 2 compute the (sub)gradient of each iteration over a single sample, git rfi

t bit; �it

� �. With mini-batch updates, the learning algorithm

at each step computes the (sub)gradient based on a smallset Ht of data instead of a single sample. We conclude thedetails of the improvement in Algorithm 4.

Algorithm 4. Private DOLA Using Mini-Batches

1: Input: fit ðwÞ :¼ ‘ðw; xi

t; yitÞ, i 2 ½1; m� and t 2 ½1; T �; doubly

stochastic matrix At ¼ ðaijðtÞÞ 2 Rm�m.2: for t ¼ 0; . . . ; T do3: for each node i ¼ 1; . . . ;m do4: bit ¼

Pmj¼1 aijðtþ 1Þðwj

t þ sjtÞ, where s

jt is a Laplace noise

vector in Rn;5: gik rfi

kðbitÞ, which is computed based on examplesxk; ykð Þ 2 Ht;

6: witþ1 ¼ Pro½bit � atþ1ð’wi

t þ 1h

Pðxk;ykÞ2Ht

gik�;(Projection ontoW )

7: broadcast the output ðwitþ1 þ si

tþ1Þ to GðtÞi;8: end for9: end for

In Line 6 of Algorithm 4, the learnable parameter w isupdated on a average of (sub)gradients 1

h

Pgik, where h

denotes the number of samples included inHt. Most impor-tantly, the mini-batch has a great advantage on guarantee-ing differential privacy for Algorithm 4. Observing theproof of Lemma 1, we find that the parameter 1=h hasreduce the sensitivity of the algorithm, which is shown inthe following lemma.

Lemma 10 (Sensitivity of Algorithm 4). With mini-batchupdates, the sensitivity of Algorithm 4 is

S2 tð Þ � 2at

ffiffiffinp

L

h: (27)

Proof. This proof follows from the proof of Lemma 1. tuIntuitively, Algorithm 4 has a much smaller sensitivity

than Algorithm 1 and Algorithm 2. According to the analy-sis of differential privacy, providing the same privacy levelfor Algorithm 4 needs less noise. The disadvantage ofAlgorithm 4 is that it takes a little more time to collect hdata entries at each iteration in online system. Similar withAlgorithm 1 and Algorithm 2, Algorithm 4 can guarantee� or �; dð Þ-differential privacy. Hence, we omit the detaileddescriptions on the privacy guarantees. As for the utility ofAlgorithm 4, we will carry out related experiments in nextsection.

7 EXPERIMENTS

In this section, we conduct numerical experiments to test thetheoretical analyses described in Sections 4, 5, and 6. Specifi-cally, we first study the privacy-regret tradeoff of Algorithm1 andAlgorithm 2. Then, we study the influence of the topol-ogy of the communication graph on the utility. Further, wetest the utilities of Algorithm 1 and Algorithm 2 on three dif-ferent scales of real world datasets. To verify the feasibility ofconverting DOLA to distributed stochastic optimization, weoutput the average (see Fig. 5a) of t online updated parame-ters, i.e., 1t

Ptl¼1 w

il . Finally, we study the improvement of the

privacy and utility by usingmini-batches.

LI ET AL.: DIFFERENTIALLY PRIVATE DISTRIBUTED ONLINE LEARNING 1449

Page 11: Differentially Private Distributed Online Learning · Differentially Private Distributed Online Learning Chencheng Li , Student Member, IEEE, Pan Zhou , Member, IEEE, Li Xiong , Qian

7.1 Differentially Private Support Vector Machines

In the experiments, we develop a series of differentially pri-vate SVM algorithms. We have the hinge loss functionfit wð Þ ¼ max 1� yit w; xi

t

� �� �, where xi

t; yit

� � 2 Rn � �1f g� �are the data available only to the ith learner. For fast conver-gence rates, we set the learning rate at ¼ 1

�t. The dataset isdivided into m subsets. Each node updates the parameterbased on its own subset and timely exchanges its parameterto the neighbors. Note that at the iteration t, the ith learnermust exchange the parameter wi

t in strict accordance withAssumption 2. To better evaluate our method, we sum thenormalized error bounds (i.e., the “Regret” on y-axis) forthe all experiment figures.

7.2 Datasets and Pre-Processing

First, we chose real datasets from UCI machine learningRepository (see Table 2). The main dataset we use for testingis Diabetes 130-US hospitals. This dataset contains 10 years(1999-2008) of clinical care from 130 US hospitals and inte-grated delivery networks. It includes over 50 features repre-senting patient, which contains such attributes as patientnumber, race, gender, age, admission type, HbA1c testresult, diagnosis and so on. In this dataset, we want to pre-dict whether one patient will experience emergency visits inthe year before the hospitalization. If we succeed in predict-ing this specific condition, medical help can be offered intime when needed. Besides the medical data, Covertype andAdult from UCI are used for comparison. Covertype is usedfor predicting forest cover type from cartographic variablesand the classification task is to predict whether a personmakes over 50K per year. Then, we use two high-dimensional datasets Real-sim and RCV1 from LIBSVM,which will be specifically tested on Algorithm 3.

For pre-processing, we follows from [10]. We remove allentries with missing data points and convert the categorialattributes to a binary vector. For non-numerical attributessuch as gender and nationality, we replace the values by

their occurrence frequency in their category. More impor-tantly, the maximum value should be first ensured to be 1by normalizing each column and then the norm of eachdata example is at most 1.

7.3 Implementation

Many distributed computing platforms such as Hadoop,spark and OpenMPI, are able to deploy the distributed andparallel computing algorithm. Since the nodes of our DOLAneed to randomly exchange parameters with some neigh-boring nodes, we require to conduct the communicationsamong the nodes without any restrictions. So we use Open-MPI to schedule the communications. We run the experi-ment program in Ubuntu 14.04 equipped with 64 CPU(8 cores) and 128 GB memory. 2 GB memory is assigned toeach CPU. The program is implemented in C Language. Spe-cifically, we deploy our experiments as follows:

In step 1, each CPU is regarded as one node in our privateDOLA. We equally distribute the dataset to the m nodes.Most importantly, we define the logical neighboring rela-tions among the nodes, which aims at simulating physicalneighboring relations. After doing this, we have a topologyof this distributed system as shown in Fig. 1. For a solid andsimple analysis, we design three different topologies,respectively defined as Topology 1, 2, and 3. The distribu-tion of the nodes in Topology 1 is sparse, which means thateach nodes has a small amount of neighbors. The nodes inTopology 2 are more closely linked than that in Topology 1,while Topology 3 has the most intensive nodes.

In step 2, we generate the communication matrix At byMATLAB. Although At changes with the time t, it mustabide by the Assumption 2 and the topology in the currentexperiment. Generating a series of communication matricesis really a very tedious work. If we look upon At as a mapfor communications. This map is given to all nodes. Becausewe do not consider the delay in our algorithms, one itera-tion ends until all the nodes finish computing and exchang-ing the learnable parameters (i.e., shown in Fig. 2). Thisprocess has a high time complexity. Hence, we do not takethe communication cost into consideration.

In step 3, when we test one of the proposed algorithms,every output (e.g., wi

t) of the nodes in every iteration shallbe recorded. Meanwhile, the regret value (due to (2) inDefinition 2) in each iteration is also computed and preserved.

7.4 Results and Analysis

Note that the nodes learn the model from their own dataresources, however the final values of the learnable

TABLE 2Summary of Datasets

Dataset Number of samples Dimensionality

Diabetes 130-US hospitals 100,000 55Covertype 581,012 54Adult 48,842 14Real-sim 72,309 20,958RCV1 20,242 47,236

Fig. 4. Regret versus Dataset: Regret (normalized by the number of iterations) incurred by Algorithm 1 and Algorithm 2 on three datasets: Diabetes130-US hospitals, Covertype, and Adult.

1450 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 30, NO. 8, AUGUST 2018

Page 12: Differentially Private Distributed Online Learning · Differentially Private Distributed Online Learning Chencheng Li , Student Member, IEEE, Pan Zhou , Member, IEEE, Li Xiong , Qian

parameters w1T ; . . . ; w

mT approach each other after T itera-

tions. This analysis is proved by the experiments. It makessense that the regret of the DOLA can be measured withrespect to an arbitrary node. Therefore, we plot the regretbased on the outputs of an arbitrary node.

In Fig. 3a and b show the average regret (normalized bythe number of iterations) incurred by our DOLA for � and�; dð Þ-different privacy on Diabetes 130-US hospitals dataset.The regret obtained by the non-private algorithm is the low-est as expected. More significantly, the regret gets closer tothe non-private regret as its privacy preservation is weaker.However, when we set � ¼ 0:01, the average regret is swal-lowed up in the noise. Hence, the privacy level cannot betoo high. Figs. 3c and d show the average regret on differenttopologies. In Topology 1, the distribution of the nodes issparse, where its nodes are not able to conduct effectivedata interchange. So the DOLA in Topology 1 obtains a badconvergence (see red curves in (c) and (d)). Since Topology2 and Topology 3 have closer nodes. They achieve betterutilities (blue and green curves) than Topology 1.

In Fig. 4, we test Algorithm 1 and Algorithm 2 on threedifferent UCI datasets: Diabetes 130-US hospitals, Covertypeand Adult. The three datasets have different scales anddimensions (see Table 2). From Fig. 4, the datasets of thesame order of magnitude do not have significant differencesin the privacy and utility performance. Based on the resultsshown in Fig. 4, it is hard to enhance the utility of the pri-vate DOLA by merely increasing a small number of thedata, unless we can increase the number in several orders ofmagnitude, which costs too much resource.

In Fig. 5, we show the accuracy-sparsity tradeoff in (a).The three topologies are randomly generated and tested onthe same dataset Real-sim, which is of high dimension (seeTable 2). This experiment indicates that an appropriatesparsity can yield the best performance (i.e., the three curvesachieve the highest accuracy at the sparsity of 43:6%) andother lower or higher sparsity would lead to a worse perfor-mance. To be convincing, we test Algorithm 3 on two high-dimensional real datasets in (b). We find that these twodatasets have different appropriate sparsity. Therefore, weneed to finetune the sparsity parameter r due to the featuresof data.

In Fig. 6, we conduct experiments on Theorem 9 and themini-batch algorithm (Algorithm 4). Fig. 6a shows that thestochastic algorithm converted to by our DOLA has fastconvergence rates, which demonstrates that converting theDOLA to distributed stochastic optimization makes sense.We test the mini-batch algorithm in (b). Comparing Fig. 6b

with Figs. 3a and 3b, we find that the curves become rela-tively smooth in Fig. 6b at the same privacy level, whichindicates that the mini-batch weakens the influence of noiseon regret. The privacy level can be improved by using mini-batches, because 0.01-DP algorithm converges (red curve inFig. 6b). That is because mini-batches reduce the sensitivityof the DOLA. See Fig. 6c, we use

Pmi¼1 wi

t � �wt

�� ��2to denote

the distance of the all learnable parameters. This experimentshows that the nodes “approach” each other over iterationsand is consistent with the theoretical results.

As we know, the hinge loss ‘ wð Þ ¼ max 1� ywTx; 0� �

leads to the datamining algorithm, SVM. To bemore persua-sive, we train a differentially private distributed SVM basedon Theorem 10 by using RCV1 dataset. Three-quarters of thedata are used to train the learnable model and the rest of thedata are for testing. Table: 3 shows the accuracy for different

Fig. 6. (a) Convergence versus Privacy, bw ¼ 1t

Ptk¼1 wk. (b) Regret

incurred by using Mini-Batches. (c) The euclidean distance is generatedbyPm

i¼1 wit � �wt

�� ��2, where �wt ¼ 1

m

Pmi¼1 w

it.

Fig. 5. (a) and (b) Regret versus Sparsity on different topologies anddatasets.

LI ET AL.: DIFFERENTIALLY PRIVATE DISTRIBUTED ONLINE LEARNING 1451

Page 13: Differentially Private Distributed Online Learning · Differentially Private Distributed Online Learning Chencheng Li , Student Member, IEEE, Pan Zhou , Member, IEEE, Li Xiong , Qian

level of privacy and different number of nodes of algorithm.Intuitively, the centralized non-private model has the high-est accuracy 82:51% while the model of 64 nodes at a highlevel � ¼ 0:01 of privacy has the lowest accuracy 50:36%. Weconclude that the accuracy gets higher as the level of privacyis lower or the number of nodes is smaller.

8 CONCLUSION AND FUTURE WORK

We have proposed a novel differentially private distributedonline learning framework. Both � and �; dð Þ-differential pri-vacy are provided for our DOLA. Furthermore, we discussedtwo extensions of our algorithms. One is the conversion ofthe differentially private DOLA to distributed stochasticoptimization. The other is that we use the mini-batch tech-nique to weaken the influence of added noise. According to[42], the utility if our algorithms can potentially be improved.Wewill study these improvements in the future.

ACKNOWLEDGMENTS

This work was supported in part by the National Natural Sci-ence Foundation of China under Grant 61401169, the Patient-CenteredOutcomes Research Institute (PCORI) under contractME-1310-07058, and the National Institute of Health (NIH)under award numberR01GM114612 andR01GM118609.

REFERENCES

[1] J. Feigenbaum, Y. Ishai, T. Malkin, K. Nissim, M. J. Strauss, andR. N.Wright, “Securemultiparty computation of approximations,”ACMTrans. Algorithms TALG, vol. 2, no. 3, pp. 435–472, 2006.

[2] C. Orlandi, “Is multiparty computation any good in practice?” inProc. Int. Conf. Acoust. Speech Signal Process, 2011, pp. 5848–5851.

[3] Y. Lindell and B. Pinkas, “Secure multiparty computation forprivacy-preserving data mining,” J. Privacy Confidentiality, vol. 1,no. 1, 2009, Art. no. 5.

[4] W. Du, Y. S. Han, and S. Chen, “Privacy-preserving multivariatestatistical analysis: Linear regression and classification,” in Proc.7th SIAM Int. Conf. Data Mining, 2004, vol. 4, pp. 222–233.

[5] C. Dwork, “Differential privacy,” in Proc. 33rd Int. Conf. AutomataLanguages Program.-Vol. Part II, 2006, pp. 1–12.

[6] C. Dwork, F. McSherry, K. Nissim, and A. Smith, “Calibratingnoise to sensitivity in private data analysis,” in Theory of Cryptog-raphy. Beerlin, Germany: Springer-Verlag, 2006, pp. 265–284.

[7] Z. Huang, S. Mitra, and N. Vaidya, “Differentially private distrib-uted optimization,” in Proc. Int. Conf. Distrib. Comput. Netw., 2015,Art. no. 4.

[8] A. Rajkumar and S. Agarwal, “A differentially private stochasticgradient descent algorithm for multiparty classification,” in Proc.14th Int. Conf. Artif. Intell. Statist., 2012, pp. 933–941.

[9] Q. Wang, Y. Zhang, X. Lu, Z. Wang, Z. Qin, and K. Ren, “Real-time and spatio-temporal crowd-sourced social network datapublishing with differential privacy,” IEEE Trans. DependableSecure Comput., to be published, doi: 10.1109/TDSC.2016.2599873.

[10] K. Chaudhuri, C. Monteleoni, and A. D. Sarwate, “Differentiallyprivate empirical risk minimization,” J. Mach. Learn. Res., vol. 12,pp. 1069–1109, 2011.

[11] P. Jain, P. Kothari, and A. Thakurta, “Differentially private onlinelearning,” in Proc. 16th Annu. Conf. Comput. Learn. Theory 7th Ker-nel Workshop, 2012, pp. 24.1–24.34.

[12] M. J. Kusner, J. R. Gardner, R. Garnett, and K. Q. Weinberger,“Differentially private Bayesian optimization,” in Proc. 29th Int.Conf. Mach. Learn., 2015, pp. 918–927

[13] H.Wang,A. Banerjee, C.-J. Hsieh, P. K. Ravikumar, and I. S. Dhillon,“Large scale distributed sparse precision estimation,” in Proc. 28thInt. Conf. Neural Inf. Process. Syst., 2013, pp. 584–592.

[14] D. Wang, P. Wu, P. Zhao, and S. C. Hoi, “A framework of sparseonline learning and its applications,” arXiv:1507.07146, 2015.

[15] S. Shalev-Shwartz and A. Tewari, “Stochastic methods for l 1-reg-ularized loss minimization,” J. Mach. Learn. Res., vol. 12, pp. 1865–1892, 2011.

[16] J. Langford, L. Li, and T. Zhang, “Sparse online learning via trun-cated gradient,” in Proc. 28th Int. Conf. Neural Inf. Process. Syst.,2009, pp. 905–912.

[17] L. Xiao, “Dual averaging method for regularized stochastic learn-ing and online optimization,” in Proc. 28th Int. Conf. Neural Inf.Process. Syst., 2009, pp. 2116–2124.

[18] J. C. Duchi, S. Shalev-Shwartz, Y. Singer, and A. Tewari,“Composite objective mirror descent,” in Proc. 16th Annu. Conf.Comput. Learn. Theory 7th Kernel Workshop, 2010, pp. 14–26.

[19] R. Tibshirani, “Regression shrinkage and selection via the lasso,”J. Roy. Statist. Soc. Series BMethodological, vol. 58, pp. 267–288, 1996.

[20] S. Song, K. Chaudhuri, and A. D. Sarwate, “Stochastic gradientdescent with differentially private updates,” in Proc. IEEE GlobalConf. Signal Inf. Processing, 2013, pp. 245–248.

[21] O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao, “Optimaldistributed online prediction using mini-batches,” J. Mach. Learn.Res., vol. 13, no. 1, pp. 165–202, 2012.

[22] M. Zinkevich, “Online convex programming and generalizedinfinitesimal gradient ascent,” Proc. 29th Int. Conf. Mach. Learn.,2003, pp. 928–936.

[23] E. Hazan, A. Agarwal, and S. Kale, “Logarithmic regret algo-rithms for online convex optimization,” Mach. Learn., vol. 69,no. 2–3, pp. 169–192, 2007.

[24] F. Yan, S. Sundaram, S. Vishwanathan, and Y. Qi, “Distributedautonomous online learning: Regrets and intrinsic privacy-preserving properties,” IEEE Trans. Knowl. Data Eng., vol. 25,no. 11, pp. 2483–2493, Nov. 2013.

[25] E. Hazan, “Online Convex Optimization,” 2015.[26] J. C. Duchi, A. Agarwal, and M. J. Wainwright, “Dual averaging

for distributed optimization,” in Proc. Allerton Conf. Commun. Con-trol Comput., 2012, pp. 1564–1565.

[27] A. Nedic and A. Ozdaglar, “Distributed subgradient methods formulti-agent optimization,” IEEE Trans. Autom. Control, vol. 54,no. 1, pp. 48–61, 2009.

[28] S. Ram, A. Nedi�c, and V. Veeravalli, “Distributed stochastic sub-gradient projection algorithms for convex optimization,” J. Optim.Theory Appl., vol. 147, no. 3, pp. 516–545, 2010.

[29] R. Bassily, A. Smith, and A. Thakurta, “Private empirical risk min-imization: Efficient algorithms and tight error bounds,” in Proc.47th Annu. IEEE Symp. Found. Comput. Sci., Oct. 2014, pp. 464–473.

[30] S. Goryczka and L. Xiong, “A comprehensive comparison of mul-tiparty secure additions with differential privacy,” IEEE Trans.Dependable Secure Comput., 2015.

[31] C. Dwork and J. Lei, “Differential privacy and robust statistics,” inProc. 13th Annu. ACM Symp. Theory Comput., 2009, pp. 371–380.

[32] F. D. McSherry, “Privacy integrated queries: An extensible plat-form for privacy-preserving data analysis,” in Proc. ACM SIG-MOD Int. Conf. Manag. Data., 2009, pp. 19–30.

[33] C. Dwork, G. N. Rothblum, and S. Vadhan, “Boosting and differ-ential privacy,” in Proc. 47th Annu. IEEE Symp. Found. Comput.Sci., 2010, pp. 51–60.

[34] D. Kifer, A. Smith, and A. Thakurta, “Private convex empiricalrisk minimization and high-dimensional regression,” J. Mach.Learn. Res., vol. 1, 2012, Art. no. 41.

[35] A. Nikolov, K. Talwar, and L. Zhang, “The geometry of differen-tial privacy: The sparse and approximate cases,” in Proc. 13thAnnu. ACM Symp. Theory Comput., 2013, pp. 351–360.

TABLE 3Accuracy for Different Levels of Privacy and Different

Numbers of Nodes of an Algorithm

Method Nodes Accuracy

Non-private1 82:51%4 74:64%64 65:72%

Private

� ¼ 1 1 82:51%� ¼ 1 4 74:64%� ¼ 1 64 65:72%� ¼ 0:1 1 80:17%� ¼ 0:1 4 70:86%� ¼ 0:1 64 62:34%� ¼ 0:01 1 75:69%� ¼ 0:01 4 64:81%� ¼ 0:01 64 50:36%

1452 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 30, NO. 8, AUGUST 2018

Page 14: Differentially Private Distributed Online Learning · Differentially Private Distributed Online Learning Chencheng Li , Student Member, IEEE, Pan Zhou , Member, IEEE, Li Xiong , Qian

[36] J. S. Liu, Monte Carlo Strategies in Scientific Computing. Berlin,Germany: Springer Science & Business Media, 2008.

[37] N. Cesa-Bianchi, A. Conconi, and C. Gentile, “On the generaliza-tion ability of on-line learning algorithms,” IEEE Trans. Inf. Theory,vol. 50, no. 9, pp. 2050–2057, Sep. 2004.

[38] M. N. Broadie, D. M. Cicek, and A. Zeevi, “General bounds andfinite-time improvement for stochastic approximation algo-rithms,” Columbia University, New York, NY, USA: 2009.

[39] H. J. Kushner and G. G. Yin, Stochastic Approximation and RecursiveAlgorithms and Applications. Berlin, Germany: Springer-Verlag, 2003.

[40] B. T. Polyak and A. B. Juditsky, “Acceleration of stochasticapproximation by averaging,” SIAM J. Control Optim., vol. 30,no. 4, pp. 838–855, 1992.

[41] O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao, “Optimaldistributed online prediction using mini-batches,” J. Mach. Learn.Res., vol. 13, pp. 165–202, 2012.

[42] A. G. Thakurta, “(Nearly) optimal algorithms for private onlinelearning in full-information and bandit settings,” Proc. 28th Int.Conf. Neural Inf. Process. Syst., pp. 2733–2741, 2013.

Chencheng Li (S’15) received the BS degreefrom the Huazhong University of Science andTechnology, Wuhan, P. R. China, in 2014, and iscurrently working toward the PhD degree in theSchool of Electronic Information and Communi-cations, Huazhong University of Science andTechnology, Wuhan, P. R. China. His currentresearch interest includes: online learning in bigdata analytics and differential privacy. He is a stu-dent member of the IEEE.

Pan Zhou (S’07–M’14) received the BS degree inadvanced class from the Huazhong University ofScience and Technology, the MS degree fromthe Department of Electronics and InformationEngineering, Huazhong University of Scienceand Technology, and the PhD degree from theSchool of Electrical and Computer Engineering,Georgia Institute of Technology (Georgia Tech),in 2011, Atlanta. He is currently an associate pro-fessor in the School of Electronic Information andCommunications, Huazhong University of Sci-

ence and Technology (HUST), Wuhan, P. R. China. His current researchinterests include machine learning and big data, security and privacy,and information networks. He is a member of the IEEE.

Li Xiong received the BS degree from the Univer-sity of Science and Technology of China, the MSdegree from Johns Hopkins University, and thePhD degree from the Georgia Institute of Technol-ogy, China, all in computer science. She is a Win-ship distinguished research associate professorof mathematics and computer science, EmoryUniversity, where she directs the Assured Informa-tion Management and Sharing (AIMS) ResearchGroup. Her areas of research are in data privacyand security, distributed and spatio-temporal data

management, and biomedical informatics. She is a recent recipient of theCareer Enhancement Fellowship by theWoodrowWilson Foundation.

Qian Wang received the PhD degree in electricalengineering from the Illinois Institute of Technol-ogy, in 2012. He is a professor in the Schoolof Computer Science, Wuhan University. Hisresearch interests include AI security, cloud secu-rity and privacy, wireless systems security, etc.He is an expert under National “1000 YoungTalents Program” of China. He is a recipient of theIEEEAsia-PacificOutstanding YoungResearcherAward in 2016. He serves as an associate editorof the IEEE Transactions on Dependable and

Secure Computing and the IEEE Transactions on Information Forensicsand Security. He is amember of the IEEE and amember of the ACM.

Ting Wang received the PhD degree from theGeorgia Institute of Technology. He is an assistantprofessor with Lehigh University. Prior to joiningLehigh, he was a research staff member with theIBM Thomas J. Watson Research Center. Hisresearch interests span both theoretical founda-tions and real-world applications of large-scaledata mining and management. His current workfocuses on data analytics for privacy and security.

" For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

LI ET AL.: DIFFERENTIALLY PRIVATE DISTRIBUTED ONLINE LEARNING 1453