towards federated learning at scale: system design

TOWARDS FEDERATED LEARNING AT SCALE: SYSTEM DESIGN

Keith Bonawitz 1 Hubert Eichner 1 Wolfgang Grieskamp 1 Dzmitry Huba 1 Alex Ingerman 1 Vladimir Ivanov 1

Chloe Kiddon 1 Jakub Konecny 1 Stefano Mazzocchi 1 H. Brendan McMahan 1 Timon Van Overveldt 1

David Petrou 1 Daniel Ramage 1 Jason Roselander 1

ABSTRACTFederated Learning is a distributed machine learning approach which enables model training on a large corpus ofdecentralized data. We have built a scalable production system for Federated Learning in the domain of mobiledevices, based on TensorFlow. In this paper, we describe the resulting high-level design, sketch some of thechallenges and their solutions, and touch upon the open problems and future directions.

1 INTRODUCTION

Federated Learning (FL) (McMahan et al., 2017) is a dis-tributed machine learning approach which enables trainingon a large corpus of decentralized data residing on deviceslike mobile phones. FL is one instance of the more generalapproach of “bringing the code to the data, instead of thedata to the code” and addresses the fundamental problemsof privacy, ownership, and locality of data. The generaldescription of FL has been given by McMahan & Ramage(2017), and its theory has been explored in Konecny et al.(2016a); McMahan et al. (2017; 2018).

A basic design decision for a Federated Learning infrastruc-ture is whether to focus on asynchronous or synchronoustraining algorithms. While much successful work on deeplearning has used asynchronous training, e.g., Dean et al.(2012), recently there has been a consistent trend towardssynchronous large batch training, even in the data center(Goyal et al., 2017; Smith et al., 2018). The Federated Aver-aging algorithm of McMahan et al. (2017) takes a similarapproach. Further, several approaches to enhancing privacyguarantees for FL, including differential privacy (McMa-han et al., 2018) and Secure Aggregation (Bonawitz et al.,2017), essentially require some notion of synchronization ona fixed set of devices, so that the server side of the learningalgorithm only consumes a simple aggregate of the updatesfrom many users. For all these reasons, we chose to fo-cus on support for synchronous rounds, while mitigatingpotential synchronization overhead via several techniqueswe describe subsequently. Our system is thus amenable torunning large-batch SGD-style algorithms as well as Feder-

1Google Inc., Mountain View, CA, USA. Correspon-dence to: Wolfgang Grieskamp <[email protected]>, VladimirIvanov <[email protected]>, Brendan McMahan <[email protected]>.

Proceedings of the 2nd SysML Conference, Palo Alto, CA, USA,2019. Copyright 2019 by the author(s).

ated Averaging, the primary algorithm we run in production;pseudo-code is given in Appendix B for completeness.

In this paper, we report on a system design for such algo-rithms in the domain of mobile phones (Android). This workis still in an early stage, and we do not have all problemssolved, nor are we able to give a comprehensive discussionof all required components. Rather, we attempt to sketch themajor components of the system, describe the challenges,and identify the open issues, in the hope that this will beuseful to spark further systems research.

Our system enables one to train a deep neural network, usingTensorFlow (Abadi et al., 2016), on data stored on the phonewhich will never leave the device. The weights are combinedin the cloud with Federated Averaging, constructing a globalmodel which is pushed back to phones for inference. Animplementation of Secure Aggregation (Bonawitz et al.,2017) ensures that on a global level individual updates fromphones are uninspectable. The system has been applied inlarge scale applications, for instance in the realm of a phonekeyboard.

Our work addresses numerous practical issues: device avail-ability that correlates with the local data distribution incomplex ways (e.g., time zone dependency); unreliable de-vice connectivity and interrupted execution; orchestration oflock-step execution across devices with varying availability;and limited device storage and compute resources. Theseissues are addressed at the communication protocol, de-vice, and server levels. We have reached a state of maturitysufficient to deploy the system in production and solve ap-plied learning problems over tens of millions of real-worlddevices; we anticipate uses where the number of devicesreaches billions.

Towards Federated Learning at Scale: System Design

Round i Round i+1

Selection Configuration Reporting Selection Configuration Repo...

Training

Training

Training

Training

Training

Training

Aggregation

Device

Server

Persistent storage

Device or network failure

Rejection (“come back later!”)

Devices check-in with the FL server,rejected ones are told to come back later

Server reads model checkpoint frompersistent storageModel and configuration are sentto selected devices

On-device training is performed,model update is reported back

Server aggregates updates into the global model as they arrive

Server writes global model checkpoint into persistent storage

1

2

3

4

5

6

1

2

3

4

5

6

Figure 1: Federated Learning Protocol

2 PROTOCOL

To understand the system architecture, it is best to start fromthe network protocol.

2.1 Basic Notions

The participants in the protocol are devices (currently An-droid phones) and the FL server, which is a cloud-baseddistributed service. Devices announce to the server thatthey are ready to run an FL task for a given FL population.An FL population is specified by a globally unique namewhich identifies the learning problem, or application, whichis worked upon. An FL task is a specific computation for anFL population, such as training to be performed with givenhyperparameters, or evaluation of trained models on localdevice data.

From the potential tens of thousands of devices announcingavailability to the server during a certain time window, theserver selects a subset of typically a few hundred which areinvited to work on a specific FL task (we discuss the reasonfor this subsetting in Sec. 2.2). We call this rendezvousbetween devices and server a round. Devices stay connectedto the server for the duration of the round.

The server tells the selected devices what computation to runwith an FL plan, a data structure that includes a TensorFlowgraph and instructions for how to execute it. Once a roundis established, the server next sends to each participant thecurrent global model parameters and any other necessary

state as an FL checkpoint (essentially the serialized state of aTensorFlow session). Each participant then performs a localcomputation based on the global state and its local dataset,and sends an update in the form of an FL checkpoint backto the server. The server incorporates these updates into itsglobal state, and the process repeats.

2.2 Phases

The communication protocol enables devices to advancethe global, singleton model of an FL population betweenrounds where each round consists of the three phases shownin Fig. 1. For simplicity, the description below does notinclude Secure Aggregation, which is described in Sec. 6.Note that even in the absence of Secure Aggregation, allnetwork traffic is encrypted on the wire.

Selection Periodically, devices that meet the eligibility cri-teria (e.g., charging and connected to an unmetered network;see Sec. 3) check in to the server by opening a bidirectionalstream. The stream is used to track liveness and orchestratemulti-step communication. The server selects a subset ofconnected devices based on certain goals like the optimalnumber of participating devices (typically a few hundreddevices participate in each round). If a device is not selectedfor participation, the server responds with instructions toreconnect at a later point in time.1

1In the current implementation, selection is done by simplereservoir sampling, but the protocol is amenable to more sophisti-


Configuration The server is configured based on the ag-gregation mechanism selected (e.g., simple or Secure Ag-gregation) for the selected devices. The server sends the FLplan and an FL checkpoint with the global model to each ofthe devices.

Reporting The server waits for the participating devicesto report updates. As updates are received, the server ag-gregates them using Federated Averaging and instructs thereporting devices when to reconnect (see also Sec. 2.3). Ifenough devices report in time, the round will be success-fully completed and the server will update its global model,otherwise the round is abandoned.

As seen in Fig. 1, straggling devices which do not reportback in time or do not react on configuration by the serverwill simply be ignored. The protocol has a certain tolerancefor such drop-outs which is configurable per FL task.

The selection and reporting phases are specified by a setof parameters which spawn flexible time windows. For ex-ample, for the selection phase the server considers a deviceparticipant goal count, a timeout, and a minimal percentageof the goal count which is required to run the round. Theselection phase lasts until the goal count is reached or atimeout occurs; in the latter case, the round will be startedor abandoned depending on whether the minimal goal counthas been reached.

2.3 Pace Steering

Pace steering is a flow control mechanism regulating thepattern of device connections. It enables the FL server bothto scale down to handle small FL populations as well toscale up to very large FL populations.

Pace steering is based on the simple mechanism of the serversuggesting to the device the optimum time window to re-connect. The device attempts to respect this, modulo itseligibility.

In the case of small FL populations, pace steering is usedto ensure that a sufficient number of devices connect tothe server simultaneously. This is important both for therate of task progress and for the security properties of theSecure Aggregation protocol. The server uses a statelessprobabilistic algorithm requiring no additional device/servercommunication to suggest reconnection times to rejecteddevices so that subsequent checkins are likely to arrive con-temporaneously.

For large FL populations, pace steering is used to random-ize device check-in times, avoiding the “thundering herd”problem, and instructing devices to connect as frequently asneeded to run all scheduled FL tasks, but not more.

cated methods which address selection bias.

Pace steering also takes into account the diurnal oscillationin the number of active devices, and is able to adjust thetime window accordingly, avoiding excessive activity duringpeak hours and without hurting FL performance during othertimes of the day.

3 DEVICE

App Process

Device

FL RuntimeConfig

Example Store

Process boundary (inter or intra app)

model and(training) plan

model updateFL Server

Data

Figure 2: Device Architecture

This section describes the software architecture running ona device participating in FL. This describes our Androidimplementation but note that the architectural choices madehere are not particularly platform-specific.

The device’s first responsibility in on-device learning is tomaintain a repository of locally collected data for modeltraining and evaluation. Applications are responsible formaking their data available to the FL runtime as an exam-ple store by implementing an API we provide. An appli-cation’s example store might, for example, be an SQLitedatabase recording action suggestions shown to the userand whether or not those suggestions were accepted. Werecommend that applications limit the total storage footprintof their example stores, and automatically remove old dataafter a pre-designated expiration time, where appropriate.We provide utilities to make these tasks easy. Data storedon devices may be vulnerable to threats like malware orphysical disassembly of the phone, so we recommend thatapplications follow the best practices for on-device datasecurity, including ensuring that data is encrypted at rest inthe platform-recommended manner.

The FL runtime, when provided a task by the FL server,accesses an appropriate example store to compute modelupdates, or evaluate model quality on held out data. Fig. 2shows the relationship between the example store and theFL runtime. Control flow consists of the following steps:

Programmatic Configuration An application configures


the FL runtime by providing an FL population name andregistering its example stores. This schedules a periodicFL runtime job using Android’s JobScheduler. Possibly themost important requirement for training machine learning(ML) models on end users’ devices is to avoid any negativeimpact on the user experience, data usage, or battery life.The FL runtime requests that the job scheduler only invokethe job when the phone is idle, charging, and connected toan unmetered network such as WiFi. Once started, the FLruntime will abort, freeing the allocated resources, if theseconditions are no longer met.

Job Invocation Upon invocation by the job scheduler ina separate process, the FL runtime contacts the FL serverto announce that it is ready to run tasks for the given FLpopulation. The server decides whether any FL tasks areavailable for the population and will either return an FL planor a suggested time to check in later.

Task Execution If the device has been selected, the FLruntime receives the FL plan, queries the app’s examplestore for data requested by the plan, and computes plan-determined model updates and metrics.

Reporting After FL plan execution, the FL runtime reportscomputed updates and metrics to the server and cleans upany temporary resources.

As already mentioned, FL plans are not specialized to train-ing, but can also encode evaluation tasks - computing qual-ity metrics from held out data that wasn’t used for training,analogous to the validation step in data center training.

Our design enables the FL runtime to either run within theapplication that configured it or in a centralized servicehosted in another app. Choosing between these two requiresminimal code changes. Communication between the appli-cation, the FL runtime, and the application’s example storeas depicted in Fig. 2 is implemented via Android’s AIDLIPC mechanism, which works both within a single app andacross apps.

Multi-Tenancy Our implementation provides a multi-tenant architecture, supporting training of multiple FL pop-ulations in the same app (or service). This allows for co-ordination between multiple training activities, avoidingthe device being overloaded by many simultaneous trainingsessions at once.

Attestation We want devices to participate in FL anony-mously, which excludes the possibility of authenticatingthem via a user identity. Without verifying user identity,we need to protect against attacks to influence the FL resultfrom non-genuine devices. We do so by using Android’sremote attestation mechanism (Android Documentation),

which helps to ensure that only genuine devices and applica-tions participate in FL, and gives us some protection againstdata poisoning (Bagdasaryan et al., 2018) via compromiseddevices. Other forms of model manipulation – such as con-tent farms using uncompromised phones to steer a model –are also potential areas of concern that we do not address inthe scope of this paper.

4 SERVER

The design of the FL server is driven by the necessity tooperate over many orders of magnitude of population sizesand other dimensions. The server must work with FL popu-lations whose sizes range from tens of devices (during de-velopment) to hundreds of millions, and be able to processrounds with participant count ranging from tens of devicesto tens of thousands. Also, the size of the updates collectedand communicated during each round can range in size fromkilobytes to tens of megabytes. Finally, the amount of trafficcoming into or out of any given geographic region can varydramatically over a day based on when devices are idle andcharging. This section details the design of the FL serverinfrastructure given these requirements.

4.1 Actor Model

The FL server is designed around the Actor ProgrammingModel (Hewitt et al., 1973). Actors are universal primitivesof concurrent computation which use message passing asthe sole communication mechanism.

Each actor handles a stream of messages/events strictlysequentially, leading to a simple programming model. Run-ning multiple instances of actors of the same type allows anatural scaling to large number of processors/machines. Inresponse to a message, an actor can make local decisions,send messages to other actors, or create more actors dynam-ically. Depending on the function and scalability require-ments, actor instances can be co-located on the same pro-cess/machine or distributed across data centers in multiplegeographic regions, using either explicit or automatic con-figuration mechanisms. Creating and placing fine-grainedephemeral instances of actors just for the duration of a givenFL task enables dynamic resource management and load-balancing decisions.

4.2 Architecture

The main actors in the system are shown in Fig. 3.

Coordinators are the top-level actors which enable globalsynchronization and advancing rounds in lockstep. Thereare multiple Coordinators, and each one is responsible foran FL population of devices. A Coordinator registers itsaddress and the FL population it manages in a shared lockingservice, so there is always a single owner for every FLpopulation which is reachable by other actors in the system,


Coordinator

Master Aggregator

Aggregator Aggregator

Persistent (long-lived) actor

Ephemeral (short-lived) actor

coordinates coordinates

Selector Selector

connections from devices

...

...

forwards

devices

creates

creates creates

Figure 3: Actors in the FL Server Architecture

notably the Selectors. The Coordinator receives informationabout how many devices are connected to each Selector andinstructs them how many devices to accept for participation,based on which FL tasks are scheduled. Coordinators spawnMaster Aggregators to manage the rounds of each FL task.

Selectors are responsible for accepting and forwarding de-vice connections. They periodically receive informationfrom the Coordinator about how many devices are neededfor each FL population, which they use to make local deci-sions about whether or not to accept each device. After theMaster Aggregator and set of Aggregators are spawned, theCoordinator instructs the Selectors to forward a subset of itsconnected devices to the Aggregators, allowing the Coordi-nator to efficiently allocate devices to FL tasks regardless ofhow many devices are available. The approach also allowsthe Selectors to be globally distributed (close to devices)and limit communication with the remote Coordinator.

Master Aggregators manage the rounds of each FL task.In order to scale with the number of devices and updatesize, they make dynamic decisions to spawn one or moreAggregators to which work is delegated.

No information for a round is written to persistent stor-age until it is fully aggregated by the Master Aggregator.Specifically, all actors keep their state in memory and areephemeral. Ephemeral actors improve scalability by remov-ing the latency normally incurred by distributed storage.In-memory aggregation also removes the possibility of at-tacks within the data center that target persistent logs ofper-device updates, because no such logs exist.

4.3 Pipelining

While Selection, Configuration and Reporting phases of around (Sec. 2) are sequential, the Selection phase doesn’tdepend on any input from a previous round. This enableslatency optimization by running the Selection phase of thenext round of the protocol in parallel with the Configura-tion/Reporting phases of a previous round. Our systemarchitecture enables such pipelining without adding extracomplexity, as parallelism is achieved simply by the virtueof Selector actors running the selection process continu-ously.

4.4 Failure Modes

In all failure cases the system will continue to make progress,either by completing the current round or restarting from theresults of the previously committed round. In many cases,the loss of an actor will not prevent the round from succeed-ing. For example, if an Aggregator or Selector crashes, onlythe devices connected to that actor will be lost. If the MasterAggregator fails, the current round of the FL task it manageswill fail, but will then be restarted by the Coordinator. Fi-nally, if the Coordinator dies, the Selector layer will detectthis and respawn it. Because the Coordinators are registeredin a shared locking service, this will happen exactly once.

5 ANALYTICS

There are many factors and failsafes in the interaction be-tween devices and servers. Moreover, much of the platformactivity happens on devices that we neither control nor haveaccess to.

For this reason, we rely on analytics to understand what isactually going on in the field, and monitor devices’ healthstatistics. On the device side we perform computation-intensive operations, and must avoid wasting the phone’sbattery or bandwidth, or degrading the performance of thephone. To ensure this, we log several activity and healthparameters to the cloud. For example: the device state inwhich training was activated, how often and how long it ran,how much memory it used, which errors where detected,which phone model / OS / FL runtime version was used,and so on. These log entries do not contain any person-ally identifiable information (PII). They are aggregated andpresented in dashboards to be analyzed, and fed into auto-matic time-series monitors that trigger alerts on substantialdeviations.

We also log an event for every state in a training round,and use these logs to generate ASCII visualizations of thesequence of state transitions happening across all devices(see Table 1 in the appendix). We chart counts of thesesequence visualizations in our dashboards, which allowsus to quickly distinguish between different types of issues.


For example, the sequence “checking in, downloaded plan,started training, ended training, starting upload, error” isvisualized as “-v[]+*”, while the shorter sequence “check-ing in, downloaded plan, started training, error” is “-v[*”.The first indicates that a model trained successfully but theresults upload failed (a network issue), whereas the secondindicates that a training round failed right after loading themodel (a model issue).

Server side, we similarly collect information such as howmany devices where accepted and rejected per traininground, the timing of the various phases of the round, through-put in terms of uploaded and downloaded data, errors, andso on.

Since the platform’s deployment, we have relied on the ana-lytics layer repeatedly to discover issues and verify that theywere resolved. Some of the incidents we discovered weredevice health related, for example discovering that trainingwas happening when it shouldn’t have, while others werefunctional, for example discovering that the drop out ratesof training participants were much higher than expected.

Federated training does not impact the user experience, soboth device and server functional failures do not have animmediate negative impact. But failures to operate properlycould have secondary consequences leading to utility degra-dation of the device. Device utility to the user is missioncritical, and degradations are difficult to pinpoint and easy towrongly diagnose. Using accurate analytics to prevent feder-ated training from negatively impacting the device’s utilityto the user accounts for a substantial part of our engineeringand risk mitigation costs.

6 SECURE AGGREGATION

Bonawitz et al. (2017) introduced Secure Aggregation, aSecure Multi-Party Computation protocol that uses encryp-tion to make individual devices’ updates uninspectable by aserver, instead only revealing the sum after a sufficient num-ber of updates have been received. We can deploy SecureAggregation as a privacy enhancement to the FL service thatprotects against additional threats within the data center byensuring that individual devices’ updates remain encryptedeven in-memory. Formally, Secure Aggregation protectsfrom “honest but curious” attackers that may have access tothe memory of Aggregator instances. Importantly, the onlyaggregates needed for model evaluation, SGD, or FederatedAveraging are sums (e.g., wt and nt in Appendix 1).2

2 It is important to note that the goal of our system is to providethe tools to build privacy preserving applications. Privacy is en-hanced by the ephemeral and focused nature of the FL updates, andcan be further augmented with Secure Aggregation and/or differ-ential privacy — e.g., the techniques of McMahan et al. (2018) arecurrently implemented. However, while the platform is designedto support a variety of privacy-enhancing technologies, stating

Secure Aggregation is a four-round interactive protocol op-tionally enabled during the reporting phase of a given FLround. In each protocol round, the server gathers messagesfrom all devices in the FL round, then uses the set of devicemessages to compute an independent response to return toeach device. The protocol is designed to be robust to a sig-nificant fraction of devices dropping out before the protocolis complete. The first two rounds constitute a Prepare phase,in which shared secrets are established and during whichdevices who drop out will not have their updates includedin the final aggregation. The third round constitutes a Com-mit phase, during which devices upload cryptographicallymasked model updates and the server accumulates a sum ofthe masked updates. All devices who complete this roundwill have their model update included in the protocol’s finalaggregate update, or else the entire aggregation will fail.The last round of the protocol constitutes a Finalizationphase, during which devices reveal sufficient cryptographicsecrets to allow the server to unmask the aggregated modelupdate. Not all committed devices are required to completethis round; so long as a sufficient number of the devices whostarted to protocol survive through the Finalization phase,the entire protocol succeeds.

Several costs for Secure Aggregation grow quadraticallywith the number of users, most notably the computationalcost for the server. In practice, this limits the maximumsize of a Secure Aggregation to hundreds of users. So asnot to constrain the number of users that may participate ineach round of federated computation, we run an instance ofSecure Aggregation on each Aggregator actor (see Fig. 3)to aggregate inputs from that Aggregator’s devices into anintermediate sum; FL tasks define a parameter k so thatall updates are securely aggregated over groups of size atleast k. The Master Aggregator then further aggregates theintermediate aggregators’ results into a final aggregate forthe round, without Secure Aggregation.

7 TOOLS AND WORKFLOW

Compared to the standard model engineer workflows oncentrally collected data, on-device training poses multiplenovel challenges. First, individual training examples are notdirectly inspectable, requiring tooling to work with proxydata in testing and simulation (Sec. 7.1). Second, modelscannot be run interactively but must instead be compiledinto FL plans to be deployed via the FL server (Sec. 7.2). Fi-nally, because FL plans run on real devices, model resourceconsumption and runtime compatibility must be verifiedautomatically by the infrastructure (Sec. 7.3).

The primary developer surface of model engineers working

specific privacy guarantees depends on the details of the applica-tion and the details of how these technologies are used; such adiscussion is beyond the scope of the current work.


development environment production environment

TensorFlow

Model Program

simulategenerate deploy

FL Server

downloadplan & model

uploadmodel & metrics

FL Plan

Figure 4: Model Engineer Workflow

with the FL system is a set of Python interfaces and tools todefine, test, and deploy TensorFlow-based FL tasks to thefleet of mobile devices via the FL server. The workflow of amodel engineer for FL is depicted in Fig. 4 and describedbelow.

7.1 Modeling and Simulation

Model engineers begin by defining the FL tasks that theywould like to run on a given FL population in Python. Ourlibrary enables model engineers to declare Federated Learn-ing and evaluation tasks using engineer-provided Tensor-Flow functions. The role of these functions is to map inputtensors to output metrics like loss or accuracy. During de-velopment, model engineers may use sample test data orother proxy data as inputs. When deployed, the inputs willbe provided from the on-device example store via the FLruntime.

The role of the modeling infrastructure is to enable modelengineers to focus on their model, using our libraries to buildand test the corresponding FL tasks. FL tasks are validatedagainst engineer-provided test data and expectations, similarin nature to unit tests. FL task tests are ultimately requiredin order to deploy a model as described below in Sec. 7.3.

The configuration of tasks is also written in Python andincludes runtime parameters such as the optimal numberof devices in a round as well as model hyperparameterslike learning rate. FL tasks may be defined in groups: forexample, to evaluate a grid search over learning rates. Whenmore than one FL task is deployed in an FL population, theFL service chooses among them using a dynamic strategythat allows alternating between training and evaluation of asingle model or A/B comparisons between models.

Initial hyperparameter exploration is sometimes done insimulation using proxy data. Proxy data is similar in shapeto the on-device data but drawn from a different distribution– for example, text from Wikipedia may be viewed as proxydata for text typed on a mobile keyboard. Our modelingtools allow deployment of FL tasks to a simulated FL serverand a fleet of cloud jobs emulating devices on a large proxydataset. The simulation executes the same code as we run ondevice and communicates with the server using simulated

FL populations. Simulation can scale to a large number ofdevices and is sometimes used to pre-train models on proxydata before it is refined by FL in the field.

7.2 Plan Generation

Each FL task is associated with an FL plan. Plans areautomatically generated from the combination of model andconfiguration supplied by the model engineer. Typically,in data center training, the information which is encodedin the FL plan would be represented by a Python programwhich orchestrates a TensorFlow graph. However, we donot execute Python directly on the server or devices. TheFL plan’s purpose is to describe the desired orchestrationindependent of Python.

An FL plan consists of two parts: one for the device and onefor the server. The device portion of the FL plan contains,among other things: the TensorFlow graph itself, selectioncriteria for training data in the example store, instructionson how to batch data and how many epochs to run on thedevice, labels for the nodes in the graph which representcertain computations like loading and saving weights, andso on. The server part contains the aggregation logic, whichis encoded in a similar way. Our libraries automatically splitthe part of a provided model’s computation which runs ondevice from the part that runs on the server (the aggregation).

7.3 Versioning, Testing, and Deployment

Model engineers working in the federated system are able towork productively and safely, launching or ending multipleexperiments per day. But because each FL task may poten-tially be RAM-hogging or incompatible with version(s) ofTensorFlow running on the fleet, engineers rely on the FLsystem’s versioning, testing, and deployment infrastructurefor automated safety checks.

An FL task that has been translated into an FL plan is notaccepted by the server for deployment unless certain condi-tions are met. First, it must have been built from auditable,peer reviewed code. Second, it must have bundled test pred-icates for each FL task that pass in simulation. Third, theresources consumed during testing must be within a saferange of expected resources for the target population. And


finally, the FL task tests must pass on every version of theTensorFlow runtime that the FL task claims to support, asverified by testing the FL task’s plan in an Android emulator.

Versioning is a specific challenge for on-device machinelearning. In contrast to data-center training, where the Ten-sorFlow runtime and graphs can generally be rebuilt asneeded, devices may be running a version of the TensorFlowruntime that is many months older than what is required bythe FL plan generated by modelers today. For example,the old runtime may be missing a particular TensorFlowoperator, or the signature of an operator may have changedin an incompatible way. The FL infrastructure deals withthis problem by generating versioned FL plans for each task.Each versioned FL plan is derived from the default (unver-sioned) FL plan by transforming its computation graph toachieve compatibility with a deployed TensorFlow version.Versioned and unversioned plans must pass the same releasetests, and are therefore treated as semantically equivalent.We encounter about one incompatible change that can befixed with a graph transformation every three months, and aslightly smaller number that cannot be fixed without com-plex workarounds.

7.4 Metrics

As soon as an FL task has been accepted for deployment, de-vices checking in may be served the appropriate (versioned)plan. As soon as an FL round closes, that round’s aggre-gated model parameters and metrics are written to the serverstorage location chosen by the model engineer.

Materialized model metrics are annotated with additionaldata, including metadata like the source FL task’s name, FLround number within the FL task, and other basic opera-tional data. The metrics themselves are summaries of devicereports within the round via approximate order statistics andmoments like mean. The FL system provides analysis toolsfor model engineers to load these metrics into standardPython numerical data science packages for visualizationand exploration.

8 APPLICATIONS

Federated Learning applies best in situations where the on-device data is more relevant than the data that exists onservers (e.g., the devices generate the data in the first place),is privacy-sensitive, or otherwise undesirable or infeasible totransmit to servers. Current applications of Federated Learn-ing are for supervised learning tasks, typically using labelsinferred from user activity (e.g., clicks or typed words).

On-device item ranking A common use of machine learn-ing in mobile applications is selecting and ranking itemsfrom an on-device inventory. For example, apps may ex-pose a search mechanism for information retrieval or in-app

navigation, for example settings search on Google Pixel de-vices (ai.google, 2018). By ranking these results on-device,expensive calls to the server (in e.g., latency, bandwidth orpower consumption dimensions) are eliminated, and anypotentially private information from the search query anduser selection remains on the device. Each user interactionwith the ranking feature can become a labeled data point,since it’s possible to observe the user’s interaction with thepreferred item in the context of the full ranked list.

Content suggestions for on-device keyboards On-devicekeyboard implementations can add value to users by sug-gesting relevant content – for example, search queries thatare related to the input text. Federated Learning can be usedto train ML models for triggering the suggestion feature,as well as ranking the items that can be suggested in thecurrent context. This approach has been taken by Google’sGboard mobile keyboard team, using our FL system (Yanget al., 2018).

Next word prediction Gboard also used our FL platformto train a recurrent neural network (RNN) for next-word-prediction (Hard et al., 2018). This model, which has about1.4 million parameters, converges in 3000 FL rounds af-ter processing 6e8 sentences from 1.5e6 users over 5 daysof training (so each round takes about 2–3 minutes).3 Itimproves top-1 recall over a baseline n-gram model from13.0% to 16.4%, and matches the performance of a server-trained RNN which required 1.2e8 SGD steps. In live A/Bexperiments, the FL model outperforms both the n-gramand the server-trained RNN models.

9 OPERATIONAL PROFILE

In this section we provide a brief overview of some keyoperational metrics of the deployed FL system, running pro-duction workloads for over a year; Appendix A providesadditional details. These numbers are examples only, sincewe have not yet applied FL to a diverse enough set of appli-cations to provide a complete characterization. Further, alldata was collected in the process of operating a productionsystem, rather than under controlled conditions explicitlyfor the purpose of measurement. Many of the performancemetrics here depend on the device and network speed (whichcan vary by region); FL plan, global model and update sizes(varies per application); number of samples per round and

3This is roughly 7× slower than in comparable data centertraining of the same model. However, we do not believe this type ofcomparison is the primary one – our main goal is to enable trainingon data that is not available in the data center. In fact, for the modelmentioned different proxy data was used for data center training.Nevertheless, fast wall-clock convergence time is important forenabling model engineers to iterate rapidly, and hence we arecontinuing to optimize both our system and algorithms to decreaseconvergence times.


computational complexity per sample.

We designed the FL system to elastically scale with the num-ber and sizes of the FL populations, potentially up into thebillions. Currently the system is handling a cumulative FLpopulation size of approximately 10M daily active devices,spanning several different applications.

As discussed before, at any point in time only a subset ofdevices connect to the server due to device eligibility andpace steering. Given this, in practice we observe that upto 10k devices are participating simultaneously. It is worthnoting that the number of participating devices dependson the (local) time of day (see Fig. 5). Devices are morelikely idle and charging at night, and hence more likely toparticipate. We have observed a 4× difference between lowand high numbers of participating devices over a 24 hoursperiod for a US-centric population.

Figure 5: Round Completion Rate

Based on the previous work of McMahan et al. (2017) andexperiments we have conducted on production FL popula-tions, for most models receiving updates from a few hundreddevices per FL round is sufficient (that is, we see diminish-ing improvements in the convergence rate from training onlarger numbers of devices). We also observe that on aver-age the portion of devices that drop out due to computationerrors, network failures, or changes in eligibility varies be-tween 6% and 10%. Therefore, in order to compensate fordevice drop out as well as to allow stragglers to be discarded,the server typically selects 130% of the target number ofdevices to initially participate. This parameter can be tunedbased on the empirical distribution of device reporting timesand the target number of stragglers to ignore.

10 RELATED WORK

Alternative Approaches To the best of our knowledge, thesystem we described is the first production-level FederatedLearning implementation, focusing primarily on the Feder-ated Averaging algorithm running on mobile phones. Nev-ertheless, there are other ways to learn from data stored onmobile phones, and other settings in which FL as a conceptcould be relevant.

In particular, Pihur et al. (2018) proposes an algorithm thatlearns from users’ data without performing aggregation onthe server and with additional formal privacy guarantees.However, their work focuses on generalized linear mod-

els, and argues that their approach is highly scalable dueto avoidance of synchronization and not requiring to storeupdates from devices. Our server design described in Sec. 4,rebuts the concerns about scalability of the synchronousapproach we are using, and in particular shows that updatescan be processed online as they are received without a needto store them. Alternative proposals for FL algorithms in-clude Smith et al. (2017); Kamp et al. (2018), which wouldbe on the high-level compatible with the system designdescribed here.

In addition, Federated Learning has already been proposedin the context of vehicle-to-vehicle communication (Sama-rakoon et al., 2018) and medical applications (Brisimi et al.,2018). While the system described in this work as a wholedoes not directly apply to these scenarios, many aspects ofit would likely be relevant for production application.

Nishio & Yonetani (2018) focuses on applying FL in dif-ferent environmental conditions, namely where the servercan reach any subset of heterogeneous devices to initiatea round, but receives updates sequentially due to cellularbandwidth limit. The work offers a resource-aware selectionalgorithm maximizing the number of participants in a round,which is implementable within our system.

Distributed ML There has been significant work on dis-tributed machine learning, and large-scale cloud-based sys-tems have been described and are used in practice. Manysystems support multiple distribution schemes, includingmodel parallelism and data parallelism, e.g., Dean et al.(2012) and Low et al. (2012). Our system imposes a morestructured approach fitting to the domain of mobile devices,which have much lower bandwidth and reliability comparedto datacenter nodes. We do not allow for arbitrary dis-tributed computation but rather focus on a synchronous FLprotocol. This domain specialization allows us, from thesystem viewpoint, to optimize for the specific use case.

A particularly common approach in the datacenter is theparameter server, e.g., Li et al. (2014); Dean et al. (2012);Abadi et al. (2016), which allows a large number of workersto collaborate on a shared global model, the parameter vec-tor. Focus in that line of work is put on an efficient serverarchitecture for dealing with vectors of the size of 109 to1012. The parameter server provides global state whichworkers access and update asynchronously. Our approachinherently cannot work with such a global state, because werequire a specific rendezvous between a set of devices andthe FL server to perform a synchronous update with SecureAggregation.

MapReduce For datacenter applications, it is now com-monly accepted that MapReduce (Dean & Ghemawat, 2008)is not the right framework for ML training. For the prob-lem space of FL, MapReduce is a close relative. One caninterpret the FL server as the Reducer, and FL devices as


Mappers. However, there are also fundamental technicaldifferences compared to a generic MapReduce framework.In our system, FL devices own the data on which they areworking. They are fully self-controlled actors which at-tend and leave computation rounds at will. In turn, the FLserver actively scans for available FL devices, and bringsonly selected subsets of them together for a round of com-putation. The server needs to work with the fact that manydevices drop out during computation, and that availabilityof FL devices varies drastically over time. These very spe-cific requirements are better dealt with by a domain specificframework than a generic MapReduce.

11 FUTURE WORK

Bias The Federated Averaging (McMahan et al., 2017) pro-tocol assumes that all devices are equally likely to partic-ipate and complete each round. In practice, our systempotentially introduces bias by the fact that devices only trainwhen they are on an unmetered network and charging. Insome countries the majority of people rarely have access toan unmetered network. Also, we limit the deployment ofour device code only to certain phones, currently with re-cent Android versions and at least 2 GB of memory, anothersource of potential bias.

We address this possibility in the current system as follows:During FL training, the models are not used to make user-visible predictions; instead, once a model is trained, it is eval-uated in live A/B experiments using multiple application-specific metrics (just as with a datacenter model). If bias indevice participation or other issues lead to an inferior model,it will be detected at this point. So far, we have not observedthis to be an issue in practice, but this is likely applicationand population dependent. Further quantification of thesepossible effects across a wider set of applications, and ifneeded algorithmic or systems approaches to mitigate them,are important directions for future work.

Convergence Time We noted in Sec. 8 that we currentlyobserve a slower convergence time for Federated Learn-ing compared to ML on centralized data where training isbacked by the power of a data center. Current FL algorithmssuch as Federated Averaging can only efficiently utilize100s of devices in parallel, but many more are available; FLwould greatly benefit from new algorithms that can utilizeincreased parallelism.

On the operational side, there is also more which can bedone. For example, the time windows to select devicesfor training and wait for their reporting is currently config-ured statically per FL population. It should be dynamicallyadjusted to reduce the drop out rate and increase round fre-quency. We should ideally use online ML for tuning thisand other parameters of the protocol configuration, bringingin e.g. time of the day as context.

Device Scheduling Currently, our multi-tenant on-devicescheduler uses a simple worker queue for determining whichtraining session to run next (we avoid running training ses-sions on-device in parallel because of their high resourceconsumption). This approach is blind to aspects like whichapps the user has been frequently using. It’s possible forus to end up repeatedly training on older data (up to theexpiration date) for some apps, while also neglecting train-ing on newer data for the apps the user is frequently using.Any optimization here, though, has to be carefully evaluatedagainst the biases it may introduce.

Bandwidth When working with certain types of models,for example recurrent networks for language modeling, evensmall amounts of raw data can result in large amounts ofinformation (weight updates) being communicated. In par-ticular, this might be more than if we would just uploadthe raw data. While this could be viewed as a tradeoff forbetter privacy, there is also much which can be improved.To reduce the bandwidth necessary, we implement compres-sion techniques such as those of Konecny et al. (2016b) andCaldas et al. (2018). In addition to that, we can modifythe training algorithms to obtain models in quantized rep-resentation (Jacob et al., 2017), which will have synergeticeffect with bandwidth savings and be important for efficientdeployment for inference.

Federated Computation We believe there are more appli-cations besides ML for the general device/server architec-ture we have described in this paper. This is also apparentfrom the fact that this paper contains no explicit mentioningof any ML logic. Instead, we refer abstractly to ’plans’,’models’, ’updates’ and so on.

We aim to generalize our system from Federated Learningto Federated Computation, which follows the same basicprinciples as described in this paper, but does not restrictcomputation to ML with TensorFlow, but general MapRe-duce like workloads. One application area we are seeing isin Federated Analytics, which would allow us to monitoraggregate device statistics without logging raw device datato the cloud.

ACKNOWLEDGEMENT

Like most production scale systems, there are many morecontributors than the authors of this paper. The followingpeople, at least, have directly contributed to design andimplementation: Galen Andrew, Blaise Aguera y Arcas,Sean Augenstein, Dave Bacon, Francoise Beaufays, AmlanChakraborty, Arlie Davis, Stefan Dierauf, Randy Dodgen,Emily Glanz, Shiyu Hu, Ben Kreuter, Eric Loewenthal,Antonio Marcedone, Jason Hunter, Krzysztof Ostrowski,Sarvar Patel, Peter Kairouz, Kanishka Rao, Michael Re-neer, Aaron Segal, Karn Seth, Wei Huang, Nicholas Kong,Haicheng Sun, Vivian Lee, Tim Yang, Yuanbo Zhang.


REFERENCES

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z.,Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M.,Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard,M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Lev-enberg, J., Mane, D., Monga, R., Moore, S., Murray, D.,Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever,I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan,V., Viegas, F., Vinyals, O., Warden, P., Wattenberg, M.,Wicke, M., Yu, Y., and Zheng, X. TensorFlow: Large-scale machine learning on heterogeneous systems. InOSDI, volume 16, pp. 265–283, 2016.

ai.google. Under the hood of the pixel 2: How ai issupercharging hardware, 2018. URL https://ai.google/stories/ai-in-hardware/. RetrievedNov 2018.

Android Documentation. SafetyNet Attestation API.URL https://developer.android.com/training/safetynet/attestation.

Bagdasaryan, E., Veit, A., Hua, Y., Estrin, D., andShmatikov, V. How to backdoor federated learning. arXivpreprint arXiv:1807.00459, 2018.

Bonawitz, K., Ivanov, V., Kreuter, B., Marcedone, A.,McMahan, H. B., Patel, S., Ramage, D., Segal, A.,and Seth, K. Practical secure aggregation for privacy-preserving machine learning. In Proceedings of the 2017ACM SIGSAC Conference on Computer and Communica-tions Security, pp. 1175–1191. ACM, 2017.

Brisimi, T. S., Chen, R., Mela, T., Olshevsky, A., Pascha-lidis, I. C., and Shi, W. Federated learning of predictivemodels from federated electronic health records. Interna-tional journal of medical informatics, 112:59–67, 2018.

Caldas, S., Konecny, J., McMahan, H. B., and Talwalkar, A.Expanding the reach of federated learning by reducingclient resource requirements. arXiv preprint 1812.07210,2018.

Dean, J. and Ghemawat, S. MapReduce: Simplified dataprocessing on large clusters. Communications of theACM, 51(1):107–113, 2008.

Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Le,Q. V., Mao, M., Ranzato, M., Senior, A., Tucker, P., Yang,K., and Ng, A. Y. Large scale distributed deep networks.In Advances in neural information processing systems,pp. 1223–1231, 2012.

Goyal, P., Dollar, P., Girshick, R., Noordhuis, P.,Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and He,K. Accurate, large minibatch SGD: Training imagenet in1 hour. arXiv preprint arXiv:1706.02677, 2017.

Hard, A., Rao, K., Mathews, R., Beaufays, F., Augenstein,S., Eichner, H., Kiddon, C., and Ramage, D. Federatedlearning for mobile keyboard prediction. arXiv preprint1811.03604, 2018.

Hewitt, C., Bishop, P. B., and Steiger, R. A universal modu-lar ACTOR formalism for artificial intelligence. In Pro-ceedings of the 3rd International Joint Conference onArtificial Intelligence. Stanford, CA, USA, August 20-23,1973, pp. 235–245, 1973.

Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M.,Howard, A., Adam, H., and Kalenichenko, D. Quan-tization and training of neural networks for effi-cient integer-arithmetic-only inference. arXiv preprintarXiv:1712.05877, 2017.

Kamp, M., Adilova, L., Sicking, J., Huger, F., Schlicht, P.,Wirtz, T., and Wrobel, S. Efficient decentralized deeplearning by dynamic model averaging. arXiv preprintarXiv:1807.03210, 2018.

Konecny, J., McMahan, H. B., Ramage, D., and Richtarik, P.Federated optimization: Distributed machine learning foron-device intelligence. arXiv preprint arXiv:1610.02527,2016a.

Konecny, J., McMahan, H. B., Yu, F. X., Richtarik, P.,Suresh, A. T., and Bacon, D. Federated learning: Strate-gies for improving communication efficiency. arXivpreprint arXiv:1610.05492, 2016b.

Li, M., Andersen, D. G., Park, J. W., Smola, A. J., Ahmed,A., Josifovski, V., Long, J., Shekita, E. J., and Su, B.-Y.Scaling distributed machine learning with the parameterserver. In 11th USENIX Symposium on Operating Sys-tems Design and Implementation (OSDI 14), pp. 583–598.USENIX Association, 2014.

Low, Y., Bickson, D., Gonzalez, J., Guestrin, C., Kyrola, A.,and Hellerstein, J. M. Distributed graphlab: A frameworkfor machine learning and data mining in the cloud. Proc.VLDB Endow., 5(8):716–727, April 2012.

McMahan, H. B. and Ramage, D. Federatedlearning: Collaborative machine learning with-out centralized training data, April 2017. URLhttps://ai.googleblog.com/2017/04/federated-learning-collaborative.html.Google AI Blog.

McMahan, H. B., Moore, E., Ramage, D., Hampson, S., andy Arcas, B. A. Communication-efficient learning of deepnetworks from decentralized data. In Proceedings of the20th International Conference on Artificial Intelligenceand Statistics, pp. 1273–1282, 2017.

https://ai.google/stories/ai-in-hardware/

https://ai.google/stories/ai-in-hardware/

https://developer.android.com/training/safetynet/attestation

https://developer.android.com/training/safetynet/attestation

https://ai.googleblog.com/2017/04/federated-learning-collaborative.html

https://ai.googleblog.com/2017/04/federated-learning-collaborative.html


McMahan, H. B., Ramage, D., Talwar, K., and Zhang, L.Learning differentially private recurrent language models.In International Conference on Learning Representations(ICLR), 2018.

Nishio, T. and Yonetani, R. Client selection for federatedlearning with heterogeneous resources in mobile edge.arXiv preprint arXiv:1804.08333, 2018.

Pihur, V., Korolova, A., Liu, F., Sankuratripati, S., Yung,M., Huang, D., and Zeng, R. Differentially-private“draw and discard” machine learning. arXiv preprintarXiv:1807.04369, 2018.

Samarakoon, S., Bennis, M., Saad, W., and Debbah, M.Federated learning for ultra-reliable low-latency v2v com-munications. arXiv preprint arXiv:1805.09253, 2018.

Smith, S., jan Kindermans, P., Ying, C., and Le, Q. V. Don’tdecay the learning rate, increase the batch size. In Interna-tional Conference on Learning Representations (ICLR),2018.

Smith, V., Chiang, C.-K., Sanjabi, M., and Talwalkar, A. S.Federated multi-task learning. In Advances in NeuralInformation Processing Systems, pp. 4424–4434, 2017.

Yang, T., Andrew, G., Eichner, H., Sun, H., Li, W., Kong,N., Ramage, D., and Beaufays, F. Applied federatedlearning: Improving google keyboard query suggestions.arXiv preprint 1812.02903, 2018.


A OPERATIONAL PROFILE DATA

In this section we present operational profile data for one ofthe FL populations that are currently active in the deployedFL system, augmenting the discussion in Sec. 9. The subjectFL population primarily comes from the same time zone.

Fig. 6 illustrates how availability of the devices variesthrough the day and its impact on the round completionrate. Because the FL server schedules an FL task for exe-cution only once a desired number of devices are availableand selected, the round completion rate oscillates in syncwith device availability.

Figure 6: A subset of the connected devices over three days(top) in states “participating” (blue) and “waiting” (purple).Other states (“closing” and “attesting”) are too rare to bevisible in this graph. The rate of successful round comple-tions (green, bottom) is also shown, along with the rate ofother outcomes (“failure”, “retry”, and “abort”) plotted onthe same graph but too low to be visible.

Fig. 7 illustrates the average number of devices participatingin an FL task round and the outcomes of the participation.Note that in each round the FL server selects more devicesfor the participation than desired to complete to offset thedevices that drop out during execution. Therefore in eachround there are devices that were aborted after a desirednumber of devices successfully complete. Another notewor-thy aspect is drop out rate correlation with the time of day,specifically the drop out rate is higher during the day timecompared to the night time. This is explained by higherprobability of the device eligibility criteria changes dueinteraction with a device.

Fig. 8 shows distribution of round run and device partici-pation time. There are two noteworthy observations. Firstis that the round run time is roughly equal to the majorityof the device participation time which is explained by thefact that the FL server selects more than needed devicesfor participation and stops execution when enough devicescomplete. Second is that device participation time is capped.

Figure 7: Average number of devices completed, abortedand dropped out from round execution

This is a mechanism used by the FL server to deal withstraggler devices; i.e., the round run time capped by theserver.

Figure 8: Round execution and device participation time

Fig. 9 illustrates the asymmetry in server network traffic,specifically that download from server dominates upload.There are several aspects that contribute. Namely each de-vice downloads both an FL task plan and current globalmodel (plan size is comparable with the global model)whereas it uploads only updates to the global model; themodel updates are inherently more compressible comparedto the global model.

Figure 9: Server network traffic

Tab. 1 shows the training round session shape visualizationsgenerated from the clients’ training state event logs. Asshown, 75% of clients complete their training rounds suc-cessfully, 22% of clients complete their training rounds buthave their results rejected by the server (these are the de-vices which report back after the reporting window alreadyclosed), and 2% of clients are interrupted before being able


Session Shape Count Percent-v[]+ˆ 1,116,401 75%-v[]+# 327,478 22%-v[! 29,771 2%

Table 1: Distribution of on-device training round sessions.Legend: - = FL server checkin, v = downloaded plan,[ = training started, ] = training completed, + = uploadstarted, ˆ = upload completed, # = upload rejected, ! = inter-rupted.

to complete their round (e.g., because the device exited theidle state).


B FEDERATED AVERAGING

In this section, we show the Federated Averaging algorithmfrom McMahan et al. (2017) for the interested reader.

Algorithm 1 FederatedAveraging targeting updatesfrom K clients per round.Server executes:

initialize w0

for each round t = 1, 2, . . . doSelect 1.3K eligible clients to compute updatesWait for updates from K clients (indexed 1, . . . ,K)

(∆k, nk) = ClientUpdate(w) from client k ∈ [K].wt =

∑k ∆k // Sum of weighted updates

nt =∑

k nk // Sum of weights

∆t = ∆kt /nt // Average update

wt+1 ← wt + ∆t

ClientUpdate(w):B ← (local data divided into minibatches)n← |B| // Update weightwinit ← wfor batch b ∈ B dow ← w − ηO`(w; b)

∆← n · (w − winit) // Weighted update// Note ∆ is more amenable to compression than wreturn (∆, n) to server

towards federated learning at scale: system design

Documents