privacy accounting and quality control in thesage ... · sage builds upon the rich literature on dp...

Privacy Accounting and Quality Control in theSage Differentially Private ML Platform

Mathias Lécuyer, Riley Spahn, Kiran Vodrahalli, Roxana Geambasu, and Daniel HsuColumbia University

AbstractCompanies increasingly expose machine learning (ML)

models trained over sensitive user data to untrusted domains,such as end-user devices and wide-access model stores. Thiscreates a need to control the data’s leakage through thesemodels. We present Sage, a differentially private (DP) MLplatform that bounds the cumulative leakage of training datathrough models. Sage builds upon the rich literature on DPML algorithms and contributes pragmatic solutions to two ofthe most pressing systems challenges of global DP: runningout of privacy budget and the privacy-utility tradeoff. To ad-dress the former, we develop block composition, a new privacyloss accounting method that leverages the growing databaseregime of ML workloads to keep training models endlessly ona sensitive data stream while enforcing a global DP guaranteefor the stream. To address the latter, we develop privacy-adaptive training, a process that trains a model on grow-ing amounts of data and/or with increasing privacy parame-ters until, with high probability, the model meets developer-configured quality criteria. Sage’s methods are designed tointegrate with TensorFlow-Extended, Google’s open-sourceML platform. They illustrate how a systems focus on charac-teristics of ML workloads enables pragmatic solutions thatare not apparent when one focuses on individual algorithms,as most DP ML literature does.

1 IntroductionMachine learning (ML) introduces a dangerous double

standard in how companies protect user data. In traditionalapplications, sensitive data is siloed and its accesses are care-fully controlled. Consider a messaging application: you donot expect everyone to have access to your messages. Indeed,there is access control logic programmed within that applica-tion that mediates all accesses to the messages and determineswho should see which messages. But in an ML-driven applica-tion, the messages are not just used for the main functionality

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies are notmade or distributed for profit or commercial advantage and that copies bearthis notice and the full citation on the first page. Copyrights for componentsof this work owned by others than ACM must be honored. Abstracting withcredit is permitted. To copy otherwise, or republish, to post on servers or toredistribute to lists, requires prior specific permission and/or a fee. Requestpermissions from [email protected] ’19, October 27–30, 2019, Huntsville, ON, Canada© 2019 Association for Computing Machinery.ACM ISBN 978-1-4503-6873-5/19/10. . . $15.00https://doi.org/10.1145/3341301.3359639

but also to train various models, such as an auto-completemodel, a chat bot, or a recommendation model. These mod-els can leak what you wrote [8] and despite that, they issuepredictions to everyone in the world, are often shipped toend-user devices for faster predictions [5, 24, 33, 44], andsometimes are shared across teams within the company moreliberally than the data [24, 33, 48].

There is perhaps a sense that because ML models aggre-gate data from multiple users, they obfuscate individuals’data and therefore warrant weaker protection than the dataitself. However, increasing evidence, both theoretical and em-pirical, suggests that ML models can indeed leak specificsabout individual entries in their training sets. Language mod-els trained over users’ emails for auto-complete have beenshown to encode not only commonly used phrases but also so-cial security numbers and credit card numbers that users mayinclude in their private communications [8]. Collaborativefiltering models, as used in recommenders, have been shownto leak specific information across users [7]. Membership ina training set was shown to be inferable even when the at-tacker only has access to a model’s external predictions [49].Finally, it has long been established theoretically that accessto too many accurate linear statistics from a dataset – as anadversary might have by observing periodic releases of mod-els, which often incorporate statistics used for featurization –fundamentally allows reconstruction of the dataset [13].

As companies continue to disseminate many versions ofmodels into untrusted domains, controlling the risk of dataexposure becomes critical. We present Sage, an ML platformbased on Google’s TensorFlow-Extended (TFX) that uses dif-ferential privacy (DP) to bound the cumulative exposure of in-dividual entries in a company’s sensitive data streams throughall the models released from those streams. DP randomizes acomputation over a dataset (e.g. training one model) to boundthe leakage of individual entries in the dataset through theoutput of the computation (the model). Each new DP com-putation increases the bound over data leakage, and can beseen as consuming part of a privacy budget that should not beexceeded; Sage makes the process that generates all modelsand statistics preserve a global DP guarantee.

Sage builds upon the rich literature on DP ML algorithms(e.g., [2, 35, 38]) and contributes pragmatic solutions to twoof the most pressing systems challenges of global DP: (1) run-ning out of privacy budget and (2) the privacy-utility tradeoff.Sage expects to be given training pipelines explicitly pro-grammed to satisfy a parameterized DP guarantee. It acts as a

arX

iv:1

909.

0150

2v2

[st

at.M

L]

6 S

ep 2

019

https://doi.org/10.1145/3341301.3359639

new access control layer in TFX that mediates data accessesby these pipelines and accounts for the cumulative privacyloss from them to enforce the global DP guarantee against thestream. At the same time, Sage provides the developers with:control over the quality of the models produced by the DPtraining pipelines (addresses challenge (2)); and the abilityto release models endlessly without running out of privacybudget for the stream (addresses challenge (1)).

The key to addressing both challenges is the realizationthat ML workloads operate on growing databases, a model ofinteraction that has been explored very little in DP, and onlywith a purely theoretical and far from practical approach [12].Most DP literature, largely focused on individual algorithms,assumes either static databases (which do not incorporatenew data) or online streaming (where computations do notrevisit old data). In contrast, ML workloads – which consist ofmany algorithms invoked periodically – operate on growingdatabases. Across invocations of different training algorithms,the workload both incorporates new data and reuses old data,often times adaptively. It is in that adaptive reuse of olddata coupled with new data that our design of Sage finds theopportunity to address the preceding two challenges in waysthat are practical and integrate well with TFX-like platforms.

To address the running out of privacy budget challenge,we develop block composition, the first privacy accountingmethod that both allows efficient training on growing databasesand avoids running out of privacy budget as long as the data-base grows fast enough. Block composition splits the datastream into blocks, for example by time (e.g., one day’s worthof data goes into one block) or by users (e.g., each user’sdata goes into one block), depending on the unit of protection(event- or user-level privacy). Block composition lets trainingpipelines combine available blocks into larger datasets to trainmodels effectively, but accounts for the privacy loss of releas-ing a model at the level of the specific blocks used to trainthat model. When the privacy loss for a given block reachesa pre-configured ceiling, the block is retired and will not beused again. However, new blocks from the stream arrive withzero privacy loss and can be used to train future models. Thus,as long as the database adds new blocks fast enough relativeto the rate at which models arrive, Sage will never run out ofprivacy budget for the stream. Finally, block composition al-lows adaptivity in the choice of training computation, privacyparameters, and blocks to execute on, thereby modeling themost comprehensive form of adaptivity in the DP literature,including [45] which only considers the first two choices.

To address the privacy-utility tradeoff we develop privacy-adaptive training, a training procedure that controls the utilityof DP-trained models by repeatedly and adaptively trainingthem on growing data and/or DP parameters available fromthe stream. Models retrain until, with high probability, theymeet programmer-specified quality criteria (e.g. an accuracytarget). Privacy-adaptive training uses block composition’s

ç√

Stream-level ACLsdatasetrequest dataset

failnotify the developerTraining Pipeline

model+featuresPre-processing

Growing Database

Training Validation success

To w

ide-

acce

ss M

odel

&

Fea

ture

Sto

re a

nd to

w

orld

-wid

e Se

rvin

g In

fras

truc

ture

Data Stream.

Fig. 1. Typical Architecture of an ML Platform.

support for adaptivity and integrates well with TFX’s design,which includes a model validation stage in training pipelines.

2 BackgroundOur effort builds upon an opportunity we observe in to-

day’s companies: the rise of ML platforms, trusted infrastruc-tures that provide key services for ML workloads in produc-tion, plus strong library support for their development. Theycan be thought of as operating systems for ML workloads.Google has TensorFlow-Extended (TFX) [5]; Facebook hasFBLearner [24]; Uber has Michelangelo [33]; and Twitterhas DeepBird [32]. The opportunity is to incorporate DP intothese platforms as a new type of access control that constrainsdata leakage through the models a company disseminates.2.1 ML Platforms

Fig. 1 shows our view of an ML platform; it is basedon [5, 24, 33]. The platform has several components: Train-ing Pipelines (one for each model pushed into production),Serving Infrastructure, and a shared data store, which we callthe Growing Database because it accumulates data from thecompany’s data streams. The access control policies on theGrowing Database are exercised through Stream-level ACLsand are typically restrictive for sensitive streams.

The Training Pipeline trains a model on data from theGrowing Database and verifies that it meets specific qualitycriteria before it is deployed for serving or shared with otherteams. It is launched periodically (e.g., daily or weekly) ondatasets containing samples from a representative time win-dow (e.g., logs over the past month). It has three customizablemodules: (1) Pre-processing loads the dataset from the Grow-ing Database, transforms it into a format suitable for trainingand inference by applying feature transformation operators,and splits the transformed dataset into a training set and atesting set; (2) Training trains the model on a training set; and(3) Validation evaluates one or more quality metrics – suchas accuracy for classification or mean squared error (MSE)for regression – on the testing set. Validation checks that themetrics reach specific quality targets to warrant the model’srelease. The targets can be fixed by developers or can bevalues achieved by a previous model. If the model meets allquality criteria, it is bundled with its feature transformationoperators (a.k.a. features) and pushed into serving.

The Serving Infrastructure manages the online aspectsof the model. It distributes the model+features to inferenceservers around the world and to end-user devices and contin-uously evaluates and partially updates it on new data. The

2

model+features bundle is also often pushed into a company-wide Model & Feature Store, from where other teams withinthe company can discover it and integrate into their own mod-els. Twitter and Uber report sharing embedding models [48]and tens of thousands of summary statistics [33] across teamsthrough their Feature Stores. To enable such wide sharing,companies sometimes enforce more permissive access controlpolicies on the Model & Feature Store than on the data itself.

2.2 Threat ModelWe are concerned with the increase in sensitive data ex-

posure that is caused by applying looser access controls tomodels+features than are typically applied to data. This in-cludes placing models+features in company-wide Model &Feature Stores, where they can be accessed by developersnot authorized to access the data. It includes pushing mod-els+features to end-user devices and prediction servers thatcould be compromised by hackers or oppressive governments.And it includes releasing predictions from these models to theworld – either as part of applications or through predictionAPIs – which can be used to infer specifics about trainingdata [49, 52]. Our goal is to “neutralize” the wider exposureof models+features by making the process of generating themDP across all models+features ever released from a stream.

We assume the following components are trusted and imple-mented correctly: Growing Database; Stream-level ACLs; theML platform code running a Training Pipeline. We also trustthe developer that instantiates the modules in each pipelineas long as the developer is authorized by Stream-level ACLsto access the data stream(s) used by the pipeline. However,we do not trust the wide-access Model & Feature Store or thelocations to which the serving infrastructure disseminates themodel+features or their predictions. Once a model/feature ispushed to those components, it is considered released to theuntrusted domain and accessible to adversaries.

We focus on two classes of attacks against models and sta-tistics (see Dwork [22]): (1) membership inference, in whichthe adversary infers whether a particular entry is in the train-ing set based on either white-box or black-box access to themodel, features, and/or predictions [3, 23, 26, 49]; and (2) re-construction attacks, in which the adversary infers unknownsensitive attributes about entries in the training set based onsimilar white-box or black-box access [8, 13, 22].

2.3 Differential PrivacyDP is concerned with whether the output of a computation

over a dataset – such as training an ML model – can revealinformation about individual entries in the dataset. To preventsuch information leakage, randomness is introduced into thecomputation to hide details of individual entries.

Definition 2.1 (Differential Privacy (DP) [20]). A random-ized algorithm Q : D → V is (ϵ,δ )-DP if for any D,D ′

with |D ⊕ D ′ | ≤ 1 and for any S ⊆ V, we have: P(Q(D) ∈S) ≤ eϵP(Q(D ′) ∈ S) + δ .

The ϵ > 0 and δ ∈ [0, 1] parameters quantify the strengthof the privacy guarantee: small values imply that one drawfrom such an algorithm’s output gives little information aboutwhether it ran on D or D ′. The privacy budget ϵ upper boundsan (ϵ,δ )-DP computation’s privacy loss with probability (1-δ ).⊕ is a dataset distance (e.g. the symmetric difference [39]). If|D ⊕ D ′ | ≤ 1, D and D ′ are neighboring datasets.

Multiple mechanisms exist to make a computation DP.They add noise to the computation scaled by its sensitiv-ity s, the maximum change in the computation’s output whenrun on any two neighboring datasets. Adding noise from aLaplace distribution with mean zero and scale s

ϵ (denotedlaplace(0, sϵ )) gives (ϵ, 0)-DP. Adding noise from a Gaussian

distribution scaled by sϵ

√2 ln( 1.25δ ) gives (ϵ,δ )-DP.

DP is known to address the attacks in our threat model [8,22, 27, 49]. At a high level, membership and reconstructionattacks work by finding data points that make the observedmodel more likely: if those points were in the training set, thelikelihood of the observed output increases. DP prevents theseattacks, as no specific data point can drastically increase thelikelihood of the model outputted by the training procedure.

DP literature is very rich and mature, including in ML.DP versions exist for almost every popular ML algorithm,including: stochastic gradient descent (SGD) [2, 55]; variousregressions [10, 31, 41, 51, 57]; collaborative filtering [38];language models [37]; feature selection [11]; model selec-tion [50]; evaluation [6]; and statistics, e.g. contingency ta-bles [4], histograms [54]. The privacy module in Tensor-Flow v2 implements several SGD-based algorithms [35].

A key strength of DP is its composition property, which inits basic form, states that the process of running an (ϵ1,δ1)-DP and an (ϵ2,δ2)-DP computation on the same dataset is(ϵ1 + ϵ2,δ1 + δ2)-DP. Composition enables the developmentof complex DP computations – such as DP Training Pipelines– from piecemeal DP components, such as DP ML algorithms.Composition also lets one account for the privacy loss result-ing from a sequence of DP-computed outputs, such as therelease of multiple models+features.

A distinction exists between user-level and event-level pri-vacy. User-level privacy enforces DP on all data points con-tributed by a user toward a computation. Event-level privacyenforces DP on individual data points (e.g., individual clicks).User-level privacy is more meaningful than event-level pri-vacy, but much more challenging to sustain on streams. Al-though Sage’s design can in theory be applied to user-levelprivacy (§4.4), we focus most of the paper on event-levelprivacy, which we deem practical enough to be deployed inbig companies. §7 discusses the limitations of this semantic.

3 Sage ArchitectureThe Sage training platform enforces a global (ϵд ,δд)-DP

semantic over all models+features that have been, or will everbe, released from each sensitive data stream. The highlightedportions in Fig. 2 show the changes Sage brings to a typical

3

Privacy-Adaptive Training

Stream-level ACLsdataset, ε, δrequest dataset

notify the developer

ACCEPT

Sage Access Control((εg,δg)-DP across all pipelines)

RETRY

Data Stream.

(ε,δ)-DP Training Pipeline(ε/3,0)-DP

Pre-processing

(ε/3,δ)-DPTraining

(ε/3,0)-DPSLAed

Validation

Mod

el+f

eatu

res

sent

to

Mod

el &

Fea

ture

Sto

re a

nd

Serv

ing

Infr

astr

uctu

re

REJECT/timeout

time

D13 4 D7 D8

privacy budget: used (dark) and available (clear) εg,δg

D2 D3

blocks

D4 D5 D6

Growing Database

Fig. 2. Sage DP ML Platform. Highlights changes from non-DP version.

ML platform. First, each Training Pipeline must be made toindividually satisfy (ϵ,δ )-DP for some privacy parametersgiven by Sage at runtime (box (ϵ,δ )-DP Training Pipeline,§3.1). The developer is responsible for making this switch toDP, and while research is needed to ease DP programming,this paper leaves that challenge aside.

Second, Sage introduces an additional layer of access con-trol beyond traditional stream-level ACLs (box Sage Ac-cess Control, §3.2). The new layer splits the data streaminto blocks and accounts for the privacy loss of releasing amodel+features bundle at the level of the specific blocks thatwere used to train that bundle. In theory, blocks can be definedby any insensitive attribute, with two attributes particularlyrelevant here: time (e.g., one day’s worth of data goes intoone block) and user ID (e.g., all of a user’s data goes into oneblock). Defining blocks by time provides event-level privacy;defining them by user ID accommodates user-level privacy.Because of our focus on the former semantic, this section as-sumes that blocks are defined by time; §4.4 discusses shardingby user ID and other attributes.

When the privacy loss for a block reaches the (ϵд ,δд)ceiling, the block is retired (blocks D1,D2,D7 are retiredin Fig. 2). However, new blocks arrive with a clean budgetand can be used to train future models. Thus, as long as thedatabase grows fast enough in new blocks, the system willnever run out of privacy budget for the stream. Perhaps sur-prisingly, this privacy loss accounting method, which we callblock composition, is the first practical approach to avoid run-ning out of privacy budget while enabling effective training ofML models on growing databases. §3.2 gives the intuition ofblock composition; §4 formalizes it and proves it (ϵд ,δд)-DP.

Third, Sage provides developers with control over the qual-ity of models produced by the DP Training Pipelines. Suchpipelines can produce less accurate models that fail to meet

1 def preprocessing_fn(inputs, epsilon):2 dist_01 = tft.scale_to_0_1(inputs["distance"],0,100)3 speed_01 = tft.scale_to_0_1(inputs["speed"],0,100)4 hour_of_day_speed = group_by_mean5 sage.dp_group_by_mean(6 inputs["hour_of_day"], speed_01, 24, epsilon, 1.0)7 return {"dist_scaled": dist_01,8 "hour_of_day": inputs["hour_of_day"],9 "hour_of_day_speed": hour_of_day_speed,

10 "duration": inputs["duration"]}1112 def trainer_fn(hparams, schema, epsilon, delta): [...]13 feature_columns = [numeric_column("dist_scaled"),14 numeric_column("hour_of_day_speed"),15 categorical_column("hour_of_day", num_buckets=24)]16 estimator = \17 tf.estimator.DNNRegressorsage.DPDNNRegressor(18 config=run_config,19 feature_columns=feature_columns,20 dnn_hidden_units=hparams.hidden_units,21 privacy_budget=(epsilon, delta))22 return tfx.executors.TrainingSpec(estimator,...)2324 def validator_fn(epsilon):25 model_validator = \26 tfx.components.ModelValidatorsage.DPModelValidator(27 examples=examples_gen.outputs.output,28 model=trainer.outputs.output,29 metric_fn = _MSE_FN, target = _MSE_TARGET,30 epsilon=epsilon, confidence=0.95, B=1)31 return model_validator3233 def dp_group_by_mean(key_tensor, value_tensor, nkeys,34 epsilon, value_range):35 key_tensor = tf.dtypes.cast(key_tensor, tf.int64)36 ones = tf.fill(tf.shape(key_tensor), 1.0)37 dp_counts = group_by_sum(key_tensor, ones, nkeys)\38 + laplace(0.0, 2/epsilon, nkeys)39 dp_sums = group_by_sum(40 key_tensor,value_tensor,nkeys)\41 + laplace(0.0, value_range * 2/epsilon, nkeys)42 return tf.gather(dp_sums/dp_counts, key_tensor)

List. 1. Example Training Pipeline. Non-DP TFX (stricken through)and DP Sage (highlighted) versions. TFX API simplified for exposition.

their quality targets more often than without DP. DP pipelinescan also push in production low-quality models whose val-idations succeed by mere chance. Both situations lead tooperational headaches: the former gives more frequent noti-fications of failed training, the latter gives dissatisfied users.The issue is often referred to as the privacy-utility tradeoff ofrunning under a DP regime. Sage addresses this challenge bywrapping the (ϵ,δ )-DP Training Pipeline into an adaptive pro-cess that invokes training pipelines repeatedly on increasingamounts of data and/or privacy budgets to reduce the effectsof DP randomness until with high probability models reachtheir quality criteria (box Privacy-Adaptive Training, §3.3).3.1 Example (ϵ,δ )-DP Training Pipeline

Sage expects each pipeline submitted by the ML developerto satisfy a parameterized (ϵ,δ )-DP. Acknowledging that DPprogramming abstractions warrant further research, List. 1illustrates the changes a developer would have to make atpresent to convert a non-DP training pipeline written for TFXto a DP training pipeline suitable for Sage. Removed/replacedcode is stricken through and the added code is highlighted.The pipeline processes New York City Yellow Cab data [42]and trains a model to predict the duration of a ride.

4

Fig. 3. Characteristics of Data Interaction in ML.

To integrate with TFX (non-DP version), the developer im-plements three TFX callbacks. (1) preprocessing_fnuses the dataset to compute aggregate features and make user-specified feature transformations. The example model hasthree features: the distance of the ride; the hour of the day;and an aggregate feature representing the average speed of cabrides each hour of the day. (2) trainer_fn specifies themodel: it configures the columns to be modeled, defines hy-perparameters, and specifies the dataset. The example modeltrains with a neural network regressor. (3) validator_fnvalidates the model by comparing test set MSE to a constant.

To integrate with Sage (DP version), the developer: (a)switches library calls to DP versions of the functions (whichideally would be available in the DP ML platform) and (b)splits the (ϵ,δ ) parameters, which are assigned by Sage at run-time, across the DP function calls. (1) preprocessing_fnreplaces one call with a DP version from Sage: the mean speedper day uses Sage’s dp_group_by_mean. This function(lines 33-42) computes the number of times each key appearsand the sum of the values associated with each key. It makesboth DP by adding draws from appropriately-scaled Laplacedistributions to each count. Each data point has exactly onekey value so the privacy budget usage composes in parallelacross keys [39]. The privacy budget is split across the sumand count queries. We envision common functions like thisbeing available in the DP ML platform. (2) trainer_fnswitches the call to the non-private regressor with the DP im-plementation, which in Sage is a simple wrapper around Ten-sorFlow’s DP SGD-based optimizer. (3) validator_fninvokes Sage’s DP model validator (§3.3).3.2 Sage Access Control

Sage uses the composition property of DP to rigorously ac-count for (and bound) the cumulative leakage of data from sen-sitive data streams across multiple releases of models+featureslearned from these streams. Specifically, for each sensitivestream, Sage maintains a pre-specified, event-level (ϵд ,δд)-DP guarantee across all uses of the stream. Unfortunately,traditional DP composition theory considers either: (a) astatic database regime, whose adaptation to growing databasesleads to wasteful privacy loss accounting; or (b) purely on-line streaming, which is inefficient for many ML workloads,including deep neural network training. We thus developedour own composition theory, called block composition, whichleverages characteristics of ML workloads running on grow-ing databases to permit both efficient privacy loss accountingand efficient learning. §4 formalizes the new theory. Thissection describes the limitations of existing DP composition

for ML and gives the intuition for block composition and howSage uses it as a new form of access control in ML platforms.Characteristics of Data Interaction in ML. Fig. 3 showsan example of a typical workload as seen by an ML platform.Each training pipeline, or query in DP parlance, is denotedQi .We note three characteristics. First, a typical ML workloadconsists of multiple training pipelines, training over timeon overlapping data subsets of various sizes from an ever-growing database. For instance, Q2 may train a large deepneural network requiring massive amounts of data to reachgood performance, while Q3 may train a linear model withsmaller data requirements, or even a simple statistic like themean of a feature over the past day. The pipelines are typicallyupdated or retrained as new data is collected, with old dataeventually deemed irrelevant and ignored.

Second, the data given to a training pipeline – and for a DPmodel, its DP parameters – are typically chosen adaptively.Suppose the model trained in Q1 on data from D1,2 withbudget ϵ1 gives unsatisfactory performance. After a new blockD3 is collected, a developer may decide to retrain the samemodel in query Q2 on data from D1,2,3, and with a higherDP budget ϵ2. Adaptivity can also happen indirectly throughthe data. Suppose Q2 successfully trained a recommendationmodel. Then, future data collected from the users (e.g., in D4)may depend on the recommendations. Any subsequent query,such as Q3, is potentially influenced by Q2’s output.

Third, ML workloads are dynamic and continuous. Newmodels are introduced and expected to be trained within areasonable amount of time; some models are run periodicallywith new data, others are removed from production; and thisdynamic workload continues as long as new data is collected.

These characteristics imply three requirements for a com-position theory suitable for ML. It must support:

R1 Queries on overlapping data subsets of diverse sizes.R2 Adaptivity in the choice of: queries, DP parameters,

and data subsets the queries process.R3 Endless execution on a growing database.

Limitations of Existing Composition Theory for ML. Noprevious DP composition theory supports all three require-ments. DP has mostly been studied for static databases, where(adaptively chosen) queries are assumed to compute over theentire database. Consequently, composition accounting is typ-ically made at query level: each query consumes part of thetotal available privacy budget for the database. Query-level ac-counting has carried over even in extensions to DP theory thathandle streaming databases [19] and partitioned queries [39].There are multiple ways to apply query-level accounting toML, but each trades off at least one requirement.

First, one can query overlapping data subsets (R1) andallow adaptivity across these queries (R2) by accounting forcomposition at the query level against the entire stream. InFig. 3, after executing queries Q1−3, the total privacy loss forthe stream would be ϵ1+ϵ2+ϵ3. This approach wastes privacy

5

budget and leads to the problem of “running out of budget.”Once ϵд = ϵ1 + ϵ2 + ϵ3, enforcing a global leakage bound ofϵд means that one must stop using the stream after query Q3.This is true even though (1) not all queries run on all the dataand (2) there will be new data in the future (e.g., D5). Thisviolates requirement (R3) of endless execution on streams.

Second, one can restructure the queries to enable finergranularity with query-level accounting. The data stream ispartitioned in blocks, as in Fig. 3. Each query is split intomultiple sub-queries, each running DP on an individual block.The DP results are then aggregated, for instance by averagingmodel updates as in federated learning [37]. Since each blockis a separate dataset, traditional composition can accountfor privacy loss at the block level. This approach supportsadaptivity (R2) and endless execution on streams (R3) as newdata blocks incur no privacy loss from past queries. However,it violates requirement (R1), resulting in unnecessarily noisylearning [14, 15]. Consider computing a feature average. DPrequires adding noise once, after summing all values on thecombined blocks. But with independent queries over eachblock, we must add the same amount of noise to the sum overeach block, yielding a more noisy total. As another example,several DP training algorithms [2, 37] fundamentally rely onsampling small training batches from large datasets to amplifyprivacy, which cannot be done without combining blocks.

Third, one can consume the data stream online using stream-ing DP. A new data point is allocated to one of the waitingqueries, which consumes its entire privacy budget. Becauseeach point is used by one query and discarded, DP holds overthe entire stream. New data can be adaptively assigned to anyquery (R2) and arrives with a fresh budget (R3). However,queries cannot use past data or overlapping subsets, violatingR1 and rendering the approach impractical for large models.Block Composition. Our new composition theory meets allthree requirements. It splits the data stream into disjoint blocks(e.g., one day’s worth of data for event-level privacy), form-ing a growing database on which queries can run on over-lapping and adaptively chosen sets of blocks (R1, R2). Thislets pipelines combine blocks with available privacy budgetsto assemble large datasets. Despite running on overlappingdata sets, our theoretical analysis (§4) shows that we can stillaccount for the privacy loss at the level of individual blocks,namely that each query only impacts the blocks it actuallyuses, not the entire data stream. In Fig. 3, the first three blockseach incur a privacy loss of ϵ1 + ϵ2 while the last block hasϵ2 + ϵ3. The privacy loss of these three queries over the entiredata stream will only be the maximum of these two values.Moreover, when the database grows (e.g. block D5 arrives),the new blocks’ privacy loss is zero. The system can thus runendlessly by training new models on new data (R3).Sage Access Control. With block composition, Sage controlsdata leakage from a stream by enforcing DP on its blocks.The company configures a desirable (ϵд ,δд) global policy

1 class DPLossValidator(sage.DPModelValidator):2 def validate(loss_fn, target, epsilon, conf, B):3 if _ACCEPT_test(..., epsilon, (1-conf)/2, B):4 return ACCEPT5 if _REJECT_test(..., epsilon, (1-conf)/2, B):6 return REJECT7 return RETRY89 def _ACCEPT_test(test_labels, dp_test_predictions,

10 loss_fn, target, epsilon, eta, B):11 n_test = dp_test_predictions.size()12 n_test_dp = n_test + laplace(2/epsilon)13 n_test_dp_min = n_test_dp -\14 2*log(3/(2*eta))/epsilon15 dp_test_loss = clip_by_value(loss_fn(test_labels,16 dp_test_predictions), 0, B)+laplace(2*B/epsilon)17 corrected_dp_test_loss = dp_test_loss +18 2*B*log(3/(2*eta))/epsilon19 return bernstein_upper_bound(20 corrected_dp_test_loss / n_test_dp_min,21 n_test_dp_min, eta/3, B) <= target2223 def bernstein_upper_bound(loss, n, eta, B):24 return loss+sqrt(2*B*loss*log(1/eta)/n)+\25 4*log(1/eta)/n

List. 2. Implementation of sage.DPLossValidator.

for each sensitive stream. The Sage Access Control compo-nent tracks the available privacy budget for each data block.It allows access to a block until it runs out of budget, afterwhich ML access to the block will forever be denied. Whenthe Sage Iterator (described in §3.3) for a pipeline requestsdata, Sage Access Control only offers blocks with availableprivacy budget. The Iterator then determines the (ϵ,δ ) privacyparameters it would like to try for its next iteration and re-quests that budget from Sage Access Control, which deducts(ϵ,δ ) from the available privacy budgets of those blocks (as-suming they are still available). Finally, the Iterator invokesthe developer-supplied DP Training Pipeline, trusting it toenforce the chosen (ϵ,δ ) privacy parameters. §4 proves thatthis access control policy enforces (ϵд ,δд)-DP for the stream.

The preceding operation is a DP-informed retention pol-icy, but one can use block composition to define other accesscontrol policies. Suppose the company is willing to assumethat its developers (or user devices and prediction serversin distinct geographies) will not collude to violate its cus-tomers’ privacy. Then the company could enforce a separate(ϵд ,δд) guarantee for each context (developer or geography)by maintaining separate lists of per-block available budgets.3.3 Privacy-Adaptive Training

Sage gives developers control over the quality of the mod-els it pushes into production, which can be affected by DPrandomness. We describe two techniques: (1) SLAed vali-dation accounts for the effect of randomness in the valida-tion process to ensure a high-probability guarantee of cor-rect assessment (akin to a quality service level agreement, orSLA); and (2) privacy-adaptive training launches the (ϵ,δ )-DP Training Pipeline on increasing amounts of data fromthe stream, and/or with increased privacy parameters, to im-prove the model’s quality adaptively until validation succeeds.Privacy-adaptive training thus leverages adaptivity support inblock composition to address DP’s privacy-utility tradeoff.

6

SLAed DP Validation. Fig. 2 shows the three possible out-comes of SLAed validation: ACCEPT, REJECT/timeout, andRETRY. If SLAed validation returns ACCEPT, then with highprobability (e.g. 95%) the model reached its configured qual-ity targets for prediction on new data from the same distri-bution. In some cases, it is also possible to give statisticalguarantees that the model will never reach a target irrespec-tive of sample size and privacy parameters, in which caseSLAed validation returns REJECT. Sage also supports timingout a training procedure if it has run for too long. Finally, ifSLAed validation returns RETRY, it signals that more data isneeded for an assessment. Here we focus on the ACCEPT andRETRY outcomes and refer the reader to Appendix B for adiscussion of REJECT tests.

We have implemented SLAed validators for three classesof metrics: loss metrics (e.g. MSE, log loss), accuracy, andabsolute errors of sum-based statistics such as mean and vari-ance. All validators follow the same logic. First, we computea DP version of the test quantity (e.g. MSE) on a testing set.Second, we compute the worst-case impact of DP noise onthat quantity for a given confidence probability; we call thisa correction for DP impact. For example, if we add Laplacenoise with parameter 1

ϵ to the sum of squared errors on n datapoints, assuming that the loss is in [0, 1] we know that withprobability (1 − η) the sum is deflated by less than − 1

ϵ ln(12η ),

because a draw from this Laplace distribution has just an ηprobability to be more negative than this value. Third, we useknown statistical concentration inequalities, also made DPand corrected for worst case noise impact, to upper boundwith high probability the loss on the entire distribution. Wenext detail the loss SLAed validator; §B describes the others.Example: Loss SLAed Validator. A loss function is a mea-sure of erroneous predictions on a dataset (so lower is better).Examples: mean squared error for regression, log loss forclassification, and minus log likelihood for Bayesian genera-tive models. List. 2 shows our loss validator and details theimplementation of its ACCEPT test.

Denote: the DP-trained model f dp; the loss function range[0,B]; the target loss τloss . Lines 11-14 compute a DP estimateof the number of samples in the test set, corrected for theimpact of DP noise to be a lower bound on the true value withprobability (1− η

3 ). Lines 15-18 compute a DP estimate of theloss sum, corrected for DP impact to be an upper bound onthe true value with probability (1 − η

3 ). Lines 19-21 ACCEPTthe model if the upper bound is at most τloss . The boundsare based on a standard concentration inequality (Bernstein’sinequality, Lines 24-25), which holds under very generalconditions [47]. We show in §B.1 that the Loss ACCEPT Testsatisfies (ϵ, 0)-DP and enjoys the following guarantee:

Proposition 3.1 (Loss ACCEPT Test). With probability atleast (1 − η), the Accept test returns true only if the expectedloss of f dp is at most τloss .

Privacy-Adaptive Training. Sage attempts to improve thequality of the model and its validation by supplying themwith more data or privacy budgets so the SLAed validatorcan either ACCEPT or REJECT the model. Several ways existto improve a DP model’s quality. First, we can increase thedataset’s size: at least in theory, it has been proven that onecan compensate for the loss in accuracy due to any (ϵ,δ )-DPguarantee by increasing the training set size [29]. Second,we can increase the privacy budget (ϵ,δ ) to decrease thenoise added to the computation: this must be done within theavailable budgets of the blocks involved in the training andnot too aggressively, because wasting privacy budget on onepipeline can prevent other pipelines from using those blocks.

Privacy-adaptive training searches for a configuration thatcan be either ACCEPTed or REJECTed by the SLAed validator.We have investigated several strategies for this search. Thosethat conserve privacy budget have proven the most efficient.Every time a new block is created, its budget is divided evenlyacross the ML pipelines currently waiting in the system. Al-located DP budget is reserved for the pipeline that receivedit, but privacy-adaptive training will not use all of it rightaway. It will try to ACCEPT using as little of the budget aspossible. When a pipeline is ACCEPTed, its remaining budgetis reallocated evenly across the models still waiting in Sage.

To conserve privacy budget, each pipeline will first trainand test using a small configurable budget (ϵ0,δ0), and aminimum window size for the model’s training. On RETRYfrom the validator, the pipeline will be retrained, making sureto double either the privacy budget if enough allocation isavailable to the Training Pipeline, or the number of samplesavailable to the Training Pipeline by accepting new data fromthe stream. This doubling of resources ensures that when amodel is ACCEPTed, the sum of budgets used by all failediterations is at most equal to the budget used by the final,accepted iteration. This final budget also overshoots the bestpossible budget by at most two, since the model with halfthis final budget had a RETRY. Overall, the resources usedby this DP budget search are thus at most four times thebudget of the final model. Evaluation §5.4 shows that thisconservative strategy improves performance when multipleTraining Pipelines contend for the same blocks.

4 Block Composition TheoryThis section provides the theoretical backing for block

composition, which we invent for Sage but which we believehas broader applications (§4.4). To analyze composition, oneformalizes permissible interactions with the sensitive data ina protocol that facilitates the proof of the DP guarantee. Thisinteraction protocol makes explicit the worst-case decisionsthat can be made by modeling them through an adversary. Inthe standard protocol (detailed shortly), an adversary A picksthe neighboring data sets and supplies the DP queries thatwill compute over one of these data sets; the choice betweenthe two data sets is exogenous to the interaction. To prove

7

(a) QueryCompose(A, b, r , (ϵi ,δi )ri=1):1: for i in 1, . . . , r do ▷ (A depends on V b

1 , . . . , V bi−1 in iter. i)

2: A gives neighboring datasets Di,0 & Di,1

3: A gives (ϵi , δi )-DP Qi4: A receives V b

i = Qi (Di,b )return V b = (V b

1 , . . . , V br )

(a) Traditional Query-level Accounting.

(b) BlockCompose(A, b, r , (ϵi ,δi )ri=1, (blocksi )ri=1):1: A gives two neighboring block datasets D0 and D1

2: for i in 1, . . . , r do ▷ (A depends on V b1 , . . . , V b

i−1 in iter. i)3: A gives (ϵi , δi )-DP Qi4: A receives V b

i = Qi (⋃

j∈blocksiDbj )

return V b = (V b1 , . . . , V b

r )

(b) Block Composition for Static Datasets. Change from query-level ac-counting shown in yellow background.

(c) AdaptiveStreamBlockCompose(A, b, r , ϵд , δд , W):

1: A gives k , the index of the block with the adversarially chosen change2: for i in 1, . . . , r do ▷ (A depends on V b

1 , . . . , V bi−1 in iter. i)

3: if create new block l and l == k then4: A gives neighboring blocks D0

k and D1k

5: else if create new block l and l , k then6: Db

l = D(W, V b1 , . . . , V b

i−1)7: A gives blocksi , (ϵi , δi ), and (ϵi , δi )-DP Qi8: if

∧j∈blocksi

AccessControljϵд ,δд (ϵj1 , δ

j1 , ..., ϵ

ji , δ

ji , 0, ...) then

9: A receives V bi = Qi (

⋃j∈blocksi

Dbj )

10: elseA receives no-op V bi =⊥

return V b = (V b1 , . . . , V b

r )

(c) Sage Block Composition. Adds support for streams (yellow lines 1-6)and adaptive choice of blocks, privacy parameters (green lines 7-8).

Fig. 4. Interaction Protocols for Composition Analysis. A is an algorithmdefining the adversary’s power; b ∈ {0, 1} denotes two hypotheses theadversary aims to distinguish; r is the number of rounds; (ϵi , δi )ri=1 the DPparameters used at each round; (blocksi )ri=1 the blocks used at each round.AccessControljϵд ,δд returns true if running (ϵi , δi )-DP query Qi on block j

ensures that with probability ≥ (1 − δд ) the privacy loss for block j is ≤ ϵд .

that the interaction satisfies DP, one must show that given theresults of the protocol, it is impossible to determine with highconfidence which of the two neighboring data sets was used.

Fig. 4 describes three different interaction protocols of in-creasing sophistication. Alg. (4a) is the basic DP compositionprotocol. Alg. (4b) is a block-level protocol we propose forstatic databases. Alg. (4c) is the protocol adopted in Sage; itextends Alg. (4b) by allowing a streaming database and adap-tive choices of blocks and privacy parameters. Highlightedare changes made to the preceding protocol.4.1 Traditional Query-Level Accounting

QueryCompose (Alg. (4a)) is the interaction protocol as-sumed in most analyses of composition of several DP inter-actions with a database. There are three important character-istics. First, the number of queries r and the DP parameters

(ϵi ,δi )ri=1 are fixed in advance. However the DP queries Qican be chosen adaptively. Second, the adversary adaptivelychooses neighboring datasets Di,0 and Di,1 for each query.This flexibility lets the protocol readily support adaptivelyevolving data (such as with data streams) where future datacollected may be impacted by the adversary’s change to pastdata. Third, the adversary receives the results V b of runningthe DP queries Qi on Di,b ; here, b ∈ {0, 1} is the exogenouschoice of which database to use and is unknown to A. DP isguaranteed if A cannot confidently learn b given V b .

A common tool to analyze DP protocols is privacy loss:

Definition 4.1 (Privacy Loss). Fix any outcomev = (v1, . . . ,vr )and denote v<i = (v1, . . . ,vi−1). The privacy loss of an algo-rithm Compose(A, b, r , ·) is:

Loss(v) = ln(P(V 0 = v)P(V 1 = v)

)= ln

( r∏i=1

P(V 0i = vi |v<i )

P(V 1i = vi |v<i )

)Bounding the privacy loss for any adversary A with high

probability implies DP [30]. Suppose that for any A, withprobability ≥ (1 − δ ) over draws from v ∼ V 0, we have:| Loss(v)| ≤ ϵ . Then Compose(A, b, r , ·) is (ϵ,δ )-DP. Thisway, privacy loss and DP are defined in terms of distinguish-ing between two hypotheses indexed by b ∈ {0, 1}.

Previous composition theorems (e.g. basic composition [17],strong composition [21], and variations thereof [28]) analyzeAlg. (4a) to derive various arithmetics for computing the over-all DP semantic of interactions adhering to that protocol. Inparticular, the basic composition theorem [17] proves thatQueryCompose(A, b, r , (ϵi ,δi )ri=1) is (∑r

i=1 ϵi ,∑r

i=1 δi )-DP.These theorems form the basis of most ML DP work. How-ever, because composition is accounted for at the query level,imposing a fixed global privacy budget means that one will“run out” of it and stop training models even on new data.

4.2 Block Composition for Static DatasetsBlock composition improves privacy accounting for work-

loads where interaction consists of queries that run on over-lapping data subsets of diverse sizes. This is one of the charac-teristics we posit for ML workloads (requirement R1 in §3.2).Alg. (4b), BlockCompose, formalizes this type of interactionfor a static dataset setting as a springboard to formalizing thefull ML interaction. We make two changes to QueryCompose.First (line 1), the neighboring datasets are defined once and forall before interacting. This way, training pipelines accessingnon-overlapping parts of the dataset cannot all be impactedby one entry’s change. Second (line 4), the data is split inblocks, and each DP query runs on a subset of the blocks.

We prove that the privacy loss over the entire dataset is thesame as the maximum privacy loss on each block, accountingonly for queries using this block:

Theorem 4.2 (Reduction to Block-level Composition). Theprivacy loss of BlockCompose(A,b, r , (ϵi ,δi )ri=1, (blocksi )ri=1)

8

is upper-bounded by the maximum privacy loss for any block:

| Loss(v)| ≤ maxk

�� ln ( r∏i=1

k ∈blocksi

P(V 0i = vi |v<i )

P(V 1i = vi |v<i )

)��.Proof. Let D0 and D1 be the neighboring datasets picked byadversary A, and let k be the block index s.t. D0

l = D1l for

all l , k, and |D0k ⊕ D1

k | ≤ 1. For any result v of Alg. (4b):�� Loss(v)�� = �� ln ( r∏i=1

P (V 0i = vi |v<i )

P (V 1i = vi |v<i )

)��=�� ln ( r∏

i=1k∈blocksi

P (V 0i = vi |v<i )

P (V 1i = vi |v<i )

)+

��ln

( r∏i=1

k<blocksi

P (V 0i = vi |v<i )

P (V 1i = vi |v<i )

)��≤ max

k

�� ln ( r∏i=1

k∈blocksi

P (V 0i = vi |v<i )

P (V 1i = vi |v<i )

)��The slashed term is zero because if k < blocksi , then⋃j ∈blocksi

D0j =

⋃j ∈blocksi

D1j , hence P (V 0

i =vi |v<i )P (V 1

i =vi |v<i )= 1. □

Hence, unused data blocks allow training of other (adap-tively chosen) ML models, and exhausting the DP budgetof a block means we retire that block of data, and not theentire data set. This result, which can be extended to strongcomposition (see Appendix A.3), can be used to do tighteraccounting than query-level accounting when the workloadconsists of queries on overlapping sets of data blocks (require-ment R1). However, it does not support adaptivity in blockchoice or a streaming setting, violating R2 and R3.4.3 Sage Block Composition

Alg. (4c), AdaptiveStreamBlockCompose, addresses thepreceding limitations with two changes. First, supportingstreams requires that datasets not be fixed before interacting,because future data depends on prior models trained andpushed into production. The highlighted portions of lines 1-10 in Alg. (4c) formalize the dynamic nature of data collectionby having new data blocks explicitly depend on previouslytrained models, which are chosen by the adversary, in additionto other mechanisms of the world W that are not impacted bythe adversary. Fortunately, Theorem 4.2 still applies, becausemodel training can only use blocks that existed at the time oftraining, which in turn only depend on prior blocks throughDP trained models. Therefore, new data blocks can train newML models, enabling endless operation on streams (R3).

Second, interestingly, supporting adaptive choices in thedata blocks implies supporting adaptive choices in the queries’DP budgets. For a given block, one can express query i’schoice to use block j as using a privacy budget of either(ϵi ,δi ) or (0, 0). Lines 7-8 in Alg. (4c) formalize the adaptivechoice of both privacy budgets and blocks (requirement R2).In supporting both, we leverage recent work on DP compo-sition under adaptive DP budgets [45]. At each round, Arequests access to a group of blocks blocksi , on which torun an (ϵi ,δi )-DP query. Sage’s Access Control permits the

query only if the privacy loss of each block in blocksi willremain below (ϵд ,δд). Applying our Theorem 4.2 and [45]’sTheorem 3.3, we prove the following result (proof in §A.2):

Theorem 4.3 (Basic Composition for Sage Block Composi-tion). AdaptiveStreamBlockCompose(A,b,r ,ϵд ,δд ,W) is (ϵд ,δд)-DP if for all k, AccessControlkϵд,δд enforces:( r∑

i=1k ∈blocksi

ϵi (v<i ))≤ ϵд and

( r∑i=1

k ∈blocksi

δi (v<i ))≤ δд .

The implication of the theorem is that under the accesscontrol scheme described in §3.2, Sage achieves event-level(ϵд ,δд)-DP over the sensitive data stream. In §A.3 we furtheranalyze strong composition for block-level accounting.4.4 Defining Blocks by User ID and Other Attributes

Block composition theory can be extended to accommodateuser-level privacy and other use cases. The theory shows thatone can split a static dataset (Theorem 4.2) or a data stream(Theorem 4.3) into disjoint blocks, and run DP queries adap-tively on overlapping subsets of the blocks while accountingfor privacy at the block level. The theorems are focused ontime splits, but the same theorems can be written for splitsbased on any attribute whose possible values can be madepublic, such as geography, demographics, or user IDs. Con-sider a workload on a static dataset in which queries combinedata from diverse and overlapping subsets of countries, e.g.,they compute average salary in each country separately, butalso at the level of continents and GDP-based country groups.For such a workload, block composition gives tighter privacyaccounting across these queries than traditional composition,though the system will still run out of privacy budget eventu-ally because no new blocks appear in the static database.

As another example, splitting a stream by user ID en-ables querying or ignoring all observations from a given user,adding support for user-level privacy. Splitting data over userID requires extra care. If existing user IDs are not knows,each query might select user IDs that do not exist yet, spend-ing their DP budget without adding data. However, makinguser IDs public can leak information. One approach is to useincrementing user IDs (with this fact public), and periodicallyrun a DP query computing the maximum user ID in use. Thiswould ensure DP, while giving an estimate of the range of userIDs that can be queried. In such a setting, block compositionenables fine-grain DP accounting over queries on any subsetof the users. While our block theory supports this use case, itsuffers from a major practical challenge. New blocks are nowcreated only when new users join the system, so new usersmust be added at a high rate relative to the model release rateto avoid running out of budget. This is unlikely to happen formature companies, but may be possible for emerging startupsor hospitals, where the stream of incoming users/patients maybe high enough to sustain modest workloads.

9

Taxi Regression TaskPipelines: Configuration:

LinearRegression(LR)

DP Alg. AdaSSP from [53], (ϵ, δ )-DPConfig. Regularization param ρ : 0.1Budgets (ϵ, δ ) ∈ {(1.0, 10−6), (0.05, 10−6)}Targets MSE ∈ [2.4 × 10−3, 7 × 10−3]

NeuralNetwork(NN)

DP Alg. DP SGD from [2], (ϵ, δ )-DPReLU, 2 hidden layers (5000/100 nodes)

Config. Learning rate: 0.01, Epochs: 3Batch: 1024, Momentum: 0.9

Budgets (ϵ, δ ) ∈ {(1.0, 10−6), (0.5, 10−6)}Targets MSE ∈ [2 × 10−3, 7 × 10−3]

Avg.Speed x3* Targets Absolute error ∈ {1, 5, 7.5, 10, 15} km/h

Criteo Classification TaskPipelines: Configuration:

LogisticRegression(LG)

DP Alg. DP SGD from [36], (ϵ, δ )-DPConfig. Learning rate: 0.1, Epochs: 3 Batch: 512Budgets (ϵ, δ ) ∈ {(1.0, 10−6), (0.25, 10−6)}Targets Accuracy ∈ [0.74, 0.78]

NeuralNetwork(NN)

DP Alg. DP SGD from [36], (ϵ, δ )-DPReLU, 2 hidden layers (1024/32 nodes)

Config. Learning rate: 0.01, Epochs: 5Batch: 1024

Budgets (ϵ, δ ) ∈ {(1.0, 10−6), (0.25, 10−6)}Targets Accuracy ∈ [0.74, 0.78]

Counts x26** Targets Absolute error ∈ {0.01, 0.05, 0.10}Tab. 1. Experimental Training Pipelines. *Three time granularities: hourof day, day of week, week of month. **Histogram of each categorical feature.

5 EvaluationWe ask four questions: (Q1) Does DP impact Training

Pipeline reliability? (Q2) Does privacy-adaptive training in-crease DP Training Pipeline reliability? (Q3) Does blockcomposition help over traditional composition? (Q4) How doML workloads perform under Sage’s (ϵд ,δд)-DP regime?Methodology. We consider two datasets: 37M-samples fromthree months of NYC taxi rides [42] and 45M ad impressionsfrom Criteo [1]. On the Taxi dataset we define a regressiontask to predict the duration of each ride using 61 binary fea-tures derived from 10 contextual features. We implementpipelines for a linear regression (LR), a neural network (NN),and three statistics (average speeds at three time granularities).On the Criteo dataset we formulate a binary classificationtask predicting ad clicks from 13 numeric and 26 categoricalfeatures. We implement a logistic regression (LG), a neuralnetwork (NN), and histogram pipelines. Tab. 1 shows details.

Training: We make each pipeline DP using known algo-rithms, shown in Tab. 1. Validation: We use the loss, accuracy,and absolute error SLAed validators on the regression, classi-fication, and statistics respectively. Experiments: Each modelis assigned a quality target from a range of possible values,chosen between the best achievable model, and the perfor-mance of a naïve model (predicting the label mean on Taxi,with MSE 0.0069, and the most common label on Criteo,with accuracy 74.3%). Most evaluation uses privacy-adaptivetraining, so privacy budgets are chosen by Sage, with anupper-bound of ϵ = 1. While no consensus exists on what a

Dataset η No SLA NP SLA UC DP SLA Sage SLA

Taxi0.01 0.379 0.0019 0.0172 0.00270.05 0.379 0.0034 0.0224 0.0051

Criteo0.01 0.2515 0.0052 0.0544 0.00180.05 0.2515 0.0065 0.0556 0.0023

Tab. 2. Target Violation Rate of ACCEPTed Models. Violations are acrossall models separately trained with privacy-adaptive training.

reasonable DP budget is, this value is in line with practicalprior work [2, 37]. Where DP budgets must be fixed, we usevalues indicated in Tab. 1 which correspond to a large bud-get (ϵ = 1), and a small budget that varies across tasks andmodels. Other defaults: 90%::10% train::test ratio; η = 0.05;δ = 10−6. Comparisons: We compare Sage’s performance toexisting DP composition approaches described in §3.2. Weignore the first alternative, which violates the endless execu-tion requirement R3 and cannot support ML workloads. Wecompare with the second and third alternatives, which we callquery composition and streaming composition, respectively.5.1 Unreliability of DP Training Pipelines in TFX (Q1)

We first evaluate DP’s impact on model training. Fig. 5shows the loss or accuracy of each model when trained onincreasing amounts data and evaluated on 100K held-out sam-ples from their respective datasets. Three versions are shownfor each model: the non-DP version (NP), a large DP budgetversion (ϵ = 1), and a small DP budget configuration with ϵvalues that vary across the model and task. For both tasks, theNN requires the most data but outperforms the linear model inthe private and non-private settings. The DP LRs catch up tothe non-DP version with the full dataset, but the other modelstrained with SGD require more data. Thus, model quality isimpacted by DP but the impact diminishes with more trainingdata. This motivates privacy-adaptive training.

To evaluate DP’s impact on validation, we train and validateour models for both tasks, with and without DP. We use TFX’svanilla validators, which compare the model’s performanceon a test set to the quality metric (MSE for taxi, accuracy forCriteo). We then re-evaluate the models’ quality metrics on aseparate, 100K-sample held-out set and measure the fractionof models accepted by TFX that violate their targets on there-evaluation set. With non-DP pipelines (non-DP trainingand validation), the false acceptance rate is 5.1% and 8.2%for the Taxi and Criteo tasks respectively. With DP pipelines(DP training, DP validation), false acceptance rates hike to37.9% and 25.2%, motivating SLAed validation.5.2 Reliability of DP Training Pipelines in Sage (Q2)

Sage’s privacy-adaptive training and SLAed validation aredesigned to add reliability to DP model training and valida-tion. However, they may come at a cost of increased datarequirements over a non-DP test. We evaluate reliability andsample complexity for the SLAed validation ACCEPT test.

Tab. 2 shows the fraction of ACCEPTed models that violatetheir quality targets when re-evaluated on the 100K-sampleheld-out set. For two confidences η, we show: (1) No SLA,

10

2

2.5

3

3.5

4

10K 100K 1M 10M

MS

E (

*0.0

01

)

Training Samples

LR NPLR ε=1.00LR ε=0.05

(a) Taxi LR MSE

2

2.5

3

3.5

4

10K 100K 1M 10M

MS

E (

*0.0

01

)

Training Samples

NN NPNN ε=1.00NN ε=0.10

(b) Taxi NN MSE

0.73 0.74 0.75 0.76 0.77 0.78 0.79

10K 100K 1M 10M

Accu

racy

Training Samples

LG NPLG ε=1.00LG ε=0.25

(c) Criteo LG Accuracy

0.73 0.74 0.75 0.76 0.77 0.78 0.79

10K 100K 1M 10M

Accu

racy

Training Samples

NN NPNN ε=1.00NN ε=0.25

(d) Criteo NN AccuracyFig. 5. Impacts on TFX Training Pipelines. Impact of DP on the overall performance of training pipelines. 5a, and 5b show the MSE loss on the Taxi regressiontask (lower is better). 5c; 5d show the accuracy on the Criteo classification task (higher is better). The dotted lines are naïve model performance.

10K

100K

1M

10M

2 3 4 5 6 7Re

qu

ire

d S

am

ple

Siz

e

MSE Target (*0.001)

No SLANP SLA

UC DP SLASage SLA

(a) Taxi LR ACCEPT

10K

100K

1M

10M

2 3 4 5 6 7Re

qu

ire

d S

am

ple

Siz

e

MSE Target (*0.001)

No SLANP SLA

UC DP SLASage SLA

(b) Taxi NN ACCEPT

10K

100K

1M

10M

0.70 0.72 0.74 0.76 0.78Re

qu

ire

d S

am

ple

Siz

e

Accuracy Target

No SLANP SLA

UC DP SLASage SLA

(c) Criteo LG ACCEPT

10K

100K

1M

10M

0.70 0.72 0.74 0.76 0.78Re

qu

ire

d S

am

ple

Siz

e

Accuracy Target

No SLANP SLA

UC DP SLASage SLA

(d) Criteo NN ACCEPTFig. 6. Number of Samples Required to ACCEPT models at achievable quality targets. For MSE targets (Taxi regression 6a, and 6b) small targets are harder toachieve and require more samples. For accuracy targets (Criteo classification 6c, and 6d) high targets are harder and require more samples.

the vanilla TFX validation with no statistical rigor, but wherea model’s quality is computed with DP. (2) NP SLA, a non-DP but statistically rigorous validation. This is the best wecan achieve with statistical confidence. (3) UC DP SLA, aDP SLAed validation without the correction for DP impact.(4) Sage SLA, our DP SLAed validator, with correction. Wemake three observations. First, the NP SLA violation ratesare much lower than the configured η values because we useconservative statistical tests. Second, Sage’s DP-correctedvalidation accepts models with violation rates close to the NPSLA. Slightly higher for the loss SLA and slightly lower forthe accuracy SLA, but well below the configured error rates.Third, removing the correction increases the violation rate by5x for the loss SLA and 20x for the accuracy SLA, violatingthe confidence thresholds in both cases, at least for low η.These results confirm that Sage’s SLAed validation is reliable,and that correction for DP is critical to this reliability.

The increased reliability of SLAed validation comes ata cost: SLAed validation requires more data compared toa non-DP test. This new data is supplemented by Sage’sprivacy-adaptive training. Fig. 6a and 6b show the amount oftrain+test data required to ACCEPT a model under various losstargets for the Taxi regression task. Fig. 6c and 6d show thesame for accuracy targets for the Criteo classification task. Wemake three observations. First, unsurprisingly, non-rigorousvalidation (No SLA) requires the least data but has a highfailure rate because it erroneously accepts models on smallsample sizes. Second, the best model accepted by Sage’s SLAvalidation are close to the best model accepted by No SLA.We observe the largest difference in Taxi LR where No SLAaccepts MSE targets of 0.0025 while the Sage SLA accepts as

low as 0.0027. The best achievable model is slightly impactedby DP, although more data is required. Third, adding a sta-tistical guarantee but no privacy to the validation (NP SLA)already substantially increases sample complexity. AddingDP to the statistical guarantee and applying the DP correctionincurs limited additional overhead. The distinction betweenSage and NP SLA is barely visible for all but the Taxi LR.For Taxi LR, adding DP accounts for half of the increase overNo SLA requiring twice as much data (one data growth stepin privacy-adaptive training). Thus, privacy-adaptive trainingincreases reliability of DP training pipelines for reasonableincrease in sample complexity.

5.3 Benefit of Block Composition (Q3)Block composition lets us combine multiple blocks into a

dataset, such that each DP query runs over all used blockswith only one noise draw. Without block composition a DPquery is split into multiple queries, each operating on a singleblock, and receiving independent noise. The combined resultsare more noisy. Fig. 7a and 7c show the model quality of theLR and NN models on the Taxi dataset, when operating onblocks of different sizes, 100K and 500K for the LR, and 5Mfor the NN. Fig. 7b and 7d show the SLAed validation samplecomplexity of the same models. We compare these configu-rations against Sage’s combined-block training that allowsML training and validation to operate on their full relevancewindows. We can see that block composition helps both thetraining and validation stages. While LR training (Fig. 7a)performs nearly identically for Sage and block sizes of 100Kor 500K (6h of data to a bit more than a day), validation issignificantly impacted. The LR cannot be validated with any

11

2

2.5

3

3.5

4

10K 100K 1M 10M

MS

E (

*0.0

01

)

Training Samples

Block Comp.Query Comp. 500KQuery Comp. 100K

(a) Taxi LR MSE

10K

100K

1M

10M

2 3 4 5 6 7

Re

qu

ire

d S

am

ple

Siz

e

MSE Target (*0.001)

Block Comp.Query Comp. 500KQuery Comp. 100K

(b) Taxi LR ACCEPT

2

2.5

3

3.5

4

10K 100K 1M 10M

MS

E (

*0.0

01

)

Training Samples

Block Comp.Query Comp. 5M

(c) Taxi DNN MSE

10K

100K

1M

10M

2 3 4 5 6 7Re

qu

ire

d S

am

ple

Siz

e

MSE Target (*0.001)

Block Comp.Query Comp. 5M

(d) Taxi DNN ACCEPTFig. 7. Block-level vs. Query-level Accounting. Block-level query accounting provides benefits to model quality and validation.

MSE better than 0.003 with block sizes of 500K, and 0.0044for blocks of size 100K. Additionally, those targets that canbe validated require significantly more data without Sage’sblock composition: 10x for blocks of size 500K, and almost100x for blocks of 100K. The NN is more affected at trainingtime. With blocks smaller than 1M points, it cannot even betrained. Even with an enormous block size of 5M, more thanten days of data (Fig. 7c), the partitioned model performs 8%worse than with Sage’s block composition. Although on suchlarge blocks validation itself is not much affected, the worseperformance means that models can be validated up to anMSE target of 0.0025 (against Sage’s 0.0023), and requirestwice as much data as with block composition.5.4 Multi-pipeline Workload Performance (Q4)

Last is an end-to-end evaluation of Sage with a workloadconsisting of a data stream and ML pipelines arriving overdiscrete time steps. At each step, a number of new data pointscorresponding approximately to 1 hour of data arrives (16Kfor Taxi, 267K for Criteo). The time between new pipelinesis drawn from a Gamma distribution. When a new pipelinearrives, its sample complexity (number of data points requiredto reach the target) is drawn from a power law distribution,and a pipeline with the relevant sample complexity is chosenuniformly among our configurations and targets (Tab. 1). Un-der this workload, we compare the average model release insteady state for four different strategies. This first two leverageQuery Composition and Streaming Composition from priorwork, as explained in methodology and § 3.3. The other twotake advantage of Sage’s Block Composition. Both strategiesuniformly divide the privacy budget of new blocks among allincomplete pipelines, but differ in how each pipeline uses itsbudget. Block/Aggressive uses as much privacy budget as isavailable when a pipeline is invoked. Block/Conserve (Sage)uses the Privacy-Adaptive Training strategy defined in § 3.3.

Fig. 8 shows each strategy’s average model release timeunder increasing load (higher model arrival rate), as the sys-tem enforces (ϵд ,δд) = (1.0, 10−6)-DP over the entire stream.We make two observations. First, Sage’s block compositionis crucial. Query Composition and Streaming Compositionquickly degrade to off-the-charts release times: supportingmore than one model every two hours is not possible andyields release times above three days. On the other hand,strategies leveraging Sage’s block composition both provide

0

20

40

60

80

100

0.1 0.2 0.3 0.4 0.5 0.6 0.7

Avg

. M

od

el R

ele

ase

Tim

e

Arrival Rate

Streaming CompositionQuery CompositionBlock/AggressiveBlock/Conserve (Sage)

(a) Taxi Dataset

0

20

40

60

80

100

0.1 0.3 0.5 0.7 0.9

Avg

. M

od

el R

ele

ase

Tim

e

Arrival Rate

Streaming CompositionBlock/AggressiveBlock/Conserve (Sage)

(b) Criteo DatasetFig. 8. Average Model Release Time Under Load.

lower release times, and can support up to 0.7 model arrivalsper hour (more than 15 new models per day) and release themwithin a day. Second, we observe consistently lower releasetimes under the privacy budget conserving strategy. At higherrates, such as 0.7 new models per hour, the difference startsto grow: Block/Conserve has a release time 4x and 2x smallerthan Block/Aggressive on Taxi (Fig. 8a) and Criteo (Fig. 8b)respectively. Privacy budget conservation reduces the amountof budget consumed by an individual pipeline, thus allowingnew pipelines to use the remaining budget when they arrive.

6 Related WorkSage’s main contribution – block composition – is related

to DP composition theory. Basic [17] and strong [21, 28]composition theorems give the DP guarantee for multiplequeries with adaptively chosen computation. McSherry [39]and Zhang, et.al. [56] show that non-adaptive queries overnon-overlapping subsets of data can share the DP budget.Rogers, et.al. [45] analyze composition under adaptive DPparameters, which is crucial to our block composition. Theseworks all consider fixed datasets and query-level accounting.

Compared to all these works, our main contribution is toformalize the new block-level DP interaction model, whichsupports ML workloads on growing databases while enforc-ing a global DP semantic without running out of budget. Thismodel sits between traditional DP interaction with static data,and streaming DP working only on current data. In provingour interaction model DP we leverage prior theoretical resultsand analysis methods. However, the most comprehensive priorinteraction model [45] did not support all our requirements,such as interactions with adaptively chosen data subsets, orfuture data being impacted by previous queries.

12

Streaming DP [9, 16, 18, 19] extends DP to data streamsbut is restrictive for ML. Data is consumed once and assumedto never be used again. This enables stronger guarantees, asdata need not even be kept internally. However, training MLmodels often requires multiple passes over the data.

Cummings, et.al. [12] consider DP over growing databases.They focus on theoretical analysis and study two setups. In thefist setup, they also run DP workloads on exponentially grow-ing data sizes. However, their approach only supports linearqueries, with a runtime exponential in the data dimension andhence impractical. In the second setup, they focus on train-ing a single convex ML model and show that it can use newdata to keep improving. Supporting ML workloads would re-quire splitting the privacy budget for the whole stream amongmodels, creating a running out of privacy budget challenge.

A few DP systems exist, but none focuses on streams or ML.PINQ [39] and its generalization wPINQ [43] give a SQL-likeinterface to perform DP queries. They introduce the parti-tion operator allowing parallel composition, which resemblesSage’s block composition. However, this operator only sup-ports non-adaptive parallel computations on non-overlappingpartitions, which is insufficient for ML. Airavat [46] pro-vides a MapReduce interface and supports a strong threatmodel against actively malicious developers. They adopt aperspective similar to ours, integrating DP with access control.GUPT [40] supports automatic privacy budget allocation andlets programmers specify accuracy targets for arbitrary DPprograms with a real-valued output; it is hence applicable tocomputing summary statistics but not to training ML models.All these works focus on static datasets and adopt a generic,query-level accounting approach that applies to any work-load. Query-level accounting would force them to run outof privacy budget if unused data were available. Block-levelaccounting avoids this limitation but applies to workloadswith specific data interaction characteristics (§3.2).

7 Summary and Future WorkAs companies disseminate ML models trained over sensi-

tive data to untrusted domains, it is crucial to start controllingdata leakage through these models. We presented Sage, thefirst ML platform that enforces a global DP guarantee acrossall models released from sensitive data streams. Its main con-tributions are its block-level accounting that permits endlessoperation on streams and its privacy-adaptive training thatlets developers control DP model quality. The key enabler ofboth techniques is our systems focus on ML training work-loads rather than DP ML’s typical focus on individual trainingalgorithms. While individual algorithms see either a staticdataset or an online training regime, workloads interact withgrowing databases. Across executions of multiple algorithms,new data becomes available (helping to renew privacy budgetsand allow endless operation) and old data is reused (allowingtraining of models on increasingly large datasets to lessen theeffect of DP noise on model quality).

We believe that this systems perspective on DP ML presentsother opportunities worth pursuing in the future. One of themis to allocate the multiple resources in a DP ML system: data,privacy budgets, and compute resources. Sage proposes aspecific heuristic for allocating the first two resources (§3.3)but leaves unexplored tradeoffs between data and computeresources. To conserve budgets, we use as much data as isavailable in the database when a model is invoked, with thelowest privacy budget. This gives us the best utilization ofthe privacy resource. But training on more data consumesmore compute resources. Identifying principled approachesto perform these allocations is an open problem that systemsresearchers are uniquely positioned to address given the richresource allocation literature developed by this community.

A key limitation of this work is the focus on event-levelprivacy, a semantic that is insufficient when groups of cor-related observations can reveal sensitive information. Thebest known example of such correlation happens when a usercontributes multiple observations, but other examples includerepeated measurements of a phenomenon over time, or usersand their friends on a social network. In such cases, observa-tions are all correlated and can reveal sensitive information,such as a user’s demographic attributes, despite event-levelDP. To increase protection, an exciting area of future work isto add support for and evaluate user-level privacy. Our blockaccounting theory is amenable to this semantic (§4.4), butfinding settings where the semantic can be sustained withoutrunning out of budget is an open challenge.

8 AcknowledgementsWe thank our shepherd, Thomas Ristenpart, and the anony-

mous reviewers for the valuable comments. This work wasfunded through NSF CNS-1351089, CNS-1514437, and CCF-1740833, two Sloan Faculty Fellowships, a Microsoft Fac-ulty Fellowship, a Google Ph.D. Fellowship, and funds fromSchmidt Futures and Columbia Data Science Institute.

References[1] Kaggle display advertising challenge dataset. https://www.kaggle.

com/c/criteo-display-ad-challenge, 2014.[2] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Tal-

war, and L. Zhang. Deep learning with differential privacy. In Proc.of the ACM Conference on Computer and Communications Security(CCS), 2016.

[3] M. Backes, P. Berrang, M. Humbert, and P. Manoharan. Membershipprivacy in microRNA-based studies. In Proc. of the ACM Conferenceon Computer and Communications Security (CCS), 2016.

[4] B. Barak, K. Chaudhuri, C. Dwork, S. Kale, F. McSherry, and K. Talwar.Privacy, accuracy, and consistency too: a holistic solution to contingencytable release. In Proc. of the ACM SIGMOD International Conferenceon Management of Data, 2007.

[5] D. Baylor, E. Breck, H.-T. Cheng, N. Fiedel, C. Y. Foo, Z. Haque,S. Haykal, M. Ispir, V. Jain, L. Koc, C. Y. Koo, L. Lew, C. Mewald,A. N. Modi, N. Polyzotis, S. Ramesh, S. Roy, S. E. Whang, M. Wicke,J. Wilkiewicz, X. Zhang, and M. Zinkevich. TFX: A Tensorflow-based production-scale machine learning platform. In Proc. of theInternational Conference on Knowledge Discovery and Data Mining(KDD), 2017.

13

https://www.kaggle.com/c/criteo-display-ad-challenge

https://www.kaggle.com/c/criteo-display-ad-challenge

[6] K. Boyd, E. Lantz, and D. Page. Differential privacy for classifierevaluation. In Proc. of the ACM Workshop on Artificial Intelligenceand Security, 2015.

[7] J. A. Calandrino, A. Kilzer, A. Narayanan, E. W. Felten, andV. Shmatikov. “You Might Also Like:” Privacy risks of collabora-tive filtering. In Proc. of IEEE Symposium on Security and Privacy(S&P), 2011.

[8] N. Carlini, C. Liu, U. Erlingsson, J. Kos, and D. Song. The secret sharer:Evaluating and testing unintended memorization in neural networks.arXiv:1802.08232, 2018.

[9] T.-H. H. Chan, E. Shi, and D. Song. Private and continual release ofstatistics. ACM Transactions on Information Systems Security, 2011.

[10] K. Chaudhuri and C. Monteleoni. Privacy-preserving logistic regression.In Proc. of the Conference on Neural Information Processing Systems(NeurIPS), 2008.

[11] K. Chaudhuri, A. D. Sarwate, and K. Sinha. A near-optimal algorithmfor differentially-private principal components. Journal of MachineLearning Research (JMLR), 14, 2013.

[12] R. Cummings, S. Krehbiel, K. A. Lai, and U. Tantipongpipat. Differ-ential privacy for growing databases. In Proc. of the Conference onNeural Information Processing Systems (NeurIPS), 2018.

[13] I. Dinur and K. Nissim. Revealing information while preserving privacy.In Proc. of the International Conference on Principles of DatabaseSystems (PODS), 2003.

[14] J. Duchi and R. Rogers. Lower bounds for locally private estimationvia communication complexity. arXiv:1902.00582, 2019.

[15] J. C. Duchi, M. I. Jordan, and M. J. Wainwright. Minimax optimalprocedures for locally private estimation. Journal of the AmericanStatistical Association, 2018.

[16] C. Dwork. Differential privacy in new settings. In Proc. of the ACMSymposium on Discrete Algorithms (SODA), 2010.

[17] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noiseto sensitivity in private data analysis. In Proc. of the Conference onTheory of Cryptography (TCC), 2006.

[18] C. Dwork, M. Naor, T. Pitassi, , and S. Yekhanin. Pan-private streamingalgorithms. In Proc. of The Symposium on Innovations in ComputerScience, 2010.

[19] C. Dwork, M. Naor, T. Pitassi, and G. N. Rothblum. Differential privacyunder continual observation. In Proc. of the ACM Symposium on Theoryof Computing (STOC), 2010.

[20] C. Dwork and A. Roth. The algorithmic foundations of differentialprivacy. Foundations and Trends® in Theoretical Computer Science,2014.

[21] C. Dwork, G. N. Rothblum, and S. Vadhan. Boosting and differentialprivacy. In Proc. of the IEEE Symposium on Foundations of ComputerScience (FOCS), 2010.

[22] C. Dwork, A. Smith, T. Steinke, and J. Ullman. Exposed! A survey ofattacks on private data. Annual Review of Statistics and Its Application,2017.

[23] C. Dwork, A. Smith, T. Steinke, J. Ullman, and S. Vadhan. Robusttraceability from trace amounts. In Proc. of the IEEE Symposium onFoundations of Computer Science (FOCS), 2015.

[24] K. Hazelwood, S. Bird, D. Brooks, S. Chintala, U. Diril, D. Dzhulgakov,M. Fawzy, B. Jia, Y. Jia, A. Kalro, J. Law, K. Lee, J. Lu, P. Noordhuis,M. Smelyanskiy, L. Xiong, and X. Wang. Applied machine learningat Facebook: A datacenter infrastructure perspective. In Proc. of In-ternational Symposium on High-Performance Computer Architecture(HPCA), 2018.

[25] W. Hoeffding. Probability inequalities for sums of bounded randomvariables. Journal of the American Statistical Association, 1963.

[26] N. Homer, S. Szelinger, M. Redman, D. Duggan, W. Tembe,J. Muehling, J. V. Pearson, D. A. Stephan, S. F. Nelson, and D. W.Craig. Resolving individuals contributing trace amounts of DNA to

highly complex mixtures using high-density SNP genotyping microar-rays. PLoS Genetics, 2008.

[27] B. Jayaraman and D. Evans. Evaluating differentially private machinelearning in practice. In Proc. of USENIX Security, 2019.

[28] P. Kairouz, S. Oh, and P. Viswanath. The composition theorem fordifferential privacy. In International Conference on Machine Learning(ICML), 2015.

[29] S. P. Kasiviswanathan, H. K. Lee, K. Nissim, S. Raskhodnikova, andA. Smith. What can we learn privately? SIAM Journal on Computing,2011.

[30] S. P. Kasiviswanathan and A. Smith. On the’semantics’ of differentialprivacy: A bayesian formulation. Journal of Privacy and Confidential-ity, 2014.

[31] D. Kifer, A. Smith, and A. Thakurta. Private convex empirical riskminimization and high-dimensional regression. In Proc. of the ACMConference on Learning Theory (COLT), 2012.

[32] N. Leonard and C. M. Halasz. Twitter meets tensorflow.https://blog.twitter.com/engineering/en_us/topics/insights/2018/twittertensorflow.html, 2018.

[33] L. E. Li, E. Chen, J. Hermann, P. Zhang, and L. Wang. Scaling machinelearning as a service. In Proc. of The International Conference onPredictive Applications and APIs, 2017.

[34] A. Maurer and M. Pontil. Empirical Bernstein Bounds and SampleVariance Penalization. 2009.

[35] H. B. McMahan and G. Andrew. A general approach to adding dif-ferential privacy to iterative training procedures. arXiv:1812.06210,2018.

[36] H. B. McMahan and G. Andrew. A general approach to adding dif-ferential privacy to iterative training procedures. arXiv:1812.06210,2018.

[37] H. B. McMahan, D. Ramage, K. Talwar, and L. Zhang. Learning differ-entially private recurrent language models. In Proc. of the InternationalConference on Learning Representations (ICLR), 2018.

[38] F. McSherry and I. Mironov. Differentially private recommender sys-tems: Building privacy into the Netflix prize contenders. In Proc. of theInternational Conference on Knowledge Discovery and Data Mining(KDD), 2009.

[39] F. D. McSherry. Privacy integrated queries: An extensible platformfor privacy-preserving data analysis. In Proc. of the ACM SIGMODInternational Conference on Management of Data, 2009.

[40] P. Mohan, A. Thakurta, E. Shi, D. Song, and D. Culler. GUPT: Privacypreserving data analysis made easy. In Proc. of the 2012 ACM SIGMODInternational Conference on Management of Data, 2012.

[41] V. Nikolaenko, U. Weinsberg, S. Ioannidis, M. Joye, D. Boneh, andN. Taft. Privacy-preserving ridge regression on hundreds of millions ofrecords. In Proc. of IEEE Symposium on Security and Privacy (S&P),2013.

[42] NYC Taxi & Limousine Commission - trip record data. http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml, 2018.

[43] D. Proserpio, S. Goldberg, and F. McSherry. Calibrating data to sen-sitivity in private data analysis: a platform for differentially-privateanalysis of weighted datasets. Proc. of the International Conference onVery Large Data Bases (VLDB), 2014.

[44] S. Ravi. On-device machine intelligence. https://ai.googleblog.com/2017/02/on-device-machine-intelligence.html, 2017.

[45] R. M. Rogers, A. Roth, J. Ullman, and S. Vadhan. Privacy odometersand filters: Pay-as-you-go composition. In Proc. of the Conference onNeural Information Processing Systems (NeurIPS), 2016.

[46] I. Roy, S. T. Setty, A. Kilzer, V. Shmatikov, and E. Witchel. Airavat: Se-curity and privacy for MapReduce. In Proc. of the USENIX Symposiumon Networked Systems Design and Implementation (NSDI), 2010.

[47] S. Shalev-Shwartz and S. Ben-David. Understanding Machine Learn-ing: From Theory to Algorithms. Appendix B. Cambridge UniversityPress, New York, NY, USA, 2014.

14

https://blog.twitter.com/engineering/en_us/topics/insights/2018/twittertensorflow.html

https://blog.twitter.com/engineering/en_us/topics/insights/2018/twittertensorflow.html

http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml

http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml

https://ai.googleblog.com/2017/02/on-device-machine-intelligence.html

https://ai.googleblog.com/2017/02/on-device-machine-intelligence.html

[48] D. Shiebler and A. Tayal. Making machine learning easy with embed-dings. SysML http://www.sysml.cc/doc/115.pdf, 2010.

[49] R. Shokri, M. Stronati, C. Song, and V. Shmatikov. Membershipinference attacks against machine learning models. In Proc. of IEEESymposium on Security and Privacy (S&P), 2017.

[50] A. Smith and A. Thakurta. Differentially private model selection viastability arguments and the robustness of lasso. Journal of MachineLearning Research, 2013.

[51] K. Talwar, A. Thakurta, and L. Zhang. Nearly-optimal private LASSO.In Proc. of the Conference on Neural Information Processing Systems(NeurIPS), 2015.

[52] F. Tramèr, F. Zhang, A. Juels, M. K. Reiter, and T. Ristenpart. Stealingmachine learning models via prediction apis. In Proc. of USENIXSecurity, 2016.

[53] Y.-X. Wang. Revisiting differentially private linear regression: op-timal and adaptive prediction & estimation in unbounded domain.arXiv:1803.02596, 2018.

[54] J. Xu, Z. Zhang, X. Xiao, Y. Yang, G. Yu, and M. Winslett. Differen-tially private histogram publication. In Proc. of the IEEE InternationalConference on Data Engineering (ICDE), 2012.

[55] L. Yu, L. Liu, C. Pu, M. E. Gursoy, and S. Truex. Differentially privatemodel publishing for deep learning. In Proc. of IEEE Symposium onSecurity and Privacy (S&P), 2019.

[56] D. Zhang, R. McKenna, I. Kotsogiannis, M. Hay, A. Machanavajjhala,and G. Miklau. Ektelo: A framework for defining differentially-privatecomputations. In Proc. of the ACM SIGMOD International Conferenceon Management of Data, 2018.

[57] J. Zhang, Z. Zhang, X. Xiao, Y. Yang, and M. Winslett. Functionalmechanism: Regression analysis under differential privacy. In Proc. ofthe International Conference on Very Large Data Bases (VLDB), 2012.

A Block CompositionThis section makes several clarifications and precisions to

the block composition theory in §4. It then ports prior strongcomposition results to our block accounting model, both forfixed and adaptive choices of blocks and DP parameters.

A.1 Clarifications and PrecisionsNeighboring Datasets. We measure the distance betweendatasets using the symmetric difference: viewing a datasetas a multiset, two datasets D and D ′ are neighboring if theirdisjunctive union (the elements which are in one of the setsbut not in their intersection) is at most one. We note |D ⊕ D ′ |the distance between D and D ′. This definition is not the moststandard: most DP work uses the Hamming distance whichcounts the number of records to change to go from D to D ′.Intuitively, under the symmetric difference an attacker canadd or remove a record in the dataset. Changing a record cor-responds to a symmetric difference of size 2, but a Hammingdistance of 1. Changing a record can still be supported underthe symmetric difference using group privacy [20]: it is thusslightly weaker but as general.

We chose to use the symmetric difference following PINQ[39], as this definition is better suited to the analysis of DPcomposition over splits of the dataset (our blocks). Changingone record can indeed affect two blocks (e.g. the timestamp ischanged) while adding or removing records only affects one.Neighboring Streams. This notion of neighboring datasetextends pretty directly to streams of data [9, 16, 18, 19]. Two

streams D and D ′ indexed by t are neighboring if there existsan index T such that: for t < T the streams are identical (i.e.|Dt<T ⊕ D ′

t<T | = 0), and for all t ≥ T the streams up to tform neighboring datasets (i.e. |Dt ≥T ⊕ D ′

t ≥T | ≤ 1). Thisis equivalent to our Algorithm (4b) where the data, thoughunknown, is fixed in advance with only one record beingchanged between the two streams.

This definition however is restrictive, because a changein a stream’s data will typically affect future data. This isespecially true in an ML context, where a record changed inthe stream will change the targeting or recommendation algo-rithms that are trained on this data, which in turn will impactthe data collected in the future. Because of this, D and D ′ willprobably differ in a large number of observations followingthe adversary’s change of one record. Interactions describedin Algorithm (4c) model these dependencies. We show inTheorem (4.3) that if the data change impacts future data onlythrough DP results (e.g. the targeting and recommendationmodels are DP) and mechanisms outside of the adversary’scontrol (our “world” variableW ), composition results are notaffected.Privacy Loss Semantics. Recall that bounding the privacyloss (Definition 4.1) with high probability implies DP [30]: ifwith probability (1 − δ ) over draws from v ∼ V 0 (or v ∼ V 1)| Loss(v)| ≤ ϵ , then the interaction generating V b is (ϵ,δ )-DP.

In this paper, we implicitly treated DP and bounded loss asequivalent by declaring Qi s as (ϵ,δ )-DP, but proving composi-tion results using a bound on Qi ’s privacy loss. However, thisis not exactly true in general, as (ϵ,δ )-DP implies a bound onprivacy loss with weaker parameters, namely that with proba-bility at least (1 − 2δ

ϵeϵ ) the loss is bounded by 2ϵ . In practice,this small difference in not crucial, as the typical Laplaceand Gaussian DP mechanisms (and those we use in Sage) dohave the same (ϵ,δ )-DP parameters for their bounds on pri-vacy loss [20]: the Laplace mechanism for (ϵ, 0)-DP impliesthat the privacy loss is bounded by ϵ and achieving (ϵ,δ )-DPwith the Gaussian mechanism implies that the privacy loss isbounded by ϵ with probability at least (1 − δ ).A.2 Proof for Basic Composition

The paper omits the proof for basic composition to savespace. Here, we re-state the result and spell out the proof:

Theorem (Theorem 4.3: Basic Sage Block Composition).AdaptiveStreamBlockCompose(A,b,r ,ϵд ,δд ,W) is (ϵд ,δд)-DP if for all k, AccessControlkϵд,δд enforces:( r∑

i=1k ∈blocksi

ϵi (v<i ))≤ ϵд and

( r∑i=1

k ∈blocksi

δi (v<i ))≤ δд .

Proof. Denote li the highest block index that existed whenquery i was run. Denote Db

≤li the data blocks that existed atthat time. Recall v<i denotes the results from all queriesreleased previous to i. Query i depends on both v<i andDb

≤li ; the latter is a random variable that is fully determined15

http://www.sysml.cc/doc/115.pdf

by v<i . Hence, the privacy loss for Alg. (4c) is: Loss(v) =

ln( r∏i=1

P (V 0i =vi |v<i ,D0

≤li)

P (V 1i =vi |v<i ,D1

≤li)

)= ln

( r∏i=1

P (V 0i =vi |v<i )

P (V 1i =vi |v<i )

),

After applying Theorem 4.2, we must show that ∀k,�� ln ( r∏i=1

k ∈blocksi

P (V 0i =vi |v<i )

P (V 1i =vi |v<i )

)�� ≤ r∑i=1

k ∈blocksi

ϵi with probability ≥

(1 −r∑i=1

k ∈blocksi

δi ). The justification follows:

�� ln ( r∏i=1

k∈blocksi

P (V 0i = vi |v<i )

P (V 1i = vi |v<i )

)�� ≤ r∑i=1

k∈blocksi

�� ln ( P (V 0i = vi |v<i )

P (V 1i = vi |v<i )

)��Since Qi is ϵi (v<i ),δi (v<i ))-DP,

�� ln (P (V 0

i =vi |v<i )P (V 1

i =vi |v<i )

)�� ≤ ϵi (v<i )with probability ≥ 1 − δi (v<i ). Applying a union bound overall queries for which k ∈ blocksi concludes the proof. □

A.3 Strong Composition with Block-Level AccountingWe now prove strong composition results for Algorithms

(4b) and (4c).Fixed Blocks and DP Parameters: We now show how touse advanced composition results (e.g. [21]) is the contextof block composition. This approach requires that both theblocks used by each query and the DP parameters be fixed inadvanced, and correspond to Algorithms (4b).

Theorem A.1 (Strong Block Composition – Fixed DP Pa-rameters). BlockCompose(A, b, r , (ϵi ,δi )ri=1, blocksri=1) is(ϵд ,δд)-DP, with:

ϵд = maxk

( r∑i=1

k ∈blocksi

(eϵi − 1)ϵi +

√√√√√ r∑i=1

k ∈blocksi

2ϵ2i log(1δ̃)),

δд = δ̃ +maxk

( r∑i=1

k ∈blocksi

δi)

Proof. After applying Theorem 4.2, what remains to be shown

is that ∀k, �� ln ( r∏i=1

i ∈blocksk

P (V 0i =vi |v<i )

P (V 1i =vi |v<i )

)�� ≤ ∑ri=1

k ∈blocksi(eϵi −

1)ϵi +√√ r∑

i=1k ∈blocksi

2ϵ2i log(1δ̃) , with probability at least (1 −

δ̃r∑i=1

k ∈blocksi

δi ). Using the fact Qi ’s privacy loss is bounded by

ϵi with probability at least (1 − δi ), we know that there existsevents Ei and E ′

i with joint probability at least (1−δi ) such that

for all vi ,�� ln ( P (V 0

i =vi |Ei )P (V 1

i =vi |E′i )) �� ≤ eϵ . We can now condition the

analysis on Ei and E ′i , and use Theorem 3.20 of [20] to get that

with probability at least (1−δ̃ ),�� ln ( P (V 0=v |Ei )

P (V 1=v |E′i )) �� ≤ eϵk , where

ϵk =∑r

i=1k ∈blocksi

(eϵi − 1)ϵi +√√ r∑

i=1k ∈blocksi

2ϵ2i log(1δ̃). A union

bound on the Ei and E ′i for all {i,k ∈ blocksi } completes the

proof. □

The proof directly extends to the stream setting (yellowparts of Alg. (4b) in the same way as in the proof of Theo-rem 4.3.Adaptive Blocks and DP Parameters: Recall that with ei-ther adaptive blocks or DP parameters (or both), DP parame-ters (ϵi (v<i ),δi (v<i )) depend on history. Traditional compo-sition theorems do not apply to this setting. For basic com-position, the DP parameters still “sum” under composition.However, as showed in [45], strong composition yields a dif-ferent formula: while the privacy loss still scales as the squareroot of the number of queries, the constant is worse than withparameters fixed in advance.

Theorem A.2 (Strong Adaptive Stream Block Composition).AdaptiveStreamBlockCompose(A, b, r , ϵд , δд , W) is (ϵд ,δд)-DP, and:

maxk

( r∑i=1

k ∈blocksi

(eϵi − 1)ϵi2

+

√√√√√2( r∑

i=1k ∈blocksi

ϵ2i +ϵ2д

28.04 log(1/δ̃ ))

√√√√√√√√(1 +

12log(

28.04 log(1/δ̃ )r∑i=1

k ∈blocksi

ϵ2i

ϵ2д+ 1) log( 1

δ̃)) )

≤ ϵд ,

δ̃ +maxk

( r∑i=1

k ∈blocksi

δi)≤ δд

Proof. Similarly to the proof of Theorem 4.3, we apply The-orem 4.2, and bound the privacy loss of any block k usingTheorem 5.1 of [45]. □

B Validation TestsSage has built-in validators for three classes of metrics.

§3.3 describes the high level approach and the specific func-tionality and properties of the loss-based validator. This sec-tion details all three validators and proves their statistical andDP guarantees.B.1 SLAed Validator for Loss Metrics

Denote a loss function l(f ,x ,y) with range [0,B] measur-ing the quality of a prediction f (x) with label y (lower isbetter), and a target loss τloss on the data distribution D.ACCEPT Test: Given a the DP-trained model f dp, we want torelease f dp only if LD(f dp) ≜ E(x,y)∼Dl(f dp,x ,y) ≤ τloss .The test works as follows. First, compute a DP estimate ofthe number of samples in the test set, corrected for the im-pact of DP noise to be a lower bound on the true value withprobability (1 − η

3 ) (Lines 11-13 List. 2):

ndpte = nte + Laplace(

2ϵ) − 2

ϵln( 3

2η).

16

Then, compute a DP estimate of the loss corrected for DPimpact (Lines 14-17 List. 2) to be an upper bound on thetrue value, Lte (f dp) ≜ 1

nte

∑te l(f dp,x ,y), with probability

(1 − η3 ):

Ldpte (f dp) =

1

ndpte

(∑te

l(f dp,x ,y)+Laplace(2Bϵ)+ 2B

ϵln( 3

2η))

Lines 18-20, we see that Sage will ACCEPT when:

Ldpte (f dp) +

√√√2BLdp

te (f dp) ln(3/η)ndpte

+4B ln(3/η)

ndpte

≤ τloss .

This test gives the following guarantee:

Proposition B.1 (Loss ACCEPT Test (same as Proposition 3.1)).With probability at least (1 − η), the Accept test returns trueonly if LD(f dp) ≤ τloss .

Proof. The corrections for DP noise imply that P(ndpte >nte) ≤ η

3 , and P(Ldpte (f dp) > Lte (f dp)) ≤ η

3 (i.e. the lowerbounds hold with probability at least (1-η3 )). Define UBdp ≜

Ldpte (f dp)+

√2BLdp

te (f dp) ln(3/η)ndpte

+4 ln(3/η)ndpte

, and UB ≜ Lte (f dp)+√2BLte (f dp) ln(3/η)

nte+

4 ln(3/η)nte

. Applying Bernstein’s inequal-ity [47] yields P(LD(f dp) > UB) ≤ η

3 . A union bound onthose three inequalities gives that with probability at least(1 − η), LD(f dp) ≤ UB ≤ UBdp . The test ACCEPTs whenUBdp ≤ τloss . □

This test uses Bernstein’s concentration inequality to com-pute an upper bound for the loss over the entire distribu-tion [47], which will give good bounds if the loss is small.If instead one expects the variance to be small, one can useempirical Bernstein bounds [34] as a drop-in replacement.Otherwise, one can fall back to Hoeffding’s inequality [25].REJECT Test: The REJECT test terminates Privacy-AdaptiveTraining of a model when no model of the considered classF can hope to achieve the desired τloss performance. Not-ing the best possible model on the data distribution f ⋆ ≜argminf ∈F LD(f ), we want to reject when LD(f ⋆) > τloss .To do this, we consider the best model in F on the trainingset f̂ ≜ argminf ∈F Ltr (f ), and proceed as follows. First,we compute the DP upper and lower bounds for ntr, holdingtogether with probability at least (1− η

3 ) (Lines 24-27 List. 2):

ndptr = ntr + Laplace(

2ϵ),

ndptr = n

dptr − 2

ϵln(3

η),

ndptr = n

dptr +

2ϵln(3

η).

Then, we compute a DP estimate of the loss corrected for DPimpact (Lines 14-17 List. 2) to be a lower bound on the true

value with probability (1 − η3 ):

Ldpte ( f̂ ) =

1

ndptr

(∑tr

l( f̂ ,x ,y) + Laplace(2Bϵ) − 2B

ϵln( 3

2η)).

Finally, Sage will REJECT when:

Ldpte ( f̂ ) − B

√log(3/η)ndptr

> τloss .

This test gives the following guarantee:

Proposition B.2 (Loss REJECT Test). With probability atleast (1 − η), f dp (or more accuratly F ) is rejected only ifLD(f ⋆) > τloss .

Proof. By definition, Ltr ( f̂ ) ≤ Ltr (f ⋆). Applying Hoeffd-ing’s inequality [25] to f ⋆ gives P(LD(f ⋆) < Ltr (f ⋆) −B√

log(3/η)ntr

) ≤ η3 . Similarly to the proof for Proposition (B.1),

applying a union bounds over this inequality and the DP cor-rection gives that with probability at least (1 − η), LD(f ⋆) ≥Ltr (f ⋆) − B

√log(3/η)

ntr≥ Ldp

tr ( f̂ ) − B

√log(3/η)ndptr> τloss , con-

cluding the proof. □

Here we leverage Hoeffding’s inequality [25] as we needto bound LD(f ⋆): since we do not know f ⋆, we cannot com-pute its variance or loss on the training set to use (empirical)Bernstein bounds.DP Guarantee: We now need to prove that the ACCEPT andREJECT tests each are (ϵ, 0)-DP. Since they each use a disjointsplit of the dataset (training and testing) running both tests is(ϵ, 0)-DP over the entire dataset.

Proposition B.3 (DP ACCEPT). Accept is (ϵ, 0)-DP.

Proof. The only data interaction for ACCEPT are to compute

ndpte and Ldp

te (f dp). The former is sensitivity 1, and is madeϵ2 -DP with the Laplace mechanism. The latter uses data inthe inner sum, which has sensitivity B and is made ϵ

2 -DP withthe Laplace mechanism. Using basic composition, the test is(ϵ, 0)-DP. □

The proof for REJECT is a bit more complicated as part ofthe procedure involves computing f̂ , which is not DP.

Proposition B.4 (DP REJECT). Reject is (ϵ, 0)-DP.

Proof. To apply the same argument as for Prop.B.3, we needto show that computing Ltr ( f̂ ), which includes computingf̂ , has sensitivity B. Recall that f̂ minimizes the trainingloss: f̂tr = argminf ∈F

∑ntr l(f ,x ,y). If we add a data point

d, then f̂tr has a loss at worst B on the new point. Becausethe best model with the new data point f̂tr+d is at least asgood as f̂tr (otherwise it wouldn’t be the argmin model),Ltr+d ( f̂tr+d ) ≤ Ltr ( f̂tr ) + B. Similarly, f̂tr+d cannot be bet-ter than f̂tr on the training set, so Ltr+d ( f̂tr+d ) ≥ Ltr ( f̂tr ).Hence the sensitivity of computing Ltr ( f̂ ) is at most B. □

17

We highlight that REJECT relies on computing f̂ for bothits statistical and DP guarantees. However, f̂ may not alwaysbe available. In particular, it can be computed for convexproblems, but not for NNs for instance.B.2 SLAed Validator for Accuracy

The accuracy metric applies to classification problems,where a model predicts one out of many classes. The targetτacc is the proportion of correct predictions to reach on thedata distribution. The accuracy validator follows the samelogic as the loss validator, with two small changes. First, theupper and lower bounds are reversed as losses are minimizedwhile accuracy is maximized. Second, the result of classifica-tion predictions follows a binomial distribution, which yieldstighter confidence intervals. Note 1{predicate} the indicatorfunction with value 1 if the predicate is true, and 0 otherwise,and Bin(k,n,η) and Bin(k,n,η) the upper and lower boundson the probability parameter p of a binomial distribution suchthat k happens with probability at least η out of n indepen-dent draws (both can be conservatively approximated using aClopper-Pearson interval).

We compute kdpte =

∑nte 1{ f (x) = y} + Laplace( 2ϵ ) and

ndpte = nte+Laplace( 2ϵ ). Similarly for the training set tr , kdptr =∑nte 1{ f̂ (x) = y} + Laplace( 2ϵ ) and n

dptr = ntr + Laplace( 2ϵ ).

Sage will ACCEPT when:

Bin(kdpte − 2

ϵln(3

η),ndpte +

2ϵln(3

η), η3

)≥ τacc .

And REJECT when:

Bin(kdptr +

2ϵln(3

η),ndptr − 2

ϵln(3

η), η3

)< τacc .

Using the same reasoning as before, we can prove the equiv-alent four Propositions (B.1, B.2, B.3, B.4) for the accuracyvalidator. However, finding f̂ for accuracy is computationallyhard.B.3 SLAed Validator for Sum-based Statistics

This validator applies to computing sum based statistics(e.g. mean, variance). The target is defined as the maximum

size of the absolute (additive) error τerr for these summarystatistics on the data distribution. This error can be computeddirectly on the training set, so there is no need for a test set.Second, because of the law of large numbers, we can alwaysreach the target precision, so there is no rejection test. Wenext show the test using Hoeffding’s inequality [25] (if weexpect the variance to be small, empirical Bernstein bounds[34] will be better and are directly applicable).

Compute:

ndptr = ntr + Laplace(

2ϵ) − 2

ϵln(2

η)

SageACCEPTs if:

1

ndptr

2ϵln(2

η) + B

√ln(2/η)ndptr

≤ τerr ,

in which case the absolute error is bellow τerr with probabilityat least (1 − η), accounting for the statistical error, as wellas the impact of DP noise on both the sum based summarystatistic and the ACCEPT test. Once again, the same reasoningallows us to prove equivalent Propositions to B.1 and B.3.

C Data CleaningThe NYC taxi data has a lot of outliers. We filtered the

dataset based on the following criteria: Prices outside of therange [$0, $1000], durations outside of [0, 2.5] hours, mal-formed date strings, Points falling outside of the box boundedby (40.923, −74.27) in the northwest and (40.4, −73.65) inthe southeast which encompasses all of New York City andmuch of the surrounding metropolitan area. Filtering pointsis acceptable within differential privacy but requires that wealso account for privacy loss on the filtered points – whichwe do. However, Sage does not address the data explorationphase which would be required for more complex data filter-ing. Exploration is outside of Sage’s scope, and may be betteraddressed by systems such as PINQ [39].

18

privacy accounting and quality control in thesage ... · sage builds upon the rich literature on dp...

Documents