noah johnson joseph p. near joseph m. hellerstein ... · queries. in sql terms, these are queries...

20
Chorus: Differential Privacy via Query Rewriting Noah Johnson Oasis Labs University of California, Berkeley Joseph P. Near University of Vermont * Joseph M. Hellerstein University of California, Berkeley Dawn Song Oasis Labs University of California, Berkeley Abstract We present CHORUS, a system with a novel architec- ture for providing differential privacy for statistical SQL queries. The key to our approach is to embed a differ- ential privacy mechanism into the query before execu- tion so the query automatically enforces differential pri- vacy on its output. CHORUS is compatible with any SQL database that supports standard math functions, requires no user modifications to the database or queries, and si- multaneously supports many differential privacy mecha- nisms. To the best of our knowledge, no existing system provides these capabilities. We demonstrate our approach using four general- purpose differential privacy mechanisms. In the first evaluation of its kind, we use CHORUS to evaluate these four mechanisms on real-world queries and data. The re- sults demonstrate that our approach supports 93.9% of statistical queries in our corpus, integrates with a pro- duction DBMS without any modifications, and scales to hundreds of millions of records. CHORUS is currently being deployed at Uber for its internal analytics tasks. CHORUS represents a signif- icant part of the company’s GDPR compliance efforts, and can provide both differential privacy and access con- trol enforcement. In this capacity, CHORUS processes more than 10,000 queries per day. 1 Introduction Organizations are collecting more and more sensitive in- formation about individuals. As this data is highly valu- able for a broad range of business interests, organizations are motivated to provide analysts with flexible access to the data. At the same time, the public is increasingly con- cerned about privacy protection. There is a growing and urgent need for technology solutions that balance these * Work done while at the University of California, Berkeley interests by supporting general-purpose analytics while guaranteeing privacy protection. Differential privacy [16, 22] is widely recognized by experts as the most rigorous theoretical solution to this problem. Differential privacy provides a formal guar- antee of privacy for individuals while allowing general statistical analysis of the data. In short, it states that the presence or absence of any single individual’s data should not have a large effect on the results of a query. This allows precise answers to questions about popula- tions in the data while guaranteeing the results reveal lit- tle about any individual. Unlike alternative approaches such as anonymization and k-anonymity, differential pri- vacy protects against a wide range of attacks, including attacks using auxiliary information [53, 42, 44, 15]. Current research on differential privacy focuses on development of new algorithms, called mechanisms, to achieve differential privacy for a particular class of queries. Researchers have developed dozens of mecha- nisms covering a broad range of use cases, from general- purpose statistical queries [19, 43, 38, 39, 46, 40, 10] to special-purpose analytics tasks such as graph analy- sis [29, 49, 32, 33, 12], range queries [28, 35, 36, 37, 56, 54, 13, 47, 8, 55], and analysis of data streams [20, 50]. Each mechanism works well for specific tasks and not as well, or not at all, on other tasks. Despite extensive academic research and an abundant supply of mechanisms, differential privacy has not been widely adopted in practice. Existing applications of dif- ferential privacy in practice are limited to specialized use cases such as web browsing statistics [24] and keyboard and emoji use [2]. There are two major challenges for practical adop- tion of differential privacy. The first is seamless inte- gration into real-world data environments. These envi- ronments include highly customized data architectures 1 arXiv:1809.07750v2 [cs.CR] 23 Sep 2018

Upload: others

Post on 15-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Noah Johnson Joseph P. Near Joseph M. Hellerstein ... · queries. In SQL terms, these are queries using standard aggregation operators (COUNT, AVG, etc.) as well as his-tograms created

Chorus: Differential Privacy via Query Rewriting

Noah JohnsonOasis Labs

University of California, Berkeley

Joseph P. NearUniversity of Vermont∗

Joseph M. HellersteinUniversity of California, Berkeley

Dawn SongOasis Labs

University of California, Berkeley

AbstractWe present CHORUS, a system with a novel architec-ture for providing differential privacy for statistical SQLqueries. The key to our approach is to embed a differ-ential privacy mechanism into the query before execu-tion so the query automatically enforces differential pri-vacy on its output. CHORUS is compatible with any SQLdatabase that supports standard math functions, requiresno user modifications to the database or queries, and si-multaneously supports many differential privacy mecha-nisms. To the best of our knowledge, no existing systemprovides these capabilities.

We demonstrate our approach using four general-purpose differential privacy mechanisms. In the firstevaluation of its kind, we use CHORUS to evaluate thesefour mechanisms on real-world queries and data. The re-sults demonstrate that our approach supports 93.9% ofstatistical queries in our corpus, integrates with a pro-duction DBMS without any modifications, and scales tohundreds of millions of records.

CHORUS is currently being deployed at Uber for itsinternal analytics tasks. CHORUS represents a signif-icant part of the company’s GDPR compliance efforts,and can provide both differential privacy and access con-trol enforcement. In this capacity, CHORUS processesmore than 10,000 queries per day.

1 IntroductionOrganizations are collecting more and more sensitive in-formation about individuals. As this data is highly valu-able for a broad range of business interests, organizationsare motivated to provide analysts with flexible access tothe data. At the same time, the public is increasingly con-cerned about privacy protection. There is a growing andurgent need for technology solutions that balance these

∗Work done while at the University of California, Berkeley

interests by supporting general-purpose analytics whileguaranteeing privacy protection.

Differential privacy [16, 22] is widely recognized byexperts as the most rigorous theoretical solution to thisproblem. Differential privacy provides a formal guar-antee of privacy for individuals while allowing generalstatistical analysis of the data. In short, it states thatthe presence or absence of any single individual’s datashould not have a large effect on the results of a query.This allows precise answers to questions about popula-tions in the data while guaranteeing the results reveal lit-tle about any individual. Unlike alternative approachessuch as anonymization and k-anonymity, differential pri-vacy protects against a wide range of attacks, includingattacks using auxiliary information [53, 42, 44, 15].

Current research on differential privacy focuses ondevelopment of new algorithms, called mechanisms, toachieve differential privacy for a particular class ofqueries. Researchers have developed dozens of mecha-nisms covering a broad range of use cases, from general-purpose statistical queries [19, 43, 38, 39, 46, 40, 10]to special-purpose analytics tasks such as graph analy-sis [29, 49, 32, 33, 12], range queries [28, 35, 36, 37, 56,54, 13, 47, 8, 55], and analysis of data streams [20, 50].Each mechanism works well for specific tasks and not aswell, or not at all, on other tasks.

Despite extensive academic research and an abundantsupply of mechanisms, differential privacy has not beenwidely adopted in practice. Existing applications of dif-ferential privacy in practice are limited to specialized usecases such as web browsing statistics [24] and keyboardand emoji use [2].

There are two major challenges for practical adop-tion of differential privacy. The first is seamless inte-gration into real-world data environments. These envi-ronments include highly customized data architectures

1

arX

iv:1

809.

0775

0v2

[cs

.CR

] 2

3 Se

p 20

18

Page 2: Noah Johnson Joseph P. Near Joseph M. Hellerstein ... · queries. In SQL terms, these are queries using standard aggregation operators (COUNT, AVG, etc.) as well as his-tograms created

and industrial-grade database engines carefully tuned forperformance and reliability. Previous differential privacysystems require changes to the data pipeline [31] or re-placement of the database with a custom engine [40, 46,39] and hence do not integrate easily into these environ-ments.

The second challenge is simultaneously supportingdifferent mechanisms. Current evidence suggests thatthere is no single “best mechanism” that performs op-timally for all queries. Rather, the best mechanism de-pends on both the query and the dataset. In fact, asdemonstrated by Hay et al. [30], the best mechanism canalso vary with the size of the dataset even for a singlequery. A practical solution must provide flexibility formechanism selection and easy integration of new mecha-nisms. Previous differential privacy systems [31, 40, 46]implement only a single mechanism.

The CHORUS system. This paper describes CHORUS,a system with a novel architecture for providing differ-ential privacy for statistical SQL queries. The key in-sight of our approach is to embed the differential pri-vacy mechanism into the SQL query before execution, sothe query automatically enforces differential privacy onits own output. We define a SQL query with this prop-erty as an intrinsically private query. CHORUS automat-ically converts untrusted input queries into intrinsicallyprivate queries. This approach enables a new architec-ture in which queries are rewritten, then submitted to anunmodified database management system (DBMS).

This new architecture addresses the two major chal-lenges outlined above. First, CHORUS is compatible withany SQL database that supports standard math functions(rand, ln, etc.) thus avoiding the need for a custom run-time or modifications to the database engine. By us-ing a standard SQL database engine instead of a customruntime, CHORUS can leverage the reliability, scalabilityand performance of modern databases, which are built ondecades of research and engineering experience.

Second, CHORUS enables the modular implementa-tion and adoption of many different mechanisms, sup-porting a significantly higher percentage of queries thanany single mechanism and providing increased flexibil-ity for both general and specialized use cases. CHORUSautomatically selects a mechanism for each input querybased on an extensible set of selection rules.

To the best of our knowledge, no existing system pro-vides these capabilities. CHORUS also protects againstan untrusted analyst: even if the submitted query is ma-licious, our transformation rules ensure that the executedquery returns only differentially private results. The re-sults can thus be returned directly to the analyst without

post-processing. This enables easy integration into exist-ing data environments via a single pre-processing step.

We demonstrate the CHORUS approach with four ex-isting general-purpose differential privacy mechanisms:Elastic Sensitivity [31], Restricted Sensitivity [10],Weighted PINQ [19] and Sample & Aggregate [40].These mechanisms support a range of SQL features andanalytic tasks. CHORUS contains query transforma-tion rules for each mechanism which convert untrusted(non–intrinsically private) queries into intrinsically pri-vate queries. We also describe how additional mecha-nisms can be added to CHORUS.

Deployment. CHORUS is currently being deployed atUber for its internal analytics tasks. CHORUS representsa significant part of the company’s General Data Pro-tection Regulation (GDPR) [4] compliance efforts, andprovides both differential privacy and access control en-forcement. We have made CHORUS available as opensource [3] to enable additional deployments elsewhere.

Evaluation. In the first evaluation of its kind, we useCHORUS to evaluate the utility and performance of theselected mechanisms on real data. Our dataset includes18,774 real queries written by analysts at Uber.

Contributions. In summary, we make the followingcontributions:

1. We present CHORUS, representing a novel architec-ture for enforcing differential privacy on SQL queriesthat simultaneously supports a wide variety of mech-anisms and runs on any standard SQL database (§ 3).

2. We describe and formalize the novel use of rule-basedquery rewriting to automatically transform an un-trusted SQL query into an intrinsically private queryusing four example general-purpose mechanisms. Wedescribe how other mechanisms can be supported us-ing the same approach (§ 4, 5).

3. We report on our experience deploying CHORUS toenforce differential privacy at Uber, where it pro-cesses more than 10,000 queries per day (§ 6).

4. We use CHORUS to conduct the first large-scale em-pirical evaluation of the utility and performance ofmultiple general-purpose differential privacy on realqueries and data (§ 7).

2 BackgroundDifferential privacy provides a formal guarantee of in-distinguishability. This guarantee is defined in terms ofa privacy budget ε—the smaller the budget, the strongerthe guarantee. The formal definition of differential pri-vacy is written in terms of the distance d(x,y) betweentwo databases, i.e. the number of entries on which they

2

Page 3: Noah Johnson Joseph P. Near Joseph M. Hellerstein ... · queries. In SQL terms, these are queries using standard aggregation operators (COUNT, AVG, etc.) as well as his-tograms created

General- SensitivityMechanism Purpose Strengths Supported Constructs Measure Algorithmic RequirementsElastic Sensitivity [31] X General analytics Counting queries w/ joins Local Laplace noise

Restricted Sensitivity [10] X Graph analysis Counting queries w/ joins Global Laplace noise

WPINQ [46] X Synthetic data gen. Counting queries w/ joins Global Weighted dataset operations,Laplace noise

Sample & Aggregate [51] X Statistical estimators Single-table aggregations Local Subsampling, Widened Win-sorized mean, Laplace noise

Table 1: Differential privacy mechanisms

differ: d(x,y) = |{i : xi 6= yi}|. Two databases x and yare neighbors if d(x,y) = 1. A randomized mechanismK : Dn → R preserves (ε,δ )-differential privacy if forany pair of neighboring databases x,y ∈ Dn and set S ofpossible outputs:

Pr[K (x) ∈ S]≤ eε Pr[K (y) ∈ S]+δ

Differential privacy can be enforced by adding noiseto the non-private results of a query. The scale of thisnoise depends on the sensitivity of the query. The lit-erature considers two different measures of sensitivity:global [19] and local [43]. For more on differential pri-vacy, see Dwork and Roth [22].

Statistical queries. Differential privacy aims to protectthe privacy of individuals in the context of statisticalqueries. In SQL terms, these are queries using standardaggregation operators (COUNT, AVG, etc.) as well as his-tograms created via the GROUP BY operator in whichaggregations are applied to records within each group.Differential privacy is not suitable for queries that returnraw data (e.g. rows in the database) since such queriesare inherently privacy-violating. We formalize the tar-geted class of queries in Section 5.1. In Section 5.5 wediscuss how our approach supports histogram queries,which require special care to avoid leaking informationvia the presence of absence of groups.

Mechanism design. Research on differential privacy hasproduced a large and growing number of differential pri-vacy mechanisms. Some mechanisms are designed toprovide broad support for many types of queries [19, 43,38, 39, 46, 40, 10], while others are designed to producemaximal utility for a particular application [29, 49, 32,33, 12, 28, 35, 37, 56, 54, 13, 47, 8, 55, 24, 30].

While mechanisms adopt unique strategies for enforc-ing differential privacy in their target domain, they gen-erally share a common set of design choices and algo-rithmic components. For example, many mechanisms re-quire addition of Laplace noise to the result of the query.

Our approach is motivated by the observation that awide range of distinct mechanisms can supported with

a common set of algorithmic building blocks. In thispaper we formalize several example building blocks viatransformation rules that describe how each algorithmiccomponent can be embedded within a SQL query. Wedemonstrate the flexibility of this design by showing thateach mechanism can be implemented simply by compos-ing these transformation rules according to the mecha-nism’s definition.

General-purpose mechanisms. The four mechanismsin Table 1 are general-purpose because they support abroad range of queries, including commonly used SQLconstructs such as join. This paper focuses on these fourmechanisms. Many more specialized mechanisms havesubstantially similar algorithmic requirements and can besupported as intrinsically private queries using variationsof the transformation rules introduced in this paper. Sec-tion 7.5 discusses this subject in more detail.

3 The CHORUS ArchitectureThis section presents the system architecture and advan-tages of CHORUS, and compares it against existing ar-chitectures for differentially private analytics. We firstdescribe the design goals motivating the CHORUS archi-tecture. Then, in Section 3.1, we describe the limitationsof existing architectures preventing previous work fromattaining these goals. Finally, Section 3.2 describes thenovel architecture of CHORUS and provides an overviewof our approach.

Design Goals. The design of CHORUS is motivatedby the desire to enforce differential privacy for general-purpose analytics in a practical setting. To that end,CHORUS has the following design goals:

1. Usability for non-experts2. Support for a wide variety of analytics queries3. Easy integration with existing data environments

As we will demonstrate in the next section, achievingthese goals is challenging, and no existing system man-ages to achieve all three. To achieve the first goal, asystem should work with standard query languages (e.g.SQL). To achieve the second goal, a system should be

3

Page 4: Noah Johnson Joseph P. Near Joseph M. Hellerstein ... · queries. In SQL terms, these are queries using standard aggregation operators (COUNT, AVG, etc.) as well as his-tograms created

able to leverage the many differential privacy mecha-nisms listed in Table 1, and should select a mechanismfor a given query automatically. To achieve the thirdgoal, a system should be deployable in the context of anexisting database engine.

3.1 Existing ArchitecturesExisting systems for enforcing differential privacy fordata analytics tasks adopt one of two architecture types:they are either deeply integrated systems or post pro-cessing systems. These architectures are summarizedin Figure 1(a) and Figure 1(b). PINQ [39], WeightedPINQ [46], GUPT [40], and Airavat [48] follow thedeep integration architecture: each one provides its ownspecialized DBMS, and cannot be used with a standardDBMS.

Elastic sensitivity [31] uses the post processing archi-tecture, in which the original query is run on the databaseand noise is added to the final result. This approach sup-ports mechanisms that do not modify the semantics ofthe original query (PINQ and Restricted sensitivity [10]could also be implemented this way), and has the advan-tage that it is compatibile with existing DBMSs. How-ever, the post processing architecture is fundamentallyincompatible with mechanisms that change how the orig-inal query executes—including WPINQ and Sample &Aggregate, listed in Table 1.

The deeply integrated and post processing architec-tures in Figure 1(a) and (b) therefore both fail to addresstwo major challenges in implementing a practical systemfor differentially private data analytics:

• Custom DBMSs are unlikely to achieve parity withmature DBMSs for a wide range of features includ-ing rich SQL support, broad query optimization, high-performance transaction support, recoverability, scal-ability and distribution.

• Neither architecture supports the simultaneous appli-cation of a large number of different mechanisms. Thedeeply integrated architecture requires building a newDBMS for each mechanism, while the post process-ing architecture is inherently incompatible with somemechanisms.

3.2 The CHORUS ArchitectureIn CHORUS, we propose a novel alternative to the deeplyintegrated and post processing architectures used by ex-isting systems. As shown in Figure 1(c), CHORUS trans-forms the input query into an intrinsically private query,which is a standard SQL query whose results are guar-anteed to be differentially private.

An intrinsically private query provides this guarantee

by embedding a differential privacy mechanism in thequery itself. When executed on an unmodified SQL da-tabase, the embedded privacy mechanism ensures thatthe query’s results preserve differential privacy. The ap-proach has three key advantages over previous work:

• CHORUS is DBMS-independent (unlike the deeply in-tegrated approach): it requires neither modifying thedatabase nor switching to purpose-built database en-gines. Our approach can therefore leverage existinghigh-performance DBMSs to scale to big data.

• CHORUS can implement a wide variety of privacy-preserving techniques. Unlike the post processing ap-proach, CHORUS is compatible with all of the mecha-nisms listed in Table 1, and many more.

• CHORUS eliminates the need for post-processing, al-lowing easy integration in existing data processingpipelines. Our approach enables a single data process-ing pipeline for all mechanisms.

By adopting this novel architecture, CHORUS achievesall three design goals listed earlier. Since input queriesare considered untrusted and the rewriting engine usesprogram analysis techniques, CHORUS preserves differ-ential privacy even in the face of malicious analysts.

CHORUS’s architecture is specifically designed to beeasily integrated into existing data environments. We re-port on the deployment of CHORUS at Uber in Section 6.

Constructing intrinsically private queries. The pri-mary challenge in implementing this architecture istransforming untrusted queries into intrinsically privatequeries. This process must be flexible enough to supporta wide variety of privacy mechanisms and also generalenough to support ad-hoc SQL queries.

As described earlier, constructing intrinsically privatequeries automatically has additional advantages. Thisapproach protects against malicious analysts by guaran-teeing differentially private results by construction. It isalso transparent to the analyst since it does not requireinput from the analyst to preserve privacy or select a pri-vacy mechanism.

As shown in Figure 1(c), CHORUS constructs intrinsi-cally private queries in two steps:

1. The Mechanism Selection component automaticallyselects an appropriate differential privacy mecha-nism to apply.

2. The Query Rewriting component embeds the se-lected mechanism in the input query, transformingit into an intrinsically private query.

To select a mechanism, CHORUS leverages an exten-sible mechanism selection engine, discussed next. The

4

Page 5: Noah Johnson Joseph P. Near Joseph M. Hellerstein ... · queries. In SQL terms, these are queries using standard aggregation operators (COUNT, AVG, etc.) as well as his-tograms created

CHORUS

Mechanism Selection

Selection Rules

Query Rewriting

Original Database

SQL Query

Intrinsically Private Query

Deep Integration

Mechanism

Specialized Database Engine

SQL Query

Differentially Private Results Differentially

Private Results

Post Processing

SQL Query

Differentially Private Results

Output Perturbation

Original Database

Sensitive Results

(a) (b) (c)

Sensitivity Mechanism

Sensitivity Mechanism

Mechanism

Pros:• Broad mechanism support

Cons:• Higher software complexity• New system required for each

mechanism

Pros:• DBMS-independent

Cons:• Supports only post-processing

mechanisms

Pros:• DBMS-independent• Broad mechanism support

Figure 1: Existing architectures: (a), (b); architecture of CHORUS: (c).

query rewriting step then transforms the input query toproduce an intrinsically private query embedding the se-lected mechanism. For this step, we employ a novel useof rule-based query rewriting, which has been studied ex-tensively for query optimization but, to our knowledge,never applied to differential privacy. We introduce oursolution by example in Section 4and formalize it in Sec-tion 5. This paper focuses on differential privacy, but thesame approach could be used to enforce other types ofprivacy guarantees or security policies [52].

Mechanism selection. CHORUS implements an exten-sible mechanism selection engine that automatically se-lects a differential privacy mechanism for each inputquery. This engine can be extended based on availablemechanisms, performance and utility goals, and to sup-port custom mechanism selection techniques. For exam-ple, Hay et al. [34] demonstrate that a machine learning-based approach can leverage properties of the data to se-lect a mechanism most likely to yield high utility. CHO-RUS is designed to support any such approach. We de-scribe an example of a syntax-based mechanism selec-tion in Section 6.

Privacy budget management. CHORUS does not pre-scribe a specific privacy budget management strategy,as the best way to manage the available privacy budgetin practice will depend on the deployment scenario andthreat model. CHORUS provides flexibility in how thebudget is managed: the sole requirement is that rewritersare supplied with the ε value apportioned to each query.1

1For simplicity we consider approaches where CHORUS stores thebudget directly. However, our query rewriting approach could allowthe DBMS to assist with budget accounting, for example by storing

For the case of a single global budget, where CHORUSis deployed as the sole interface to the database, CHORUScan track the remaining budget according to the standardcomposition theorem for differential privacy [19]. Whena new query is submitted, CHORUS subtracts from theremaining budget the ε value allocated to that query, andrefuses to process new queries when the budget is ex-hausted. In Section 7.5, we discuss more sophisticatedmethods that may yield better results for typical deploy-ments.

4 Query RewritingThis section demonstrates our approach by example, us-ing the four general-purpose differential privacy mecha-nisms listed in Table 1. For each mechanism, we brieflyreview the algorithm used. Then, we describe, using sim-ple example queries, how an input SQL query can besystematically transformed into an intrinsically privatequery embedding that algorithm, and give an argumentfor the correctness of each transformation.

4.1 Sensitivity-Based MechanismsWe first consider two mechanisms that add noise to thefinal result of the query: Elastic Sensitivity [31] and Re-stricted Sensitivity [10]. Elastic Sensitivity is a bound onthe local sensitivity of a query, while Restricted Sensitiv-ity is based on global sensitivity. Both approaches addLaplace noise to the query’s result.

For a query with sensitivity s returning value v, theLaplace mechanism releases v + Lap(s/ε), where ε isthe privacy budget allocated to the query. Given a ran-

ε values in a separate table and referencing and updating the valueswithin the rewritten query.

5

Page 6: Noah Johnson Joseph P. Near Joseph M. Hellerstein ... · queries. In SQL terms, these are queries using standard aggregation operators (COUNT, AVG, etc.) as well as his-tograms created

dom variable U sampled from the uniform distribution,we can compute the value of a random variable X ∼Lap(s/ε):

X =− sε

sign(U) ln(1−2|U |)

In SQL, we can sample from the uniform distribution us-ing RANDOM(). Consider the following query, which returnsa (non–differentially private) count of trips in the data-base. This query can be transformed into an intrinsicallyprivate query as follows:

SELECT COUNT (∗) AS count FROM trips

⇓WITH orig AS ( SELECT COUNT (∗) AS count FROM trips ),

uniform AS ( SELECT ∗, RANDOM ()-0.5 AS uFROM orig)

⇓WITH orig AS ( SELECT COUNT (∗) AS count FROM trips ),

uniform AS ( SELECT ∗, RANDOM ()-0.5 AS uFROM orig)

SELECT count-( s/ε )∗SIGN(u)∗LN (1-2∗ABS(u)) AS countFROM uniform

The first step defines U using RANDOM(), and the seconduses U to compute the corresponding Laplace noise.

The correctness of this approach follows from the def-inition of the Laplace mechanism. The two transforma-tion steps combined clearly generate Laplace noise withthe correct scale, and add it to the sensitive query results.

CHORUS can calculate either Elastic Sensitivity withsmoothing via smooth sensitivity [43]2 or RestrictedSensitivity via a dataflow analysis of the query. Suchan analysis is described in [31].

We formalize the construction of intrinsically privatequeries via sensitivity-based approaches using the La-place Noise transformation, defined in Section 5.

4.2 Weighted PINQWeighted PINQ (WPINQ) enforces differential privacyfor counting queries with equijoins. A key distinctionof this mechanism is that it produces a differentially pri-vate metric (called a weight), rather than a count. Theseweights are suitable for use in a workflow that generatesdifferentially private synthetic data, from which countsare easily derived. The workflow described in [46]uses weights as input to a Markov chain Monte Carlo(MCMC) process.

CHORUS’s implementation of WPINQ computesnoisy weights for a given counting query according tothe mechanism’s definition [46]. Since the weights aredifferentially private, they can be released to the analystfor use with any desired workflow.

2Smooth sensitivity guarantees ε,δ -differential privacy, and incor-porates the setting of δ into the smoothed sensitivity value.

The WPINQ mechanism adds a weight to each rowof the database, updates the weights as the query ex-ecutes to ensure that the query has a sensitivity of 1,and uses the Laplace mechanism to add noise to theweighted query result. WPINQ has been implementedas a standalone data processing engine with a special-ized query language, but has not been integrated into anySQL DBMS.

Where a standard database is a collection of tuplesin Dn, a weighted database (as defined in Proserpio etal. [46]) is a function from a tuple to its weight (D→R).In this setting, counting the number of tuples with a par-ticular property is analogous to summing the weights ofall such tuples. Counting queries can therefore be per-formed using sum.

In fact, summing weights in a weighted dataset pro-duces exactly the same result as the corresponding count-ing query on the original dataset, when the query doesnot contain joins. When the query does contain joins,WPINQ scales the weight of each row of the join’s out-put to maintain a sensitivity of 1. Proserpio et al. [46]define the weight of each row in a join as follows:

A./B = ∑k

Ak×BTk

||Ak||+ ||Bk||(1)

Since the scaled weights ensure a sensitivity of 1, La-place noise scaled to 1/ε is sufficient to enforce differ-ential privacy. WPINQ adds noise with this scale to theresults of the weighted query.

In SQL, we can accomplish the first task (addingweights) by adding a column to each relation. Considertransforming our previous example query:

SELECT COUNT (∗) FROM trips

⇓SELECT SUM( weight )FROM ( SELECT ∗, 1 AS weight FROM trips )

This transformation adds a weight of 1 to each row in thetable, and changes the COUNT aggregation function into aSUM of the rows’ weights. The correctness of this trans-formation is easy to see: as required by WPINQ [46], thetransformed query adds a weight to each row, and usesSUM in place of COUNT.

We can accomplish the second task (scaling weightsfor joins) by first calculating the norms ||AK || and ||Bk||for each key k, then the new weights for each row usingAk×BT

k . For a join between the trips and drivers tables,for example, we can compute the norms for each key:

WITH tnorms AS ( SELECT driver id ,SUM( weight ) AS norm

FROM tripsGROUP BY driver id ),

dnorms AS ( SELECT id , SUM( weight ) AS norm

6

Page 7: Noah Johnson Joseph P. Near Joseph M. Hellerstein ... · queries. In SQL terms, these are queries using standard aggregation operators (COUNT, AVG, etc.) as well as his-tograms created

FROM driversGROUP BY id)

Then, we join the norms relations with the original re-sults and scale the weight for each row:

SELECT ... ,(t. weight∗d. weight )/( tn.norm+dn.norm) AS weight

FROM trips t, drivers d, tnorm tn , dnorm dnWHERE t. driver id = d.id

AND t. driver id = tn. driver idAND d.id = dn.id

The correctness of this transformation follows fromequation (1). The relation tnorms corresponds to ||Ak||,and dnorms to ||Bk||. For each key, t.weight corresponds toAk, and d.weight to Bk.

Finally, we can accomplish the third task (adding La-place noise scaled to 1/ε) as described earlier.

We formalize the construction of intrinsically privatequeries via WPINQ using three transformations: theMetadata Propagation transformation to add weights toeach row, the Replace Aggregation Function transforma-tion to replace counts with sums of weights, and the La-place Noise transformation to add noise to the results.All three are defined in Section 5.

4.3 Sample & AggregateThe Sample & Aggregate [43, 51] mechanism works forall statistical estimators, but does not support joins. Sam-ple & Aggregate has been implemented in GUPT [40],a standalone data processing engine that operates onPython programs, but has never been integrated into apractical database. As defined by Smith [51], the mech-anism has three steps:

1. Split the database into disjoint subsamples2. Run the query on each subsample independently3. Aggregate the results using a differentially-private

algorithm

For differentially-private aggregation, Smith [51] sug-gests Widened Winsorized mean. Intuitively, WidenedWinsorized mean first calculates the interquartilerange—the difference between the 25th and 75th per-centile of the subsampled results. Next, the algorithmwidens this range to include slightly more data points,then clamps the subsampled results to lie within thewidened range. This step eliminates outliers, which isimportant for enforcing differential privacy. Finally, thealgorithm takes the average of the clamped results, andadds Laplace noise scaled to the size of the range (i.e. theeffective sensitivity) divided by ε .

In SQL, we can accomplish tasks 1 and 2 by addinga GROUP BY clause to the original query. Consider a querythat computes the average of trip lengths:

SELECT AVG( trip distance ) FROM trips

⇓SELECT AVG( trip distance ), ROW NUM () MOD n

AS sampFROM tripsGROUP BY samp

This transformation generates n subsamples and runs theoriginal query on each one. The correctness of tasks 1and 2 follows from the definition of subsampling: theGROUP BY ensures that the subsamples are disjoint, and thatthe query runs on each subsample independently. To ac-complish task 3 (differentially private aggregation), wecan use a straightforward implementation of WidenedWinsorized mean in SQL, since the algorithm itself isthe same for each original query.

We formalize the construction of intrinsically privatequeries via Sample & Aggregate using the Subsamplingtransformation, defined in Section 5.

5 Formalization & CorrectnessThis section formalizes the construction of intrinsicallyprivate queries as introduced in Section 4. We beginby introducing notation (Section 5.1). We then definereusable transformation rules (Section 5.2) that can becomposed to construct mechanisms. Next, we formalizethe four mechanisms described earlier in terms of theserules (Section 5.3). Finally, we prove a correctness prop-erty: our transformations do not modify the semantics ofthe input query (Section 5.4).

By formalizing the transformation rules separatelyfrom the individual mechanisms, we allow the rules to bere-used in defining new mechanisms—taking advantageof the common algorithmic requirements demonstratedin Table 1. An additional benefit of this approach is theability to prove correctness properties of the rules them-selves, so that these properties extend to all mechanismsimplemented using the rules.

5.1 PreliminariesCore relational algebra. We formalize our transforma-tions as rewriting rules on a core relational algebra thatrepresents general statistical queries. We define the corerelational algebra in Figure 2. This algebra includes themost commonly-used features of query languages likeSQL: selection (σ ), projection (Π), equijoins (./), andcounting with and without aggregation. We also includeseveral features specifically necessary for our implemen-tations: constant values, random numbers, and the arith-metic functions ln, abs, and sign.

We use standard notation for relational algebra with afew exceptions. We extend the projection operator Π toattribute expressions, which allows projection to add newnamed columns to a relation. For example, if relation r

7

Page 8: Noah Johnson Joseph P. Near Joseph M. Hellerstein ... · queries. In SQL terms, these are queries using standard aggregation operators (COUNT, AVG, etc.) as well as his-tograms created

has schema U , then the expression ΠU∪weight:1r adds anew column named “weight” with the default value 1 foreach row to the existing columns in r. In addition, wecombine aggregation and grouping, writing Count

a1..anto in-

dicate a counting aggregation with grouping by columnsa1..an. We write Suma

a1..anto indicate summation of column

a, grouping by columns a1..an.

Notation for rewriting rules. Each transformation is de-fined as a set of inference rules that rewrites a relationalalgebra expression. A particular rule allows rewriting anexpression as specified in its conclusion (below the line)if the conditions specified in its antecedent (above theline) are satisfied (either through syntactic properties ofthe query or by applying another inference rule).

Our approach relies on the ability to analyze andrewrite SQL queries. This rewriting can be achieved bya rule-based query optimization engine [1, 5].

Most of the conditions specified in our rewriting rulesuse standard notation. One exception is conditions ofthe form Q : U , which we use to denote that the queryQ results in a relation with schema U . We extend thisnotation to denote the schemas of database tables (e.g.t : U) and relational expressions (e.g. r1 : U).

Some of our rewriting rules have global parameters,which we denote by setting them above the arrow indi-cating the rewriting rule itself. For example, r x→ Πxrallows rewriting r to project only the column named x,where the value of x is provided as a parameter. Mostparameters are values, but parameters can also be func-tions mapping a relational algebra expression to a new

expression. For example, rf→ f (r) indicates rewriting r

to the result of f (r).

5.2 Transformation Rules

Laplace Noise. All four mechanisms in Table 1 requiregenerating noise according to the Laplace distribution.We accomplish this task using the Laplace Noise trans-formation, defined in Figure 3. This transformation hasone parameter: γ , which defines the scale of the noise tobe added to the query’s result. For a query with sensitiv-ity 1 and privacy budget ε , for example, γ = 1/ε sufficesto enforce differential privacy.

The Laplace Noise transformation defines a single in-ference rule. This rule allows rewriting a top-level querywith schema U according to the Lap function. Lap wrapsthe query in two projection operations; the first (definedin Unif) samples the uniform distribution for each valuein the result, and the second (defined in Lap) uses thisvalue to add noise drawn from the Laplace distribution.

Attribute expressionsa attribute namee ::= a | a : vv ::= a | n ∈ N | v1 + v2 | v1 ∗ v2 | v1/v2

| rand() | randInt(n) | ln(v) | abs(v) | sign(v)

Relational transformationsR ::= t | R1 ./

x=yR2 |Πe1,...,en R | σϕ R

| Count(R) | Counta1..an

(R)

Selection predicatesϕ ::= e1θe2 | eθvθ ::= < | ≤ | = | 6= | ≥ | >

QueriesQ ::= Count(R) | Count

a1..an(R) | Suma(R) | Suma

a1..an

(R)

Figure 2: Syntax of core relational algebra

Qγ→ Q

Q : U

Qγ→ Lap(Q)

where

Unif(Q) = ΠU∪{ux:rand()−0.5|x∈U}(Q))

Lap(Q) = Π{x:x+−γ sign(ux)ln(1−2abs(ux))|x∈U}(Unif(Q))

Figure 3: Laplace Noise Transformation

8

Page 9: Noah Johnson Joseph P. Near Joseph M. Hellerstein ... · queries. In SQL terms, these are queries using standard aggregation operators (COUNT, AVG, etc.) as well as his-tograms created

Ri, j,c→ R

r1i, j,c→ r′1 r2

i, j,c→ r′2 r′1 : U1 r′2 : U2

r1 ./A=B

r2i, j,c→ j(r′1 ./

A=Br′2)

t : U m 6∈U

ti, j,c→ ΠU∪m:i()t

ri, j,c→ r′

ΠU (r)i, j,c→ ΠU∪{m}(r′)

ri, j,c→ r′

σφ (r)i, j,c→ σφ (r′)

ri, j,c→ r′ Count(r′) : U

Count(r)i, j,c→ ΠU∪m:c(r′)Count(r′)

ri, j,c→ r′ Count

a1..an(r′) : U

Counta1..an

(r)i, j,c→ ΠU∪m:c(r′)Count

a1..an(r′)

Qi, j,c→ Q r

i, j,c→ r′

Count(r)i, j,c→ Count(r′)

ri, j,c→ r′

Counta1..an

(r)i, j,c→ Count

a1..an(r′)

Figure 4: Metadata Propagation Transformation

Metadata Propagation. Many mechanisms requiretracking metadata about each row as the query executes.To accomplish this, we define the Metadata Propagationtransformation in Figure 4. The Metadata Propagationtransformation adds a column to each referenced tableand initializes its value for each row in that table, thenuses the specified composition functions to compose thevalues of the metadata column for each resulting row ofa join or an aggregation.

The Metadata Propagation transformation has threeparameters: i, a function defining the initial value of themetadata attribute for each row in a database table; j, afunction specifying how to update the metadata columnfor a join of two relations; and c, a function specifyinghow to update the metadata column for subqueries.

The inference rule for a table t uses projection to add anew attribute to t’s schema to hold the metadata, and ini-tializes that attribute to the value of i(). The rules for pro-jection and selection simply propagate the new attribute.The rule for joins applies the j function to perform a lo-calized update of the metadata column. The rules forcounting subqueries invoke the update function c to de-termine the new value for the metadata attribute. Finally,the rules for counting queries eliminate the metadata at-tribute to preserve the top-level schema of the query.

Replacing Aggregation Functions. The Replace Ag-gregation Function transformation, defined in Figure 5,allows replacing one aggregation function (Γ) with an-other (Γ′). To produce syntactically valid output, Γ andΓ′ must be drawn from the set of available aggregation

QΓ,Γ′→ Q

Γ(r)Γ,Γ′→ Γ′(r) Γ

a1..an(r)

Γ,Γ′→ Γ′a1..an

(r)

Figure 5: Replace Aggregation Function Transformation

functions. The antecedent is empty in the rewriting rulesfor this transformation because the rules operate only onthe outermost operation of the query.

Subsampling. The Subsampling transformation, definedin Figure 6, splits the database into disjoint subsamples,runs the original query on each subsample, and aggre-gates the results according to a provided function. TheSubsampling transformation is defined in terms of theMetadata Propagation transformation, and can be usedto implement partition-based differential privacy mecha-nisms like Sample & Aggregate.

The parameters for the Subsampling transformationare A , a function that aggregates the subsampled results,and n, the number of disjoint subsamples to use duringsubsampling. Both inference rules defined by the trans-formation rewrite the queried relation using the Meta-

data Propagation transformation (ri, j,c→ r′). The parame-

ters for Metadata Propagation initialize the metadata at-tribute with a random subsample number between zeroand n, and propagate the subsample number over count-ing subqueries. The update functions for joins and count-ing subqueries is undefined, because subsampling is in-compatible with queries containing these features.3

In order to satisfy the semantics preservation property,the aggregation function A must transform the query re-sults on each subsample into a final result with the sameshape as the original query. Many aggregation functionssatisfy this property (e.g. the mean of all subsamples),but not all of them provide differential privacy.

5.3 Mechanism Definitions

We now formally define the mechanisms described ear-lier in terms of the transformations defined in Sec-tion 5.2. We describe each mechanism as a series of oneor more transformations, and define the parameters foreach transformation to obtain the correct semantics forthe mechanism.

3Subsampling changes the semantics of joins, since join keys inseparate partitions will be prevented from matching. It also changesthe semantics of aggregations in subqueries, since the aggregation iscomputed within a single subsample.

9

Page 10: Noah Johnson Joseph P. Near Joseph M. Hellerstein ... · queries. In SQL terms, these are queries using standard aggregation operators (COUNT, AVG, etc.) as well as his-tograms created

Q A→ Q ri, j,c→ r′

Count(r) A→A (Countm

(r′))

ri, j,c→ r′

CountG1..Gn

(r) A→A ( CountG1..Gn,m

(r′))

where i = randInt(n), j =⊥,c(r) =⊥,and A aggregates over m

Figure 6: Subsampling Transformation

Elastic SensitivityLet s be the Elastic Sensitivity of Q. Let γ = s/ε . IfQ

γ→ Q′, then Q′ is an intrinsically private query viaElastic Sensitivity.

Restricted SensitivityLet s be the restricted sensitivity of Q. Let γ = s/ε . IfQ

γ→ Q′, then Q′ is an intrinsically private query viaRestricted Sensitivity.

Weighted PINQ

Let i = 1 and let j scale weights as described in Sec-

tion 4.2. Let γ = 1/ε . If Qi, j,c→ Q′

Γ,Γ′→ Q′′γ→ Q′′′,

then Q′′′ is an intrinsically private query via WeightedPINQ.

Sample & Aggregate

Let A implement private Winsorized mean [51] for adesired ε . If Q A→Q′, then Q′ is an intrinsically privatequery via Sample & Aggregate.

5.4 CorrectnessThe correctness criterion for a traditional query rewrit-ing system is straightforward: the rewritten query shouldhave the same semantics (i.e. return the same results) asthe original query. For intrinsically private queries, how-ever, the definition of correctness is less clear: enforcingdifferential privacy requires modifying the results to pre-serve privacy.

In this section, we define semantic preservation,which formalizes the intuitive notion that our transfor-mations should perturb the result attributes of the query,as required for enforcing differential privacy, but changenothing else about its semantics. Semantic preserva-tion holds when the transformation in question preservesthe size and shape of the query’s output, and addition-ally preserves its logical attributes. Logical attributes are

those which are used as join keys or to perform filtering(i.e. the query makes decisions based on these attributes,instead of simply outputting them).

Each of our transformations preserve this property.Furthermore, semantic preservation is preserved overcomposition of transformations, so semantic preser-vation holds for any mechanism defined using ourtransformations—including those defined in Section 5.3.

Definition 1 (Logical Attribute). An attribute a is a log-ical attribute if it appears as a join key in a join expres-sion, in the filter condition ϕ of a filter expression, or inthe set of grouping attributes of a Count or Sum expres-sion.

Definition 2 (Semantic Preservation). A transformation(→) satisfies semantic preservation if for all queries Qand Q′, if Q→ Q′, then (1) the schema is preserved: Q :U ⇒ Q′ : U; (2) the number of output rows is preserved:|Q| = |Q′|; and (3) logical attributes are preserved (seeDefinition 3).

Definition 3 (Logical Attribute Preservation). Considera transformation Q : U → Q′ : U. Split U into two setsof attributes {Uk,Ua}, such that Uk contains all of theattributes from U used as logical attributes in Q and Uacontains all of the other attributes. Now construct Q′rby renaming each attribute k ∈Uk in Q′ to k′. Then →preserves logical attributes if there exists a one-to-onerelation E ⊆ Q×Q′r such that for all e ∈ E and k ∈Uk,ek = ek′ .

Theorem 1 (Composition). If two transformations (→aand →b) both satisfy semantic preservation, then theircomposition (→c) satisfies semantic preservation: for allqueries Q, Q′, and Q′′, if Q→a Q′ →b Q′′ implies thatQ→c Q′′, and both→a and→b preserve semantics, then→c preserves semantics.

Proof. Assume that→a and→b preserve semantics, andQ→a Q′ →b Q′′. We have that: (1) Q : U , Q′ : U , andQ : U ′′ ; (2) |Q| = |Q′| = |Q′′|; and (3) Q, Q′, and Q′′

contain the same logical attributes. Thus by Definition 2,→c preserves semantics.

Theorem 2. The Laplace Noise transformation (Fig-ure 3) satisfies the semantic preservation property.

Proof. The rules in Figure 3 define only one transforma-tion, at the top level of the query. The Unif function addsa column ux for each original column x; the Lap func-tion consumes this column, replacing the original col-umn x with its original value plus a value sampled fromthe Laplace distribution. The outermost projection pro-duces exactly the same set of attributes as the input query

10

Page 11: Noah Johnson Joseph P. Near Joseph M. Hellerstein ... · queries. In SQL terms, these are queries using standard aggregation operators (COUNT, AVG, etc.) as well as his-tograms created

Q, preserving the schema; the transformation adds onlyprojection nodes, so the number of output rows is pre-served; no join or select can occur at the top level of aquery, so no logical attributes are modified.

Theorem 3. The Metadata Propagation transformation(Figure 4) satisfies the semantic preservation property.

Proof. The only rules that make major modifications tothe original query are those for joins and counts. Theother rules add a new attribute m and propagate it throughthe query, but do not change the number or contents ofthe rows of any relation. At the top level of the query(i.e. q ∈ Q), the transformation eliminates the attributem. For queries without joins or subquery aggregations,the Metadata Propagation transformation is the identitytransformation, so it satisfies semantic preservation.

We argue the remaining cases by induction on thestructure of Q.Case r1 ./

A=Br2. Let r = r1 ./

A=Br2. If j does not change

the query’s semantics, except to update the attribute m(i.e. r : U ⇒ r = ΠU−m j(r)), then the semantics of r arepreserved.Case Count(r) and CountG1..Gn(r). The rule modifiesthe m attribute, but does not modify any other attributeor change the size of the relation, so semantics are pre-served.

Theorem 4. The Replace Aggregation Function trans-formation (Figure 5) satisfies the semantic preservationproperty.

Proof. Aggregation function replacement has the poten-tial to modify the values of the query’s results, but notits shape or logical attributes. The transformation’s onlyrule allows changing one function to another, but pre-serves the grouping columns and number of functions.The schema, number of rows, and logical attributes aretherefore preserved.

Theorem 5 (Subsampling preserves semantics). If theaggregation function A aggregates over the m attribute,then our Subsampling transformation (Figure 6) satisfiesthe semantic preservation property.

Proof. We know that Q has the form Count(r) orCount

G1..Gn,m(r). By Theorem 3, we know that in either case, if

ri, j,c→ r′, then r has the same semantics as r′. We proceed

by cases.Case Q = Count(r). Let q1 = Countm(r′). By the def-inition of the transformation, Q′ = A (q1). The queryq1 has exactly one row per unique value of m. Since

A aggregates over m, A (q1) has exactly one row, andtherefore preserves the semantics of Q.Case Q = Count

G1..Gn,m(r). Let q1 = Count

G1..Gn,m(r′). By the defi-

nition of the transformation, Q′ = A (q1). The query q1has exactly one row per unique tuple (G1..Gn,m). SinceA aggregates over m, A (q1) has exactly one row perunique tuple (G1..Gn), and therefore preserves the se-mantics of Q.

5.5 Handling Histogram QueriesSQL queries can use the GROUP BY operator to return a re-lation representing a histogram, as in the following querywhich counts the number of trips greater than 100 milesin each city:

SELECT city id , COUNT (∗) as count FROM tripsWHERE distance > 100GROUP BY city id

This type of query presents a problem for differential pri-vacy because the presence or absence of a particular cityin the results reveals whether the count for that city waszero.

The general solution to this problem is to require theanalyst to explicitly provide a set of desired bins, andreturn a (noisy) count of zero for absent bins. Such anapproach is used, for example, in PINQ [39], WeightedPINQ [46], and FLEX [31]. Unfortunately, this approachimpose an unfamiliar user experience and is burdensomefor the analyst, who is never allowed to view the resultsdirectly.

For bins with finite domain, CHORUS provides a su-perior solution by enumerating missing histogram binsautomatically in the rewritten query. The missing binsare populated with empty aggregation values (e.g., 0 forcounts) before mechanism-specific rewriting, at whichpoint they are handled identically as non-empty bins.This allows the full results of histogram queries to be re-turned to the analyst without post-processing or interpo-sition. If the domain cannot be enumerated (e.g., becauseit is unbounded), CHORUS falls back to the approach de-scribed above and does not release results directly to theanalyst.

This feature requires the operator to define a mappingfrom columns that may be used in a GROUP BY clause tothe database field containing the full set of values fromthat column’s domain. This information may be definedmanually or extracted from the database schema (e.g.,via primary and foreign key constraints), and is providedduring initial deployment.

In this example, suppose the full set of city ids arestored in column city id of table cities. Using this in-

11

Page 12: Noah Johnson Joseph P. Near Joseph M. Hellerstein ... · queries. In SQL terms, these are queries using standard aggregation operators (COUNT, AVG, etc.) as well as his-tograms created

formation, CHORUS generates the following intermedi-ate query:

WITH orig AS (SELECT city id , COUNT (∗) as count FROM tripsWHERE distance > 100GROUP BY city id

)SELECT cities . city id ,

(CASE WHEN orig. count IS NULL THEN 0ELSE orig. count END) as count

FROM orig RIGHT JOIN citiesON orig. city id = cities . city id

The RIGHT JOIN ensures that exactly one row exists in theoutput relation for each city id in cities, and the CASE ex-pression outputs a zero for each missing city in the origi-nal query’s results. The query thus contains every city idvalue regardless of the semantics of the original query.This intermediate query is then sent to the mechanismrewriter, which adds noise to each bin as normal.

6 Implementation & DeploymentThis section describes our implementation of CHORUSand our experience deploying it to enforce differentialprivacy at Uber.

6.1 ImplementationOur implementation of CHORUS automatically trans-forms an input SQL query into an intrinsically privatequery. CHORUS currently supports the four differentialprivacy mechanisms discussed here, and is designed foreasy extension to new mechanisms. We have releasedCHORUS as an open source project [3].

Our implementation is built on Apache Calcite [1],a generic query optimization framework that transformsinput queries into a relational algebra tree and providesfacilities for transforming the tree and emitting a newSQL query. We built a custom dataflow analysis andrewriting framework on Calcite to support intrinsicallyprivate queries. The framework, mechanism-specificanalyses, and rewriting rules are implemented in 5,096lines of Java and Scala code.

The approach could also be implemented with otherquery optimization frameworks or rule-based queryrewriters such as Starburst [45], ORCA [5], and Cas-cades [26].

6.2 Privacy Budget ManagementWe have designed CHORUS to be flexible in its handlingof the privacy budget, since best approach in a givensetting is likely to depend on the domain and the kindsof queries posed. A complete study of approaches formanaging the privacy budget is beyond the scope of thiswork, but we outline some possible strategies here. Wedescribe the specific method used in our deployment in

the next subsection.As described earlier, a simple approach to budget man-

agement involves standard composition. More sophis-ticated methods for privacy budget accounting includethe advanced composition [23] and parallel composi-tion [19], both of which are directly applicable in oursetting. For some mechanisms, the moments account [7]could be used to further reduce privacy cost.

Support for other mechanisms. Mechanisms them-selves can also have a positive effect on the privacybudget, and many mechanisms have been designed toprovide improved accuracy for a workload of similarqueries. Many of these mechanisms are implemented interms of lower-level mechanisms (such as the Laplacemechanism) that CHORUS already supports, and there-fore could be easily integrated in CHORUS.

The sparse vector technique [21] answers a sequenceof queries, but only gives answers for queries in whoseresults lie above a given threshold. The technique is im-plemented in terms of the Laplace mechanism.

The Matrix Mechanism [36] and MWEM [28] algo-rithms both answer a query workload by posing carefullychosen queries on the private database using a lower-level mechanism (e.g. the Laplace mechanism), then us-ing the results to build a representation that can answerthe queries in the workload.

The Exponential Mechanism [38] enforces differentialprivacy for queries that return categorical (rather than nu-meric) data, by picking from the possible outputs withprobability generated from an analyst-provided scoringfunction. This technique can be be implemented as anintrinsically private query if the scoring function is givenin SQL; the transformed query can run the function oneach possible output and then pick from the possibilitiesaccording to the generated distribution.

6.3 DeploymentCHORUS is currently being deployed to enforce differen-tial privacy for analysts that query customer data at Uber.The primary goals of this deployment are to protect theprivacy of customers from insider attacks, and to en-sure compliance with the requirements of Europe’s Gen-eral Data Protection Regulation (GDPR) [4]. In the cur-rent deployment, CHORUS processes more than 10,000queries per day.

Data environment & architecture. The data environ-ment into which CHORUS is deployed consists of severalDBMSs (three primary databases, plus several more forspecific applications), and a single central query inter-face through which all queries are submitted. The queryinterface is implemented as a microservice that performs

12

Page 13: Noah Johnson Joseph P. Near Joseph M. Hellerstein ... · queries. In SQL terms, these are queries using standard aggregation operators (COUNT, AVG, etc.) as well as his-tograms created

query processing and then submits the query to the ap-propriate DBMS and returns the results.

Our deployment involves a minimal wrapper aroundthe CHORUS library to expose its rewriting functional-ity as a microservice. The only required change to thedata environment was a single modification to the queryinterface, to submit queries to the CHORUS microser-vice for rewriting before execution. The wrapper aroundCHORUS also queries a policy microservice to determinethe security and privacy policy for the user submittingthe query. This policy informs which rewriter is used—by default, differential privacy is required, but for someprivileged users performing specific business tasks, dif-ferential privacy is only used for older data.

A major challenge of this deployment has been sup-porting the variety of SQL dialects used by the variousDBMSs. The Calcite framework is intended to providesupport for multiple dialects, but this support is incom-plete and we have had to make changes to Calcite in or-der to support custom SQL dialects such as Vertica.

Privacy budget. The privacy budget is managed by themicroservice wrapper around CHORUS. The microser-vice maintains a small amount of state to keep track ofthe current cumulative privacy cost of all queries submit-ted so far, and updates this state when a new query issubmitted.

The current design of the CHORUS microservice main-tains a single global budget, and uses advanced compo-sition [22] to track the total budget used for the queriessubmitted so far.

As we gain experience with the deployment, we arebeginning to consider more sophisticated budget man-agement approaches that take advantages of propertiesof the data and the query workload. For example, newdata is added to the database continuously in this partic-ular use case, so recent work leveraging the growth of thedatabase to answer an unbounded number of queries [14]may be directly applicable.

Mechanism selection. Our deployment of CHORUSleverages a syntax-based selection procedure which aimsto optimize for utility (low error). As we show in Sec-tion 7.4, this approach performs well for this deploy-ment. For different query workloads, other approachesmay work significantly better, and CHORUS is designedto support extension to these cases.

The syntax-based approach uses a set of rules thatmap SQL constructs supported by each mechanism to aheuristic scoring function indicating how likely queriesusing that construct will yield high utility. For example,Restricted sensitivity [10] supports counting queries withjoins, but does not support many-to-many joins. Elastic

sensitivity [31] supports a wider set of equijoins, includ-ing many-to-many joins, but generally provides slightlylower utility than Restricted sensitivity due to smooth-ing. Sample & aggregate [51] does not support joins, butdoes support additional aggregation functions (including“average” and “median”).

When a query is submitted, the mechanism selectionengine analyzes the query to determine its syntactic prop-erties including how many joins it has, of what type,and what aggregation functions it uses. It then appliesthe rules to determine which mechanisms can supportthe query and selects the mechanism with highest score.Note this process does not depend on the data and hencedoes not consume privacy budget.

This approach represents a simple but effective strat-egy for automatic mechanism selection. In Section 7.4,we demonstrate that our rules are effective for select-ing the best mechanism on a real-world query workload.This approach is also easily extended when a new mecha-nism is added: the mechanism designer simply adds newrules for SQL constructs supported by the mechanism.Moreover, the scoring function can be tuned for otherobjectives, for example favoring mechanisms achievinglow performance overhead rather than highest utility.

7 EvaluationThis section reports results of the following experiments:

• We report the percentage of queries that can be sup-ported by each mechanism as intrinsically privacyqueries using a corpus of real-world queries, demon-strating that a combination of the four evaluatedmechanisms covers 93.9% of these queries.

• We use CHORUS to conduct the first empirical studyof several differential privacy mechanisms on a real-world SQL workload. We report the performanceoverhead and utility of each mechanism across its sup-ported class of queries.

• We demonstrate that our rule-based approach for auto-matic mechanism selection is effective at selecting thebest mechanism for each input query. Using the sim-ple set of rules presented earlier, our approach selectsthe optimal mechanism for nearly 90% of the queriesin our corpus.

Corpus. We use a corpus of 18,774 real-world queriescontaining all statistical queries executed by data ana-lysts at Uber during October 2016.

The corpus includes queries written for several usecases including fraud detection, marketing, business in-telligence and general data exploration. It is thereforehighly diverse and representative of SQL data analytics

13

Page 14: Noah Johnson Joseph P. Near Joseph M. Hellerstein ... · queries. In SQL terms, these are queries using standard aggregation operators (COUNT, AVG, etc.) as well as his-tograms created

ElasticSensitivity

RestrictedSensitivity

WPINQ

Sample&Aggregate13,413queries(71.4%)

10,821queries(57.6%)8,532queries(45.4%)

5,698queries(30.3%)

Supportedbyatleastonemechanism17,628queries(93.9%)

Figure 7: Size and relationship of query sets supportedby each evaluated mechanism.

queries generally. The queries were executed on a data-base of data sampled from the production database.

7.1 Mechanism Support for QueriesEach mechanism evaluated supports a different subset ofqueries. This is due to the unique limitations and sup-ported constructs of each mechanism, as summarized inSection 3. We measured the percentage of queries fromour corpus supported by each mechanism to assess thatmechanism’s coverage on a real-world workload. Fig-ure 7 depicts the relative size and overlap of each set ofsupported queries for the evaluated mechanisms. ElasticSensitivity is the most general mechanism and can sup-port 71.4% of the queries in our corpus, followed by Re-stricted Sensitivity (57.6%), WPINQ (30.3%) and Sam-ple & Aggregate (45.4%).

Elastic Sensitivity and Restricted Sensitivity supportlargely the same class of queries, and WPINQ supportsa subset of the queries supported by these two mecha-nisms. In Section 7.5 we discuss limitations preventingthe use of WPINQ for certain classes of queries sup-ported by Elastic Sensitivity and Restricted Sensitivity.

Sample & Aggregate supports some queries supportedby other mechanisms (counting queries that do not usejoin), as well as a class of queries using statistical esti-mators (such as sum and average), that are not supportedby the other mechanisms. In total, 93.9% of queries aresupported by at least one of the four mechanisms.

The results highlight a key advantage of our approach:different classes of queries can be simultaneously sup-ported via selection of one or more specialized mecha-nisms. This ensures robust support across a wide rangeof general and specialized use cases, and allows incre-mental adoption of future state-of-the-art mechanisms.

7.2 Performance OverheadWe conduct a performance evaluation demonstrating theperformance overhead of each mechanism when imple-mented as an intrinsically private query.

Overhead (%) Primary causeMean Median of overhead

Elastic Sensitivity 2.8 1.7 Random noise gen.Restricted Sensitivity 3.2 1.6 Random noise gen.WPINQ 50.9 21.6 Additional joinsSample & Aggregate 587 394 Grouping/aggregation

Table 2: Performance overhead of evaluated differentialprivacy mechanisms.

Figure 8: Performance overhead of differential privacymechanisms by execution time of original query.

Experiment Setup. We used a single HP Vertica7.2.3 [6] node containing 300 million records includingtrips, rider and driver information and other associateddata stored across 8 tables. We submitted the querieslocally and ran queries sequentially to avoid any effectsfrom network latency and concurrent workloads.

To establish a baseline we ran each original query 10times and recorded the average after dropping the low-est and highest times to control for outliers. For everymechanism, we used CHORUS to transform each of themechanism’s supported queries into an intrinsically pri-vate query. We executed each intrinsically private query10 times and recorded the average execution time, againdropping the fastest and slowest times. We calculate theoverhead for each query by comparing the average run-time of the intrinsically private query against its base-line.4

Results. The results are presented in Table 2. The aver-age overhead and median overhead for Elastic Sensitivityare 2.82% and 1.7%, for Restricted Sensitivity these are3.2% and 1.6%, for WPINQ 50.9% and 21.58% and forSample & Aggregate 587% and 394%.

Figure 9 shows the distribution of overhead as a func-tion of original query execution time. This distributionshows that the percentage overhead is highest when the

4Transformation time is negligible and therefore not included in theoverhead calculation. The transformation time averages a few millisec-onds, compared with an average query execution time of 1.5 seconds.

14

Page 15: Noah Johnson Joseph P. Near Joseph M. Hellerstein ... · queries. In SQL terms, these are queries using standard aggregation operators (COUNT, AVG, etc.) as well as his-tograms created

original query was very fast (less than 100ms). This isbecause even a small incremental performance cost isfractionally larger for these queries. The values reportedin Table 2 are therefore a conservative estimate of theoverhead apparent to the analyst.

WPINQ and Sample & Aggregate significantly alterthe way the query executes (see Section 4) and thesechanges increase query execution time. In the case ofWPINQ, the query transformation adds a new join tothe query each time weights are rescaled (i.e. one newjoin for each join in the original query), and these newjoins result in the additional overhead. Sample & Aggre-gate requires additional grouping and aggregation steps.We hypothesize that these transformations are difficultfor the database to optimize during execution. Figure 9shows that, in both cases, the performance impact isamortized over higher query execution times, resultingin a lower relative overhead for more expensive queries.

7.3 Utility of Selected MechanismsCHORUS enables the first empirical study of the utilityof many differential privacy mechanisms on a real-worldquery workload. This experiment reveals innate trends ofeach mechanism on a common database and query work-load. For each differential privacy mechanism, this ex-periment reports the relative magnitude of error added toresults of its supported query set. We present the resultsas a function of query sample size, discussed below.

Experiment Setup. We use the same setup described inthe previous section to evaluate the utility of Elastic Sen-sitivity, Restricted Sensitivity, and Sample & Aggregate.As described in Section 4.2, WPINQ’s output is a differ-entially private statistic used as input to a post-processingstep, rather than an answer to the query posed, so we donot measure WPINQ’s utility.

For each query, we set the privacy budget ε = 0.1for all mechanisms. For Elastic Sensitivity, we set δ =n−ε lnn (where n is the database size), following Dworkand Lei [18]. For Sample & Aggregate, we set the num-ber of subsamples `= n0.4, following Mohan et al. [40].

We ran each intrinsically private query 10 times on thedatabase and report the median relative error across theseexecutions. For each run we report the relative error asthe percentage difference between the differentially pri-vate result and the original non-private result. Consistentwith previous evaluations of differential privacy [30] wereport error as a proxy for utility since data analysts areprimarily concerned with accuracy of results.

If a query returns multiple rows (e.g., histogramqueries) we calculate the mean error across all histogrambins. If the query returns multiple columns we treat each

output column independently since noise is applied sep-arately to every column.Query Sample Size. Our corpus includes queries cover-ing a broad spectrum of use cases, from highly selectiveanalytics (e.g., trips in San Francisco completed in thepast hour) to statistics of large populations (e.g., all tripsin the US). Differential privacy generally requires the ad-dition of more noise to highly selective queries than toqueries over large populations, since the influence of anyindividual’s data diminishes as population size increases.Consequently, a query’s selectivity is important for inter-preting the relative error introduced by differential pri-vacy. To measure the selectivity we calculate the samplesize of every aggregation function in the original query,which represents the number of input records to whichthe function was applied.Results. Figures 9 and 10 show the results of this experi-ment. All three mechanisms exhibit the expected inverserelationship between sample size and error; moreover,this trend is apparent for queries with and without joins.

Where the other three mechanisms support only count-ing queries, Sample & Aggregate supports all statisticalestimators. Figure 10 shows the utility results for Sam-ple & Aggregate, highlighting the aggregation functionused. These results indicate that Sample & Aggregatecan provide high utility (<10% error) for each of its sup-ported aggregation functions on approximately half ofthe queries.

7.4 Automatic Mechanism SelectionWe evaluated the effectiveness of the syntax-based auto-matic mechanism selection approach described in Sec-tion 6.For each query in our corpus, this experimentcompares the utility achieved by the mechanism selectedby our rule-based approach to the best possible utilityachievable by any mechanism implemented in CHORUS.Experiment Setup. We used the same corpus of queriesand the same database of trips as in the other experi-ments. For each query, we ran all of the mechanismsthat support the query and recorded the relative error (i.e.utility) of each one. We defined the oracle utility for eachquery to be the minimum error achieved by any of thefour implemented mechanisms for that query. The ora-cle utility is intended to represent the utility that couldbe obtained if a perfect oracle for mechanism selectionwere available. We used our syntax-based mechanismselection method to select a single mechanism, and de-termined the utility of that mechanism. Finally, we com-puted the difference between the oracle utility and theutility achieved by our selected mechanism.Results. We present the results in Figure 11. For 88%

15

Page 16: Noah Johnson Joseph P. Near Joseph M. Hellerstein ... · queries. In SQL terms, these are queries using standard aggregation operators (COUNT, AVG, etc.) as well as his-tograms created

Figure 9: Utility of elastic sensitivity and restricted sensitivity, by presence of joins.

Figure 10: Utility of Sample & Aggregate, by aggrega-tion function.

number percent

Equal to oracle utility

11217 0.878180537070383

Within 10% of oracle utility

952 0.074532216393956

More than 10% worse than oracle utility

604 0.0472872465356612

Total 12773

5%7%

88%

Within 10% of oracle utility

More than 10% worse than oracle utility

Equal to oracle utility

Perc

ent o

f Q

uerie

s

0%

50%

100%

5%7%

88%

Equal to oracle utility

Within 10% of oracle utility

More than 10% worse than oracle utility

�1

Figure 11: Effectiveness of automatic mechanism selec-tion.

of the queries in our corpus, the automatic mechanismselection rules select the best mechanism, and thereforeprovide the same utility as the oracle utility. Of the re-maining queries, the selected mechanism often providesnearly oracle utility: for 7% of queries, the selectedmechanism is within 10% error of the oracle utility.

The remaining queries (5%) represent opportunitiesfor improving the approach—perhaps through the useof a prediction model trained on features of the queryand data. Previous work [30, 34] uses such a machinelearning-based approach; for range queries, these resultssuggest that a learning-based approach can be very ef-fective, though the approach has not been evaluated onother types of queries.

7.5 Discussion and Key Takeaways

Strengths & weaknesses of differential privacy. Themechanisms we studied generally worked best for statis-tical queries over large populations. None of the mech-anisms was able to provide accurate results (e.g. within1% error) for a significant number of queries over pop-ulations smaller than 1,000. These results confirm theexisting wisdom that differential privacy is ill-suited forqueries with small sample sizes. For large populations(e.g. more than 10,000), on the other hand, multiplemechanisms were able to provide accurate results. In ad-dition, a large set of such queries exists in our corpus.

Mechanism performance. Our performance evaluationhighlights the variability in computation costs of differ-ential privacy mechanisms. Approaches such as Elas-tic Sensitivity or Restricted Sensitivity incur little over-head, suggesting these mechanisms are ideal for perfor-mance critical applications such as real-time analytics.Given their higher performance cost, mechanisms suchas WPINQ and Sample & Aggregate may be most ap-propriate for specialized applications where performanceis less important than suitability of the mechanism for

16

Page 17: Noah Johnson Joseph P. Near Joseph M. Hellerstein ... · queries. In SQL terms, these are queries using standard aggregation operators (COUNT, AVG, etc.) as well as his-tograms created

a particular problem domain. For example, WPINQ isthe only evaluated mechanism that supports syntheticdata generation, a task known to be highly computation-intensive.

The performance of intrinsically private queries candepend on the database engine and transformations ap-plied to the query. In this work we do not attempt tooptimize the rewritten queries for performance.

Unsupported queries. The current implementation ofCHORUS applies a single mechanism to an entire inputquery. As a result, every aggregation function used bythe input query must be supported by the selected mech-anism, or the transformation fails. For example, considera query with joins that outputs both a count and an aver-age. Neither Elastic Sensitivity (which does not supportaverage) nor Sample & Aggregate (which does not sup-port join) can fully support this query.

This issue disproportionately affects WPINQ, sinceour implementation of WPINQ does not supportCOUNT(DISTINCT ...) queries. It is not obvious how to doso: the weights of any record in the database only reflectthe number of duplicate rows until a join is performed(and weights are re-scaled).

It is possible to leverage multiple mechanisms in a sin-gle intrinsically private query by treating each output col-umn separately. This approach would provide improvedsupport for queries like the example above, which useseveral different aggregation functions. We leave suchan extension to future work.

8 Related Work

Differential Privacy. Differential privacy was origi-nally proposed by Dwork [16, 19, 17]. The reference byDwork and Roth [22] provides an overview of the field.

Much recent work has focused on task-specific mech-anisms for graph analysis [29, 49, 32, 33, 12], rangequeries [28, 35, 36, 37, 56, 54, 13, 47, 8, 55], and analy-sis of data streams [20, 50]. As described in Section 7.5,such mechanisms are complementary to our approach,and could be implemented on top of CHORUS to providemore efficient use of the privacy budget.

Differential Privacy Systems. A number of systemsfor enforcing differential privacy have been developed.PINQ [39] supports a LINQ-based query language, andimplements the Laplace mechanism with a measure ofglobal sensitivity. Weighted PINQ [46] extends PINQ toweighted datasets, and implements a specialized mecha-nism for that setting.

Airavat [48] enforces differential privacy for MapRe-duce programs using the Laplace mechanism. Fuzz [25,27] enforces differential privacy for functional programs,

using the Laplace mechanism in an approach similarto PINQ. DJoin [41] enforces differential privacy forqueries over distributed datasets. Due to the additionalrestrictions associated with this setting, DJoin requiresthe use of special cryptographic functions during queryexecution so is incompatible with existing databases.GUPT [40] implements the Sample & Aggregate frame-work for Python programs.

In contrast to our approach, each of these systems sup-ports only a single mechanism and, with the exception ofAiravat, each implements a custom database engine.

Security & Privacy via Query Rewriting. To the bestof our knowledge, ours is the first work on using querytransformations to implement differential privacy mech-anisms. However, this approach has been used in pre-vious work to implement access control. Stonebreakerand Wong [52] presented the first approach. Barker andRosenthal [9] extended the approach to role-based ac-cess control by first constructing a view that encodesthe access control policy, then rewriting input queries toadd WHERE clauses that query the view. Byun and Li [11]use a similar approach to enforce purpose-based accesscontrol: purposes are attached to data in the database,then queries are modified to enforce purpose restrictionsdrawn from a policy.

9 ConclusionThis paper presents CHORUS, a system with a novelarchitecture for enforcing differential privacy for SQLqueries on an unmodified database. CHORUS works byautomatically transforming input queries into intrinsi-cally private queries. We have described the deploymentof CHORUS at Uber to provide differential privacy, whereit processes more than 10,000 queries per day. We makeCHORUS available as open source [3].

We used CHORUS to perform the first empirical evalu-ation of various mechanisms on real-world queries anddata. The results demonstrate that our approach sup-ports 93.9% of statistical queries in our corpus, integrateswith a production DBMS without any modifications, andscales to hundreds of millions of records.

AcknowledgmentsThe authors would like to thank Om Thakkar for hishelpful comments. This work was supported by theCenter for Long-Term Cybersecurity, and DARPA &SPAWAR under contract N66001-15-C-4066. The U.S.Government is authorized to reproduce and distributereprints for Governmental purposes not withstanding anycopyright notation thereon. The views, opinions, and/orfindings expressed are those of the author(s) and should

17

Page 18: Noah Johnson Joseph P. Near Joseph M. Hellerstein ... · queries. In SQL terms, these are queries using standard aggregation operators (COUNT, AVG, etc.) as well as his-tograms created

not be interpreted as representing the official views orpolicies of the Department of Defense or the U.S. Gov-ernment.

References[1] Apache Calcite. http://calcite.apache.org. Accessed:

2017-04-27.

[2] Apple previews iOS 10, the biggest iOS re-lease ever. http://www.apple.com/newsroom/2016/06/

apple-previews-ios-10-biggest-ios-release-ever.html. Ac-cessed: 2016-08-14.

[3] Chorus Source Code. https://github.com/uber/

sql-differential-privacy.

[4] General Data Protection Regulation - Wikipedia.https://en.wikipedia.org/wiki/General Data Protection Regulation.

[5] GPORCA. http://engineering.pivotal.io/post/

gporca-open-source/. Accessed: 2017-04-27.

[6] Vertica Advanced Analytics. http://www8.hp.com/us/

en/software-solutions/advanced-sql-big-data-analytics/. Ac-cessed: 2017-01-20.

[7] Martin Abadi, Andy Chu, Ian Goodfellow, H Bren-dan McMahan, Ilya Mironov, Kunal Talwar, andLi Zhang. Deep learning with differential privacy.In Proceedings of the 2016 ACM SIGSAC Confer-ence on Computer and Communications Security,pages 308–318. ACM, 2016.

[8] Gergely Acs, Claude Castelluccia, and Rui Chen.Differentially private histogram publishing throughlossy compression. In Data Mining (ICDM), 2012IEEE 12th International Conference on, pages 1–10. IEEE, 2012.

[9] Steve Barker and Arson Rosenthal. Flexible secu-rity policies in sql. In Database and ApplicationSecurity XV, pages 167–180. Springer, 2002.

[10] Jeremiah Blocki, Avrim Blum, Anupam Datta, andOr Sheffet. Differentially private data analysis ofsocial networks via restricted sensitivity. In Pro-ceedings of the 4th Conference on Innovations inTheoretical Computer Science, ITCS ’13, pages87–96, New York, NY, USA, 2013. ACM.

[11] Ji-Won Byun and Ninghui Li. Purpose based accesscontrol for privacy protection in relational databasesystems. The VLDB Journal, 17(4):603–619, 2008.

[12] Shixi Chen and Shuigeng Zhou. Recursive mech-anism: Towards node differential privacy and un-restricted joins. In Proceedings of the 2013 ACMSIGMOD International Conference on Manage-ment of Data, SIGMOD ’13, pages 653–664, NewYork, NY, USA, 2013. ACM.

[13] Graham Cormode, Cecilia Procopiuc, Divesh Sri-vastava, Entong Shen, and Ting Yu. Differentiallyprivate spatial decompositions. In Data engineer-ing (ICDE), 2012 IEEE 28th international confer-ence on, pages 20–31. IEEE, 2012.

[14] Rachel Cummings, Sara Krehbiel, Kevin A Lai,and Uthaipon Tantipongpipat. Differential pri-vacy for growing databases. arXiv preprintarXiv:1803.06416, 2018.

[15] Yves-Alexandre de Montjoye, Cesar A Hidalgo,Michel Verleysen, and Vincent D Blondel. Uniquein the crowd: The privacy bounds of human mobil-ity. Scientific reports, 3, 2013.

[16] Cynthia Dwork. Differential privacy. In MicheleBugliesi, Bart Preneel, Vladimiro Sassone, andIngo Wegener, editors, Automata, Languages andProgramming, volume 4052 of Lecture Notes inComputer Science, pages 1–12. Springer BerlinHeidelberg, 2006.

[17] Cynthia Dwork. Differential privacy: A survey ofresults. In Theory and applications of models ofcomputation, pages 1–19. Springer, 2008.

[18] Cynthia Dwork and Jing Lei. Differential privacyand robust statistics. In Proceedings of the forty-first annual ACM symposium on Theory of comput-ing, pages 371–380. ACM, 2009.

[19] Cynthia Dwork, Frank McSherry, Kobbi Nissim,and Adam Smith. Calibrating noise to sensitivityin private data analysis. In Theory of CryptographyConference, pages 265–284. Springer, 2006.

[20] Cynthia Dwork, Moni Naor, Toniann Pitassi, andGuy N Rothblum. Differential privacy under con-tinual observation. In Proceedings of the forty-second ACM symposium on Theory of computing,pages 715–724. ACM, 2010.

[21] Cynthia Dwork, Moni Naor, Omer Reingold,Guy N Rothblum, and Salil Vadhan. On the com-plexity of differentially private data release: effi-cient algorithms and hardness results. In Proceed-ings of the forty-first annual ACM symposium onTheory of computing, pages 381–390. ACM, 2009.

18

Page 19: Noah Johnson Joseph P. Near Joseph M. Hellerstein ... · queries. In SQL terms, these are queries using standard aggregation operators (COUNT, AVG, etc.) as well as his-tograms created

[22] Cynthia Dwork, Aaron Roth, et al. The algorith-mic foundations of differential privacy. Founda-tions and Trends in Theoretical Computer Science,9(3-4):211–407, 2014.

[23] Cynthia Dwork, Guy N Rothblum, and Salil Vad-han. Boosting and differential privacy. In Foun-dations of Computer Science (FOCS), 2010 51stAnnual IEEE Symposium on, pages 51–60. IEEE,2010.

[24] Ulfar Erlingsson, Vasyl Pihur, and Aleksandra Ko-rolova. Rappor: Randomized aggregatable privacy-preserving ordinal response. In Proceedings of the2014 ACM SIGSAC conference on computer andcommunications security, pages 1054–1067. ACM,2014.

[25] Marco Gaboardi, Andreas Haeberlen, Justin Hsu,Arjun Narayan, and Benjamin C Pierce. Linear de-pendent types for differential privacy. In ACM SIG-PLAN Notices, volume 48, pages 357–370. ACM,2013.

[26] Goetz Graefe. The cascades framework for queryoptimization.

[27] Andreas Haeberlen, Benjamin C Pierce, and Ar-jun Narayan. Differential privacy under fire. InUSENIX Security Symposium, 2011.

[28] Moritz Hardt, Katrina Ligett, and Frank McSherry.A simple and practical algorithm for differentiallyprivate data release. In Advances in Neural In-formation Processing Systems, pages 2339–2347,2012.

[29] Michael Hay, Chao Li, Gerome Miklau, and DavidJensen. Accurate estimation of the degree distribu-tion of private networks. In Data Mining, 2009.ICDM’09. Ninth IEEE International Conferenceon, pages 169–178. IEEE, 2009.

[30] Michael Hay, Ashwin Machanavajjhala, GeromeMiklau, Yan Chen, and Dan Zhang. Principledevaluation of differentially private algorithms us-ing dpbench. In Fatma Ozcan, Georgia Koutrika,and Sam Madden, editors, Proceedings of the 2016International Conference on Management of Data,SIGMOD Conference 2016, San Francisco, CA,USA, June 26 - July 01, 2016, pages 139–154.ACM, 2016.

[31] Noah Johnson, Joe Near, and Dawn Song. Practi-cal differential privacy for sql queries using elasticsensitivity. arXiv:1706.09479, 2017.

[32] Vishesh Karwa, Sofya Raskhodnikova, AdamSmith, and Grigory Yaroslavtsev. Private analysisof graph structure. Proceedings of the VLDB En-dowment, 4(11):1146–1157, 2011.

[33] Shiva Prasad Kasiviswanathan, Kobbi Nissim, So-fya Raskhodnikova, and Adam Smith. Analyzinggraphs with node differential privacy. In Theory ofCryptography, pages 457–476. Springer, 2013.

[34] Ios Kotsogiannis, Ashwin Machanavajjhala,Michael Hay, and Gerome Miklau. Pythia:Data dependent differentially private algorithmselection. In Proceedings of the 2017 ACMInternational Conference on Management of Data,pages 1323–1337. ACM, 2017.

[35] Chao Li, Michael Hay, Gerome Miklau, and YueWang. A data-and workload-aware algorithm forrange queries under differential privacy. Proceed-ings of the VLDB Endowment, 7(5):341–352, 2014.

[36] Chao Li, Michael Hay, Vibhor Rastogi, GeromeMiklau, and Andrew McGregor. Optimizing lin-ear counting queries under differential privacy. InProceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of da-tabase systems, pages 123–134. ACM, 2010.

[37] Chao Li, Gerome Miklau, Michael Hay, AndrewMcGregor, and Vibhor Rastogi. The matrix mecha-nism: optimizing linear counting queries under dif-ferential privacy. The VLDB Journal, 24(6):757–781, 2015.

[38] Frank McSherry and Kunal Talwar. Mechanismdesign via differential privacy. In Foundations ofComputer Science, 2007. FOCS’07. 48th AnnualIEEE Symposium on, pages 94–103. IEEE, 2007.

[39] Frank D McSherry. Privacy integrated queries:an extensible platform for privacy-preserving dataanalysis. In Proceedings of the 2009 ACM SIG-MOD International Conference on Management ofdata, pages 19–30. ACM, 2009.

[40] Prashanth Mohan, Abhradeep Thakurta, Elaine Shi,Dawn Song, and David Culler. Gupt: privacy pre-serving data analysis made easy. In Proceedings ofthe 2012 ACM SIGMOD International Conferenceon Management of Data, pages 349–360. ACM,2012.

19

Page 20: Noah Johnson Joseph P. Near Joseph M. Hellerstein ... · queries. In SQL terms, these are queries using standard aggregation operators (COUNT, AVG, etc.) as well as his-tograms created

[41] Arjun Narayan and Andreas Haeberlen. Djoin: dif-ferentially private join queries over distributed da-tabases. In Presented as part of the 10th USENIXSymposium on Operating Systems Design and Im-plementation (OSDI 12), pages 149–162, 2012.

[42] Arvind Narayanan and Vitaly Shmatikov. How tobreak anonymity of the Netflix prize dataset. CoRR,abs/cs/0610105, 2006.

[43] Kobbi Nissim, Sofya Raskhodnikova, and AdamSmith. Smooth sensitivity and sampling in privatedata analysis. In Proceedings of the thirty-ninthannual ACM symposium on Theory of computing,pages 75–84. ACM, 2007.

[44] Vijay Pandurangan. On taxis and rainbows:Lessons from NYC’s improperly anonymized taxilogs. https://medium.com/@vijayp/of-taxis-and-rainbows-f6bc289679a1. Accessed: 2015-11-09.

[45] Hamid Pirahesh, Joseph M. Hellerstein, and WaqarHasan. Extensible/rule based query rewrite opti-mization in starburst. In Proceedings of the 1992ACM SIGMOD International Conference on Man-agement of Data, SIGMOD ’92, pages 39–48, NewYork, NY, USA, 1992. ACM.

[46] Davide Proserpio, Sharon Goldberg, and Frank Mc-Sherry. Calibrating data to sensitivity in privatedata analysis: A platform for differentially-privateanalysis of weighted datasets. Proceedings of theVLDB Endowment, 7(8):637–648, 2014.

[47] Wahbeh Qardaji, Weining Yang, and Ninghui Li.Differentially private grids for geospatial data. InData Engineering (ICDE), 2013 IEEE 29th In-ternational Conference on, pages 757–768. IEEE,2013.

[48] Indrajit Roy, Srinath TV Setty, Ann Kilzer, VitalyShmatikov, and Emmett Witchel. Airavat: Securityand privacy for mapreduce. In NSDI, volume 10,pages 297–312, 2010.

[49] Alessandra Sala, Xiaohan Zhao, Christo Wilson,Haitao Zheng, and Ben Y Zhao. Sharing graphsusing differentially private graph models. In Pro-ceedings of the 2011 ACM SIGCOMM conferenceon Internet measurement conference, pages 81–98.ACM, 2011.

[50] Elaine Shi, HTH Chan, Eleanor Rieffel, RichardChow, and Dawn Song. Privacy-preserving ag-gregation of time-series data. In Annual Network

& Distributed System Security Symposium (NDSS).Internet Society., 2011.

[51] Adam Smith. Privacy-preserving statistical estima-tion with optimal convergence rates. In Proceed-ings of the forty-third annual ACM symposium onTheory of computing, pages 813–822. ACM, 2011.

[52] Michael Stonebraker and Eugene Wong. Accesscontrol in a relational data base management sys-tem by query modification. In Proceedings of the1974 annual conference-Volume 1, pages 180–186.ACM, 1974.

[53] Latanya Sweeney. Weaving technology and policytogether to maintain confidentiality. The Journal ofLaw, Medicine & Ethics, 25(2-3):98–110, 1997.

[54] Yonghui Xiao, Li Xiong, Liyue Fan, and Sla-womir Goryczka. Dpcube: differentially privatehistogram release through multidimensional parti-tioning. arXiv preprint arXiv:1202.5358, 2012.

[55] Jia Xu, Zhenjie Zhang, Xiaokui Xiao, Yin Yang,Ge Yu, and Marianne Winslett. Differentially pri-vate histogram publication. The VLDB Journal,22(6):797–822, 2013.

[56] Xiaojian Zhang, Rui Chen, Jianliang Xu, XiaofengMeng, and Yingtao Xie. Towards accurate his-togram publication under differential privacy. InProceedings of the 2014 SIAM International Con-ference on Data Mining, pages 587–595. SIAM,2014.

20