a practical model for collaborative databases: securely...

47
A Practical Model for Collaborative Databases: Securely Mixing, Searching and Computing Rachit Garg 1 , Nishant Kumar 2 , Shweta Agrawal 1 , Manoj Prabhakaran 3 1 IIT Madras 2 Microsoft Research India 3 IIT Bombay Abstract. We introduce the notion of a Functionally Encrypted Data- store which collects data from multiple data-owners, stores them en- crypted on an untrusted server, and allows untrusted clients to make select-and-compute queries on the collected data. Little coordination and no communication is required among the data-owners or the clients. Our security and performance profile is similar to that of conventional search- able encryption systems, while the functionality we offer is significantly richer. The client specifies a query as a pair (Q, f ) where Q is a filtering pred- icate which selects some subset of the dataset and f is a function on some computable values associated with the selected data. We provide efficient protocols for various functionalities of practical relevance. We demonstrate the utility, efficiency and scalability of our protocols via extensive experimentation. In particular, we use our protocols to model computations relevant to the Genome Wide Association Studies such as Minor Allele Frequency (MAF), Chi-square analysis and Hamming Distance. Keywords: Searchable encryption · Computing on encrypted cloud stor- age · Genome wide association studies 1 Introduction Given the importance of cloud computing today, enabling controlled computa- tion on large encrypted cloud storage is of much practical value in various privacy sensitive situations. Over the last several years, several tools have emerged that offer a variety of approaches towards this problem, offering different trade-offs among security, efficiency and generality. While theoretical schemes based on modern cryptographic tools like secure multi-party computation (MPC) [72,26], fully homomorphic encryption (FHE) [25] or functional encryption (FE) [62,6] can provide strong security guarantees, their computational and communication requirements are incompatible with most of the realistic applications today. At the other end are efficient tools like CryptDB [58], Monomi [67], Seabed [53] and Arx [56], which add a lightweight encryption layer under the hood of conven- tional database queries, but, as we discuss later, offer limited security guarantees and do not support collaborative databases. While there also exist tools which seek to strike a different balance by trading off some efficiency for more robust 1

Upload: others

Post on 28-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Practical Model for Collaborative Databases: Securely ...shwetaag/papers/Search_and_compute.p… · A Practical Model for Collaborative Databases: Securely Mixing, Searching and

A Practical Model for Collaborative Databases:Securely Mixing, Searching and Computing

Rachit Garg1, Nishant Kumar2, Shweta Agrawal1, Manoj Prabhakaran3

1IIT Madras 2Microsoft Research India 3IIT Bombay

Abstract. We introduce the notion of a Functionally Encrypted Data-store which collects data from multiple data-owners, stores them en-crypted on an untrusted server, and allows untrusted clients to makeselect-and-compute queries on the collected data. Little coordination andno communication is required among the data-owners or the clients. Oursecurity and performance profile is similar to that of conventional search-able encryption systems, while the functionality we offer is significantlyricher.The client specifies a query as a pair (Q, f) where Q is a filtering pred-icate which selects some subset of the dataset and f is a function onsome computable values associated with the selected data. We provideefficient protocols for various functionalities of practical relevance. Wedemonstrate the utility, efficiency and scalability of our protocols viaextensive experimentation. In particular, we use our protocols to modelcomputations relevant to the Genome Wide Association Studies suchas Minor Allele Frequency (MAF), Chi-square analysis and HammingDistance.

Keywords: Searchable encryption · Computing on encrypted cloud stor-age · Genome wide association studies

1 Introduction

Given the importance of cloud computing today, enabling controlled computa-tion on large encrypted cloud storage is of much practical value in various privacysensitive situations. Over the last several years, several tools have emerged thatoffer a variety of approaches towards this problem, offering different trade-offsamong security, efficiency and generality. While theoretical schemes based onmodern cryptographic tools like secure multi-party computation (MPC) [72,26],fully homomorphic encryption (FHE) [25] or functional encryption (FE) [62,6]can provide strong security guarantees, their computational and communicationrequirements are incompatible with most of the realistic applications today. Atthe other end are efficient tools like CryptDB [58], Monomi [67], Seabed [53] andArx [56], which add a lightweight encryption layer under the hood of conven-tional database queries, but, as we discuss later, offer limited security guaranteesand do not support collaborative databases. While there also exist tools whichseek to strike a different balance by trading off some efficiency for more robust

1

Page 2: A Practical Model for Collaborative Databases: Securely ...shwetaag/papers/Search_and_compute.p… · A Practical Model for Collaborative Databases: Securely Mixing, Searching and

security guarantees and better support for collaboration – like Searchable En-cryption (starting with [65,21]), and Controlled Functional Encryption [50] –they offer limited functionality.

In this work, we propose a new solution – called Functionally EncryptedDatastore (FED) – that can be used to implement a secure cloud-based data storethat collects data from multiple data-owners, stores it encrypted on an untrustedserver, and allows untrusted clients to make select-and-compute queries on thecollected data. Little coordination and no communication is required among thedata-owners or the clients. Our security and performance profile is similar tothat of conventional searchable encryption systems, while the functionality weoffer is significantly richer.

Some Motivating Scenarios. There are many scenarios today when data is col-lected from many users in a centralized database and made available to otherusers for querying. In such situations, the individuals whose data has been col-lected should be considered the actual data-owners, who are providing their datawith an expectation that it will be used only for certain well-specified purposes.One such example is that of census data: governments use census data for ad-ministrative purposes, and also provide restricted access to select researchers forselect purposes [68,74]. Another scenario involves private corporations who cancollect large amounts of information about their customers, which could be legit-imately useful in improving their customer service. A third example, and the onethat we use in our experimental analysis, is offered by genomic studies. GenomeWide Association Studies (GWAS) look into entire genomes across different indi-viduals to discover associations between genetic variants and particular diseasesor traits [9,46,48].

However, in all such scenarios, individuals’ privacy is vulnerable and the ab-sence of fine grained control can lead to significant problems. This was illustratedin an incident involving Ipsos Mori, a British marketing research firm, who of-fered to share data pertaining to 27 million mobile phone users with the police[40]. While they defended themselves by insisting that they will only release safeaggregated data which their users had acquiesced to [66], the deal was shelvedunder public pressure. Similarly, GWAS studies involve sharing highly sensitiveinformation about the individuals with several researchers in different parts ofthe world, and the individuals may have little control over how their data will beused in future. Indeed, as evidenced by the case of the Havasupai tribe againstthe Arizona State University [32,35], the researchers from the university collectedgenetic data for studying links to diabetes and later used it for other sensitivesubjects like schizophrenia, migration and inbreeding, without the consent of theindividuals or the community who contributed the data.

These are a few of the many scenarios that motivate the need to implementefficient, fine-grained control over computation on large scale data collected frommultiple data-owners.

Addressing the Challenge. Guided by the above scenarios, we set our goals. Werequire a datastore that provides strong security guarantees while offering flexible

2

Page 3: A Practical Model for Collaborative Databases: Securely ...shwetaag/papers/Search_and_compute.p… · A Practical Model for Collaborative Databases: Securely Mixing, Searching and

functionality — multiple data owners, and support for expressive computationqueries — and practical efficiency. We briefly discuss what this entails.

Firstly, accommodating multiple data owners in a flexible and scalable man-ner calls for several features:

– No coordination among data-owners. The data-owners should not need totrust, or even be aware of each other.

– Untrusted clients. The clients should not need to hold any state or key material(except as may be needed to authenticate themselves to the servers, in ahigher-level application). This allows opening up the datastore to a large setof clients without requiring to trust them.

– Data-owners oblivious of the clients. The data-owners should not need to beaware of the clients who will query the datastore in the future, and in partic-ular, cannot provide them with any keys. Further, the data-owners need notbe online when the queries arrive from the clients.

– Anonymously collaborative database. The association of the encrypted dataitems with their data-owners should be hidden from all the parties in thesystem.

A Trust Assumption. This set of requirements creates a tension with the ba-sic premise of untrusted servers in a secure datastore: if the data-owners areoblivious of the clients and are not online when clients access the datastore,then the datastore must be fully operable by the servers; but this allows theservers to freely access the collected data (e.g., by emulating the role of clientsto make as many accesses as they wish). Resolving this tension necessitates atrust assumption: we require that there are multiple non-colluding servers. Thisis often a reasonable assumption in practice, given the growing availability ofcloud computing services from competing service providers. We seek a solutionthat uses multiple servers minimally – there should be only one server with largestorage (who stores all the encrypted data), and a single auxiliary server (whomay store some key material). Either server by itself can be corrupt, but theywill be assumed not to collude with each other 1. We remark that the modelof non-colluding servers has been used quite successfully in the literature in thepast, including multi-prover interactive proof systems [4], multi-server PrivateInformation Retrieval [19], Oblivious RAM [47], CFE [50] 2 and in SearchableEncryption itself [52,57].

Our Solution. To support expressive queries beyond search, we seek efficientimplementation of typical (relational) database operations on a single table.Towards this we seek to support a select-and-compute functionality. We considerqueries to be specified as a pair (Q, f) where Q is a filtering predicate which

1 We observe that in the setting of single data owner, the auxiliary server may beemulated by the data owner itself. However, this requires the data owner to stayonline even during the query phase, something we wish to avoid.

2 the storage server is implicit in their constructions, data owner does storage andretrieval and no collusion between data-owner and key-server is assumed.

3

Page 4: A Practical Model for Collaborative Databases: Securely ...shwetaag/papers/Search_and_compute.p… · A Practical Model for Collaborative Databases: Securely Mixing, Searching and

selects some subset of the dataset and f is a function on the selected data. Akey feature we seek is that the computation overheads for a select-and-computequery should not scale with the entire database size, but only with the number ofselected records. This would offer substantial efficiency gains in most practicalapplications: for instance, in a census-like database for a country, a selectionquery for a zip-code would result in a small fraction of all the records to bechosen.

Finally, we seek a level of cryptographic security and efficiency that is typicalof existing searchable encryption works (e.g., [51,43]). Here, the security guaran-tees are formulated in the form of an ideal functionality (following the UniversallyComposable security framework) and well-defined leakage functions specified aspart of this functionality. These solutions are highly scalable and capable of han-dling large databases [39,51,11,21], with the query-time complexity scaling withthe number of selected rows (as discussed above). However, we note that theseworks are restricted to a single data owner and only search queries (rather thansearch-and-compute). For FED, we shall set a similar security-efficiency combi-nation as the goal, while requiring a much richer functionality, namely, multipledata owners and computation on selected data.

1.1 Our Results

As discussed above, we introduce the notion of a Functionally Encrypted Data-store (FED), which permits a data-owner to securely outsource its data to astorage server, such that, with the help of an auxiliary server, clients can effi-ciently carry out select-and-compute queries. We emphasize that our databaseis anonymously collaborative in the sense that it contains data belonging to mul-tiple data owners but hides the association of the (encrypted) data items withtheir owners.

Apart from introducing the notion of FED our contributions include:

– A general framework for instantiating FED schemes. The framework is mod-ular, and consists of two components, which may be of independent interest(when only the corresponding functionality is desired) – namely, SearchablyEncrypted Datastore (SED) and Computably Encrypted Datastore (CED).

– We provide instantiations of this framework for evaluating arbitrary functions,as well as important special classes of functions, the latter more efficiently thanthe former. As part of this, we instantiate a multiple data-owner version ofsearchable encryption, which is of independent interest.

– We demonstrate the utility and practicality of our protocols based on exten-sive experimentation, involving realistic statistical analysis tasks for genomicstudies.

Along the way we obtain several other results that are of independent interest.

– As a starting point for our constructions, we give constructions for “singledata-owner” versions of FED, SED and CED (denoted by sFED, sSED andsCED, respectively), which are simpler and more efficient, albeit offering nosupport for multiple data-owners.

4

Page 5: A Practical Model for Collaborative Databases: Securely ...shwetaag/papers/Search_and_compute.p… · A Practical Model for Collaborative Databases: Securely Mixing, Searching and

– We introduce a versatile collection of techniques under the umbrella of “onionsecret-sharing” by leveraging “onion encryption” which was originally usedin anonymous routing and mixed nets [61,18]. In our setting, we use these todistribute secret shares of multiple messages to multiple receivers, maintainingthe link across the shares of the same message while delinking the messagesfrom their origin. This yields an anonymously collaborative database as dis-cussed above.

1.2 Overview of Constructions

We present several modular constructions of FED (and the single data-ownerversion sFED), which can be instantiated in multiple ways, by plugging in dif-ferent implementations of its components. As discussed above, we identify twosimpler primitives, Searchably Encrypted Datastore (SED) and Computably En-crypted Datastore (CED), and show how they can be securely dovetailed into aconstruction of FED. Following is a roadmap to the constructions in this work.

sFEDSection 4.3

(see Figure 2)

FEDSection 5

(see Figure 4a)

SEDCED

Section 5.4 Section 5.2(see Figure 4b)

sCED

‣ Value Retrieval‣ Summation‣ Summation (alt)‣ General

Section 4.2 sSED

Section 4.1

map &merge-map

‣ OPRF-based‣ SFE-based

Section 5.3

OnionSecret-Sharing

Section 5.1

The starting point of our constructions are single data-owner versions sSEDand sCED, shown at the bottom of the above map. We show that these com-ponents can be implemented by leveraging constructions from the literature,namely, the Multi-Client Symmetric Searchable Encryption (MC-SSE) schemedue to Jarecki et al. [36] and Controlled Functional Encryption (CFE) due toNaveed et al. [50]. The search query family supported by our sSED constructionsare the same as in [36]. For sCED, we support a few specialized functions, as wellas a general function family. The primitives sSED and sCED are of independentinterest, and also they can be combined to yield a single data-owner version ofFED (called sFED).

To upgrade sSED and sCED constructions into full-fledged (multi data-owner) SED and CED schemes, we require several new ideas. One challenge

5

Page 6: A Practical Model for Collaborative Databases: Securely ...shwetaag/papers/Search_and_compute.p… · A Practical Model for Collaborative Databases: Securely Mixing, Searching and

in this setting is to be able to hide the association of the (encrypted) data itemswith their data-owners. Our approach is to first securely merge the data fromthe different data-owners in a way that removes this association, and then usethe single data-owner constructions on the merged data set. For this, both SEDand CED constructions rely on onion secret-sharing techniques (see Section 5.1for an overview). In the case of CED, merging essentially consists of a multi-set union. But in the case of sSED, merging entails merging “search indices”on individual data sets into a search index for the combined data set. Sincethe data-owners do not trust (or are even aware of) each other, this operationmust be implemented via the servers in an “oblivious” fashion, and onion secret-sharing techniques alone are not adequate. We propose two approaches to mergethe indices – one in the random oracle model using an Oblivious PseudorandomFunction (OPRF) protocol, and another one with comparable efficiency in thestandard model, relying on 2-party Secure Function Evaluation (SFE).

1.3 Related Work

Below, we contrast our notion of Functionally Encrypted Datastores with priorwork in terms of various features. A detailed summary appears in Table 2 inAppendix B.

– Multiple data owners: We allow encrypted data to be collected from mul-tiple data-owners. Prior works that allow this include multiparty computation[72,26], multi-input functional encryption [27], multi-key fully homomorphic en-cryption [45] and controlled functional encryption (CFE) [50], all of which suffercosts proportional to the entire dataset. A different line of work based on multi-key/multi-user Searchable Encryption [60,59,30,69] allows data to be shared frommultiple data owners to clients for searching of single keywords, but unlike thiswork, does not hide the association of the data to its owner, nor allows dataowners to be oblivious of the clients.

– Rich functionality with strong security : Symmetric searchable encryption(SSE) [21,12,11,54,51,23,7] and its extensions (including Structured Encryption[14,38], Queryable Encryption [16] etc.) support keyword searches and have per-formance sublinear in the size of the dataset, but do not allow computationon the search results. In contrast, tools such as CryptDB [58], Monomi [67],Seabed [53] and Arx [56] do allow full-fledged search-and -compute (searches areattribute based rather than keyword queries). But they incur higher leakage tothe server and argue security in the “snapshot attacker model” which has beencriticized for being unrealistically weak [28]. Moreover, their security analysisassumes fully trusted clients who do not collude with the server(s). In contrast,we offer a stronger security model where clients may be malicious, either servercan collude with a subset of data owners and/or clients and the attacker is per-sistent, i.e. has access to the view of the corrupt parties throughout the lifetimeof the system. Leakage to servers is limited to the leakage typical in SSE, andthere is no leakage to data owners or clients.

6

Page 7: A Practical Model for Collaborative Databases: Securely ...shwetaag/papers/Search_and_compute.p… · A Practical Model for Collaborative Databases: Securely Mixing, Searching and

– Computationally light clients: Our clients are very efficient and only per-form work proportional to the size of their queries and outputs, typically inde-pendent of database size. In comparison, the CryptDB family of constructionsdo not execute all the computational operations fully at the server, but insteadrequire the clients to download certain intermediate results, decrypt them andperform the remaining computation. Similarly, the CFE scheme of [50] requiresclients to evaluate a garbled circuit on the entire dataset. While multi client SSEhas more lightweight clients than the above, the clients still do work proportionalto the size of the filtered data.

– Deployability : We allow clients to join the system dynamically – indeed,clients and data owners can be oblivious of each others’ identity (or even number)in contrast with prior constructions. However, we do rely on the availability oftwo non-colluding servers, which, as mentioned earlier, is a necessary assumption.

2 Preliminaries

In this section, we define the preliminaries that we will use in our work. Adetailed index of notation appears in Table 1 in Appendix A.

Data Records. Throughout this paper, we consider databases and queries of thefollowing form.

– Data: Each data record in the database is of the form (w, x) ∈ W×X , wherew forms the searchable attributes and x the computable values. A databaseis a multi-set of such records.

– Search Queries: Q consists of the set of supported search queries of the formQ :W → {0, 1}. For a multi-set Z of data records, Q[Z] denotes the multi-setobtained by using Q to select records from Z and projecting them to the Xcoordinate. Formally, writing µS(y) to denote the multiplicity of an element

y in a multi-set S, we have µQ[Z](x) =∑w∈W:Q(w)=1

µZ(w, x).

– Computation Queries: F consists of the set of supported computationqueries. Each f ∈ F takes as input a multi-set consisting of elements in X .

– Leakage Function: A leakage function L maps data sets and (optionally)queries to a pair of messages (LS, LA), to be sent to a pair of servers.

Composite Queries. We shall also consider an extension of the definition ofQ[Z]: We allow Q = (Q1, · · · , Qd), where each Qi : W → {0, 1} and f =(f0, f1, · · · , fd), where for i > 0, fi are functions on multi-sets of values, and f0is a function on a d-tuple; we define

f(Q[Z]) := f0(f1(Q1[Z], · · · , fd(Qd[Z])).

We note that this could be further generalized to recursively allow Qi and fi tohave the same general structure.

7

Page 8: A Practical Model for Collaborative Databases: Securely ...shwetaag/papers/Search_and_compute.p… · A Practical Model for Collaborative Databases: Securely Mixing, Searching and

UC Security. We shall use the framework of Universally Composable security[10] to define our primitives. Towards this we shall present an ideal functional-ity, a corruption model (called non-colluding semi-honest servers model, below),and communication pattern restrictions on a protocol that securely realizes thefunctionality. The corruption model allows active corruption of some parties and(only) passive corruption of the others. Jumping ahead, we mention that in ourmodular constructions, the sub-protocols will be invoked using roles that areconsistent with the corruption model, so that a corruption pattern admissible ina protocol will lead to corruption patterns admissible in the sub-protocols too;this lets one use the universal composition theorem [10].

Oblivious PRF. An Oblivious PRF (OPRF) scheme [49] is a two-party protocolto securely evaluate a PRF FK(τ), in which the first party inputs τ , the secondinputs K, the first learns the value of FK(τ) (and the second learns nothing). Weuse the UC-secure OPRF protocol introduced in [37] in our constructions, whichallows active corruption of the first party (receiver) and passive corruption ofthe second party (one with the key).

Keyword Search Queries. The major search query families that have receivedattention in the searchable encryption literature – and also of interest to thiswork – are “keyword queries.”3 A keyword query is either a predicate aboutthe presence of a single keyword in a record (document), or a boolean formulaover such predicates. In terms of the notation above, the searchable attribute foreach record is a set of keywords, w ⊆ K where K is a given keyword space. Thatis, W = P(K), the power set of K. We define the class of keyword occurrencequeries, QK = {Qτ : τ ∈ K}, where Qτ is defined as Qτ (w) = 1 iff τ ∈ w. Amore complex search query can be specified as a boolean formula over severalsuch keyword occurrence predicates. We denote this query family by Q∗K.

We point out that keyword searches can be extended to more general databasequeries as follows: Each row in a database table is interpreted as a document,consisting of “keywords” of the form (attribute, value), one for each attribute(column) in the table. Then a search query that is specified as a formula overattribute-value equality predicates can be encoded as a formula over keyword oc-currence predicates. Range queries can be modeled as a disjunction over severalpredicates corresponding to a priori fixed ranges of varying sizes.

3 Defining FED

To define an FED scheme, we present an ideal functionaly (called FED), a cor-ruption model (called non-colluding semi-honest servers model, below), and com-

3 We remark that the concept of searchable encryption has been generalized to moreexpressive forms of search, such as pattern matching, range queries and searchingover structured data [63,15,67,17,44,22]. While our constructions do not focus onsuch search query families, our general framework applies to all these notions aswell.

8

Page 9: A Practical Model for Collaborative Databases: Securely ...shwetaag/papers/Search_and_compute.p… · A Practical Model for Collaborative Databases: Securely Mixing, Searching and

munication pattern restrictions on a protocol for it to be an FED scheme. Afterwe present these elements, an FED scheme is defined in Definition 1.

The Functionality. FED is formulated as a two stage functionality, involving aninitialization stage and a query stage. Figure 1 describes this functionality andalso gives a schematic representation. The parties involved are data-owners Di,a storage server S, an auxiliary server A, and clients, each one denoted asC.

As shown in Figure 1, FED functionality allows a client to send (Q, f) andlearn f(Q[Z]), where Z is the multi-set union of the data from all the data-owners. (The notation Q[Z] is defined in Section 2.)

(Q,F ,L)-FED functionality, supporting a search query family Q and a computationfunction family F , with leakage given by the function L:

– Initialization phase: When a data-owner Di sendsZi ⊆ W × X , generate (LS, LA) ← L(Zi) and sendthem to S and A respectively.

– Transition to the next phase on command from Sor A (notifying the other). Define the multiset unionZ =

⋃i Zi.

– Query phase: Each time a client C sends a query(Q, f) ∈ Q × F , generate (LS, LA) ← L(Q, f, Z) andsend them to S and A respectively; send f(Q[Z]) to C.

SEDCED

map &merge-map

OPRF-basedSFE-based

sFEDSection II

(see Figure 2)

FED

Section IV (see Figure 3)

Section IV-D Section IV-B(see Figure 4)

sCED

Value RetrievalSummationSummation (alt)General

Section III-A sSED

Section III-B

Section IV-C

OnionSecret-Sharing

Section IV-A

sFEDD C

A

S

Z Q, f

f(Q[Z])

sCED

D CSA

X

Q

f(δT[X])

sSED

W

f

Q[W ]

T

Z Q, f

f(Q[Z])

CED

CSA

Xm

Q

f(δT[X])

SED

Wm

f

Q[W ]

T

Q, f

f(Q[Z])FED C

A

S

Z1

Q, f

f(Q[Z])

Dm

D1

Zm

..

. … DmD1ZmZ1

X1

W1

A S

QW1 Wm

Q[W ]

C

W Q

K"

Q[W ]map

SED

DmD1

K#

merge-map

Fig. 1 The FED functionality. The dotted lines indicate leakage from functionality.Note that we do not allow any leakage to the data owners or the clients.

Non-Colluding Semi-Honest Servers Corruption Model. We define the non-colludingsemi-honest servers corruption model in which the adversary can corrupt anysubset of players, with the following restrictions:

– Only the clients can be actively corrupt. The other corrupt parties (S, A andthe data-owners) should remain honest-but-curious.

– The storage server S and the auxiliary server A cannot simultaneously becorrupt (i.e., they do not collude).

While our protocols can be slightly modified to accommodate actively corruptdata-owners as well, we do not do so. This is because, even given the idealFED functionality, one actively corrupt data-owner can choose to supply arbi-trary data to the database, potentially invalidating the results from the entiredatabase. To adequately handle active corruption of the data-owners, the FEDfunctionality should be augmented with the ability for the servers to enforcepolicies on the data being collected from the data-owners. We do not considerthis extension in this work.

Two-Phase One-Pass Protocol. As mentioned before, to be deployable in a real-istic setting, an FED scheme should have a restricted interaction pattern. Halevi

9

Page 10: A Practical Model for Collaborative Databases: Securely ...shwetaag/papers/Search_and_compute.p… · A Practical Model for Collaborative Databases: Securely Mixing, Searching and

et al. defined a server-based one-pass protocol [29] to model “secure computa-tion on the web”, where the clients join and leave a protocol one-by-one, while aserver remains online throughout. We adapt this model for the FED frameworkas follows. A protocol among data-owners, servers and clients is said to be atwo-phase one-pass protocol if it satisfies the following:

– The protocol consists of sequentially executing several sessions of two proto-cols, in two phases; the first protocol (executed in the first phase) involvesonly the servers and a single data-owner, while the second protocol (executedin the second phase) involves only the servers and a single client.

– The servers can maintain their state across the sessions. Every other partyparticipates only in one of the sessions.

– Each data owner and client is given the servers’ public-keys (but not eachothers’). The servers obtain the public-key for a data-owner or a client whenit joins the protocol.

– Only one of the servers has a “large” storage between the sessions – i.e., storagesize depending on the input size across the sessions – while the amount ofstorage for the other server(s) is upper bounded by a function of the securityparameter (independent of the inputs). (In our setting with two servers S andA, we define S as the one with large storage.)

Definition 1. A (Q,F ,L)-FED scheme is a two-phase one-pass protocol thatUC-securely implements the (Q,F ,L)-FED functionality (Figure 1) in the non-colluding semi-honest servers corruption model.

The above definition readily generalizes to other intermediate primitives we de-fine, by replacing the functionality FED by other functionalities that will bespecified later (namely, SED, CED, sFED, sSED and sCED).

Note that for a scheme to be fully specified, we need to describe not justQ and F , but also the leakage function L. In each of our constructions in thesequel, we shall do this explicitly.

4 Single Data-Owner Protocols

In this section we introduce a single data-owner version of FED, denoted bysFED, and also construct a sFED scheme. The single data-owner setting issimpler as it avoids having to “mix” data records from different data-owners.Our sFED scheme relies on two other new functionalities we introduce, namely,(single data-owner versions of) Searchably Encrypted Datastore (sSED) andComputably Encrypted Datastore (sCED). We begin by presenting these newfunctionalities.

4.1 Searchably Encrypted Datastore

Recall that in an FED or sFED scheme, a query has two components – a searchquery Q and a computation function f . The Searchably Encrypted Datastore

10

Page 11: A Practical Model for Collaborative Databases: Securely ...shwetaag/papers/Search_and_compute.p… · A Practical Model for Collaborative Databases: Securely Mixing, Searching and

functionality (SED or sSED) has a similar structure, but supports only thesearch query; all the records that match the search query are revealed to thestorage server S. Jumping ahead, the choice of S to be the party receiving theoutput rather than the client C is dictated by our plan to use this functionalityin protocols for FED and sFED.

The functionality sSED is depicted in Figure 2: There is a single data-ownerD with input W ⊆ W×I, where each element in W has a unique identifier id ∈ Ias its second coordinate; the output that S receives when a client C inputs Q isthe set of identities Q[W ] ⊆ I.

In Section 4.5, we shall see that a multi-client version of Symmetric Search-able Encryption (MC-SSE) from [36] can be used to construct an sSED scheme.The main limitations of MC-SSE compared to sSED are that (1) in the formerthe data-owner D remains online during the query phase whereas in the latterD can be online only during the initialization phase, and (2) in the former theoutput is delivered to both S and C whereas in the latter it must be deliveredonly to S. In our construction in Section 4.5, we shall leverage the auxiliaryserver A to meet these additional requirements.

4.2 Computably Encrypted Datastore

The second functionality we introduce – CED, or its single data-owner variantsCED– helps us securely carry out a computation on an already filtered dataset. The complexity of this computation will be related to the size of the filtereddata rather than the entire contents of the data set.

In sCED, as shown in Figure 2, a single data-owner D (who stays online onlyduring the initialization phase) has an input in the form of X ⊆ I × X . Later,during the query phase, clients can compute functions on a subset of data. Moreprecisely, a client C can specify a function f from a pre-determined functionfamily, and the storage server S specifies a set T ⊆ I, and C receives f(δT [X])where we define δT (id, x) = x iff id ∈ T , and δT [X] is the multiset of x valuesobtained by applying δT (id, x) to all elements of X.

In Section 4.4, we present protocols for sCED, for various specialized functionfamilies, as well as for a general function family.

4.3 sFED Protocol Template

Protocol sFED-templsSED,sCED: This protocol is illustrated in Figure 2. Dur-ing the initialization phase, D maps its input Z to a pair (W,X), where W ⊆W × I and X ⊆ I × X such that (w, x) 7→ ((w, id), (id, x)) where id is ran-domly drawn from (a sufficiently large set) I. Then, in the initialization phaseof sFED, the parties D, S and A invoke the initialization phase of sSED, andthat of sCED (possibly in parallel). During the query phase of sFED, S, A andC first invoke the query phase of sSED, so that S obtains T = Q[W ] as theoutput; then they invoke the query phase of sCED and C obtains f(δT [X]); note

11

Page 12: A Practical Model for Collaborative Databases: Securely ...shwetaag/papers/Search_and_compute.p… · A Practical Model for Collaborative Databases: Securely Mixing, Searching and

SEDCED

map &merge-map

OPRF-basedSFE-based

sFEDSection II

(see Figure 2)

FED

Section IV (see Figure 3)

Section IV-D Section IV-B(see Figure 4)

sCED

Value RetrievalSummationSummation (alt)General

Section III-A sSED

Section III-B

Section IV-C

OnionSecret-Sharing

Section IV-A

sFEDD C

A

S

Z Q, f

f(Q[Z])

sCED

D CSA

X

Q

f(δT[X])

sSED

W

f

Q[W ]

T

Z Q, f

f(Q[Z])

CED

CSA

Xm

Q

f(δT[X])

SED

Wm

f

Q[W ]

T

Q, f

f(Q[Z])FED C

A

S

Z1

Q, f

f(Q[Z])

Dm

D1

Zm

..

. … DmD1ZmZ1

X1

W1

A S

QW1 Wm

Q[W ]

C

W Q

K"

Q[W ]map

SED

DmD1

K#

merge-map

Fig. 2 sFED functionality (left) and the protocol sFED-templsSED,sCED. The dottedlines indicate leakage. In sFED-templ, the parties communicate to each other only byinvoking the functionalities sSED and sCED.

that δT [X] = Q[Z] if there are no collisions when elements are drawn from I toconstruct (W,X) from Z.

Leakage, LsFED-templ: The protocol sFED-templsSED,sCED leaks, for every query

(Q, f) from C, the set T = Q[W ] to S, and in addition provides S and A withthe leakage provided by the sSED and sCED functionalities (which depends onhow they are instantiated). Note that D chooses ids at random to define W andX and the leakage functions of sSED and sCED are applied to these sets. Also,as ids are random, leaking T amounts to only leaking its pattern over multiplequeries: specifically, T1, · · · , Tn contains only the information provided by theintersection sizes of various combinations of these sets. Formally, this leakage isgiven by

pattern(T1, · · · , Tn) := {∣∣ ⋂i∈S

Ti∣∣}S⊆{1...n}. (1)

The following theorem is a consequence of universal composition, applied tothe non-colluding semi-honest servers corruption model.

Theorem 1. Protocol sFED-templsSED,sCED, when instantiated using a (Q,LsSED)-sSED scheme and an (F ,LsCED)-sCED scheme, is a (Q,F ,LsFED-templ)-sFEDscheme, where LsFED-templ is as defined above.

Hence, to construct sFED, we need only instantiate appropriate sSED andsCED schemes. We proceed to do this next, starting with sCED.

4.4 sCED Protocols

We present protocols for a few different computation function families. We re-mark that sCED somewhat resembles a primitive called Controlled FunctionalEncryption [50], and our sCED protocols involve similar ideas as in that work.

In each of the protocols below, D has an input X ⊆ I × X during theinitialization phase. It will be convenient to define the set J ⊆ I as J ={id|∃x s.t. (id, x) ∈ X}. During each query, S has an input T ⊆ J and C has aninput f from the computation function family. Below, all our constructions usea PRF F .

12

Page 13: A Practical Model for Collaborative Databases: Securely ...shwetaag/papers/Search_and_compute.p… · A Practical Model for Collaborative Databases: Securely Mixing, Searching and

Value Retrieval: This is the functionality associated with standard SSE, wherethe selected values, or documents, are retrieved without any further computationon them. There is a single function in the corresponding computation functionfamily FValRet, given by f(δT [X]) = δT [X]. When the client C and the dataowner D are the same party (as is the case in the simplest version of SSE), thiscan be implemented in a straightforward fashion using a PRF. Below we givea simple scheme which relies on A to extend this to a setting with (multiple)clients who do not communicate directly with D.

Protocol ValRet

– Initialization Phase: D picks a PRF key K, and defines βid := xid⊕FK(id)and sends {(id, βid)}id∈X to S and K to A, who store them.

– Computation Phase: S sends T (randomly permuted) and a fresh PRF keyK1 to A; it also sends {βid ⊕FK1

(id) | id ∈ T} (under the same permutation)to C. A sends {FK(id) ⊕ FK1

(id) | id ∈ T} to C. C outputs {ai ⊕ bi}i, where{ai}i and {b}i are the messages it received from S and A (in the same order).

– Leakage, LValRet : On initialization, the ID-set J is leaked to S. On eachquery, T is leaked to A.

Recall that, in the overall sFED protocol template, ids will be chosen ran-domly. Hence leaking J amounts to leaking only its size |X| (the data can bepadded with dummy entries so that instead of |X|, only an upper bound on it isleaked), and leaking T amounts to only leaking its pattern over multiple queries(see Equation 1).

Theorem 2. Protocol ValRet is a (FValRet,LValRet)-sCED scheme, assumingthe security of the PRF scheme used.

Proof sketch: We briefly sketch the elements in the protocol that help it achievesecurity. In the initialization phase D secret-shares its data between the twonon-colluding servers, so that an adversary corrupting either one learns no in-formation about the data. In the computation phase, C receives freshly ran-domized secret-shares (using the key K1) of the answer to its query, with theelements randomly permuted. This is because, if C does not collude with oneof the servers it should receive no information other than the multi-set of re-trieved values, f(δT [X]). In particular, it may not learn whether an id selectedby one query gets selected again under another query. The permutation and freshsecret-sharing ensures that its view can be completely simulated just based onthe multi-set of retrieved values, f(δT [X]) = {xid|id ∈ T}. Note that if C col-ludes with one of the servers, this rerandomization has no effect, but also, inthat case, it is allowed to learn T and there is no need for rerandomization. �

Summation: The family FSum consists of the single function f such thatf(S) =

∑x∈S x (where the summation is in a given abelian group which the

domain of values is identified with). The following simple and efficient protocolyields a sCED scheme for summation, with A learning only the size and the“pattern” information about the input of S.

Protocol Sum

13

Page 14: A Practical Model for Collaborative Databases: Securely ...shwetaag/papers/Search_and_compute.p… · A Practical Model for Collaborative Databases: Securely Mixing, Searching and

– Initialization Phase: D picks a PRF key K, and defines βid := xid +FK(id)and sends {(id, βid)}id to S and K to A, who store them.

– Computation Phase: S defines the set R := {id | id ∈ T} and a randomvalue ρ; it sends (ρ,R) to A and γ := ρ +

∑id∈T βid to C. A sends δ :=

ρ+∑α∈R FK(α) to C. C outputs γ − δ.

– Leakage, LSum : On initialization, the ID-set J is leaked to S. On each query,T is leaked to A.

This protocol is a natural extension of the ValRet scheme above, with thesame initialization phase. The client C receives a fresh additive secret-sharing ofthe single output value it seeks. The security argument is similar to before; inparticular, if C does not collude with either server, its view can be completelysimulated from its output without any leakage.

Theorem 3. Protocol Sum is a (FSum,LSum)-sCED scheme, assuming the se-curity of the PRF scheme used.

In Appendix D.1 we present an alternate protocol for summation, using ad-ditive homomorphic encryption, which has lower communication complexity andalso avoids the leakage of the filtered set T to A. We also extend the value re-trieval and summation protocols to the setting when each value x is a vector(x1, · · · , xm), and the function f acts on a subset of attributes.

General Functions: Next we present a sCED scheme for the family Fckt

consisting of general functions represented using boolean circuits. Our schemeis similar to the CFE scheme of [50] for general functions although there aresome important differences between the two models. In particular, the modelof CFE conflates the client C and the storage server S; also, it does not allowthe data-owner D to directly communicate with the auxiliary sever A, resultingin the use of public-key encryption in [50], which we avoid, resulting in greaterefficiency.

A client C who wishes to evaluate a function f sends a circuit representationof f to A. The inputs to this circuit are the values {xid}id∈T which none of theparticipants in the query phase (C, A nor S) knows. At a high-level, the idea isthat A will construct a garbled circuit for f and sends it to S. For each input bitfor this circuit, there are two labels, which will both be encrypted by A using keysthat are derived from a master key that the data-owner D gives it (as describedbelow). All these encrypted labels are sent along with the garbled circuit. Toevaluate the garbled circuit, S needs to know how to decrypt one out of the twolabels corresponding to each input position. To enable the evaluation, D wouldhave provided S with the decryption key for the labels corresponding to each bitof xid, for each id (during the initialization phase). A detailed description of thisscheme is given in Figure 3.

The leakage function for this construction is as follows:Leakage, Lckt : On initialization, the ID-set J is leaked to S. On each query,

T and f are leaked to A, and the circuit structure of f (as revealed by the garbledcircuit) is leaked to S.

14

Page 15: A Practical Model for Collaborative Databases: Securely ...shwetaag/papers/Search_and_compute.p… · A Practical Model for Collaborative Databases: Securely Mixing, Searching and

Protocol CktEval: A sCED scheme for general functions:The values computed over (i.e., elements in X ) are represented as bit strings of the samesize, say |x| = t. We shall refer to a PRF F and a Symmetric Key Encryption scheme(SKE) with algorithms (SKE.Gen, SKE.Enc, SKE.Dec) (denoting Key Generation, En-cryption and Decryption algorithms respectively), which can be based on F .

– Initialization Phase: The data-owner D picks a PRF key K. For each (id, xid) ∈ X,where xid = (xid1 , · · · , xidt ), it computes {ωid,i, λ

0id,i, λ

1id,i}ti=1, where ωid,i = xid,i⊕νid,i,

with νid,i being a pseudorandom bit, and λbid,i (for b = 0, 1) are two encryption keysgenerated by SKE.Gen, all computed using FK(id, i) as the source of randomness. D

sends {id, {λxidi

id,i, ωid,i}ti=1}id∈X to S, and it sends K to A, for storing.– Computation Phase: C’s input is a function f defined on a multi-set (order of the

inputs not being important). We assume that given a number n, a circuit represen-tation of f can be efficiently computed corresponding the input being a multi-set ofsize n. The storage server S’ input is a set T ⊆ X.

1. S sends T to A (randomly permuted). Then, for each id ∈ T , A computes{νid,i, λ0

id,i, λ1id,i}ti=1 using {FK(id, i)}ti=1.

2. C sends f to A. Then, A constructs a garbled circuit for f specialized for |T | inputs.The input wires of this circuit are indexed by {(id, i)}id∈T,i∈[t]. Let (τ0id,i, τ

1id,i) be

the pair of wire-labels used for each such input wire.Then, for each (id, i), and b ∈ {0, 1}, it defines the ciphertext

cbid,i ← SKE.Encλb⊕νid,iid,i

(τb⊕νid,iid,i ).

Note that cωid,i

id,i is the encryption of τxidiid,i using the key λ

xidiid,i.

A sends to S {(c0id,i, c1id,i)}id∈T,i∈[t], along with the garbled circuit except the outputdecoding map. It sends the output decoding map to C.

3. For each (id, i), S decrypts cωid,i

id,i using λxidiid,i, to get τ

xidiid,i . It uses these labels for the

input wires to evaluate the circuit, and obtains the output labels. It sends theseoutput labels to C.

4. C uses the output decoding map from A, and output labels from S to calculatethe output.

Fig. 3 A sCED scheme for general functions.

The leakage can be further reduced by allowing C to specify any part ofthe function as a private input to the circuit computing f (rather than beinghardwired into the circuit). For each of the wires corresponding to this privateinput, the two labels will be provided by C to A, and C will send one of thesetwo labels to S, so that neither S nor A by itself learns this input. The theorembelow can be proven based on the security of the garbled circuits (similar to theargument in [50]):

Theorem 4. Protocol CktEval (Figure 3) is a (Fckt,Lckt)-sCED scheme, as-suming the security of the PRF scheme used.

Composite Queries: Recall that a composite query consists ofQ = (Q1, · · · , Qd)and f = (f0, f1, · · · , fd) such that f(Q[Z]) := f0(f1(Q1[Z]), · · · , fd(Qd[Z])).

15

Page 16: A Practical Model for Collaborative Databases: Securely ...shwetaag/papers/Search_and_compute.p… · A Practical Model for Collaborative Databases: Securely Mixing, Searching and

The sSED protocol for non-composite queries directly generalizes to compositequeries, by simply running d instances of the original sSED functionality to letS learn Ti = Qi[W ] for each i. But we need to adapt the sCED protocol to avoidrevealing each fi(Qi[Z]) to C. We discuss how to do so in Appendix D.2.

4.5 sSED Protocols

In this section, we discuss how to instantiate the building block of searchableencryption sSED by adapting constructions of symmetric searchable encryption(SSE) from the literature (see, e.g., [12] and references therein). We focus on thefamily of keyword search queries, Q∗K, which can in turn be used to capture moregeneral database queries (see Section 2). We shall adapt the SSE constructionof Jarecki et al. [36] into an sSED scheme, though the high-level ideas below areapplicable to adapting other constructions as well.

In SSE, there are only two parties: a client and a server. To store and searcha database with an SSE scheme, the client (who is the data owner) uploads anencrypted database to the server. Later, the client can carry out search querieson this database by interacting with the server, while providing the server withminimal information (patterns in the queries and responses). Jarecki et al. [36]extended SSE to a multi-client setting in which the data owner is separate from(and does not trust) the clients. Here, the data owner must remain online whenthe clients query the server, as the untrusted server should not be in a positionto answer queries by itself.

Recall that in sSED, we further separate the roles of the data owner and thequery-phase assistant, by introducing an untrusted auxiliary server. This allowsthe data owner to be present only at the initial phase when the database isconstructed. More importantly, in the multiple data owner setting (in which noone data owner can be trusted with access to all the data), it is crucial to avoidrelying on any one data owner to control clients’ access to the entire database.Another difference between sSED and SSE is that the latter reveals the searchoutcome (the IDs of the rows that match the search) to the client; in sSED, weseek to reveal this to S, but it cannot be revealed to the client (as the clientsare not provided with any leakage). Finally, in sSED, we seek to handle activecorruption of clients.

In Appendix C, we sketch how to adapt the MC-OXT protocol of [36] toaccount for the above requirements of sSED. We also present multiple security-efficiency tradeoffs. As a lightweight modification, we can simply let the auxiliaryserver A play the role of (1) the data owner during the query phase, and (2)the client. This incurs some leakage to A, namely the search queries and thesearch outcome. Note that when used in the sFED protocol template (Figure 2),since sSED is instantiated with random IDs, only the pattern of the searchoutcomes, as in Equation 1, rather than the search outcomes themselves arerevealed to A, as well as to S. Also since many of the sCED protocols alreadyleak this information to A, this provides an appropriate level of security for sSEDprotocols to be used in the sFED protocol template. For more details, please seeAppendix C.

16

Page 17: A Practical Model for Collaborative Databases: Securely ...shwetaag/papers/Search_and_compute.p… · A Practical Model for Collaborative Databases: Securely Mixing, Searching and

5 FED Protocols

Our FED protocol template is identical to that of the sFED protocol, exceptthat the functionalities sSED and sCED are replaced by the analogous multipledata-owner versions, SED and CED (see Figure 4a).

SEDCED

map &merge-map

OPRF-basedSFE-based

sFEDSection II

(see Figure 2)

FED

Section IV (see Figure 3)

Section IV-D Section IV-B(see Figure 4)

sCED

Value RetrievalSummationSummation (alt)General

Section III-A sSED

Section III-B

Section IV-C

OnionSecret-Sharing

Section IV-A

sFEDD C

A

S

Z Q, f

f(Q[Z])

sCED

D CSA

X

Q

f(δT[X])

sSED

W

f

Q[W ]

T

Z Q, f

f(Q[Z])

CED

CSA

Xm

Q

f(δT[X])

SED

Wm

f

Q[W ]

T

Q, f

f(Q[Z])FED C

A

S

Z1

Q, f

f(Q[Z])

Dm

D1

Zm

..

. … DmD1ZmZ1

X1

W1

A S

QW1 Wm

Q[W ]

C

W Q

K"

Q[W ]map

SED

DmD1

K#

merge-map

(a)

SEDCED

map &merge-map

‣ OPRF-based‣ SFE-based

sFEDSection 3.1

(see Figure 2)

FED

Section 4 (see Figure 3)

Section 4.4 Section 4.2(see Figure 4)

sCED

‣ Value Retrieval‣ Summation‣ Summation (alt)‣ General

Section 3.2 sSED

Section 3.3

Section 4.3

OnionSecret-Sharing

Section 4.1

A S

QW1 Wm

Q[W ]

C

W Q

K"

Q[W ]map

sSED

DmD1

K#

merge-map

(b)

Fig. 4 (a) FED protocol template. Each D splits its input data Zi as Wi and Xi, andinputs to SED and CED respectively. W denotes the combined data-set

⋃iWi.

(b) SED protocol template using three functionalities sSED, merge-map and map. Al-though not shown, all three functionalities used may specify leakage to S and A. Inaccessing sSED, A plays the role of both the (single data-owner) D and A.

Protocol FED-templSED,CED: Each data-owner Di maps its input Zi to a pair(Wi, Xi), where Wi ⊆ W×I and Xi ⊆ I×X such that (w, x) 7→ ((w, id), (id, x))where id is randomly drawn from (a sufficiently large set) I.4 After that, all theparties proceed exactly as in sFED-templ, but with the parties accessing SEDand CED instead of sSED and sCED, and with each data-owner using (Wi, Xi)as its input.

Leakage LFED-templ: The leakage is similar to LsFED-templ defined earlier, butwith leakage from sSED and sCED schemes replaced by those from SED andCED. Specifically, on a client query (Q, f), the leakage consists of the set (orequivalently, the pattern information of) T = Q[W ] to S, where W =

⋃iWi;

also the leakages from SED and CED are provided to S and A.

As before, we can state the security guarantee of this construction:

Theorem 5. Protocol FED-templSED,CED, when instantiated using a (Q,LSED)-SED scheme and an (F ,LCED)-CED scheme, is a (Q,F ,LFED-templ)-FED scheme.

The main challenge then is in implementing SED and CED schemes, forreasonable leakage functions. Below, we present our protocols for realizing thesefunctionalities.

4 I should be large enough so that we may assume that each (honest) data-owner willuse a unique id for each record (w, x), disjoint from the set of IDs used by the others,except with negligible probability.

17

Page 18: A Practical Model for Collaborative Databases: Securely ...shwetaag/papers/Search_and_compute.p… · A Practical Model for Collaborative Databases: Securely Mixing, Searching and

5.1 Onion Secret-Sharing

In going from single data-owner schemes to multi data-owner schemes, we seekto make the collection of data-owners behave like a single entity (without inter-acting with each other), so that they can communicate their collective data tothe two servers in the form in which the underlying single data-owner schemecommunicates it. Note that from this collective data, neither server should beable to link the records that are selected by a search query back to the indi-vidual data owners from whom it originates. In principle, this problem can besolved generically using secure multi-party computation techniques. However,for efficiency reasons, we develop a suite of techniques under the name of onionsecret-sharing, that carefully combines secret-sharing and public-key encryptionto achieve this.

Onion secret-sharing is a non-trivial generalization of the traditional mix-nets [18]. In a mix-net, a set of senders D1, · · · ,Dm want to send their messagesM1, · · · ,Mm to a server S, with the help of an auxiliary server A (who doesnot collude with S), so that neither S nor A learns the association between themessages and the senders (except for the senders whom they collude with). Thisis easily achieved as follows: each Di sends JMiKPKS

to A where PKS is thepublic-key of S for a semantically secure public-key encryption scheme, and thenotation JMKPK denotes encryption of M using a public-key PK. A collects allsuch ciphertexts, sorts them lexicographically (or randomly permutes them), andforwards them to S; S decrypts them to obtain the multiset {M1, · · · ,Mm}.

Now consider the following task. Each sender Di wants to share its messageMi between two servers S and A; that is, it sets Mi = σi ⊕ ρi, and wants tosend σi to S and ρi to A. While the senders want their messages to get randomlypermuted, the association between σi and ρi needs to be retained. Onion secret-sharing provides a solution to this problem, as follows:

Each Di sends J(ρi, ζi)KPKSwhere ζi is of the form JσiKPKA

. A mixes theseciphertexts and forwards them to S, who decrypts them to recover pairs of theform (ρi, ζi). Now, S reshuffles (or sorts) these pairs, stores ρi and sends ζi (inthe new order); A recovers σi from ζi (in the same order as ρi are maintainedby S).

This can be taken further to incorporate additional functionality. As an ex-ample of relevance to us, suppose A wants to add a short, private tag to themessages being secret-shared so that the tag persists even after random permu-tation. Among the messages which were assigned the same tag, A should notbe able to link the shares it receives after the permutation to the ones it orig-inally received; S should obtain no information about the tags. One solution isfor A to add encrypted tags to the data items, and then while permuting thedata items, S would rerandomize the ciphertexts holding the tags. We presentan alternate approach, which does not require additional functionality from thepublic-key encryption scheme, but instead augments onion secret-sharing withextra functionality:

– Di creates a 3-way additive secret sharing of 0 (the all 0’s string), as αi⊕βi⊕γi = 0, and sends (αi, Jβi, ρi, Jγi, σiKPKA

KPKS) to A.

18

Page 19: A Practical Model for Collaborative Databases: Securely ...shwetaag/papers/Search_and_compute.p… · A Practical Model for Collaborative Databases: Securely Mixing, Searching and

– A assigns tags τi for each of them, and sends (in sorted order)(τi ⊕ αi, Jβi, ρi, Jγi, σiKPKA

KPKS) to S.

– S sends (τi ⊕ αi ⊕ βi, Jγi, σiKPKA) to A, in sorted order; it stores ρi (in the

same sorted order).– A recovers (τi ⊕ αi ⊕ βi ⊕ γi, σi) = (τi, σi).

This allows S and A to receive all the shares (in the same permuted order); Slearns nothing about the tags; A cannot associate which shares originated fromwhich Di, except for what is revealed by the tag. (Even if S or A collude withsome Di, the unlinkability is retained for the remaining Di.)

In Section 5.3, we use a variant of the above scheme to let A tag entries invarious lists with a pseudonym (multiple lists may get the same pseudonym),before the lists are unpacked and shuffled again, destroying the linkage to thelists, but retaining the tags. 5

5.2 Protocol Template for SED

We describe a general protocol template to realize the SED functionality, usingaccess to the sSED functionality. The high-level plan is to let A create a mergeddatabase so that it can play the role of D for sSED. However, since we requireprivacy against A, the merged database should appear shorn of all information(except statistics that we are willing to leak). Hence, during the initializationphase, we not only merge the databases, but also replace the keywords withpseudonyms and keep other associated data encrypted. We use pseudonyms forkeywords (rather than encryptions) to support queries: During the query phase,the actual keywords will be mapped to these pseudonyms and revealed to A.These two tasks at the initialization and query phases are formulated as a pairof sub-functionalities — merge-map and map— collectively referred to as thefunctionality mmap, as described in Figure 5.

The protocol SED-templsSED,mmap is shown in Figure 4b. In this protocol,sSED is invoked with A playing the role of the data-owner as well. The invocationof merge-map is part of the initialization phase and that of map is part of thequery phase. As shown in the figure, S uses DKS

(decoding function) to compute

its output Q[W ] from Q(W ).Note that A does not store any additional information between the two phases

(other than what the implementation of sSED requires).

Leakage: Since this protocol delivers the merge-mapped data W to A, it leakscertain statistical information about the merged data W (not individual datasetsWi) to A. The exact nature of the leakage depends on the mapping-function M .6

5 For simplicity, here we consider the tagging to be arbitrary, whereas in Section 5.3,it is done based on equality checks. Here we allow A to add tags while it knows thelink between data-items and Di; in our application, this link is broken by an extraround of mixing.

6 Such leakage could be avoided by relying on secure two-party computation of acertain function between A and S during initialization, but with high communicationcosts.

19

Page 20: A Practical Model for Collaborative Databases: Securely ...shwetaag/papers/Search_and_compute.p… · A Practical Model for Collaborative Databases: Securely Mixing, Searching and

Functionality Pair mmap = (merge-map, map)

– Functionality merge-map takes as inputs Wi from each Di; it generates a pair “map-ping key” Kmap = (KS,KA) (independent of its inputs), and creates a “merged-and-mapped” input set, W . Merging simply refers to computing the (multi-set)union W =

⋃iWi; mapping computes the multi-set W = {(w, id)|(w, id) ∈ W, w =

MKmap(w), and id = MKmap(id)}, where the mapping function M is to be specified

when instantiating this template.a It outputs KS to S and (W ,KA) to A. (Since KA

will be stored by A, we require that it be short, independent of the data size.)– Functionality map takes KS and KA as inputs from S and A respectively, and a queryQ from a client C; then it outputs a new query Q to C. We shall require that thereis a decoding function D such that Q[W ] = {DKS(id)|id ∈ Q(W )}, where W is asdescribed above.

These functionalities may specify leakages to S and/or A, but not to the data-ownersor clients. Note that since map gives an output Q to C, we shall require that it can besimulated from Q.

a For notational simplicity, we have specified the mapping using a single function MoverW∪I. Typically, this function acts differently onW and I, using different partsof the key Kmap.

Fig. 5 Functionality mmap used by the protocol SED-templ (See Figure 4b for anillustration).

In addition, leakages from merge-map, map and sSED (the latter on the mergeddataset) will be included in the leakage of this protocol.

5.3 Instantiating SED Protocol Template

To instantiate the above template, we need protocols for merge-map and map, inaddition to a protocol for the single data-owner sSED functionality. We considerthe family of keyword search queries, Q∗K (see Section 2). An implementation ofsSED for Q∗K was already presented in Section 4.5. Below we present two im-plementations of mmap, for Q∗K, called mmap-OPRF and mmap-SFE, with differentefficiency-security trade-offs.

Recall that in this setting the searchable attribute for each record is a setof keywords, w ⊆ K. For concreteness, we shall require that every keyword isrepresented using a bit string of the same length, so that K ⊆ {0, 1}λ. Eachdata-owner Di has an input Wi ⊆ W × I. In both the solutions, Di shall usea representation of Wi as a set Wi ⊆ K × I such that (w, id) ∈ Wi iff w =

{τ |(τ, id) ∈ Wi}. In the two solutions below, M maps the keywords differently;

but in both the solutions, an identity id that occurs in Wi is mapped to ζ idi ←JidKPKS

(an encryption of id under S’s public-key). In one scheme, we use anoblivious pseudorandom function (OPRF) protocol to calculate M , while theother relies on a secure function evaluation for equality. While an OPRF protocolis more complex than secure equality evaluation, access to OPRF allows usto construct a simpler protocol. On the other hand, suitably efficient OPRF

20

Page 21: A Practical Model for Collaborative Databases: Securely ...shwetaag/papers/Search_and_compute.p… · A Practical Model for Collaborative Databases: Securely Mixing, Searching and

protocols are known only in the heuristic Random Oracle model [37], whereasefficient secure evaluation of equality is possible in the standard model.

Construction mmap-OPRF. The pair of protocols for the functionalities merge-mapand map, collectively called mmap-OPRF, is shown in Figure 6. In this solution,the mapping key KA is empty, and KS consists of a PRF key K, in addition to asecret-key for a PKE scheme, SKS. M maps each keyword τ using a pseudoran-dom function with the PRF key K. This is implemented using an oblivious PRFexecution with S, in both merge-map and map protocols. That is, for w ⊆ K wehave MKmap(w) = {FK(τ)|τ ∈ w}, where F is a PRF.

Note that mmap-OPRF uses two a priori bounds (or, includes such bounds

as part of leakage) for each data-owner, Ni and Li such that |Wi| ≤ Ni and

|Ki| ≤ Li, where Ki = {τ |∃id s.t. (τ, id) ∈ Wi} is the set of keywords in Di’s

data (recall that Wi is a representation of Di’s data as keyword-identity pairs).

Protocol mmap-OPRF

merge-map:

– Keys. S generates a keypair for a CPA-secure PKE scheme, (PKS, SKS), and similarlyA generates (PKA, SKA). The keys PKS and PKA are published. S also generates aPRF key K.

– Oblivious Mapping. Each data-owner, for each τ ∈ Ki, Di engages in an ObliviousPRF (OPRF) evaluation protocol with S, in which Di inputs τ , S inputs K and Direceives as output τ := FK(τ). Di carries out Li− |Ki| more OPRF executions (withan arbitrary τ ∈ Ki as its input) with S.

– Shuffling. Each data-owner Di computes ζ idi ← JidKPKS , for each id such that there

is a pair of the form (τ, id) ∈ Wi. All the data-owners use the help of S to route theset of all pairs (τ , ζ idi ) to A, as follows:• Each Di, for each (τ, id) computes ξidi,τ ← J(τ , ζ idi )KPKA , and sends it to S. Further

Di computes Ni − |Wi| more ciphertexts of the form J⊥KPKA .• S collects all such ciphertexts (Ni of them from Di, for all i), and lexicographically

sorts them (or randomly permutes them). The resulting list is sent to A.• A decrypts each item in the received list, and discards elements of the form ⊥ to

obtain a set consisting of pairs of the form (τ , id), where id = ζ idi . W is defined asthe set of pairs of the form (w, id) where w contains all τ values for each id.

– Outputs. A’s outputs are W and an empty KA; S’s output is KS = (SKS,K).

map: A client C has an input Q which is specified using one or more keywords. Foreach keyword τ appearing in Q, C engages in an OPRF execution with S, who inputsthe key K that is part of KS. The resulting query is output as τ .

Fig. 6 Protocol mmap-OPRF implementing functionality mmap.

We point out the need for OPRF evaluations. If we allowed S to simply givethe keyK to every data-owner to carry out the PRF evaluations locally, then, if Acolludes with even one data-owner Di, it will be able to learn the actual keywordsin the entire data-set. As described below, using an OPRF considerably improveson this.

21

Page 22: A Practical Model for Collaborative Databases: Securely ...shwetaag/papers/Search_and_compute.p… · A Practical Model for Collaborative Databases: Securely Mixing, Searching and

Leakage, LOPRFmmap : S learns the bounds Ni and Li for each i, and A learns

∑iNi;

from the map phase, S learns the number of keywords in the query.We also include in LOPRF

mmap what the output W to A reveals (being an output,this is not “leakage” for mmap-OPRF; but it manifests as leakage in the SED-templprotocol that uses this functionality). W reveals an anonymized version of thekeyword-identity incidence matrix of the merged data,7where the actual labelsof the rows and columns (i.e., the keywords and the ids) are absent. If A colludeswith some data-owners, it learns the actual keyword labels of all the correspond-ing keywords held by the colluding data-owners.

Protocol OPRF: Since we require security against actively corrupt clients, we needa UC secure protocol for OPRF that is secure against active corruption of thereceiver (and passive corruption of the party with the key). The OPRF protocolof [37], which is secure in the random oracle model, based on the one-more Diffie-Hellman assumption, meets our requirements (see Appendix E.1). We then havethe following theorem.

Theorem 6. Protocol SED-templsSED,mmap, instantiated using a (Q∗K,LsSED)-sSED scheme and the protocol mmap-OPRF, where the latter is in turn instantiatedusing protocol OPRF, is a (Q∗K, (LsSED,LOPRF

mmap ))-SED scheme in the random or-acle model, under the one-more Diffie-Hellman assumption, and assuming thesecurity of the PRF and PKE schemes used.

Please see Appendix E.5 for the proof.

Construction mmap-SFE. Our next protocol avoids OPRF evaluations. The ideahere is to allow each data owner to send secret-shared keywords between S andA, and rely on secure function evaluation (SFE) to associate the keywords withpseudonymous handles, so that the same keyword (from different data owners)gets assigned the same handle (but beyond that, the handles are uninformative).Further, neither server should be able to link the keyword shares with the dataowner from which it originated, necessitating a shuffle of the outputs. Due tothe complexity of the above task, a standard application of SFE will be veryexpensive. Instead, we present a much more efficient protocol that relies onsecure evaluation only for equality checks, with the more complex computationscarried out locally by the servers; as a trade-off, we shall incur leakage similar tothat of the OPRF-based protocol above. Below, we sketch the ideas behind theprotocol, with the formal description in Appendix E.2, Figure 11 and Figure 12.

Consider two data owners who have shared keywords τ1 and τ2 among A andS, so that A has (α1, α2) and S has (β1, β2), where τ1 = α1⊕β1 and τ2 = α2⊕β2.Note that τ1 = τ2 ⇔ α1 ⊕ β1 = α2 ⊕ β2 ⇔ α1 ⊕α2 = β1 ⊕ β2. So, for A to checkif τ1 = τ2, A and S can locally compute α1 ⊕ α2 and β1 ⊕ β2 and use a secureevaluation of the equality function to compare them (with only A learning theresult). By comparing all pairs of keywords in this manner, A can identify items

7 This is a 0-1 matrix with 1 in the position indexed by (τ, id) iff (τ, id) ∈⋃i Wi. The

rows and columns may be considered to be ordered randomly.

22

Page 23: A Practical Model for Collaborative Databases: Securely ...shwetaag/papers/Search_and_compute.p… · A Practical Model for Collaborative Databases: Securely Mixing, Searching and

with identical keywords and assign them pseudonyms (e.g., small integers), whileholding only one share of the keyword.

We note that using securely pre-computed correlations, secure evaluation ofthe equality function can be carried out very efficiently, simply by exchanginga pair of values in a sufficiently large field. Please refer to Appendix E.2 for adescription of the protocol.

The data shared by the data owners to the servers includes not just keywords,but also the records (document identifiers) associated with each keyword. Tominimize the number of comparisons needed, each data owner packs all therecords corresponding to a keyword τ into a single list that is secret-sharedalong with the keyword (rather than share separate (τ, id) pairs). However, afterhandles have been assigned to the keywords, such lists should be unpacked andshuffled to erase the link between multiple records that came from a single list(and hence the same data owner). Each entry in each list can be tagged by Ausing the keyword handle assigned to the list, but then the entries need to beshuffled while retaining the tag itself. How this can be accomplished using theonion secret-sharing technique was already discussed in Section 5.1.

The full protocol involves a few other details: the initial secret sharing of thekeywords are delivered to the two servers using onion secret-sharing; the numberof entries in individual lists for each keyword, and the total number of lists fromeach data owner are padded up; the dummy entries in a list are culled jointly bythe two servers. We refer the reader to the detailed description of the protocolin Figure 11 and Figure 12.

We note that this construction uses O((∑i Li)

2) secure equality checks,where Li is an upper bound on the total number of keywords in Di’s data.In comparison, the previous construction needed O(

∑i Li) OPRF evaluations.

Even though secure equality checks have a low online cost, when the number ofkeywords involved is large, the quadratic complexity can offset this advantage.Hence, in Appendix E.3, we provide a mechanism to improve the efficiency ofthis construction, using a partition of the keyword space into bins.

Leakage, LSFEmmap : The leakage includes LOPRF

mmap above, with the following addi-tions: During the merge-map phase S learns an upper bound L ≥ |

⋃iKi|; from

the map phase, A learns the pattern of the keywords in a query. In Appendix E.4,we describe how the leakage may be further reduced.

We have the following theorem.

Theorem 7. Protocol SED-templsSED,mmap, instantiated using a (Q∗K,LsSED)-sSED scheme and the protocol mmap-SFE, is a (Q∗K, (LsSED,LSFE

mmap))-SED schemein the standard model, assuming the security of the PRF, PKE and SFE schemesused.

Please see Appendix E.5 for the proof.

5.4 CED Protocols

Upgrading an sCED protocol to a CED protocol is simpler than the sSED toSED transformation: Since there are no indices to be computed as part of a

23

Page 24: A Practical Model for Collaborative Databases: Securely ...shwetaag/papers/Search_and_compute.p… · A Practical Model for Collaborative Databases: Securely Mixing, Searching and

sCED protocol, we do not need to reconstruct a merge-mapped dataset. Indeed,instead of using sCED functionality as a blackbox, we can directly modify theinitialization phase of the sCED protocol. Since the sCED protocols tend to keepall the data records secret-shared between S and A, we need only have the dataowners route their data shares to the two servers using onion secret-sharing. Weshow how the sCED schemes in Section 4.4 are modified in this manner.

Value Retrieval and Summation: Below we describe how the sCED schemein Section 4.4 for Value Retrieval is adapted into a CED scheme. Adaptation ofthe Summation scheme is similar.

In the sCED scheme for Value Retrieval, every value xid was secret-sharedbetween A and S as xid = αid ⊕ βid, where αid = FK(id), with K being a PRFkey that the data-owner and A shared (and S did not have access to). However,this is not viable in a multi data-owner setting, because if S colludes with anyone data-owner it will learn this key; alternatively, if each data-owner uses adifferent key, this would let A collect the different data-items as coming fromdifferent data-owners, revealing additional statistics about that data as well asabout the answer to queries, beyond what a merged data-set reveals. To preventsuch leakage, we delink the key to the data owner and link it to each record,i.e. we generate a new key for each record (xid) that Di sends. We present themodified protocol below.

Protocol ValRet-mod

– Modified Initialization Phase:1. S generates key-pairs (PKS, SKS) for a CPA-secure PKE scheme, and publishes

the public-keys.2. For each (id, x) ∈ Xi, Di picks a PRF key Kid and calculates βid = FKid(id) ⊕

xid (where F is a PRF with appropriate input and output lengths). Di sends{JδidKPKA}id to S, where δid = (JKidKPKS , (id, βid)). (This is possibly padded withrandom entries, where δid = ⊥).

3. S collects all the data from different data owners, sorts them lexicographically andsends them to A.

4. A receives the data, decrypts it to calculate {JKidKPKS , (id, βid)}id. It then picksa PRF key K. Let γid = βid ⊕ FK(id). It sends {(JKidKPKS , (id, γid))}id to S.

5. S, for each id, decrypts to calculate Kid. It defines θid ← γid ⊕ FKid(id). (Notethat θid = xid ⊕ FK(id)).

6. S stores {(id, θid)}id while A stores K.

The computation phase proceeds exactly as before.

The modified initialization phase leaks (an upper bound on) |Xi| for eachdata-owner to S and the combined ID-set J to both A and S. (We remark that,in the FED protocol which uses this functionality, the IDs are chosen randomlyand hence leaking the ID-set J only amounts to leaking the size |X| of thecombined data set.) Let L′ValRet be the leakage, LValRet of the Value RetrievalsCED protocol, changed according to the modified initialization phase as statedabove. Then we have the following theorem.

24

Page 25: A Practical Model for Collaborative Databases: Securely ...shwetaag/papers/Search_and_compute.p… · A Practical Model for Collaborative Databases: Securely Mixing, Searching and

Theorem 8. Protocol ValRet-mod is a (FValRet,L′ValRet)-CED scheme, assum-ing the security of the PRF and PKE schemes used.

General Functions: The sCED scheme for general functions (discussed inSection 4.4) can be easily adapted to the multi data-owner setting. We defer thedetails to Appendix D.3.

6 Implementation and Experimental Results

In this section, we report details of our implementation and the experimentswe performed to demonstrate the scalability and applicability of our protocols.Please see Appendix F.1 for details about our implementation.

To test the feasibility of using our sFED and FED schemes for realisticallylarge-scale computations, we measured their performance on tasks representativeof Genome Wide Association Studies (GWAS). The specific functions we choosewere adapted from the iDash competition [34,41] for secure outsourcing of ge-nomic data. Each record in a GWAS database corresponds to an individual, withfields corresponding to demographic and phenotypic attributes (like sex, race,diseases, etc.), as well as genetic attributes. The genetic attributes of interest– called Single Nucleotide Polymorphisms (SNPs) – describe, for each of sev-eral loci (i.e., positions) in the genome, the “genotype” at that locus (typically,one of a small number of possibilities). In a typical query, the demographic andphenotypic attributes are used for filtering (e.g., South Asian males with type2 diabetes [42,13]), and statistics are computed over the genetic information. Inour experiments, the following statistics were computed.

– Minor Allele Frequency (MAF): In what is called a biallelic setting, each locusof interest has one of three genotypes – say AA, BB or AB (corresponding totwo “alleles” A and B, and two chromosomes). Given a set of N individuals(selected using a search filter) and a locus of interest, the total allele countsare defined as nA = 2nAA + nAB and nB = 2nBB + nAB , where nAA, nBBand nAB are the number of individuals of the three genotypes. Then, MAFfor that locus is defined [41] as the real number

min(nA, nB)

nA + nB=min(nA, 2N − nA)

2N.

– Chi-square analysis (ChiSq): To quantify the association between an alleleand a disease, often a chi-square (χ2) statistic defined below is employed [20].Two groups of individuals – case and control groups – are selected using twosearch filters Q and Q′ (with and without a disease). Defining (nA, nB) and(n′A, n

′B) as above, but for Q and Q′ respectively, the χ2 test statistic [41,1]

can be formulated as

χ2 =(N +N ′)(nAN

′ − n′AN)2

NN ′(nA + n′A)((N +N ′)− (nA + n′A))

where N = (nA + nB) and N ′ = (n′A + n′B).

25

Page 26: A Practical Model for Collaborative Databases: Securely ...shwetaag/papers/Search_and_compute.p… · A Practical Model for Collaborative Databases: Securely Mixing, Searching and

– Average Hamming Distance (Hamming): The Hamming distance between 2genome sequences is often used as a metric for genome comparison [71,55].Here, we use the entire genotype vector x for each individual (rather than the

genotype at a given locus). The function fx∗(Q[Z]) =∑x∈Q[Z]∆(x,x∗)

|Q[Z]| , where

∆ stands for the hamming distance of two strings.– Genome retrieval (GenomeRet): This is a retrieval task where we retrieve the

genome data at some locus for the set of individuals selected by the searchfilter.

20000 60000 1000000

200

400

600

800

Com

m (M

B)

Init Communication Size

3000 6000 9000 12000

2

4

6

8

10Query Communication Size

20000 60000 100000#(input records)

0

100

200

300

400

Tim

e (s

ec)

Init Gross Computation Time

3000 6000 9000 12000#(filtered records)

2

4

6

8

10

Query Gross Computation Time

sFEDsSEDFEDOPRFSEDOPRFFEDSFESEDSFE

(a)

20000 60000 1000000

200

400

600

800

1000

Com

m (M

B)

Init Communication Size

3000 6000 9000 12000

2

4

6

8

Query Communication Size

20000 60000 100000#(input records)

0

100

200

300

400

Tim

e (s

ec)

Init Gross Computation Time

3000 6000 9000 12000#(filtered records)

2

4

6

8

10

Query Gross Computation Time

sFEDsSEDFEDOPRFSEDOPRFFEDSFESEDSFE

(b)

20000 60000 1000000

200

400

600

800

1000

1200

1400

Com

m (M

B)

Init Communication Size

3000 6000 9000 12000

2

4

6

8

10

12

14

16Query Communication Size

20000 60000 100000#(input records)

0

100

200

300

400

500

Tim

e (s

ec)

Init Gross Computation Time

3000 6000 9000 12000#(filtered records)

5

10

15

20

Query Gross Computation Time

sFEDsSEDFEDOPRFSEDOPRFFEDSFESEDSFE

(c)

5000 15000 25000

0

500

1000

1500

2000

2500

3000

Com

m (M

B)

Init Communication Size

800 1600 2400 3200

0

1000

2000

3000

4000

5000

Query Communication Size

5000 15000 25000#(input records)

0

20

40

60

80

100

Tim

e (s

ec)

Init Gross Computation Time

800 1600 2400 3200#(filtered records)

0

5

10

15

20

25

30Query Gross Computation Time

sFEDsSEDFEDOPRFSEDOPRFFEDSFESEDSFE

(d)

Fig. 7 Total communication cost and computation time in the initialization and queryphases for single data owner and multiple data owner setting (latter using bothmmap-OPRF and mmap-SFE), for the following applications: (a) Genome Retrieval, (b)MAF, (c) Chi-square analysis, (d) Average Hamming.

Results. We performed experiments in both the single (sFED) and multiple dataowner (FED) setting. Please see Appendix F.2 for details about our experiments.Our protocols turn out to be quite practical, even for fairly large databases. Ageneral feature of all our protocols is that most heavy computation is performedat the two servers while the client is very efficient: In all the experiments shownhere, the client’s computation time never exceeds 3 milliseconds.

The total costs (across all parties) for various applications are summarizedin Figure 7. The initialization costs are plotted against the total number of

26

Page 27: A Practical Model for Collaborative Databases: Securely ...shwetaag/papers/Search_and_compute.p… · A Practical Model for Collaborative Databases: Securely Mixing, Searching and

input records in the initial database, while the query costs are plotted againstthe number of filtered records. In each case, we plot the cost of the full sFEDprotocol, as well as its sSED component (dashed lines), as the latter is a proxyfor searchable encryption costs from the literature and serves as a baseline. In theinitialization phase, the sFED protocols require only a small constant overheadover the sSED component, whereas in the query phase the overhead is mostlynegligible. An exception to the latter trend is the case of the Average Hammingfunctionality (Figure 7 (d)) where a large function is computed using a garbledcircuit (the query complexity is still dependent on the filtered records; not onthe entire dataset). Please see Appendix F.1 for more details.

References

1. Baker, J.: Comparing two populations. http://grows.ups.edu/curriculum/

Comparing%20Two%20Population1.htm

2. Bellare, M., Hoang, V.T., Rogaway, P.: Foundations of Garbled Circuits. In: CCS(2012)

3. Bellare, M., Hoang, V.T., Keelveedhi, S., Rogaway, P.: Efficient garbling from afixed-key blockcipher. Cryptology ePrint Archive, Report 2013/426 (2013)

4. Ben-Or, M., Goldwasser, S., Kilian, J., Wigderson, A.: Multi-prover interactiveproofs: How to remove intractability assumptions. In: STOC (1988)

5. Bethencourt, J.: Paillier library (2006), http://acsc.cs.utexas.edu/

libpaillier/

6. Boneh, D., Sahai, A., Waters, B.: Functional encryption: Definitions and challenges.In: TCC. pp. 253–273 (2011)

7. Bost, R.: Σoϕoς: Forward secure searchable encryption. In: Proceedings of the2016 ACM SIGSAC Conference on Computer and Communications Security. pp.1143–1154. ACM (2016)

8. Brakerski, Z., Vaikuntanathan, V.: Efficient fully homomorphic encryption from(standard) LWE. In: FOCS (2011)

9. Bush, W.S., Moore, J.H.: Genome-wide association studies. PLoS computationalbiology 8(12), e1002822 (2012)

10. Canetti, R.: Universally composable security: A new paradigm for cryptographicprotocols. In: FOCS (2001)

11. Cash, D., Jaeger, J., Jarecki, S., Jutla, C.S., Krawczyk, H., Rosu, M., Steiner, M.:Dynamic searchable encryption in very-large databases: Data structures and imple-mentation. In: 21st Annual Network and Distributed System Security Symposium,NDSS (2014)

12. Cash, D., Jarecki, S., Jutla, C.S., Krawczyk, H., Rosu, M., Steiner, M.: Highly-scalable searchable symmetric encryption with support for boolean queries. In:CRYPTO (2013)

13. Chambers, J.C., Abbott, J., Zhang, W., Turro, E., Scott, W.R., Tan, S.T., Afzal,U., Afaq, S., Loh, M., Lehne, B., et al.: The south asian genome. PLoS One 9(8),e102645 (2014)

14. Chase, M., Kamara, S.: Structured encryption and controlled disclosure. In: Inter-national Conference on the Theory and Application of Cryptology and InformationSecurity. pp. 577–594. Springer (2010)

15. Chase, M., Kamara, S.: Structured encryption and controlled disclosure. In: ASI-ACRYPT (2010)

27

Page 28: A Practical Model for Collaborative Databases: Securely ...shwetaag/papers/Search_and_compute.p… · A Practical Model for Collaborative Databases: Securely Mixing, Searching and

16. Chase, M., Shen, E.: Substring-searchable symmetric encryption. Proceedings onPrivacy Enhancing Technologies 2015(2), 263–281 (2015)

17. Chase, M., Shen, E.: Substring-searchable symmetric encryption. PoPETs 2015(2015)

18. Chaum, D.L.: Untraceable electronic mail, return addresses, and digitalpseudonyms. Commun. ACM 24(2), 84–90 (Feb 1981)

19. Chor, B., Goldreich, O., Kushilevitz, E., Sudan, M.: Private information retrieval.In: FOCS (1995)

20. Clarke, G.M., Anderson, C.A., Pettersson, F.H., Cardon, L.R., Morris, A.P., Zon-dervan, K.T.: Basic statistical analysis in genetic case-control studies. Nature pro-tocols 6(2), 121 (2011)

21. Curtmola, R., Garay, J., Kamara, S., Ostrovsky, R.: Searchable symmetric encryp-tion: improved definitions and efficient constructions. In: ACM CCS. pp. 79–88(2006)

22. Demertzis, I., Papadopoulos, S., Papapetrou, O., Deligiannakis, A., Garofalakis,M.: Practical private range search revisited. In: ACM SIGMOD (2016)

23. Drazen, W.A., Ekwedike, E., Gennaro, R.: Highly scalable verifiable encryptedsearch. In: 2015 IEEE Conference on Communications and Network Security, CNS.pp. 497–505 (2015)

24. Fisch, B.A., Vo, B., Krell, F., Kumarasubramanian, A., Kolesnikov, V., Malkin, T.,Bellovin, S.M.: Malicious-client security in blind seer: a scalable private dbms. In:Security and Privacy (SP), 2015 IEEE Symposium on. pp. 395–410. IEEE (2015)

25. Gentry, C.: Fully homomorphic encryption using ideal lattices. In: STOC. pp. 169–178 (2009)

26. Goldreich, O., Micali, S., Wigderson, A.: How to play any mental game. In: STOC(1987)

27. Goldwasser, S., Gordon, S.D., Goyal, V., Jain, A., Katz, J., Liu, F., Sahai, A., Shi,E., Zhou, H.: Multi-input functional encryption. In: EUROCRYPT. pp. 578–602(2014)

28. Grubbs, P., Ristenpart, T., Shmatikov, V.: Why your encrypted database is notsecure. In: Proceedings of the 16th Workshop on Hot Topics in Operating Systems.pp. 162–168. ACM (2017)

29. Halevi, S., Lindell, Y., Pinkas, B.: Secure computation on the web: Computingwithout simultaneous interaction. In: Advances in Cryptology - CRYPTO 2011,Proceedings. pp. 132–150 (2011)

30. Hamlin, A., Shelat, A., Weiss, M., Wichs, D.: Multi-key searchable encryption,revisited. In: IACR International Workshop on Public Key Cryptography. pp. 95–124. Springer (2018)

31. van der Harst, P., Verweij, N.: The identification of 64 novel genetic loci provides anexpanded view on the genetic architecture of coronary artery disease. Circulationresearch pp. CIRCRESAHA–117 (2017)

32. Havasupai tribe and the lawsuit settlement aftermath. http://genetics.ncai.

org/case-study/havasupai-Tribe.cfm

33. Hirokawa, M., Morita, H., Tajima, T., Takahashi, A., Ashikawa, K., Miya, F.,Shigemizu, D., Ozaki, K., Sakata, Y., Nakatani, D., et al.: A genome-wide asso-ciation study identifies plcl2 and ap3d1-dot1l-sf3a2 as new susceptibility loci formyocardial infarction in japanese. European Journal of Human Genetics 23(3),374 (2015)

34. IDASH PRIVACY & SECURITY WORKSHOP 2015. http://www.

humangenomeprivacy.org/2015/competition-tasks.html (2015)

28

Page 29: A Practical Model for Collaborative Databases: Securely ...shwetaag/papers/Search_and_compute.p… · A Practical Model for Collaborative Databases: Securely Mixing, Searching and

35. Indian tribe wins fight to limit research of its dna. https://www.nytimes.com/

2010/04/22/us/22dna.html?pagewanted=all&_r=1&36. Jarecki, S., Jutla, C.S., Krawczyk, H., Rosu, M., Steiner, M.: Outsourced symmetric

private information retrieval. In: ACM CCS (2013)37. Jarecki, S., Kiayias, A., Krawczyk, H., Xu, J.: Highly-efficient and composable

password-protected secret sharing (or: How to protect your bitcoin wallet online).In: 2016 IEEE European Symposium on Security and Privacy (EuroS&P). pp.276–291. IEEE (2016)

38. Kamara, S., Moataz, T.: Sql on structurally-encrypted databases. In: InternationalConference on the Theory and Application of Cryptology and Information Security.pp. 149–180. Springer (2018)

39. Kamara, S., Papamanthou, C., Roeder, T.: Dynamic searchable symmetric encryp-tion. In: Proceedings of the 2012 ACM conference on Computer and communica-tions security. pp. 965–976. ACM (2012)

40. Kerbaj, R., Ungoed-Thomas, J.: http://www.thesundaytimes.co.uk/sto/news/uk_news/National/article1258476.ece?CMP=OTH-gnws-standard-2013_05_11

41. Kim, M., Lauter, K.: Private genome analysis through homomorphic encryption.BMC medical informatics and decision making 15(5), S3 (2015)

42. Kooner, J.S., Saleheen, D., Sim, X., Sehmi, J., Zhang, W., Frossard, P., Been, L.F.,Chia, K.S., Dimas, A.S., Hassanali, N., et al.: Genome-wide association study inpeople of south asian ancestry identifies six novel susceptibility loci for type 2diabetes. Nature genetics 43(10), 984 (2011)

43. Kurosawa, K., Ohtaki, Y.: Uc-secure searchable symmetric encryption. In: Finan-cial Cryptography and Data Security, pp. 285–298. Springer (2012)

44. Li, R., Liu, A.X., Wang, A.L., Bruhadeshwar, B.: Fast range query processing withstrong privacy protection for cloud computing. Proc. VLDB Endow. 7(14) (2014)

45. Lopez-Alt, A., Tromer, E., Vaikuntanathan, V.: On-the-fly multiparty computationon the cloud via multikey fully homomorphic encryption. In: STOC. pp. 1219–1234(2012)

46. Low, S.K., Takahashi, A., Mushiroda, T., Kubo, M.: Genome-wide associationstudy: a useful tool to identify common genetic variants associated with drug tox-icity and efficacy in cancer pharmacogenomics (2014)

47. Lu, S., Ostrovsky, R.: Distributed oblivious ram for secure two-party computation.In: Theory of Cryptography Conference. pp. 377–396. Springer (2013)

48. MacArthur, J., Bowler, E., Cerezo, M., Gil, L., Hall, P., Hastings, E., Junkins,H., McMahon, A., Milano, A., Morales, J., et al.: The new nhgri-ebi catalog ofpublished genome-wide association studies (gwas catalog). Nucleic acids research45(D1), D896–D901 (2016)

49. Naor, M., Reingold, O.: Number-theoretic constructions of efficient pseudo-randomfunctions. J. ACM (2004)

50. Naveed, M., Agrawal, S., Prabhakaran, M., Wang, X., Ayday, E., Hubaux, J.P.,Gunter, C.: Controlled functional encryption. In: CCS (2014)

51. Naveed, M., Prabhakaran, M., Gunter, C.A.: Dynamic searchable encryption viablind storage. In: Security and Privacy (SP), 2014 IEEE Symposium on. pp. 639–654. IEEE (2014)

52. Orencik, C., Selcuk, A., Savas, E., Kantarcioglu, M.: Multi-keyword search overencrypted data with scoring and search pattern obfuscation. International Journalof Information Security 15(3), 251–269 (2016)

53. Papadimitriou, A., Bhagwan, R., Chandran, N., Ramjee, R., Haeberlen, A., Singh,H., Modi, A., Badrinarayanan, S.: Big data analytics over encrypted datasets withseabed. In: OSDI. pp. 587–602 (2016)

29

Page 30: A Practical Model for Collaborative Databases: Securely ...shwetaag/papers/Search_and_compute.p… · A Practical Model for Collaborative Databases: Securely Mixing, Searching and

54. Pappas, V., Krell, F., Vo, B., Kolesnikov, V., Malkin, T., Choi, S.G., George, W.,Keromytis, A., Bellovin, S.: Blind seer: A scalable private dbms. In: Security andPrivacy (SP), 2014 IEEE Symposium on. pp. 359–374. IEEE (2014)

55. Pilcher, C.D., Wong, J.K., Pillai, S.K.: Inferring hiv transmission dynamics fromphylogenetic sequence relationships. PLoS medicine 5(3), e69 (2008)

56. Poddar, R., Boelter, T., Popa, R.A.: Arx: A strongly encrypted database system.IACR Cryptology ePrint Archive 2016, 591 (2016)

57. Poh, G.S., Mohamad, M.S., Chin, J.J.: Searchable symmetric encryption over mul-tiple servers. Cryptography and Communications 10(1), 139–158 (2018)

58. Popa, R.A., Redfield, C., Zeldovich, N., Balakrishnan, H.: Cryptdb: protectingconfidentiality with encrypted query processing. In: Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles. pp. 85–100. ACM (2011)

59. Popa, R.A., Stark, E., Valdez, S., Helfer, J., Zeldovich, N., Balakrishnan, H.: Build-ing web applications on top of encrypted data using mylar. In: 11th USENIX Sym-posium on Networked Systems Design and Implementation (NSDI 14). pp. 157–172(2014)

60. Popa, R.A., Zeldovich, N.: Multi-key searchable encryption. Cryptology ePrintArchive, Report 2013/508 (2013), https://eprint.iacr.org/2013/508

61. Reed, M.G., Syverson, P.F., Goldschlag, D.M.: Anonymous connections and onionrouting. IEEE J.Sel. A. Commun. 16(4), 482–494 (Sep 2006)

62. Sahai, A., Waters, B.: Fuzzy identity-based encryption. In: EUROCRYPT. pp.457–473 (2005)

63. Shi, E., Bethencourt, J., Chan, T.H., Song, D., Perrig, A.: Multi-dimensional rangequery over encrypted data. In: Security and Privacy, 2007. SP’07. IEEE Symposiumon. pp. 350–364. IEEE (2007)

64. Shoup, V.: NTL: A library for doing number theory. https://www.shoup.net/ntl/65. Song, D.X., Wagner, D., Perrig, A.: Practical techniques for searches on encrypted

data. In: Security and Privacy, 2000. S&P 2000. Proceedings. 2000 IEEE Sympo-sium on. pp. 44–55. IEEE (2000)

66. official statement, I.M.: https://www.ipsos-mori.com/newsevents/latestnews/1390/Ipsos-MORI-response-to-the-Sunday-Times.aspx

67. Tu, S., Kaashoek, M.F., Madden, S., Zeldovich, N.: Processing analytical queriesover encrypted data. In: Proceedings of the 39th international conference on VeryLarge Data Bases. pp. 289–300. PVLDB’13, VLDB Endowment (2013), http:

//dl.acm.org/citation.cfm?id=2488335.2488336

68. US Government: http://www.census.gov/ces/dataproducts/index.html69. Van Rompay, C., Molva, R., Onen, M.: Secure and scalable multi-user searchable

encryption. IACR Cryptology ePrint Archive 2018, 90 (2018)70. Voight, B.F., Scott, L.J., Steinthorsdottir, V., Morris, A.P., Dina, C., Welch, R.P.,

Zeggini, E., Huth, C., Aulchenko, Y.S., Thorleifsson, G., et al.: Twelve type 2 dia-betes susceptibility loci identified through large-scale association analysis. Naturegenetics 42(7), 579 (2010)

71. Wang, C., Kao, W.H., Hsiao, C.K.: Using hamming distance as information for snp-sets clustering and testing in disease association studies. PloS one 10(8), e0135918(2015)

72. Yao, A.C.C.: How to generate and exchange secrets. In: FOCS (1986)73. Zahur, S., Evans, D.: Obliv-c: A language for extensible data-oblivious computa-

tion. IACR Cryptology ePrint Archive 2015 (2015)74. Zelin, A.: https://www.mrs.org.uk/pdf/Andrew_Zelin_presentation.pdf

30

Page 31: A Practical Model for Collaborative Databases: Securely ...shwetaag/papers/Search_and_compute.p… · A Practical Model for Collaborative Databases: Securely Mixing, Searching and

SUPPLEMENTARY MATERIAL

A Notation and Preliminaries

We rely on some standard mathematical notation, like [n] to denote the set{1, ...n}. Below we summarize additional notation and symbols used across thepaper.

Table 1 Notation index

Notation Meaning

D Data Owner

S Storage Server

A Auxilliary Server

C Client

FED Functionally Encrypted Datastore

SED Searchably Encrypted Datastore

CED Computably Encrypted Datastore

sFED Single Data-Owner Functionally Encrypted Datastore

sSED Single Data-Owner Searchably Encrypted Datastore

sCED Single Data-Owner Computably Encrypted Datastore

W Universe of search-attributes in the data records

X Universe of compute-attributes in the data records

I Universe of identifiers used for the data records

Q Universe of supported search queries; Q ∈ QF Universe of supported computation queries; f ∈ FL Leakage function

K Universe of keywords

QK Query family for single keyword-occurrence queries; here W = P(K)

Q∗K Query family for boolean formula over keyword-occurrence queries

Z Input of D to FED functionality; Z ⊆ W ×XW Input of D to SED functionality; W ⊆ W × IX Input of D to CED functionality; X ⊆ I × X

(Q, f) Input of C to FED; Q ∈ Q and f ∈ FT Set of id’s selected by a query; T ⊆ I; T = Q[W ]

J The set of unique identifiers for each record in the CED database;J ⊆ I; J = {id|∃x s.t. (id, x) ∈ X}

JMKPK Public key encryption of M using a public-key PK

〈〈M〉〉PK Homomorphic encryption of M using a public-key PK

Some of our constructions make use of additive homomorphic encryption andgarbled circuits. We refer the reader to [8] and [2] for formal definitions.

31

Page 32: A Practical Model for Collaborative Databases: Securely ...shwetaag/papers/Search_and_compute.p… · A Practical Model for Collaborative Databases: Securely Mixing, Searching and

B Related Work

In Table 2, we give a comparison of our work with different related works. As isevident from the table, our work represents a different trade-off between func-tionality, security and query-phase efficiency.

Table 2 Comparison of related works

Functionality SecurityQuery-PhaseEfficiency∗∗

Search-and-Compute

SearchsupportsBooleanformulas

Computesupportsgeneralfunctions

MultiData-Owner

MultiClient

Corruption LevelNotes/Additional Assumptions Client Server

Server(s) Client(s)Data-Owner(s)

CryptDB[58]

Y Y Y (SQL) N N Passive Passive PassiveTrusted application server and pas-sive DBMS server. No server-clientcollusion. Only snapshot attacker.

O(1) O(t)

Seabed[53]

Y Y Y (OLAP) N Y Passive Passive PassiveNo server-client collusion. Clientsneed a key given by the data owner.Only snapshot attacker.

O(1) O(n)

Arx [56] Y Y N N N Passive Passive PassiveTrusted application server and pas-sive DBMS server. No server-clientcollusion. Only snapshot attacker.

O(1) O(t logn)

OXT [12] N Y NA N N Passive Passive Passive - O(t1k) O(t1k)

OSPIR[36]

N Y NA N Y Passive Malicious MaliciousNo server-client collusion. Also pro-vides query privacy to clients assum-ing no server-data owner collusion.

O(t1k) O(t1k)

BlindSeer[54,24]

N Y NA N Y Passive Malicious PassiveNo server-client collusion. Also pro-vides query privacy to clients assum-ing no server-data owner collusion.

O(t1 logn) O(t1 logn)

CFE [50] N NA Y (circuits) Y Y Passive Malicious PassiveNo authority-client or authority-storage collusion. Can provide queryprivacy to clients.

O(n) O(n)

Multi-key/userSE [30,69]

N N N Y Y Passive Passive Malicious

File sharing setting. Server allowedto collude with subset of data ownersor clients. DOs aware of clients (fixedat start of protocol). Shared data canbe traced back to original DO.

O(1) O(n)

Thiswork

Y Y Y (circuits) Y Y Passive Malicious Passive

Two non-colluding servers. Eitherserver can collude with subset ofclients/data owners. Can providesearch/function query privacy toclients, anonymity to data owners.DOs oblivious of clients (can dynam-ically join system).

O(1) O(t1k)

∗∗Based on a query of the form SELECT SUM(COL) WHERE COL1 = α1 AND COL2 = α2 · · · COLk = αk, whereCOL and COLi are column names. If computation is not supported, SUM is omitted; if search is not supportedWHERE clause is omitted. n denotes the total number of records in the database and t denotes the number ofrecords matching the search condition. WLOG, assume COL1 = α1 filters the least number of records and t1 denotesthe number of records satisfying it. In each case, assume that all supported pre-processing has been done prior to thequery.

32

Page 33: A Practical Model for Collaborative Databases: Securely ...shwetaag/papers/Search_and_compute.p… · A Practical Model for Collaborative Databases: Securely Mixing, Searching and

C Modifying existing SSE schemes in literature

Many existing SSE schemes and their multi-client versions in literature can bemodified and used in our setting of SED. We describe one such instantiationby using the MC-OXT scheme of [36] for the family of keyword search queries,Q∗K (see Section 2). We provide the core ideas of their scheme in Appendix C.1,explain our modifications in Figure 8 and use this for our implementation. Ourmodifications do not require prior knowledge of the scheme and can be under-stood by a high level idea of the entities involved and the functionality expected.Our techniques for modifying MC-OXT scheme can be extended to other SSEschemes as well. Please see Figure 8 for a high-level idea of the protocol flow inMC-OXT, our modifications as well as security-efficiency tradeoffs.

C.1 Brief Description of MC-OXT

We give a brief overview of the multi-client SSE scheme (MC-OXT) of [36]. In thissetting, we have a data owner D who processes a large database DB to producean encrypted database EDB and a master key MSK, of which it sends EDB tothe server S. The model allows for multiple clients C, who may request D fortokens corresponding to certain queries Qj , which D constructs using its MSK. Itprovides the client with a key corresponding to Qj , along with a signature so thatS may verify that C is authorized to make this query. The client transforms thereceived token into search tokens by doing computation and sends these searchtokens with their corresponding signatures to S, who verifies the signature, andexecutes a search protocol. Finally, it obtains encrypted id values from the serverwho is in possession of the decryption key.

The MC-OXT protocol of [36] is designed to support arbitrary Booleanqueries but for ease of exposition here, we will consider the special case where theclient makes conjunctive queries of the form w1∧w2∧ . . .∧wn where ∀ i, wi ∈ K(wi are keywords in the document, and conjunction means that we wish toquery the document with all the keywords w1, . . . , wn). The least fequent word,w1 (say) is known as s-word, while the remaining words w2, . . . , wn are calledx-words. It is assumed that the data owner maintains information that lets itcompute the s-word for any query of a client.

Setup of MC-OXT: Pick PRFs Fτ and Fp with the appropriate ranges, andkey KS for Fτ and two keys KX ,KI for Fp. The keys KS and KX are usedto generate pseudorandom handles of keywords w ∈ K, denoted by strap =Fτ (KS , w) and xtrap = gFp(KX ,w) respectively. The key KI is used to generatea pseudorandom handle of document indices id ∈ DB as xind = Fp(KI , id).Next, the protocol computes xtag = gFp(KX ,w)·xind for all “valid” (w, id) pairs,i.e. pairs such that the document id contains the keyword w, and adds xtag toconstruct a set called XSet. The idea of constructing a set such as XSet is thatif we can query for all documents containing the word w1 (our s-word) usinga list Tw1

. Then we can check ∀ id ∈ Tw1if id contains the keyword w2 by

computing if gFp(KX ,w2)·Fp(KI ,id) ∈ XSet. To construct the list Tw1securely, we

33

Page 34: A Practical Model for Collaborative Databases: Securely ...shwetaag/papers/Search_and_compute.p… · A Practical Model for Collaborative Databases: Securely Mixing, Searching and

construct another set called TSet. More formally, for w ∈ K, the data ownergenerates a fresh pair of keys (Kz,Ke) using strap(w) and encrypts list of id’scontaining keyword w by e = Enc(Ke, id). Next, for c ∈ [|Tw|], it blinds the|Tw| representatives xind1, . . . , xindTw as follows: set zc = Fp(Kz, c) and y =xind · zc−1. It stores (e, y) in TSet(w). Note that (e, y) are constructed usingkeys that are specific to s-word w. It then chooses a key KM for AuthEnc - anauthenticated encryption scheme, and this key is shared with the server. Thekeys (KS ,KX ,KI ,KT ,KM ) are retained by D as the master secret.

Query of MC-OXT: When the client requests a token for the query w1 ∧w2 ∧ . . . ∧ wn, the data owner computes the s-word, computes stag using KT ,strap using KS as well as xtrap2, . . . , xtrapn corresponding to w2, . . . , wn usingKX . Next, it blinds the xtrap values as bxtrapi = gFp(KX ,wi)·ρi where ρi arechosen randomly from Z∗p. Then it creates an authenticated encryption env ←AuthEnc(KM , (stag, ρ2, . . . , ρn)) and returns SK = (env, strap, bxtrap2, . . . , bxtokenn)to the client.

The client uses SK to construct search tokens as follows. It uses strap to com-pute Kz and for c ∈ [Tw] it computes zc = Fp(Kz, c). Now, it sets bxtoken[c, i] =bxtrapzci and sends env along with all the bxtoken[c, i] to the server. The serververifies the authenticity of env using KM , and decrypts it to find stag andρ2, . . . , ρn. It uses stag to obtain TSet(w1) and a list of {(e, y)}. Then fori ∈ {2, . . . , n} and ∀y ∈ TSet(w1), it checks if bxtoken[c, i]y/ρi ∈ XSet (Thuschecking if document id ∈ Tw1 contains the keyword w2 or not). If the member-ship succeeds, it retrieves the corresponding encrypted document index e andsends it to the client who decrypts it and requests the corresponding documentfrom the server.

C.2 Security-Efficiency Tradeoffs

We also present multiple security-efficiency tradeoffs relative to the above pro-tocol (see Figure 9 for a detailed description of the tradeoffs):

– We can avoid leaking the search outcome to A, but only leak an upper boundon the size of the outcome, by using an additive homomorphic encryption in-stead of plain public-key encryption used in the SSE scheme of [36] to encryptthe ids. This solution could be seen as an instance of onion secret-sharing (seeSection 5.1) that supports reusability.

– We can retain the original efficiency of the SSE scheme, but slightly increasethe leakage to S (only if the keyword query involves a boolean condition overmultiple keyword predicates) and avoid leaking the search outcome to A bysimply omitting the above mentioned layer of encryption.

– Further, if we start with the OSPIR-OXT protocol in [36] that handles activelycorrupt clients, then we can avoid leaking the search queries to A.

34

Page 35: A Practical Model for Collaborative Databases: Securely ...shwetaag/papers/Search_and_compute.p… · A Practical Model for Collaborative Databases: Securely Mixing, Searching and

High Level Overview of query phase of MC-OXT in [36]:

1. C inputs a conjunctive query and sends this to D.2. D performs computation and sends tokens as authorization to C.3. C performs additional computation over these tokens and sends them to S.4. S uses the tokens by C to perform search and returns the encrypted set of document

indices to C.5. C locally decrypts these indices to obtain output of the MC-OXT protocol.6. C requests the corresponding documents from S.

Modification to an sSED protocol:

– Initialization phase:The initialization phase proceeds exactly as in the MC-OXT protocol, finally, thekeys for authorizing a query are now sent to A instead of being with D.

– Query phase:1. C inputs a conjunctive query and sends this to A.2. A performs the computation intended by D and C ( above in steps 2,3) and sends

tokens to S.3. S uses the tokens by A to perform search and returns the encrypted set of document

indices to A.4. A decrypts the indices to obtain output of the MC-OXT protocol and sends this

to S.

Leakage:

– Leakage to A is equivalent to leakage to D during query phase of MC-OXT. A has anadditional leakage of the filtered set of id’s.

– C does not learn the filtered set of id’s, the size of the documents matching to leastfrequent keyword or the pattern of the least frequent keyword. In other words, C doesnot incur any leakage in this modification.

– Leakage to S stays same as in MC-OXT.

Fig. 8 Modification of MC-OXT in [36] to our sSED setting.

35

Page 36: A Practical Model for Collaborative Databases: Securely ...shwetaag/papers/Search_and_compute.p… · A Practical Model for Collaborative Databases: Securely Mixing, Searching and

Modification of MC-OXT using Additive Homomorphic Encryption:Notation 〈〈M〉〉PK denotes additive homomorphic encryption of M using a public-keyPK.

– Initialization Phase:1. A and S create a public/secret key-pair (PKA, SKA) and (PKS, SKS) for the ho-

momorphic encryption scheme and publish PKA and PKS respectively.2. D proceeds exactly similar as in MC-OXT. Whenever D encrypts the database

in MC-OXT. It stores an additive homomorphic encryption of id instead of asymmetric key encryption, i.e. Enc(K, id) is replaced by 〈〈id〉〉PKA .

– Query Phase:1. C inputs a conjunctive query and sends this to A.2. A performs the computation intended by D and C ( above in steps 2,3) and sends

tokens to S.3. S uses the tokens by A to perform search and returns the encrypted set of document

indices to A, i.e. S receives {cid}id∈T where cid = 〈〈id〉〉PKA . For each id, it chooses arandom value and adds the random value to the ciphertext. i.e. γid = 〈〈rid〉〉PKA +cidand δid = 〈〈−rid〉〉PKS . S sends {(γid, δid)}id∈T to A.

4. A decrypts using SKA to reveal {(id + rid, δid)}id∈T . A encrypts using PKS andcalculates {ηid}id∈T where ηid = 〈〈id + rid〉〉PKS + δid. It permutes this set andsends it to S. (Note that: ηid = 〈〈id + rid〉〉PKS + 〈〈−rid〉〉PKS = 〈〈id〉〉PKS )

5. S decrypts using SKS to reveal {id}id∈T .We note that there is a need for permuting the plaintext id’s from the ciphertexts tostay consistent with the exact leakage to S in MC-OXT. Otherwise, S learns statisticsabout each id that was decrypted due to the intersections with other keywords thatit calculated.

Leakage:

– Leakage to A is equivalent to leakage to D during query phase of MC-OXT. A addi-tionally has knowledge of the size of the filtered set of id’s.

– C does not incur any leakage in this modification.– Leakage to S stays same as in MC-OXT.

Modification by Removing Encryption:

– Initialization Phase:1. D proceeds exactly as in MC-OXT, it stores the plaintext in clear, i.e. Enc(K, id)

is replaced by id.– Query Phase:

1. C inputs a conjunctive query and sends this to A.2. A performs the computation intended by D and C ( above in steps 2,3) and sends

tokens to S.3. S uses the tokens by A to perform search and learns the set of filtered indices i.e.{id}id∈T

Leakage:

– Leakage to A is equivalent to leakage to D during query phase of MC-OXT. A has noadditional leakage.

– C does not incur any leakage in this modification.– Leakage to S increases when compared to the leakage of S in MC-OXT. S learns

each id in the least frequent keyword and is also able to learn intersection patternsbetween each id in the least frequent keyword and other queried keywords. However,if performing single keyword search, the leakage in this protocol exactly matchesleakage in MC-OXT.

Fig. 9 Security-Efficiency tradeoffs in modified MC-OXT in [36].

36

Page 37: A Practical Model for Collaborative Databases: Securely ...shwetaag/papers/Search_and_compute.p… · A Practical Model for Collaborative Databases: Securely Mixing, Searching and

D CED Protocols: Missing Details from Section 4.4 andSection 5.4

In this section, we provide additional protocols from section Section 4.4 andSection 5.4.

D.1 Alternate Protocol for Summation in Section 4.4

Our protocol for summation in Section 4.4 uses additive secret sharing to com-pute FSum. In this section we describe an alternative protocol that uses an ad-ditive homomorphic scheme to achieve the same functionality. The advantage ofusing such a scheme is that we can prevent the leakage of the filtered set T to Aand reduce the asymptotic communication overheads. However, there is tradeoffin efficiency; the initialization phase is slower as there are additive homomorphicencryptions that need to be computed.

The scheme relies on an additive homomorphic encryption scheme (like Pail-lier’s scheme), which supports rerandomization of ciphertexts (or more specifi-cally, homomorphically adding a random ciphertext (of a random value) to anygiven ciphertext results in a random ciphertext). The domain of the values X isidentified with a finite group that forms the message space of the homomorphicencryption scheme.

Protocol Alt-Sum

– Initialization Phase: D creates a public/secret key-pair (PK,SK) for thehomomorphic encryption scheme. It then sends SK to A and (PK, {cid}id) toS, where cid ← HomEncPK(xid). A and S store the values they receive.

– Computation Phase: S first picks a random value y ∈ X and computesρ ← HomEncPK(y); it then computes c = ρ +

∑id∈T cid, where the summa-

tion notations (+ and∑

) represent the homomorphic addition of ciphertextssupported by the encryption scheme. It sends y to C and c to A. A computesz ← HomDecSK(c) and sends z to C, who outputs z − y.

– Leakage:, LAlt−Sum On initialization, the ID-set J is leaked to S.

The main advantage of the scheme is that this prevents the leakage of theset T to A, as S can aggregate the values (under a layer of encryption) withoutinvolving A. However, this comes at the cost of efficiency. In particular, theinitialization phase here is much slower as it involves public key encryptions foreach element in X.

Theorem 9. Protocol Alt-Sum is a (FSum,LAlt−Sum)-sCED scheme, assumingthe security of the additive homomorphic encryption scheme used.

Proof sketch: Security against S relies on the semantic security of the homomor-phic encryption scheme. In particular, the view of S (during the initializationphase) merely involves ciphertexts. Security against A and C (if not colludingwith S) are information-theoretic: during the computation phase A sees onlya random ciphertext of a random value, and the view of C consists of a freshsecret-sharing of the final output. �

37

Page 38: A Practical Model for Collaborative Databases: Securely ...shwetaag/papers/Search_and_compute.p… · A Practical Model for Collaborative Databases: Securely Mixing, Searching and

We adapt the protocol above to the multi data-owner setting. The modifiedscheme has the same computation phase as before, but the initialization phasechanges as follows:

Protocol Alt-Sum-mod

– Modified Initialization Phase: A creates a public/secret key-pair (PK,SK)for the homomorphic encryption scheme, and publishes PK. Each Di, for eachid in its data-set sends {J(id, cid)KPKS

}id to A, where cid ← HomEncPK(xid)(possibly padded with dummy entries). A gathers these sets from all Di, sortsthem, and sends them to S. S recovers the set {id, cid}id.

– Leakage, L′Alt−Sum The modified initialization phase leaks (an upper boundon) |Xi| for each data-owner to A. (Leakage to S remains the same, namelythe ID-set J .)

The protocol relies on the same security analysis as the single data-ownerversion. Secure routing through A ensures that S does not know about the com-position of the data from each data-owner.

Theorem 10. Protocol Alt-Sum-mod is a (FSum,L′Alt−Sum)-sCED scheme, as-suming the security of the AHE and PKE schemes used.

Value Retrieval and Summation for Vector Values: The above protocolsreadily extend to the setting when each value x is a vector (x1, · · · , xm), and thefunction f acts on a subset of attributes. C will send the coordinates of interest toA, but not to S, or to both the servers (for added efficiency). The protocol couldbe seen as parallel executions of the original protocol, one for each coordinate ofinterest; if S does not have the information regarding the coordinates of interest,the execution is carried out for all coordinates, but at the last step, A sends toC only the shares for coordinates included in f .

D.2 Composite Queries protocol in Section 4.4

Towards adapting the above non-composite queries sCED protocols, note thatin all of them, the last step has S and A sending secret shares for the outputto C (in the case of value retrieval and summation protocols this is an additivesharing over an appropriate group, and in case of general functions, for eachoutput bit, one share (from A) is a map from a pair of labels to {0, 1}, and theother share (from S) is a label). The reconstruction operation to calculate thefinal output from these secret shares is simple and efficient. We use this commontheme in our protocols to allow calculation of composite queries.

We modify these protocols as follows: Instead of S and A sending the sharesto C, they carry out a secure 2-party computation of f0 with each input beingsecret-shared as above. This can be implemented for a general f0 (e.g., findingthe maximum) using Yao’s garbled circuits and oblivious transfer, or for specialfunctions like linear combinations using simple protocols (e.g., if f0 is addition,and the inputs are additively secret-shared in the same group, S and A each can

38

Page 39: A Practical Model for Collaborative Databases: Securely ...shwetaag/papers/Search_and_compute.p… · A Practical Model for Collaborative Databases: Securely Mixing, Searching and

simply add the shares they have locally, and send the result to C). In case f0 isevaluated using Yao’s garbled circuits, the circuit structure used to evaluate f0is leaked to S.

D.3 CED for General Functions

In Figure 10 we present the modifications required to convert the sCED schemefor general functions discussed in Section 4.4 into an CED scheme, as discussedin Section 5.4.

Protocol CktEval-mod

– Modified Initialization Phase: Each data-owner Dj , for each (id, xid) ∈ Xj , gen-erates νid,i and λbid,i (for b = 0, 1) randomly (instead of using a key shared with A).

It computes ζid = {J(νid,i, λ0id,i, λ

1id,i)KPKA}

ti=1, and γid = (id, ζid, {λ

xidiid,i, ωid,i}ti=1),

where ωid,i = xid,i ⊕ νid,i as before. It sends {JγidKPKS}id∈Jj to A who collects themfrom all the data-owners, sorts them lexicographically, and sends them to S. S de-crypts them to obtain {γid}id∈J , where J =

⋃j Jj .

– Modification of the Computation Phase: Step (1) is modified as follows:1. S sends {ζid}id∈T to A (randomly permuted), and A decrypts them to obtain, for

each id ∈ T , {νid,i, λ0id,i, λ

1id,i}ti=1.

The rest of the computation phase proceeds as above.

Fig. 10 Modifications required to convert sCED into CED.

The leakage function for this construction is as follows:Leakage, L′ckt : On initialization, the ID-set J =

⋃j Jj is leaked to S, and

the sizes {|Jj |}j are leaked to A. On each query, T (or rather, its pattern) and fare leaked to A, and the circuit structure of f (as revealed by the garbled circuit)is leaked to S.

The security of the resulting protocol is summarized below.

Theorem 11. Protocol CktEval-mod (Figure 10) is a (Fckt,L′ckt)-CED scheme,assuming the security of the PRF and PKE schemes used.

E SED protocols: Missing Details

E.1 Actively secure OPRF

In this section, we describe the UC-secure OPRF protocol of [37] which we use inour constructions and which allows active corrution of the receiver and passivecorruption of the party with the key.

Let G a cyclic group of prime order m and g be a generator of the group.The receiver, whose input is x, samples r ∈ [m] uniformly and sends a = H0(x)r

to the key-holder, where H0 is a random oracle that maps into the group. Thekey-holder replies with b = aK , where K ∈ [m] is the key. The receiver out-puts H1(x, b1/r), where H1 is another random oracle. Formally, then the PRF isdescribed as fK(x) = H1(x,H0(x)K).

39

Page 40: A Practical Model for Collaborative Databases: Securely ...shwetaag/papers/Search_and_compute.p… · A Practical Model for Collaborative Databases: Securely Mixing, Searching and

The protocol described above UC-securely implements the ideal OPRF func-tionality assuming that the (N,Q)-one more Diffie-Hellman Assumption holdsfor the group. For completeness, we reproduce the assumption (as described in[37]) here, but refer the the reader to Theorem 1 of [37] for further details.

(N,Q)-one more Diffie Hellman Assumption. For any poly time adversary A,

Prk←RZm,gi←RG[A(·)k,DDH (·)(g, gk, g1, g2, ...gN ) = S]

is negligible, where S = {(gjs , gkjs)|s = 1, ..., Q + 1}, Q is the number of A’s

queries to the (·)k oracle, and js ∈ [N ] for s ∈ [Q+ 1].

E.2 Protocol for computing equality checks

We use securely pre-computed correlations for a secure evaluation of an equalitycheck between two parties. The strings to check for equality will be mappedusing a collision resistant hash to, say 64 bits, and interpreted as elements ina field, say F = GF (264). The protocol uses a pre-computed correlation: Aliceholds (a, p) ∈ F2 and Bob holds (b, q) ∈ F2 such that a + b = pq and p, q andone of a, b are uniformly random. Alice sends u := x− p to Bob, where x is herinput. Bob replies with v := q(u − y) + b, where y is his input. Alice concludesx = y iff a = v. Please see Figure 11 and Figure 12 for the complete description.

E.3 Improving Efficiency

Suppose the keyword space partitions as K = B1 ∪ · · · ∪ B`, such that thereis an upperbound Lij on |Ki ∩ Bj |, for each data-owner Di and each bin Bj .Given appropriate statistics on the data domain, such upper bounds can oftenbe derived for easy to compute partitions (e.g., a bin may contain keywords inthe English vocabulary that start with a certain set of letters or short prefixes).Then, the protocol can be easily modified so that the equality-checks step (Step 3in Figure 11 can be repeated for each bin separately, without having to comparekeywords across bins. The total number of equality checks needed, becomesO(∑j(∑i Lij)

2); when ` bins are used and if each Lij = O(Li/`), this becomes

O((∑i Li)

2/`), providing a saving of up to factor ` over the original construction.(The assumption that we can take Lij = O(Li/`) would hold only when ` is nottoo large, placing a limit on the efficiency gain possible using binning.)

E.4 Reducing SED Leakage Further

The template used in Section 5.3 reveals the merge-mapped data-set W to A.The mapping ensures that the search results Q[W ] are not linked to W , and canbe simulated independently given only their pattern information. Nevertheless,W itself reveals the entire keyword-identity incidence matrix, as mentioned inthe leakage description in Section 5.3. This may be acceptable if the statistics

40

Page 41: A Practical Model for Collaborative Databases: Securely ...shwetaag/papers/Search_and_compute.p… · A Practical Model for Collaborative Databases: Securely Mixing, Searching and

merge-map:

1. Keys. S generates a keypair for PKE, (PKS, SKS), and similarly A generates (PKA, SKA).A generates a PRF key K. The keys PKS and PKA are published.

2. Shuffled Sharing. At the end of this phase, for each (i, τ) such that τ ∈ Ki:- S holds (αi,τ , tagi,τ ), where αi,τ ∈ {0, 1}λ is a random string (as long as a keyword), and

tagi,τ is a unique (random) index for each (i, τ);- A holds (βi,τ , Γi,τ , Θi,τ , tagi,τ ), where:

• βi,τ := τ ⊕ αi,τ ; Γi,τ := {Γ idi,τ |(τ, id) ∈ Wi} (padded as explained below). Here Γ id

i,τ :=(ξidi,τ , γ

idi,τ , Jηidi,τ , Jµid

i,τ KPKAKPKS) with:◦ ξidi,τ ← JJζ idi KPKAKPKS (fresh encryptions for each (i, id, τ)), where ζ idi ← JidKPKS (a

single encryption per (i, id)),◦ γid

i,τ , ηidi,τ , µ

idi,τ being a random additive secret-sharing of 0λ (i.e., γid

i,τ⊕ηidi,τ⊕µidi,τ = 0λ).

Γi,τ is padded with additional entries of the form Γ⊥i,τ – which is defined identically asΓ idi,τ except that ζ idi ← J⊥KPKS – so that |Γi,τ | is an a priori fixed value.

• Θi,τ := (θi,τ , Jφi,τ KPKS) for a random φi,τ ; and θi,τ := τ ⊕ φi,τ .This phase proceeds as follows:

– Each Di, for each τ ∈ Ki picks two random masks αi,τ and φi,τ and computes πi,τ :=(βi,τ , Γi,τ , Θi,τ ), as defined above. Then it sends (αi,τ , Jπi,τ KPKA) to S. (The number ofelements sent by each Di is padded up to an a priori bound, with dummy entries of theform (α, J⊥KPKA).)

– S collects entries of the form (αi,τ , ρi,τ ) from all Di, and randomly permutes them.Let tagi,τ denote the serial number of these elements in the permuted order. S stores(αi,τ , tagi,τ ) and communicates (ρi,τ , tagi,τ ) to A.

– A decrypts the entries to obtain (πi,τ , tagi,τ ). (It will discard the entries where π = ⊥.)

3. Grouping Keywords Using Equality Checks. At the end of this phase, there is a(hidden) injective mapping h :

⋃iKi → [L] (where L equals |

⋃iKi|, or an upper-bound

thereof). For each (i, τ) with τ ∈ Ki, A will hold a tuple (h(τ), Γi,τ , Θi,τ ).

– Below we index the entries held by S and A as αt, πt etc., t being the tag value. Forevery two tags, t, t′, S and A engage in a 2-party secure equality check protocol to checkif αt ⊕ αt′ = βt ⊕ βt′ . Note that this equality holds iff τt = τt′ . A learns the results.(If πt = ⊥, A uses an arbitrary string instead of βt in the equality check protocol.)

– A partitions the set of tags into equivalence classes T1, · · · , TL, such that for all k ∈ [L],and t, t′ ∈ Tk, the equality test above was positive. (Hence there is some keyword τkcorresponding to all the elements in Tk. This implicitly defines the mapping h : τk 7→ k.)

4. Culling Dummy Entries. At the end of this phase, for each (i, τ, id) with (τ, id) ∈ Wi, Aobtains a tuple (h(τ), ζ idi ).

– A sends to S (in random order) entries of the form (h(τ)⊕ γidi,τ , Jηidi,τ , Jµid

i,τ KPKAKPKS , ξidi,τ )

where some of the entries may have id = ⊥ (in which case, ξidi,τ = J⊥KPKS).– From each such triple, where the last two items are ciphertexts under PKS, S extracts

(ηidi,τ , Jµidi,τ KPKA) from the first ciphertext and either Jζ idi KPKA or ⊥ from the second one.

It discards all entries where ⊥ is obtained from the second ciphertext. It then permutesand sends back entries of the form (h(τ)⊕ γid

i,τ ⊕ ηidi,τ , Jµidi,τ KPKA , Jζ

idi KPKA) to A.

– From each tuple received, A decrypts the second item to obtain µidi,τ , and the third item

to receive ζ idi ; combining former with first item, A recovers h(τ)⊕γidi,τ ⊕ηidi,τ ⊕µid

i,τ = h(τ).

5. Sharing the Mapped Keywords. At the end of this phase:- For each (i, τ, id) with (τ, id) ∈ Wi, A obtains a tuple (h(τ), τ) where τ = FK(h(τ)), and- S obtains {(k, σk)|k ∈ L} where, when k = h(τ), σk = τ ⊕ τ .

– For each k ∈ [L], A sets τk = FK(k) Then, it chooses a representative t∗k ∈ Tk and sends(θt∗

k⊕ τk, Jφt∗

kKPKS) to S (sorted by k). (If there are fewer than L equivalence classes,

randomly generated θ and φ values are used to generate dummy messages.)– S receives an ordered list of L pairs (δk, JφkKPKS), and it computes σk = δk ⊕ φk.

6. Outputs. A outputs KA = K and W consisting of elements (τ , ζ idi ). S outputs KS =(Z, SKS), where Z is the set {(k, σk)|k ∈ [L]} = {(h(τ), σh(τ))|τ ∈

⋃iKi}.

Fig. 11 Protocol implementing merge-map functionality for instantiating the templatefor SED protocols, based on SFE.

41

Page 42: A Practical Model for Collaborative Databases: Securely ...shwetaag/papers/Search_and_compute.p… · A Practical Model for Collaborative Databases: Securely Mixing, Searching and

map: A client C has an input Q which is specified using one or more keywords. Foreach keyword τ appearing in Q, C engages in a protocol with S and A to compute τ .The protocol is as follows:

– C sends additive secret-shares γ, δ to S and A respectively, where γ ⊕ δ = τ .– For k = 1, · · · , L, S and A carry out an equality check protocol to check if (γ⊕σk) =

(δ ⊕ FK(k)). The output is revealed to A. There will be at most one value k∗ suchthat the equality holds.

– A sends FK(k∗) to C, who takes it as τ . (If no such k∗ exists, A sends a random value,or alternately, if permitted, may reveal to C that there is no match.)

Fig. 12 Protocol implementing map functionality for instantiating the template forSED protocols, based on SFE.

of the overall matrix are not private; further, by adding noise in the form ofdummy keywords and identities one can further limit the accessible informationin W .

However, there may be scenarios where it would be desirable to almost com-pletely eliminate the leakage from W to A. We present a modification of theearlier two constructions which achieves this, as long as the search queries areindividual keyword membership queries (rather than predicates over such mem-bership queries). Observe that in the case of individual keyword queries, we may

map an id used by a data-owner Di to multiple values idτ , for each τ such that(τ, id) ∈ Wi. Then, on a keyword query, at most one such value will be recovered(corresponding to the queries keyword); we shall require that on decoding anysuch value is mapped back to the same id. Then, instead of the (unlabelled)keyword-identity incidence matrix, A learns only the total number of keywordsand their “degree distribution” (i.e., the histogram showing, for each n, the num-ber of keywords that appear in n documents in the merged dataset). A does noteven learn the total number of ids.

For the constructions in Section 5.3, this is easily implemented by replacingthe ciphertext ζ idi by ζ idi,τ ← JidKPKS

(i.e., fresh encryptions for each (τ, id) ∈ Wi).

E.5 Security Analysis of SED protocols

In this section, we provide a detailed security analysis of our SED protocols.We focus on the instantiation of the SED protocol template (Section 5.2)

using the components developed in Section 5.3. The analysis of the full SEDprotocol breaks down into the analysis of the template protocol (Section 5.2) andthe protocols for merge-map, map and sSED separately. Note that to enable thismodular analysis, it is crucial that these three functionalities are fully specified,including their leakages.

First we analyze merge-map, in Figure 6. Instead of directly using the randomoracle model, we assume an ideal OPRF functionality (which is then UC-securelyrealized in the random oracle model). Given the ideal OPRF functionality, theview of S consists of Li invocations of OPRF be data-owner Di, followed by Niciphertexts encrypted under PKA. These can be simulated knowing Ni, Li which

42

Page 43: A Practical Model for Collaborative Databases: Securely ...shwetaag/papers/Search_and_compute.p… · A Practical Model for Collaborative Databases: Securely Mixing, Searching and

are part of the leakage, assuming semantic-security of the public-key encryptionscheme. (Note that we assume that all the keywords are encoded into bit-stringsof the same length.) The view of A consists of a set of ciphertexts (permutedto disassociate with the data-owners who created it) which can be perfectlysimulated knowing only

∑iNi (which is part of the leakage) and W (which is

part of the output). For this, we rely on the semantic security of the public-keyencryption scheme.

The protocol for map in Figure 6 is easily seen to be a UC-secure realizationof the corresponding functionality as it directly uses the OPRF functionality. Asdescribed in Section 5.3, we UC-securely realize this functionality against activecorruption of the receiver, in the random oracle model, by using the OPRFprotocol of [37].

Finally, we turn to the analysis of the SED template protocol (see Figure 4b).This relies on the fact that the leakages from SED to A and S are defined toinclude the corresponding leakages from merge-map, map and sSED, as wellas the information required to simulate the intermediate output (namely W )that A receives, namely the keyword-identity incidence matrix of the mergeddata. (Here we rely on the pseudorandomness guarantee of the OPRF to ensurethat by learning W , A learns nothing more than the unlabeled keyword-identityincidence matrix of the merged data, except for the labeling of keywords held bycorrupt data-owners.) Also, KS which is output to S is sampled independentlyof the inputs (and hence perfectly simulated), and the information revealed to Sby Q[W ] can be perfectly simulated from Q[W ] and KS. In this protocol, if thefunctionality sSED and merge-map are implemented to handle actively corruptclients, the overall protocol can also be seen to be admit active corruption ofclients. merge-map, as mentioned above, relies on the OPRF protocol for this, andas discussed in Section 4.5, we can modify the searchable encryption protocolsfrom the literature to handle actively corrupt clients.

The analysis of the merge-map and map protocols in Figure 11 and Figure 12is more tedious, but the simulation is not much more complicated than above.In particular, the ciphertexts received by the servers S and A throughout theprotocol (encrypted using the other server’s public-key) can be simulated onlyknowing the number of ciphertexts (by relying on the semantic secuity of public-key encryption). These protocols also rely on a 2-party equality-check protocol,which can be replaced for the purposes of analysis using an ideal equality-checkfunctionality. By carefully inspecting the protocol one can verify that the view ofeither server (colluding with data-owners or clients, but not with each other) canbe entirely simulated using the leakage information specified in Section 5.3. Asthis analysis is not particularly enlightening, we omit the details and point thereader to the discussion on onion secret-sharing (Section 5.1) for the intuitionunderlying the construction and the analysis.

43

Page 44: A Practical Model for Collaborative Databases: Securely ...shwetaag/papers/Search_and_compute.p… · A Practical Model for Collaborative Databases: Securely Mixing, Searching and

F Implementation and Experiments: Additional Details

In this section, we provide details about our implementation and experimentsperformed.

F.1 Implementation: Additional Details

Our implementation used C++11 with STL support. Various opensource cryp-tographic libraries and implementations (including OpenSSL, libPaillier, NTL,JustGarbled and Obliv-C) were used to implement the various components inour protocol. The MC-OXT protocol and OPRF protocol of [37] (see Section 5.3)were reimplemented for use in our SED protocols. We conducted our experimentson synthetic data generated randomly.

We use the following opensource libraries and codebases in our implementa-tion.

– We used the OpenSSL library (version 1.0.2) for the following. PRFs andSymmetric Key Encryption (SKE) were based on AES-256, while public keyencryption (PKE) was implemented using RSA-OAEP, using hybrid encryp-tion where appropriate. For group operations in the MC-OXT protocol of [36],we used NIST-224p elliptic curves.

– We re-implemented the MC-OXT scheme of [36] and ensured that the perfor-mance is comparable to that reported in [36].

– Additive Homomorphic scheme was implemented using the Paillier Cryptosys-tem using the Libpaillier library [5].

– JustGarble [3] was used to implement the CED scheme for general functionevaluation.

– Two-party computation for CED for composite queries was implemented us-ing Obliv-C [73]. This provided an implementation of Oblivious Transfer andYao’s Garbled Circuit.8

– NTL (Number Theory Library) [64] version 11.3.0 was used to implement thefield operations in our construction using secure equality checks

F.2 Experiments: Additional Details

Dataset and Queries. We conducted our experiments on synthetic data gen-erated randomly. Our dataset sizes are typically between 10000−100000 recordsand 200 SNPs, inspired by real-world applications [33,31,70]. For the Hammingfunctionality, we used 5000−25000 records and 200 SNPs. For searching on thesedatabases, we carefully chose our parameters (search keywords) so that the num-ber of filtered records ranged between 600−3500 for the Hamming functionalityand between 2000 − 12000 for other functionalities. All the reported values areaveraged over the same 10 queries to account for experimental variation.

8 Obliv-C provides a high-level interface for implementing a secure 2-party computa-tion protocol. But it is not suited for generating garbled circuits outside the frame-work of a protocol, for which we relied on JustGarble.

44

Page 45: A Practical Model for Collaborative Databases: Securely ...shwetaag/papers/Search_and_compute.p… · A Practical Model for Collaborative Databases: Securely Mixing, Searching and

Protocols used for our Functionalities.

MAF. Computing MAF at a given locus can be implemented as a special in-stance of a composite query of the form f0(f1(Q[Z]), f2(Q[Z]), where Q isthe demographic/phenotypic filter, and f1 computes N = |Q[Z]|, f2 com-putes nA, and f0 the MAF itself (upto a finite precision) from nA and N .Note that if we encode the genotypes BB,AB,AA as 0, 1, 2 respectively,then nA is simply the sum of the encoded genotype values. Hence we shallbe able to use our summation (and alternate summation) sCED and CEDprotocols.

Chi-square analysis (ChiSq). Computing ChiSq corresponds to a compositequery f0(f1(Q[Z]), f2(Q[Z]), f3(Q′[Z]), f4(Q′[Z]), where f1, f2, f3, f4 com-pute nA, N, n

′A, N

′ respectively, and f0 implements the above formula (upto finite precision).

Average Hamming. We implement this using our general functions protocol.

Genome retrieval (GenomeRet). We implement this using our value re-trieval protocol.

Experiments Analysis. For the FED setting, we consider both protocols, i.e.based on oblivious PRF (FED OPRF) and secure function evaluation on equalityfunction (FED SFE). Results are reported over two metrics - total communica-tion cost and gross computation time across all entities. The computation timeof any entity excludes time taken to read/write to disk and also time taken totransfer data over the network. These metrics are contrasted between FED andSED and are reported for the initialization and query phases. The relevant en-tities were simulated using TCP clients and servers running on localhost. Allexperiments were performed on a Linux machine with 64 GB RAM and IntelCore i7-7700 CPU with processing frequency of 3.60 GHz. All the databaseswere pre-loaded into the RAM.

A few more observations about our experiments are summarized here.

Initialization Phase. In all our protocols, the initialization costs are linear in thenumber of input records. Here, the speed-up of sFED over FED OPRF rangesbetween 6 to 10 for different protocols. This is explained by the extensive useof public key operations in the latter setting. However, our obtained numbersare still efficient in practice. Additionally, the SFE protocol is slower than theOPRF protocol by approximately a factor of 1.5 for the FED component andfactor 2 for the SED component. This is due to the level of nesting of public keyoperations and onion secret sharing in the SED component. The communicationcosts for the initialization phase also reflect the same trends.

Query Phase. In all our protocols, the query time is linearly proportional tothe number of filtered records. This is partly a consequence of our dataset con-struction, where the frequency of any keyword is proportional to the size of the

45

Page 46: A Practical Model for Collaborative Databases: Securely ...shwetaag/papers/Search_and_compute.p… · A Practical Model for Collaborative Databases: Securely Mixing, Searching and

filtered dataset.9 This restriction helps us simplify the comparison of the FEDprotocols with their SED components. Also we note that the differences betweenthe two FED protocols (with OPRF and SFE) and the sFED protocol are neg-ligible, showing that the map protocols add little overhead relative to the queryphase of sFED.

More details of our experimentation are summarized below.

– Genome Retrieval: As seen in Figure 7a, sSED is the heavier part of thecomputation in sFED. Computation on filtered inputs of size 12000 took closeto 11 and 3.7 sec for FED and sFED.

– MAF: Computation on filtered inputs of size 12000 took close to 11 secondsfor FED and close to 4 sec for sFED (See Figure 7b). This competes withGenomeRet.

– Chi-square Analysis: The Chi-square formula was represented as a gar-bled circuit with 38000 gates. This required 0.23 sec for computation usingthe Obliv-C framework [73]. Note that this computation stayed invariant ofthe number of records filtered. Computation on filtered inputs of size 12000took close to 24 sec for FED and close to 14 sec for sFED. Time taken wasmore than twice when compared to MAF or Genome Retrieval as there weretwo summation queries computed in the inner protocol and an additionalpolynomial computed in the outer protocol. Please see Figure 7c for details.

– Average Hamming distance: In this case, the computation is heavy inthe sCED stage and all trends observed due to the changes in sSED arehidden, as seen in Figure 7d. On a population of size 25000, being filteredto 3100 individuals, the circuit being computed had approximately 80 milliongates and 60 million input bits. Hence, the two functionalities sFED andFED (both protocols i.e. OPRF and SFE) perform similarly. Computation onfiltered inputs of size 3100 took about 30 sec.

– Comparing the two summation protocols: Our two summation proto-cols are compared in Figure 13. During initialization, while summation basedon additive secret sharing took about 6 and 0.6 sec in the FED and sFED func-tionalities respectively, the alternate summation protocol took correspond-ingly 207 and 202 sec for the same dataset. Comparing FED and sFED, wenote that during initialization the computation and hence performance aresame. During the query phase, the two functionalities perform equally well.

9 Note that on an arbitrary dataset, our CED protocols will still run in time linearin the filtered dataset size, but SED query time is linear in frequency of the query’sleast frequent keyword – which may not equal the size of the filtered dataset ingeneral.

46

Page 47: A Practical Model for Collaborative Databases: Securely ...shwetaag/papers/Search_and_compute.p… · A Practical Model for Collaborative Databases: Securely Mixing, Searching and

500 1500 25000

10

20

30

40

50

60

70

Com

m (M

B)

Setup Communication Size

80 160 240

0.05

0.10

0.15

0.20

Query Communication Size

500 1500 2500#(input records)

0

50

100

150

200

Tim

e (s

ec)

Setup Gross Computation Time

80 160 240#(filtered records)

0.05

0.10

0.15

0.20

0.25

0.30Query Gross Computation Time

sFEDAddSSsFEDAddHomFEDOPRFAddSSFEDOPRFAddHom

Fig. 13 Two CED protocols are contrasted here, one using the additive secret-sharingbased summation (AddSS) and the other using the additive homomorphic encryptionbased summation (AddHom).

47