toward securing cloud-based data analytics: a discussion ... · with cloud-based technologies and...

20
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000. Digital Object Identifier 10.1109/ACCESS.2017.DOI Toward Securing Cloud-Based Data Analytics: A Discussion on Current Solutions and Open Issues SOMAYEH SOBATI-M. 1 , AMJAD FAYOUMI 2 (MEMBER, IEEE) 1 Department of Electrical and Computer Engineering, Hakim Sabzevari University, Sabzevar, Iran (e-mail: [email protected]) 2 Management Science, Lancaster University Management School, Lancaster, UK, LA1 4YX (e-mail: [email protected]) Corresponding author: Somayeh Sobati-M. (e-mail: s.sobati@ hsu.ac.ir). ABSTRACT In the last few years, organizations and business professionals have realized the value of data analytics in supporting decision-making. Where several activities are performed on on-line data by different stakehold- ers, such as cleansing, aggregation, analysis and visualization, cloud-based data analytics has become a favored choice for business professionals due to the elasticity, availability, scalability, and pay-as-you-go features offered by cloud computing. However, large amounts of data stored on the cloud are very sensitive (e.g., innovation, financial, legal, customers’ data), and so data privacy remains one of the top concerns for many reasons; mainly those relating to legal or competition issues. In this paper, we review the security and cryptographic mechanisms which aim to make data analytics secure in a cloud environment, and discuss current research challenges. INDEX TERMS Cloud computing , Data analytics, Data privacy, Query processing. I. INTRODUCTION With cloud-based technologies and services flourishing, many organizations have started adopting hybrid Information System (IS) solutions, particularly the multi-hybrid cloud deployment model, where IS services are outsourced and shared with several cloud service providers and integrated with the organization’s on-premises systems. Cloud comput- ing appeals to businesses and organizations because of the wide variety of benefits it offers. The pay-per-use model of cloud computing is very attractive, especially for small and medium enterprises, because it reduces start-up costs [101]. Some cloud outsourcing models go as far as outsourcing a complete business process to a third party Cloud Service Provider (CSP), and they share part of their IS assets through the cloud system [4]. The growth of cloud-based services has made data ana- lytics a feasible reality for organizations [101]. One closely related outsourcing model that has recently emerged is data analytics outsourcing [5], where organizations offer a value- added third party agent access to their heterogeneous, large datasets stored on the cloud in order to perform some analyt- ical tasks and deliver insight to the organization [92]. Data analytics helps businesses and organizations analyze data to gain a fuller understanding of the situation confronting them and make better decisions [7]. Data analytics outsourcing services, or Data Analytics-as-a-Service (DAaaS), entails storing sensitive data on a remote CSP’s infrastructure, and querying and retrieving data on demand [8]. Data Analytics- as-a-Service is a customizable analytical platform which uses the software-as-a-service (SaaS) cloud-based service delivery model. Different customizable data analytics tools are typically used by companies which are collaborating in an integrated value chain. The companies use DAaaS with the intention of integrating cross-organizational business processes to maximize data-driven value and enable rapid innovation. The security solutions and challenges we discuss in this paper relate to securing data which is stored on a CSP’s infrastructure and is subject to limited access by multiple organizations. These organizations are collaborating with each other, and data analytics is crucial for them collectively. Although cloud data outsourcing provides many benefits, it also raises security and privacy concerns. Ensuring data privacy is not a trivial matter, because data owners lose control of their data in cloud environments. Threats to privacy come from both outside and inside, so data privacy must be guaranteed not only against external adversaries breaking into the system, but also malicious insiders and so-called VOLUME 4, 2016 1

Upload: others

Post on 11-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Toward Securing Cloud-Based Data Analytics: A Discussion ... · With cloud-based technologies and services flourishing, many organizations have started adopting hybrid Information

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.

Digital Object Identifier 10.1109/ACCESS.2017.DOI

Toward Securing Cloud-Based DataAnalytics: A Discussion on CurrentSolutions and Open IssuesSOMAYEH SOBATI-M.1, AMJAD FAYOUMI2(MEMBER, IEEE)1Department of Electrical and Computer Engineering, Hakim Sabzevari University, Sabzevar, Iran (e-mail: [email protected])2Management Science, Lancaster University Management School, Lancaster, UK, LA1 4YX (e-mail: [email protected])

Corresponding author: Somayeh Sobati-M. (e-mail: s.sobati@ hsu.ac.ir).

ABSTRACTIn the last few years, organizations and business professionals have realized the value of data analytics insupporting decision-making. Where several activities are performed on on-line data by different stakehold-ers, such as cleansing, aggregation, analysis and visualization, cloud-based data analytics has become afavored choice for business professionals due to the elasticity, availability, scalability, and pay-as-you-gofeatures offered by cloud computing. However, large amounts of data stored on the cloud are very sensitive(e.g., innovation, financial, legal, customers’ data), and so data privacy remains one of the top concerns formany reasons; mainly those relating to legal or competition issues. In this paper, we review the security andcryptographic mechanisms which aim to make data analytics secure in a cloud environment, and discusscurrent research challenges.

INDEX TERMS Cloud computing , Data analytics, Data privacy, Query processing.

I. INTRODUCTIONWith cloud-based technologies and services flourishing,many organizations have started adopting hybrid InformationSystem (IS) solutions, particularly the multi-hybrid clouddeployment model, where IS services are outsourced andshared with several cloud service providers and integratedwith the organization’s on-premises systems. Cloud comput-ing appeals to businesses and organizations because of thewide variety of benefits it offers. The pay-per-use model ofcloud computing is very attractive, especially for small andmedium enterprises, because it reduces start-up costs [101].Some cloud outsourcing models go as far as outsourcing acomplete business process to a third party Cloud ServiceProvider (CSP), and they share part of their IS assets throughthe cloud system [4].

The growth of cloud-based services has made data ana-lytics a feasible reality for organizations [101]. One closelyrelated outsourcing model that has recently emerged is dataanalytics outsourcing [5], where organizations offer a value-added third party agent access to their heterogeneous, largedatasets stored on the cloud in order to perform some analyt-ical tasks and deliver insight to the organization [92]. Dataanalytics helps businesses and organizations analyze data togain a fuller understanding of the situation confronting them

and make better decisions [7]. Data analytics outsourcingservices, or Data Analytics-as-a-Service (DAaaS), entailsstoring sensitive data on a remote CSP’s infrastructure, andquerying and retrieving data on demand [8]. Data Analytics-as-a-Service is a customizable analytical platform whichuses the software-as-a-service (SaaS) cloud-based servicedelivery model. Different customizable data analytics toolsare typically used by companies which are collaborating inan integrated value chain. The companies use DAaaS withthe intention of integrating cross-organizational businessprocesses to maximize data-driven value and enable rapidinnovation.

The security solutions and challenges we discuss in thispaper relate to securing data which is stored on a CSP’sinfrastructure and is subject to limited access by multipleorganizations. These organizations are collaborating witheach other, and data analytics is crucial for them collectively.

Although cloud data outsourcing provides many benefits,it also raises security and privacy concerns. Ensuring dataprivacy is not a trivial matter, because data owners losecontrol of their data in cloud environments. Threats to privacycome from both outside and inside, so data privacy mustbe guaranteed not only against external adversaries breakinginto the system, but also malicious insiders and so-called

VOLUME 4, 2016 1

Page 2: Toward Securing Cloud-Based Data Analytics: A Discussion ... · With cloud-based technologies and services flourishing, many organizations have started adopting hybrid Information

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

honest but curious CSPs [9].

A straightforward solution to the problem of data privacyis to encrypt sensitive data locally in a trusted environmentbefore sending it to a CSP. Encryption protects sensitive dataeven if the server is compromised. However, encrypting dataincurs additional storage costs, and query processing overencrypted data is both non-trivial and costly in terms ofcomputing power in the cloud, and thus money. Moreover, inaddition to overheads, the levels of security and functionalityare other issues which must be addressed properly. Typically,existing solutions from the literature [10], [11], [12], [13]cannot provide a seamless and complete solution which guar-antees secure cloud data analytics, because they fail to dealadequately with at least one of these issues [14]. SearchableEncryption (SE) allows keyword searches over encrypteddocuments [93], [94]. However, SE schemes involve a trade-off between security and efficiency.

A common limitation of SE schemes is their prohibitivecomputational cost. In the context of cloud outsourcing,access control can be achieved by using attribute-based en-cryption (ABE), which enables the decryption of data whena user has certain attributes which satisfy the access structure[95]. However, the main disadvantages of ABE schemes aredeficiency and functionality issues. Consequently, choosingan efficient cryptographic scheme with adequate levels of se-curity and functionality for computing analytical queries overencrypted data, which would thus be an effective solution forsecure cloud-based data analytics, is still a challenge.

In this paper, we give a comprehensive overview of securecloud-based data analytics which provides an easy entrypoint for researchers with no cryptographic background. Weintroduce a general architecture and system model which willaccommodate a wide range of uses of cloud data analytics.This architecture incorporates a specific adversary and se-curity model, and we provide an overview of cryptographictools for ensuring data privacy with an emphasis on function-ality and efficiency in terms of overheads. We then presentand compare practical secure solutions from the literature andidentify various security and efficiency issues. A comparativeanalysis of security, overhead and functionality issues ispresented at the end of the article in order to provide aseries of guidelines and insights to assist in designing anddeveloping secured cloud-based data analytics with a level offunctionality and efficiency which meets user requirements.The rest of the paper is organized as follows: we presentour reference system architecture, adversary and securitymodel for cloud data analytics in Section II. We classify thevarious cryptographic schemes which can be implementedin the reference system for preserving privacy in Sectionn III. We then review, discuss and compare practical solutionsin Section IV, while limitations and unresolved issues arediscussed in Section V. Finally, conclusions are drawn inSection VI.

II. ASSUMPTIONS AND REFERENCE MODELIn this section, we introduce the main actors and theiractivities in relation to data analytics in the cloud. Then,we describe the ways in which the security of cloud-baseddata analytics can be compromised. Since we review crypto-graphic approaches for preserving privacy, we define a stan-dard security model to specify the level of security guaranteesdesired in cloud-based data analytics. Taking an adversaryand security model as our basis, we elaborate the systemgoals which must be achieved in the cloud context

A. ACTORS AND ACTIVITIES IN CLOUD DATAANALYTICSA cloud-based data analytics system allows a server to storeand query encrypted data on behalf of a user without gaininginformation about the underlying data.

Figure 1 depicts the basic system model and architectureof cloud-based data analytics, which consists of three mainentities: the Data Owner, the Cloud Service Provider andAuthorized Users.The Data Owner (DO) is an entity which has a large amountof data to be outsourced in the cloud, and can be an individualuser, or a large, small or medium business or enterprise.The Cloud Service Provider (CSP) provides data storageservices and computational resources.Authorized Users (AU): the DO allows the authorized users,which could be other enterprises or individual users, to usethe outsourced data. AU can query encrypted data and re-trieve encrypted results and decrypt them to get correspond-ing plaintext.

In a secure cloud-based architecture, a trusted component(TC) must execute confidential tasks such as key manage-ment functions, query rewriting and post-processing, anddecryption. The DO and the AU are assumed the TC in somesettings (TC-DO and TC-AU in Figure 1). In this setting, theTC must be installed and maintained on each AU and DOseparately. To avoid installing the TC in several machines, theTC is set between the client (i.e., the DO or the AU) and theCSP [12], which is called the proxy server (TC-PS in Figure1). In this setting, the proxy server intercepts client/servercommunications and performs data/query encryption anddecryption on behalf of the user. The proxy server maintainssome metadata and the private keys for data encryption andquery transformation. Finally, tamper-proof hardware can beembedded at the CSP (TC-TPH in Figure 1), which performsencryption/decryption and query rewriting [15]. The AU alsoneeds to perform a few security tasks like query rewriting anddecryption. Typically, the interactions in this architecture areas follows:1) The DO outsources data to a CSP (or multiple CSPs [16],[17], which are non-colluding with each other or with otherexternal adversaries) in encrypted form while still maintain-ing the capability to query the data efficiently.2) An AU encrypts a query and sends the encrypted query tothe CSP.3) The CSP executes the query over encrypted data and

2 VOLUME 4, 2016

Page 3: Toward Securing Cloud-Based Data Analytics: A Discussion ... · With cloud-based technologies and services flourishing, many organizations have started adopting hybrid Information

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

FIGURE 1. System architecture for securing cloud-based data analytics

returns encrypted results to the AU.4) Finally, the AU decrypts the results.All interactions between the DO/AU and the CSP can beexecuted through the proxy server, TC-PS, when the TC-PSis the TC.

B. ADVERSARY MODEL

An adversary model specifies an adversary’s ability tothreaten security. In cloud outsourcing scenarios, the adver-sary can be either honest but curious (passive), or malicious(active). Honest but curious is a widely-used adversary modelfor cloud outsourcing scenarios [18], [19], [20], [10], [11],[21]. An honest but curious CSP or insider faithfully com-plies with any service-level agreement (SLA) and stores data,runs computations and queries, and provides results withoutalteration. However, such CSP may access data and gaininformation by inferring from queries and results.

A malicious adversary can manipulate data and queryresults, and even delete stored data, which compromisesintegrity and availability. Since current known CSPs are well-established companies such as Google or Amazon, it is hardto see the possibility of them behaving maliciously, as thiswould damage their reputation and have a negative impact ontheir revenues [22]. Nevertheless, the malicious adversary istaken into consideration in some settings [13], [23]. In thispaper, we consider the honest but curious adversary model,i.e., the CSP or insider.

C. REFERENCE SECURITY MODELSecurity model is used to define the security guarantees ofan encryption scheme. Basically, an encryption scheme ischaracterized by Π=(M, C,K, Enc,Dec), where M is allpossible plaintexts, C is a set of all possible ciphertexts, andK a set of all possible keys belonging to a key space. Enc isa randomized algorithm where Enc : K ×M −→ C andDec is a deterministic algorithm Dec : K × C −→ M.For all possible plaintexts, the following property should besatisfied [24].

∀m ∈M, ∀k ∈ K : Dec(k,Enc(k,m)) = m.

There are two fundamental types of encryption schemes:symmetric-key and public-key encryption. In symmetric-keyencryption, the encryption and decryption keys are identicalwhile in public-key encryption there is a key for encryptionand another key for decryption. In public-key schemes, theencryption key is publicly available for encryption, but onlythe authorized users who have access to the decryption keycan decrypt ciphertexts. We use PPT to denote the class ofalgorithms that are in probabilistic polynomial time.

To define the security model of an encryption scheme,the first step is defining a security goal. Then, it must beexamined whether this goal can be achieved by an adversary.To this end, an experiment is defined between the adversaryand a challenger and the probability that the adversary winsthe experiment is studied. This experiment is called security

VOLUME 4, 2016 3

Page 4: Toward Securing Cloud-Based Data Analytics: A Discussion ... · With cloud-based technologies and services flourishing, many organizations have started adopting hybrid Information

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

model. An adversary’s advantage is a measure of how muchmore successful it is at winning the experiment compared tothe random guess.

For an encryption scheme Π , the following game is de-fined between an adversary A, which tends to break thesystem and a challenger, C, who receives as input variableb where b ∈ {0, 1}. First, the adversary chooses two equal-length plaintexts m0 , m1 ∈ M. The challenger picks oneof two plaintexts, encrypts it and sends the result to theadversary. Then, the adversary should not be able to guesswhich plaintext is encrypted by the challenger. This modelis known as IND-CPA (Indistinguishability under chosen-plaintext attack), which is formalized in the following defi-nition.

An encryption scheme, Π , is IND-CPA if for any PPTadversary A, the probability that A "wins" the followinggame is negligible.

1) Challenger C takes a random key k2) Adversary A chooses m0, m1 ∈ M such that |m0| =|m1| and sends them to C

3) Challenger C chooses b ∈ {0, 1} uniformly at randomand sends cb = Enc(k,mb) to A

4) A outputs b′ ∈ {0, 1}, and is said to win if b′ = b

In other words, if we define the advantage of the adversaryA as follows.

AdvcpaA,Π = Pr[b′ = b]− 1/2

then, the encryption scheme Π is IND-CPA secure ifAdvcpaA,Π is negligible, i.e., A cannot guess b except withprobability close to 1/2.Note that in the above game, the adversary can issue severalqueries adaptively one after the other in order to simulate theability of distinguishing multiple ciphertexts encrypted underthe same key.

IND-CPA implies that the knowledge of ciphertexts pro-vides no information about the underlying plaintexts for anadversary, i.e., ciphertexts leak no information about theplaintexts.

D. OBJECTIVES OF CLOUD-BASED DATA ANALYTICSSecure cloud-based data analytics must target the followinggoals.• Privacy: Data privacy must be preserved in cloud-

based data analytics. Encryption is a promising solutionbecause the adversary and the CSP can access onlyencrypted data, but ciphertexts may leak some informa-tion and compromise data privacy. Hence, IND-CPA isdesired to achieve this in cloud-based data analytics.

• Functionality: Proposed solutions should efficientlysupport different kinds of analytical workload, i.e., exactmatch queries (equality checking, GROUP BY,JOIN,DISTINCT), range queries (inequality checking,SORT, ORDER BY), and aggregation queries (SUM,AVG, MIN, and MAX) should be supported efficiently

and effectively [17]. Moreover, computations over mul-tiple columns in a table in a database must be handled.For instance, multiplication of two columns, C = A ×B.

• Efficiency: A secure cloud-based solution should be ef-ficient in terms of computational, storage, and commu-nication overhead. Computational and storage overheadshould be minimized for both the user and the CSP.Basically, in data outsourcing scenarios, it is consideredthat the user has limited storage and computationalresources, thus, that overhead must be diminished as faras possible. At the CSP, the pay-per-use model of cloudcomputing is a good reason to minimize the total over-head of the system. Communication overhead impliesthe number of intermediate results that is transferred tothe user for query post-processing.

III. CRYPTOGRAPHIC METHODSThere are two main cryptographic approaches that enableprocessing over encrypted data without decryption. We de-scribe these cryptographic schemes, which can be imple-mented in practical systems for preserving data privacy inthis section.

A. HOMOMORPHIC ENCRYPTIONHomomorphic Encryption (HE) allows performing arbitraryarithmetic operations over encrypted data without decryp-tion [25]. HE provides IND-CPA security. Typically, an HEscheme is a scheme with an additional evaluation algorithm,Eval, to process over encrypted data [26]. The evaluation al-gorithm takes a public key, Pk, a function, f , two ciphertexts,c1 and c2 as inputs and outputs ciphertext c∗ where c∗ =Eval(pk, f, c1, c2). In other words, Eval manipulates theencryption of two plaintexts m1 and m2, c1 = Enc(pk,m1)and c2 = Enc(pk,m2), and outputs Enc(pk, f(m1,m2))(Figure 2).

If an HE scheme allows performing arbitrary computation,then the scheme is called Fully Homomorphic Encryption(FHE) [25]. FHE allows performing arithmetic operations(+,−,×,÷) over encrypted data without decryption. FHEis a powerful scheme and offers the highest level of securityand certainly has a role to play in privacy preserving queryprocessing [27]. FHE scheme is first introduced in [25],followed by other researches to improve the performance ofthe original scheme [28], [29], [30], [31], [32], [13], [33],[34], [35]. Even though many improvements e.g., reducingencryption key size or eliminating some complicated phases,FHE is prohibitively slow and requires so much computingpower that it cannot be used in practice. As a result, buildingan efficient and usable FHE is still a great challenge.

Partially Homomorphic Encryption (PHE) for specific op-erations is efficient and can be used in practice. PHE allowseither addition or multiplication over encrypted data, but notboth. PHE offers the same security guarantees, IND-CPA,

4 VOLUME 4, 2016

Page 5: Toward Securing Cloud-Based Data Analytics: A Discussion ... · With cloud-based technologies and services flourishing, many organizations have started adopting hybrid Information

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

𝑚1, 𝑚2

𝑐1 = 𝐸𝑛𝑐(𝑝𝑘,𝑚1)𝑐2 = 𝐸𝑛𝑐(𝑝𝑘,𝑚2)

𝐸𝑛𝑐(𝑝𝑘, 𝑓(𝑚1, 𝑚2))

𝑓(𝑚1, 𝑚2)

𝑓

𝐸𝑣𝑎𝑙(𝑓)

𝐷𝑒𝑐𝑟𝑦𝑝𝑡𝐸𝑛𝑐𝑟𝑦𝑝𝑡

FIGURE 2. Homomorphic encryption

while being more efficient and closer to practical implemen-tation. If a PHE scheme allows addition or multiplication,it is referred to as additive homomorphic and multiplicativehomomorphic, respectively.

Paillier’s scheme [36] is an example of additive homo-morphic scheme and is currently the most efficient. WithPaillier’s scheme, multiplying the encryption of two valuesis equal to the encryption of the sum of the values, i.e.,Enc(k,m1) × Enc(k,m2) = Enc(k,m1 + m2), wheremultiplication is performed modulo some public-key k [12].Paillier’s scheme is used in practical solutions [21], [12] tocompute SUM and AVG aggregation queries over encrypteddata without decryption.

B. PROPERTY-PRESERVING ENCRYPTIONProperty-Preserving Encryption (PPE) preserves certainproperties of plain- texts on the corresponding ciphertexts,which enables the CSP to compute over encrypted data.Typically, in PPE ideal security is relaxed to provide moreefficient solutions [37]. Order-preserving encryption (OPE)and deterministic encryption (DET) are examples of PPEschemes.

1) Deterministic EncryptionDeterministic (DET) encryption encrypts the same plain-text into identical ciphertexts, thus the equality property ispreserved. DET allows equality checking by encrypting aplaintext into the same ciphertexts when using the same key.Thus:

∀k ∈ K, ∀m1,m2 ∈M, Enc(k,m1) = Enc(k,m2), iffm1 = m2.

DET allows performing SELECT with equality predicates,equality JOIN, GROUP BY, COUNT, and DISTINCTqueries [38]. DET is not IND-CPA secure. The followinggame shows that a DET encryption scheme, ΠDET , is notIND-CPA secure.

1) Challenger C chooses b ∈ {0, 1} uniformly at random

2) The adversary A chooses m0,m1 ∈ M so that m0 =m1 = m∗

3) C sends c∗ = Enc(k,m∗) to A4) A chooses m′0 = m∗, m′1 6= m∗5) C sends c = Enc(k,m′b) to A6) A outputs 0 if c = c∗ and 1 otherwise

Hence, AdvcpaA,Π=1/2, i.e., non-negligible.As a result, the adversary can learn data duplicates, whichleads to information leakage. In order to define the securityof DET schemes, the IND-CPA security game is replaced bya new security game, IND-DCPA ( Indistinguishability un-der Distinct Chosen Plaintext Attacks), where the adversaryis restricted to pick distinct plaintexts [39]. An encryptionscheme, Π , is IND-DCPA if for any PPT adversary A, theprobability that A "wins" in the following game is negligible[40].

1) Challenger C takes a random key2) Adversary A chooses m0 and m1 where m0 and m1

are distinct3) Challenger C chooses b ∈ {0, 1} uniformly at random

and sends cb = Enc(k,mb) to A4) A outputs b′ ∈ {0, 1} and is said to win if b = b′

In the other word, considering the advantage of the adver-sary as follows

AdvdcpaA,Π = Pr[b′ = b]− 1/2

the encryption schemeΠ is IND-DCPA secure if AdvdcpaA,Π isnegligible.

IND-DCPA is weaker than IND-CPA, but they are equiv-alent when the domain of plaintexts contains unique values[41].

2) Order-Preserving EncryptionOrder-Preserving Encryption (OPE) is a deterministic en-cryption scheme that preserves the order of plaintexts inciphertexts [42], i.e., for any key k,

VOLUME 4, 2016 5

Page 6: Toward Securing Cloud-Based Data Analytics: A Discussion ... · With cloud-based technologies and services flourishing, many organizations have started adopting hybrid Information

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

k ∈ K, ∀m1,m2 ∈M, Enc(k,m1) ≤ Enc(k,m2), iffm1 ≤ m2.

OPE allows performing range queries when given encryptedconstraints Enc(k, c1) and Enc(k, c2) corresponding torange [c1, c2]. Aggregation queries MIN, MAX, ORDERBY, and SORT can also be computed directly over encrypteddata.Since OPE is deterministic it is not IND-CPA secure. Thesecurity definition of OPE is defined by Boldyreva et al. in[42] which is called IND-OCPA (Indistinguishability underOrdered Chosen Plaintext Attacks). An IND-OCPA securescheme hides all information about the plaintext values ex-cept the order, which is a minimum requirement for order-preserving property. An encryption scheme, Π , is IND-OCPA if for any PPT adversary A, the probability that A"wins" in the following game is negligible [42].

1) Challenger C takes a random keyFor i = 1, . . . , q

a) Adversary A chooses mi,0 and mi,1 where|mi,0| = |mi,1|

b) Challenger C chooses b ∈ {0, 1} uniformly atrandom and sends cb = Enc(k,mi,b) to Awherem1,0, . . . ,mq,0 are distinct andm1,1, . . . ,mq,1

are distinctand m`,0 < mj,0 ↔ m`,1 < mj,1, 1 ≤ `, j ≤q

c) A outputs b′ ∈ {0, 1} and is said to win if b = b′

In the other word, considering the advantage of the adver-sary A as follows

AdvocpaA,Π = Pr[b′ = b]− 1/2

the encryption scheme is IND-OCPA secure if AdvocpaA,Π isnegligible. OPE is a weaker encryption scheme than DETbecause in addition to leak data duplicates, it reveals the orderof plaintexts.

Different OPE schemes are proposed in both the databaseand cryptography community. OPE is introduced in thedatabase community by Agrawal et al. as a tool to supportefficient range queries over encrypted data [43], which mapseach value of the plaintext domain to one value in theciphertext domain. This scheme bears weak privacy pro-tection because the original data values can statistically beestimated by an adversary who has access to ciphertexts[44]. Damiani et al. propose an order-preserving indexingmethod, which preserves the order of plaintexts over in-dexes [45]. A B+tree index is built over plaintexts and isstored at the CSP in encrypted format. In order to queryprocessing, the encrypted tree must be travelled by the CSP.However, the user should perform a sequence of queries toretrieve nodes, which induces communication overhead. Theproposed scheme provides IND-OCPA security, but incursheavy communication and computation overhead. Sobati-Met al. introduce an order-preserving indexing method, whichdestroys the frequency distribution of plaintexts [37]. Whilst

the proposed indexing method provides IND-OCPA security,it is efficient in terms of communication and computationoverhead. Boldyreva et al. introduce a new OPE with arandom mapping that preserves order [42]. The proposedscheme cannot provide IND-OCPA because it leaks at leasthalf of the plaintext bits (i.e., more information than OPE)[46].

Popa et al. introduce the first practical IND-OCPA schemein the cryptography community, mutable order-preservingencoding (mOPE) [46]. mOPE requires an interactive pro-tocol for query processing. Additionally, mOPE relies onuser-defined functions (UDFs) for query processing, whichmakes it unsuitable for cloud outsourcing. Liu et al. introducean OPE scheme that randomly splits the original plaintextdomain into successive intervals with different lengths [44].Then, an extended ciphertext domain is selected and splitinto the same number of intervals. Finally, nonlinear mappingfunctions map the original plaintexts into ciphertexts in theextended domain. The proposed method partially destroysthe distribution of original data and reveals some informationabout underlying values, which breaks IND-OCPA security.

Table 1 shows the comparison between existing OPEschemes. Among all OPE approaches, only Boldyreva’sscheme is implemented in some practical solutions.

TABLE 1. Comparison of OPE schemes

Scheme Interactive IND-OCPAAgrawal 8 8Damiani 4 4

Sobati 8 4Boldyreva 8 8

Popa 4 4Liu 8 8

IV. ENCRYPTION-BASED PRACTICAL SOLUTIONSBuilding a secure cloud-based analytics system has beendiscussed briefly in the literature with limited practical con-tribution. Some solutions have nonetheless been proposed.The most important challenge of secure solutions is weaksecurity guarantees or heavy overhead in the systems wereview in this section, to the best of our knowledge.

For each practical solution, we explain the main idea,the architecture, how data are organized at the CSP, andquery processing. Then, we analyze the advantages and thedeficiencies of each solution in terms of security and perfor-mance.

A. BUCKETIZATIONBucketization is introduced as a privacy-preserving methodin [10], [11] that allows partial execution of a query atthe CSP with the help of indexes 3. Bucketization supportsqueries over encrypted data without decryption. Queries areevaluated in an approximate manner, thus the returned results

6 VOLUME 4, 2016

Page 7: Toward Securing Cloud-Based Data Analytics: A Discussion ... · With cloud-based technologies and services flourishing, many organizations have started adopting hybrid Information

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

(𝑎)

(𝑏)

(𝑐)

FIGURE 3. (a) An example of a relation R, (b) partition function used for attribute eid, and (c) the encryption relation RS stored at the CSP [10]

may contain some false-positives. The final result is found bydecrypting the data and executing query post-processing atthe users.

Bucketization divides data into buckets and provides ex-plicit labels for each bucket [47]. The domain of plaintexts ispartitioned into a set of non-overlapping buckets (subsets),with the same size (or maybe different size). A label isdefined for each bucket and may or may not preserve theorder of values in the original domain. Then, bucket labels arestored along with encrypted values at the CSP. These labelsallow equality, range (if preserving order) and join queriesat the CSP without decryption. Bucketization-based indexingusually returns false-positives in the result of a query. Thus,query post-processing is needed at the users to filter out false-positives [48].

Any relationR in a database is encrypted in tuple-level andauxiliary indexes, i.e., buckets’s labels, are used by the CSPfor query processing. Thus, the relation R(A1, A2, ..., An) isstored at the CSP as: RS(etuple, AS1 , A

S2 , ..., A

Sn), where the

attribute etuple corresponds to an encrypted tuple, and eachASi corresponds to the index for the attribute Ai. An exampleis shown in Figure 3.

The DO must maintain metadata such as the label of bucketfor transforming queries to the appropriate representation onthe CSP and performing query post-processing.

In Bucketization architecture, queries are transformed bya query translator and a query executor, which both residewith the DO. For query processing, the AU poses the queryto the query translator.Each query is transformed into the server-side and user-sidesub-queries, Qs and Qc, respectively (Figure 4). Qs is exe-cuted by the CSP over encrypted data using corresponding

indexes (ASi s).The result of Qs is sent back to the query executer, which

decrypts the result of Qs and executes Qc to filter false-positives and retrieve the final result (post-processing).

Query execution in bucketization operates as follows.– Step1.

The AU sends a query Q to the query translator.The query translator rewrites Q into Qs and Qc. Tothis end, the query translator uses some metadata(e.g.,the buckets boundaries) and replaces the plain-text values in the query with bucket boundaries.Then, the query translator sends Qs to the CSP.

– Step2. The CSP executes Qs and returns all en-crypted records back, whose bucket numbers sat-isfy the query constrains. The encrypted results aresent to the query executor.

– Step3. The query executor decrypts the results andexecutes Qc to eliminate false-positives and sendsthe final results to the AU.

Bucketization provides efficient query processing whilekeeping the information disclosure to a minimum. Further-more, query evaluation is often much simpler than crypto-graphic schemes [49].However, there is a trade-off between security and efficiency.The smaller number of buckets leaks less information andincreases security, but induces more false-positives in queryresults and more computational overhead at the AU.

When the labels are order-preserving, IND-CPA securityis not guaranteed. Note that here the encrypted data is in the

VOLUME 4, 2016 7

Page 8: Toward Securing Cloud-Based Data Analytics: A Discussion ... · With cloud-based technologies and services flourishing, many organizations have started adopting hybrid Information

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

𝐀𝐔

𝑸𝑺

𝑸𝒄

𝐂𝐒𝐏

FIGURE 4. Bucketization query workload [10]

form of (ciphertext)||(bucketlabel), e.g., in Figure 3 the ci-phertexts of attribute eid are in the form of etuple||eidS . Ad-ditionally, bucketization-based indexing reduces data gran-ularity. All values in a row are encrypted together, whichmeans that all encrypted rows must be shipped back to theAU inducing communication overhead.

B. CRYPTDBCryptDB is a secure DataBase Management System (DBMS)developed at MIT [12] with both academic [20], [50], [18],[21], [51], [52], [53], [54], [55], [56] and industrial [57],[58], [59], [60], [61] impacts [38], e.g., Google [57] andSAP [58] produced their own CryptDB-inspired solutions.CryptDB aims at providing data privacy guarantees in theface of a compromised server and a honest but curious CSPby data encryption.CryptDB uses a set of encryption schemesbased on the queries issued by the AU [38] and adoptsdifferent kinds of encryption schemes, i.e., PPE and PHE,which are dynamically adjusted depending on the queries[26]. Encryption in CryptDB is like onion layers that storemultiple ciphertexts within each other (Figure 5).

For instance, given two encryption schemes Enc1 andEnc2, the encryption of a value m is defined as:

c = Enc1(k1, (Enc2(k2,m))).

The outermost onion layers provide the highest level ofsecurity, IND-CPA security, whereas the inner layers, provide

more functionality and less security guarantees, e.g., IND-DCPA for DET layer.

Each value in a relation is encrypted independently.Numeric values are maintained in three different onions,OnionEq, OnionOrd, and OnionAdd, which are used forequality checking, range queries, and aggregations SUM andAVG, respectively. In other words, for each value CryptDBstores three encrypted values at the CSP. At the CSP, queryprocessing is computed using UDFs.

In CryptDB architecture, a proxy server intercepts com-munications between the user and the CSP and appliesen/decryption of queries and results (Figure 6).

OPE: order comparison

FIGURE 5. An example of onion encryption layers in CryptDB [12]

To create an encrypted database from a database, the proxyserver generates a master secret key and uses it to encrypteach relation in the database [62]. The proxy server stores the

8 VOLUME 4, 2016

Page 9: Toward Securing Cloud-Based Data Analytics: A Discussion ... · With cloud-based technologies and services flourishing, many organizations have started adopting hybrid Information

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

FIGURE 6. CryptDB architecture [12]

master key and the scheme of the database and uses them forquery rewriting. When a query is issued, the proxy dynami-cally peels off onion layers down to a layer corresponding tothe given computation. Onion layers are adjusted by sendingthe related key and updating the columns. Then, the proxyserver anonymizes each table and column name and encryptseach constant in the issued query with an encryption schemecorresponding to the requested operation.

Query execution in CryptDB operates as follows.

– Step1. The AU sends a query Q to the proxy. Theproxy rewrites Q into Qs operating at the CSP.To this end, the proxy encrypts all constants in Qadopting the encryption scheme that best suits theoperation to be computed [26].

– Step2. The proxy checks if the CSP should be givenkeys to remove some encryption layers. In this case,the proxy issues an UPDATE query that removesspecific layers of encryption, and then sends Qs tothe CSP.

– Step3. The CSP computes Qs and returns the en-crypted results to the proxy.

– Step4. The proxy decrypts the encrypted results andsends the final results to the AU.

For instance, in order to evaluate a range query, each value ofan attribute is encrypted using an OPE scheme (first encryp-tion layer). Then, the resulting OPE ciphertexts are encryptedwith a randomized encryption (RND) scheme (second en-cryption layer). A randomized encryption scheme encryptsthe same values into different ciphertexts using the samekey and allows no computation whilst providing the highestlevel of security, IND-CPA. When a range query is issued,the proxy server sends the decryption key for the secondencryption layer to the CSP. The CSP then decrypts therandomized ciphertexts and gets access to order-preservingciphertexts. Hence, the CSP learns the order of values andevaluates the range query [63] and sends the encrypted valuesto the proxy server.

CryptDB supports standard SQL queries over encrypteddata and needs no change to existing applications. Basi-

cally applications can transparently run on top of CryptDB.However, CryptDB is targeted for transactional workloads.It is not feasible to run analytical workloads over CryptDB[64]. As a result, CryptDB supports only 2 queries out of 22queries from the TPC-H benchmark.

Some computations are not supported, for instance, com-puting both summation and comparison on encrypted values.Thus, to execute the following query.Q1 :

SELECT SUM(price) AS totalFROM orders

GROUP BY order_idHAVING total >100

CryptDB cannot check if total >100 at the CSP, be-cause total is encrypted with a PHE scheme, i.e., thePaillier scheme, which does not support order comparison[21].

Complex queries over multiple columns in a relation, forinstance, the following queryQ2 :

SELECT SUM(t1.B ∗ t2.A) FROM T as t1 WHEREt1.A=t2.B.

cannot be evaluated because multiplication of two encryptedvalues is not supported in CryptDB.

Moreover, onion layers become an overhead as well aspeeling off a layer, especially in the case of big tables.CryptDB performs almost all queries at the CSP with arelatively small overhead in terms of query throughput [12].However, the throughput is worst for queries that use PHE,such as summations because, in order to compute summa-tions, CryptDB implements modular multiplications at theCSP. Improving the performance of aggregations is importantbecause aggregation is common in data analytics.

Although onions offer multiple levels of security but revealdifferent information about the data [22]. It is easy to seethat security decreases over time when the outer layersare removed. Because adjustable encryption architecture in

VOLUME 4, 2016 9

Page 10: Toward Securing Cloud-Based Data Analytics: A Discussion ... · With cloud-based technologies and services flourishing, many organizations have started adopting hybrid Information

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

CryptDB is unidirectional, i.e., once a column is decryptedto a weaker scheme like DET, it never returns to a higherencryption level [65].

C. MONOMIMONOMI extends CryptDB’s functionality to support ana-lytical queries [21]. In contrast to CryptDB that focuses ontransactional workloads, MONOMI mainly targets analyticalworkloads [64].Instead of using the trusted proxy server, MONOMI splitsquery processing between the CSP and the user. Usinguser/server query splitting, MONOMI executes as much ofthe query as is practical over encrypted data at the CSPand executes remaining computations by the user, whichdecrypts data and processes queries further [66]. To this end,MONOMI introduces an optimized designer and a planner.The designer chooses an appropriate database design basedon the target workload. Like CryptDB, MONOMI imple-ments different cryptographic schemes, but unlike CryptDB,there is no onion of encryption. To choose a set of encryptionschemes, MONOMI uses an optimizing designer similar tophysical designers used by other databases, which takes thekind of computations that are likely to appear in futurequeries, e.g., SUM for attribute Salary (Figure 7). In otherwords, the designer can be considered similar to automatedindex selection and materialized view selection tools.

MONOMI designer is invoked when a database is loadedby the DO. The DO must also provide a query workloadQ1, Q2, . . . , Qn to represent the operations that the AUwill perform over data. Given the user’s inputs, MONOMIdesigner provides a physical design consisting a set of en-cryption schemes for each relation. For each query Qi i =1, . . . , n, the designer determines a set of required encryptionschemes and invokes the planner to determine how to bestexecuteQi given the encryption schemes. MONOMI plannerdetermines different plans by determining for each plan whatparts ofQi would be executed by the AU and what parts at theCSP. Then, a cost model is used by the planner to estimate thecost of each plan (e.g., execution times). The planner choosesthe fastest plan for Qi and denotes the corresponding subsetof encryption schemes.

Once the best plan is determined for each query, thedesigner takes the union of the required encryption schemesand uses them for physical design. Hence, a plaintext valueis encrypted using different cryptographic schemes. Further,the planner selects the query execution path for each querygiven a particular physical design.

In order to execute queries that cannot be computed at theCSP alone, MONOMI partitions query execution across theCSP, which has access only to encrypted data, and the AU,who has access to the decryption keys.However, query splitting technique cannot be optimal inall situations and depends on data. To choose an optimalsplitting, MONOMI models the cost of query execution foreach query plan as the sum of execution time at the CSP,

transfer time, and post-processing time at the AU’s side.Then, the lowest-cost plan is chosen.

Query execution in MONOMI consists the following steps.

– Step1. The AU sends a query Q to MONOMIplanner. The planner computes different plansPQ1 , P

Q2 , . . . , P

Q` , where PQi i = 1, . . . , ` con-

sists of three sub queries Qsi , Qei , and Qci . The sub-

query Qsi is performed by the CSP over encrypteddata, Qei asks for retrieving some encrypted dataRei = {e1, e2, . . . , en} from the CSP, and Qciis performed by the AU over retrieved encryptedvalues, Rei , after decryption.For each plan PQi i = 1, . . . , `, the plannercomputes a costCQi = t1+t2+t3 where t1, t2, andt3 are execution times at the CSP, transferring time,and post-processing time by the AU, respectively.Then, a plan with the lowest-cost is selected by theplanner. We show the selected plan as PQselect. Theplanner sends Qsselect and Qeselect to the CSP.

– Step2. The CSP computes Qsselect over encrypteddata and returns encrypted results, Rs, to the plan-ner. The CSP also retrieves encrypted values, Re,corresponding to Qeselect and sends them to theplanner, too.

– Step3. The planner sends the results Rs and Re tothe AU. The planner also sends Qcselect to the AU.

– Step4. The AU decrypts encrypted values inRe andexecutes Qcselect over unencrypted values. The AUalso decrypts Rs to obtain the final results.

Like CryptDB, MONOMI implements PHE [36] for com-puting aggregation queries SUM and AVG, which is compu-tationally intensive and induces a large ciphertext size. How-ever, large ciphertext size imposes storage cost and affectsquery processing. Additionally, PHE requires modular multi-plications, which is computationally expensive. To overcomethese drawbacks, MONOMI introduces some optimizationtechniques. To avoid performing separate modular multipli-cations, MONOMI concatenates several plaintext values intoa single larger plaintext that will be encrypted as a singlevalue. In this way, MONOMI reduces storage overhead andsimultaneously decreases the number of modular multiplica-tions.

Query splitting enables executing more queries with op-timal costs. However, it requires that queries are declaredahead of time by the user, which is not possible for allscenarios, especially for ad-hoc analytical workloads. More-over, it is essential to cut down the bandwidth required totransfer intermediate results and computation and storageresources for the AU for query processing [21], becauseresources usage at the AU must be minimum for maintainingthe benefits of outsourcing.

Optimization techniques used by MONOMI improve per-formance, but also induce some limitations. Packing valuesinto a single value reduces storage overhead and also amor-tizes computation overhead, but makes impossible partial

10 VOLUME 4, 2016

Page 11: Toward Securing Cloud-Based Data Analytics: A Discussion ... · With cloud-based technologies and services flourishing, many organizations have started adopting hybrid Information

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

DO

AU

MONOMI Planner

FIGURE 7. MONOMI architecture [21]

operation or partial query processing. For example, whenpacking multiple values in a column into a single plaintext,a query that updates only a part of the column cannot beexecuted.

D. SDBSecureDB (SDB) is a secure query processing system, whichallows a wide range of complex queries to be computedby the CSP [67]. SDB introduces a new encryption schemebased on so called secret sharing. In secret sharing, eachplaintext value is split into several shares that are stored atn users’ [68]. One single user has no means to reconstructthe secret, but a subset of k ≤ n users can reconstruct thesecret [69].

Each plaintext value, mi, is split into two shares, onekept at the DO, which is referred as item key, mk, andanother at the CSP, which is regarded as ciphertext, me. Inorder to reduce data storage at the DO’s side, item keys isconsidered to be the same for all values. In order to encryptthe plaintext values in an attribute, first a random secret key,k, is defined. Then, for each row of attribute a random row-id, ri, is defined, too. Then, the row-ids are encrypted by thesecret key k, mk = Enc(k, ri) and mi is encrypted usingencrypted row-id, mk, which is me = Enc′(mk,m). Thevalues of row-ids are encrypted and stored along with theciphertexts at the CSP (Figure 8).

The proposed scheme simulates FHE, which leads to a

wide range of queries supported by the CSP. The proposedencryption scheme supports data interoperability , i.e., theoutput of an operator is used as the input of another operator.SDB computes complex analytical queries over multiplecolumns in a table using data interoperability. For instance,computing "A + B > 1000k" is possible in SDB, butthe same computation is not supported by the other securesystems like CryptDB.

SDB architecture is similar to CryptDB’s in which a proxyserver is set between the DO/AU and the CSP.

Query execution in SDB operates as follows.– Step1. The AU sends a query,Q, to the proxy server.

The proxy rewritesQ into sub queriesQs operatingat the CSP and Qc which is computed at the proxyserver.

– Step2. The proxy server executes Qc and producesnew secret keys.

– Step3. The proxy server sends the new secret keysand Qs to the CSP. Note that the new keys gen-erated by the proxy are needed for computing thesub-query Qs at the CSP.

– Step4. The CSP computes Qs and returns the en-crypted results to the proxy.

– Step5. The proxy decrypts the encrypted results andsends the final results to the AU.

SDB sends an UPDATE query to the CSP for some oper-ations. For instance, consider two columns A and B, whichare encrypted with two different keys, ak and ab. To compute

VOLUME 4, 2016 11

Page 12: Toward Securing Cloud-Based Data Analytics: A Discussion ... · With cloud-based technologies and services flourishing, many organizations have started adopting hybrid Information

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

𝑚 𝑚 𝑚𝑘 𝑚𝑒

𝑚𝑒

𝑚 𝐸𝑛𝑐 (𝑘, 𝑟𝑖 )

DO

FIGURE 8. Encryption of attribute A in SDB [67]

C = A + B, the proxy server generates a new key ck usingak and ab and sends ck to the CPS. The CSP executes theUPDATE query to encrypt A and B with the new key, ck, andgets new columns A′ and B′, and then adds the encryptedvalues of A′ and B′.

SDB requires modular multiplication for decryption,which induces heavy computational overhead at the proxyserver. Computations at the CSP also consist of modularexponential, which is expensive and affects query processing.While the proposed scheme supports a wide range of queries,there are still some queries that are not supported, e.g.,keyword search over encrypted strings.

Security guarantees in SDB is the highest, i.e., IND-CPA security. Yet, IND-CPA reveals no information aboutthe underlying plaintexts, which makes impossible usingoptimization techniques, e.g., B+tree indexing. As a result,query processing needs scanning the whole database, leadingto poor performance. Like CryptDB and MONOMI, SDBstrongly relies on UDFs at the CSP for query processing,which makes it unsuitable for cloud-based scenarios becausethe DO has no permission to create UDFs, e.g., small enter-prises deploy their secure web-based system using the renteddatabase [70].

E. TRUSTEDDBTrustedDB is an outsourced database prototype that im-plements tamper-proof trusted hardware for secure queryprocessing in the cloud [71]. A secure co-processor (SCPU)such as the IBM 4764/5 [72] is deployed at the CSP’s sideto compute query processing over encrypted data. The SCPUprovides a secure computation environment. However, SC-PUs are constrained in both computation ability and memorycapacity. Hence, data are classified as being either privatesensitive or insensitive public [73]. The former are encrypted

and operations are performed inside the SCPU, and the latterare stored unencrypted at the CSP. All sensitive data areencrypted by the DO before uploading to the CSP. The entiredatabase is stored outside the SCPU. Data are encryptedusing a symmetric encryption scheme such as AES. Somerandomness are also added to ensure IND-CPA security.However, such randomized encryption schemes allow nocomputation over encrypted data. Sensitive encrypted datathat need to be accessed by the SCPU for query processingare pulled in by the SCPU.An issued query is rewritten by the AU into two sub-queries,which are executed by the CSP and the SCPU [15]. TheSCPU in TrustedDB consists of a Request Handler, a QueryParser, and a Query Dispatcher to handle query rewriting(Figure 10).

Query execution in TrustedDB operates as follows.– Step1. The AU encrypts a query using the public

key used by the SCPU and sends the encryptedquery to the CSP.

– Step2. The CSP forwards the encrypted query tothe SCPU. The Request Handler (RH) receives theencrypted query.

– Step3. The RH decrypts the query and sends theunencrypted query to the Query Parser (QP).

– Step4. The QP rewrites the query into three sub-queries, Qs, Qe, and QTC , and sends them to theQuery Dispatcher (QD).

– Step5. The QD sends Qs and Qe to the CSP.– Step6. The CSP executes Qe and retrieves re-

quested encrypted data, Re, and sends them to theQD. The CSP also executes Qs over unencrypteddata and sends the results back again to the QD.

– Step7. The QD decrypts data in Re and sendsdecrypted values along with QTC to the DB engine

12 VOLUME 4, 2016

Page 13: Toward Securing Cloud-Based Data Analytics: A Discussion ... · With cloud-based technologies and services flourishing, many organizations have started adopting hybrid Information

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

DO AU

𝐑𝐞𝐪𝐮𝐞𝐬𝐭 𝐇𝐚𝐧𝐝𝐥𝐞𝐫

𝐄𝐧

𝐜𝐫𝐲

𝐩𝐭𝐞

𝐝

𝐐𝐮

𝐞𝐫𝐲

𝐝𝐞𝐜𝐫𝐲𝐩𝐭𝐞𝐝 𝐐𝐮𝐞𝐫𝐲

𝐐𝐮𝐞𝐫𝐲 𝐏𝐚𝐫𝐬𝐞𝐫

𝐐𝐮𝐞𝐫𝐲 𝐃𝐢𝐬𝐩𝐚𝐭𝐜𝐡𝐞𝐫

𝑸𝒔 𝑸𝑻

𝐏𝐫𝐢𝐯𝐚𝐭𝐞 𝐃𝐚𝐭𝐚

𝐏𝐮𝐛𝐥𝐢𝐜 𝐃𝐚𝐭𝐚 𝐃𝐚𝐭𝐚𝐁𝐚𝐬𝐞

𝐄𝐧𝐠𝐢𝐧𝐞

𝐃𝐚

𝐭𝐚𝐁

𝐚𝐬𝐞

𝐒𝐜𝐡

𝐞𝐦

𝐞

𝐄𝐧𝐜𝐫𝐲𝐩𝐭𝐞𝐝 𝐑𝐞𝐬𝐮𝐥𝐭

CSP SCPU

𝑸𝒔 𝑸𝒆

𝑸𝑻𝑪

𝑸𝒔 𝑸𝒆

𝑸𝑻𝑪

𝐄𝐧

𝐜𝐫𝐲

𝐩𝐭𝐞

𝐝

𝐑𝐞

𝐬𝐮𝐥𝐭

FIGURE 9. TrustedDB architecture and query processing workload

residing inside the CSPU. After executingQTC , thefinal results are re-encrypted by the QD and sent tothe RH. Finally, the RH sends the encrypted resultsto the AU.

Note that the execution of private queries, i.e.,QTC , dependson the output of public queries, i.e., Qs, and vice-e-versa.

Using the SCPU and a randomized encryption scheme,TrustedDB simulates FHE, i.e., the security guarantees ofTrustedDB is IND-CPA. However, leaving some data unen-crypted at the CSP may compromise the privacy of encryptedsensitive data by linking between them. Moreover, sendingencrypted data to the SCPU incurs communication overhead,which affects query processing and results in poor perfor-mance. Sending the final results to the QD, which encryptsthe results imposes computation overhead.

F. CIPHERBASECipherBase is another solution based on tamper-proof trustedhardware to preserve the confidentiality of sensitive data [20].A FPGA (Field Programmable Gate Array) is set at the CSPas a trusted component, TC, which computes some computa-tions over sensitive data on behalf of the CSP. Computationsare decomposed between the trusted TC and the CSP. LikeTrustedDB, the sensitivity of data must be defined by theDO on the scheme definition. Insensitive public data arestored unencrypted at the CSP. Highly sensitive data, e.g.,Patient.ID, are encrypted using a strong encryption scheme,i.e., an IND-CPA secure scheme. Less sensitive data, e.g.,Patient.Age are encrypted by a weaker encryption scheme,e.g., a PPE scheme.

Highly sensitive data are shipped to the TC, which aredecrypted and processed. In fact, CipherBase simulates FHE

by integrating non-homomorphic encryption schemes (e.g.,AES in CBC mode) with trusted hardwares to compute anyoperation.Query processing in CipherBase consists of several round tripbetween the TC and the CSP. A query planner, which residesat the AU’s, is responsible for query re-writing.

The query planner rewrites an issued query into threesub-queries. Among them, two sub-queries are executed bythe CSP and the third sub-query is sent to the TC. Queryexecution in CipherBase operates as follows.

– Step1. The query planner rewrites an issued query,Q, into three sub-queries , Qs, Qe, and QTC andsends them to the CSP.

– Step2. The CSP executes Qe and retrieves en-crypted tuples, Re, and sends them along withQTC to the TC. The CSP also computes Qs overunencrypted data.

– Step3. The TC decrypts all tuples in Re and exe-cutes QTC over decrypted data.

– Step4. The TC encrypts the results and ships themback to the CSP.

– Step5. The CSP sends all results consisting en-crypted and unencrypted results to the AU.

Security guarantees in CipherBase is specified by theDO. The highest level of security achieves when all dataare highly sensitive. However, security comes at a cost ofpoor performance. Specifying some data as less sensitiveand insensitive by the DO degrades security guarantees andincreases performance. In this case, CipherBase provides thesame level of security as CryptDB.Query processing in CipherBase involves shipping encrypteddata from the CSP to the TC, decrypting, query processing,

VOLUME 4, 2016 13

Page 14: Toward Securing Cloud-Based Data Analytics: A Discussion ... · With cloud-based technologies and services flourishing, many organizations have started adopting hybrid Information

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

CipherBase Planner

𝐋𝐞𝐬𝐬𝐒𝐞𝐧𝐬𝐢𝐭𝐢𝐯𝐞𝐃𝐚𝐭𝐚

𝐏𝐮𝐛𝐥𝐢𝐜𝐃𝐚𝐭𝐚

𝐇𝐢𝐠𝐡𝐒𝐞𝐧𝐬𝐢𝐭𝐢𝐯𝐞𝐃𝐚𝐭𝐚

AU

𝐃𝐚𝐭𝐚𝐁𝐚𝐬𝐞Scheme

𝑺𝒆𝒏𝒔𝒊𝒕𝒊𝒗𝒆Data

DO

𝑸𝒔 𝑸𝑻𝑪 𝑸𝒆

TC

𝐃𝐞𝐜𝐫𝐲𝐩𝐭𝐢𝐨𝐧

𝐐𝐮𝐞𝐫𝐲𝐢𝐧𝐠

𝐑𝐞 − 𝐞𝐧𝐜𝐫𝐲𝐩𝐭𝐢𝐨𝐧

𝑹𝒆 (High Sensitive Data)𝑸𝑻𝑪

CSP

𝐅𝐢𝐧𝐚𝐥 𝐑𝐞𝐬𝐮𝐥𝐭

𝑸

En

crypted

𝐑𝐞𝐬𝐮𝐥𝐭

FIGURE 10. CipherBase architecture and query processing workload

and re-encrypting data in TC, and shipping results back to theCSP. Moving encrypted data between the CSP and the TCincurs communication overhead. Moreover, decryption andre-encryption are also costly in terms of computation. Whilstsome optimization techniques are considered to minimizeround trips, but the overhead of decryption/re-encryption isstill a bottleneck.

G. COMPARISON OF EXISTING SOLUTIONSIn this section, we compare the existing solutions presentedin Section IV regarding to overhead, security, trusted com-ponent, and query support features. Table 2 summarizes thefeatures of all solutions, which we discuss below.

1) Communication OverheadCommunication overhead is defined as the number of en-crypted records sent as the result of an issued query. Considera query Q, the encrypted results that satisfy the query isshown as |Q(e)|. Bucketization induces a communicationoverhead of |Q(e)| + |Q(f)| where |Q(f)| indicates thenumber of false-positives. Since MONOMI planner requestsretrieving some encrypted values, Rs, the communicationoverhead of MONOMI is |Q(e)|+ |Q(Rs)|.

Like MONOMI, TrustedDB and CipherBase retrieve someencrypted values, Rs, and send them to the TC. Hence, thecommunication overhead of TrustedDB and CipherBase is

|Q(Rs)|, which is imposed at the CSP’s side when data aremoved to the TC.

2) Computation Overhead

Computation overhead at the user’s side is defined as the timeof decryption, te. Bucketization requires also auxiliary time,tf , to eliminate false positives after decryption. MONOMIinduces an extra time, td, for decrypting encrypted values inRs and also tpost for query processing over decrypted values.Hence, MONOMI has the largest computation overhead atthe user’s side.

At the CSP, computation overhead consists of the requiredtime for query processing over encrypted values, tquery.CryptDB and SDB sends also an UPDATE query, whichinduces tquery+tUPDATE computation overhead at the CSP.MONOMI also requires tretrieve for retrieving encryptedvalues. Nevertheless, the computation overhead of MONOMIis less than CryptDB and SDB because the UPDATE queryconsists of the re-encryption of encrypted values and con-tains expensive modular operations such as modular multi-plication. Computation overhead for TrustedDB consists oftp for querying public unencrypted data, tretrieve for re-trieving encrypted private values, tdec for decrypting privateencrypted data, and tTC for querying data after decryptioninside the SCPU. CipherBase needs also tquery for query-ing less sensitive data at the CSP. However, comparing to

14 VOLUME 4, 2016

Page 15: Toward Securing Cloud-Based Data Analytics: A Discussion ... · With cloud-based technologies and services flourishing, many organizations have started adopting hybrid Information

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

TrustedDB, CipherBase incurs less computation overheadbecause CipherBase needs less data decryption because somecomputations are computed over less sensitive data withoutdecryption.

3) SecuritySecurity guarantees depend directly on the cryptographicschemes which are used in a solution. Bucketization usesPHE, but cannot provide IND-CPA because the bucket’slabels reveal some information about the encrypted values.Bucketization introduces a trade-off between security andefficiency; for example, when the labels are order-preserving,query processing is more efficient, but at the cost of a lowerlevel of security. In fact, the security level of order-preservinglabeling is the same as that of PPE [92].

On the other hand, having a small number of bucketsincreases security, but triggers more false positives in queryresults, leading to poor performance and less efficiency. PHEis implemented in CryptDB and MONOMI to provide IND-CPA security; however, security is relaxed to improve theefficiency of query processing [69]. CryptDB and MONOMIimplement PPE to support more queries, which disclosesIND-CPA security. IND-CPA is guaranteed when no query isexecuted and the outermost layer of encryption, randomizedencryption, has been kept. Once the randomized layer isremoved, however, security is degraded to the level of PPE.

The proposed encryption scheme in SDB provides IND-CPA security. However, IND-CPA is no longer guaranteedwhen a query is executed. After an updating query, the valuesof a column are transformed in such a way that the querycan be executed by the CSP. For instance, in order to executea GROUP BY for a column, after updating, the encryptionvalues are equal for the same plaintexts, and so equality leaks,which degrades the level of security to the level of PPE .

Implementing randomized encryption schemes along withthe SCPU in TrustedDB guarantees IND-CPA, so TrustedDBoffers the same level of security as FHE , but greater effi-ciency. At the same time, leaving less sensitive data unen-crypted at the CSP raises concerns about linking attacks. Inlinking attacks, the attacker tries to find some informationabout sensitive data using public unencrypted data, whichmay compromise data privacy [91]. More importantly, secu-rity is guaranteed as long as the SCPU is not in danger [67].Similarly, the TC in CipherBase simulates FHE by executingcomputations inside the TC, and security is guaranteed aslong as the TC is not compromised [102]. Yet the levelof security required depends on the sensitivity of the data,i.e. highly sensitive data should be encrypted with a strongencryption scheme so that no information will be revealed.Even though CipherBase provides a reasonable security levelfor highly sensitive data, using PPE schemes for less sensitivedata degrades security guarantees.

The vulnerability of PPE to inference attacks is studied in[1]. The authors describe an inference attack directly over en-crypted data. Even if a non-deterministic encryption schemeis used to encrypt data, such an attack can be executed

successfully. The results convincingly illustrate the trade-offsbetween security, performance and functionality for queryprocessing in a TC.

4) Trusted Component (TC)The TC is set at the user’s side (the AU and the DO’s side)in Bucketization and MONOMI. CryptDB and SDB relyon TC-PS as the TC. Considering the DO and the AU asa TC eliminates the need of extra components. However,the TC must be installed and maintained on each AU andDO separately. Setting a proxy server adds a new pointof attack and induces computational and storage cost, buteliminates the need of installing the TC for each DO and AU.The TC is configured at the CSP (TC-TPH in Figure 1) inTrustedDB and CipherBase. Whilst installing the TC at theCSP minimizes computational burdens for the AU, it needsan agreement with the CSP, which is not always feasible.Moreover, using tamper-proof hardware is still expensive andit is not affordable for small/medium businesses. More im-portantly, installing the TC at the CSP protects data privacyagainst attackers that compromise the CSP, but not insidersand the CSP who has access to the TC.

5) Query SupportEquality checking, inequality, and aggregation queries aresupported in all systems. All those queries are executedon only one column in a table, but SDB handles opera-tions between separated columns, e.g., addition between twocolumns A and B, A+B.

Regarding to standard benchmarks, SDB is the onlysystem that supports all TPC-H queries. Bucketization,CryptDB, and MONOMI handle only 2, 4, and 19 out of 22TPC-H queries, respectively. TrustedDB supports only 4 non-nested queries from TPC-H because nested queries are notsupported by TrustedDB. CipherBase is designed for OLTPworkloads; hence query processing for TPC-H workloads isnot discussed.

V. LIMITATIONS AND OPEN ISSUESA. PRIVACY VS. FUNCTIONALITYDespite the highest level of security provided by FHE, cur-rent FHE schemes are inefficient for practical data analyticsdue to computation overhead [62]. Contrary, PHE schemesare more efficient and closer to practical solutions. However,PHE schemes provide less functionality due to supportinglimited operations only. PPE schemes allow operating overencrypted data in the same way as they would operateon plaintext. PPE are crucial for practical secure solutionsbecause of their functionality. However, such PPE schemesprovide lower security guarantees, e.g., IND-DCPA secu-rity for DET encryption, leading to the leakage of a non-trivial amount of information, which makes such schemesvulnerable to some attacks (e.g., frequency analysis attacks).Recent works [74], [62], [75], [1], [2] apply and developdifferent attacks using the leakage of PPE schemes. Forinstance, Naveed et al. demonstrate that a large fraction of

VOLUME 4, 2016 15

Page 16: Toward Securing Cloud-Based Data Analytics: A Discussion ... · With cloud-based technologies and services flourishing, many organizations have started adopting hybrid Information

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

TAB

LE

2:Fe

atur

esof

prac

tical

solu

tions

Que

ries

Com

puta

tion

Com

mun

icat

ion

Cry

ptog

raph

icSc

hem

esR

EA

M-C

Serv

erU

ser

Serv

erU

ser

FHE

PHE

PPE

TC

Secu

rity

IND

-CPA

TPC

-H

Buc

ketiz

atio

nt q

uery

t e+

t f−

|Q(e)|+|Q

(f)|

84

8T

C-D

OT

C-A

U8

24

44

8

Cry

ptD

Bt q

uery

+tUPDATE

t e−

|Q(e)|

84

4T

C-P

S8

44

44

8

MO

NO

MI

t query

+tretr

ieve

t e+

t d+

t post

−|Q

(e)|+|Q

(Rs)|

84

4T

C-D

OT

C-A

U8

194

44

8

SDB

t query

+tUPDATE

t e−

|Q(e)|

48

8T

C-P

S4

All

44

44

Approaches

Trus

tedD

Bt p

+tretr

ieve//+

t dec+t

TC

t e|Q

(Rs)|

|Q(e)|

48

8T

C-T

PH4

44

44

8

Cip

herB

ase

t p+t

query

+tretr

ieve+

t dec+t

TC

t e|Q

(Rs)|

|Q(e)|

84

4T

C-T

PH8

−4

44

8

R:R

ange

quer

ies,

E:E

qual

itych

ecki

ng,A

:Agg

rega

tion,

M-C

:Mul

tiC

olum

ns

16 VOLUME 4, 2016

Page 17: Toward Securing Cloud-Based Data Analytics: A Discussion ... · With cloud-based technologies and services flourishing, many organizations have started adopting hybrid Information

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

the records from the DET encrypted columns in a databasecan be decrypted when the adversary possesses statisticalinformation about the plaintexts [62]. Hence, secure systemssuch as CryptDB and MONOMI sacrifice data privacy forenhancing functionality, i.e., there is a trade-off between ahigher level of functionality and less data privacy.

B. PRIVACY VS. EFFICIENCYProcessing over encrypted data is computationally more ex-pensive than original plaintext [76]. The first reason is dataexpansion. For instance, a 32-bit plaintext is expanded to256 bits of ciphertext using classic AES+CBC (Cipher-BlockChaining) encryption [18] and 1,024 or more bits long usingPHE. Such enormous data expansion induces both storageand computation overhead. The second reason is the natureof operations. In particular, operations at the CSP shouldnot involve any expensive modular arithmetic operation likemodular multiplication or exponentiation [77]. Whilst FHEand PHE offer the highest security guarantees, but necessitatemodular operation for query processing. Security guaranteesare relaxed in PPE to provide more efficiency.

Setting a trusted server like a trusted proxy server inCryptDB induces extra cost, which does not fit cloud dataoutsourcing. Moreover, since the trusted server has access tothe encryption keys and plaintext information, it becomes anappealing target for attackers.

Splitting queries in [10], [21] to be executed partly at theCSP and partly by the AU incurs computational overhead atthe end user. In data outsourcing the goal is fully outsourcingcomputations because the resources for the AU are limited.Moreover, transferring intermediate results to the user in-duces communication overhead. Hence, it is essential to cutdown the bandwidth required to transfer intermediate resultsto the AU and reduce computational overhead at the AU’sside.

Tamper-proof hardware used in TrustedDB and Ci-pherBase is significantly constrained in both computationability and memory capacity, hence setting such componentsat the CSP faces major efficiency challenges. A trade-offshould be defined between more efficient untrusted mainCPU computation and less efficient trusted computationsinside the tamper-proof component [71]. Moreover, leavingsome data unencrypted at the CSP for the sake of memorycapacity limitation may disclose the privacy of encryptedsensitive data.

C. ACCESS PATTERN PRIVACYData protection methods, discussed before, guarantee theconfidentiality and privacy of the data stored at the CSP.Another security issue in outsourcing scenarios arises whenquerying data whilst preserving the privacy of access pat-terns. In fact, by observing enough query results the adver-sary could infer about data and data privacy could be com-promised by correlating prior information with frequentlyqueried data [77]. Private access patterns have recently raisedthe attention of researchers [48]. New cryptographic schemes

are needed that allow the CSP to send items in response aquery without knowing which item is being sent. Private In-formation Retrieval (PIR) [78], [79], [80], [81], theoreticallyenables accessing data items while preventing the CSP tolearn anything about query access patterns [82]. The most ofcurrent PIR solutions aim at very strong security bounds andremain unsuitable for practical purposes [83]. Deployment ofexisting PIR protocols would have been orders of magnitudemore time consuming than transferring the entire database tothe user [84].

Oblivious RAM (ORAM) [85] provides access patternprivacy, too [84]. ORAM allows reading and writing tomemory without revealing access pattern to the CSP. ORAMcontinuously shuffles and re-encrypts data as they are beingaccessed, thereby completely hiding access pattern [86].

While the idea of relying on ORAM and PIR to enableaccess pattern privacy is promising, but a key challenge isthe efficiency of such schemes.Another solution could be sending frequently fake querieswithout taking care with the results [87], [88]. The goal isto mislead the adversary of making valid inference aboutcorrelation between hot frequently queried data and otherdata. Submitting such queries requires more investigation onthe way of generating that looks realistic.

D. DATA INTEGRITYWhile a passive model is by far the most widely assumedadversary model in the literature, in some scenarios the activemodel should be considered. Thus, there is also the needto address an active adversary model. In an active model,the adversary has malicious intension and may change ormodify the data, results of queries or even in some casesinterrupt or cause denial of service. Such threats are aimedat data integrity and availability. Protecting against suchan adversary needs more effort by the user to ensure thecorrectness of data and query results. The results must bedemonstrably authentic to ensure that the data has not beentampered with (integrity). The proof of completeness mustbe carried by the results that allow the user to verify that theCSP has not omitted any valid tuples that match the queriespredicate [89] (completeness). Such proof assures that thequery is executed with completeness over their entire targetdata set [77]. For instance, when executing a JOIN query,the user should be able to verify that the CSP returned allmatching values.

Authentication and integrity checking along encryptionbecome important in such scenarios. Typically, signatureor MAC (Message Authentication Code) is used to allowthe user to check the integrity of returned items [90], [23].However, existing works introduce mechanisms for efficientintegrity and authentication checking only for simple queriesand they are limited to the queries of some specific kind.

E. HIGH DIMENSIONAL DATACloud-based EHRs (Electronic Health Record systems) areanother case of cloud-based data analytics. In EHRs, a pa-

VOLUME 4, 2016 17

Page 18: Toward Securing Cloud-Based Data Analytics: A Discussion ... · With cloud-based technologies and services flourishing, many organizations have started adopting hybrid Information

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

tient’s sensitive data (e.g. data extracted from disease images)is stored at the CSP. The patient’s data is always stored inthe form of high-dimensional vectors [97]. The objective ofcloud-based EHRs is to find approximate k Nearest Neigh-bors (KNN) in a privacy-preserving manner, so approximatekNN queries are executed over vectors without any leakageof information about the underlying vectors [96]. DifferentialPrivacy techniques [99], [100] add a great amount of noiseto query results, leading to low accuracy and making queryresults useless. However, encryption cannot be implementedfor EHRs due to the heavy computational overhead forhigh-dimensional vectors [98]. Existing solutions focus onlow-dimensional datasets and are unable to handle high-dimensional data [97].

VI. CONCLUSIONIn this paper, we discuss data security issues which emergewhen organizations outsource both data and data analytics toCloud Service Providers. We review the security mechanismswhich can be used for the deployment of secure cloud-based data analytics services. We focus particularly on cryp-tographic schemes and practical systems which enable theexecution of queries over encrypted data without decryption.We highlight the benefits and drawbacks of existing solutionsin a cloud computing context and suggest practical solutions.

Building practical secure systems is a challenging taskbecause there is a trade-off between privacy and functional-ity/efficiency. Using FHE in practical solutions enables com-puting arbitrary operations , i.e., combining high functional-ity with the highest level of security, which is promising forcloud-based data analytics. However, the level of efficiencyof FHE is still a major problem. PHE schemes on the otherhand are more efficient and therefore closer to being practicalsolutions, but they support partial computations only andcannot provide a completely practical solution with all thedesired functionality. In PPE schemes privacy is relaxed toprovide greater efficiency, and whilst they provide the samefunctionality as computing over unencrypted data, their poorlevel of security is a great challenge. Implementing tamper-proof hardware for secure query processing simulates FHEwith greater efficiency, although this is affected by the limitedmemory capacity of such trusted hardware.

Cryptography cannot simultaneously provide the highestlevels of security, efficiency and functionality. It is thus es-sential to specify clearly the objectives of any deployment ofcloud-based data analytics and adopt cryptographic schemeswhich are tailored to those objectives. In future work, weplan to analyze the efficiency of various practical solutionsby carrying out different tests with various datasets.

REFERENCES[1] P. Antonopoulos, A. Arasu, K. Eguro, J. Hammer, R. Kaushik, D. Koss-

mann, R. Ramamurthy, and J. Szymaszek. Pushing the limits of encrypteddatabases with secure hardware. CoRR, abs/1809.02631, 2018.

[2] F. Pallas and M. Grambow. Three tales of disillusion: Benchmarkingproperty preserving encryption schemes. In S. Furnell, H. Mouratidis, and

G. Pernul, editors, Trust, Privacy and Security in Digital Business, pages39–54, Cham, 2018. Springer International Publishing.

[3] D. Naous, J. Schwarz, and C. Legner. Analytics as a service: Cloudcomputing and the transformation of business analytics business modelsand ecosystems. In Proceedings of the 25th European Conference onInformation Systems (ECIS), Portugal, June 5-10, 2017.

[4] S. Schulte, C. Janiesch, S. Venugopal, I. Weber, and P. Hoenisch, “Elasticbusiness process management: State of the art and open challenges forBPM in the cloud,” Fut. Gen. Comp. Syst., vol. 46, pp. 36–50, 2015.

[5] E. Damiani, S. D. C. di Vimercati, S. Foresti, S. Jajodia, S. Paraboschi,and P. Samarati, “Key management for multi-user encrypted databases,” inProc. StorageSS, Fairfax, VA, USA, Nov. 2005, pp. 74–83.

[6] E. Shi, J. Bethencourt, H. T. Chan, D. X. Song, and A. Perrig, “Multi-dimensional range query over encrypted data,” in Proc. IEEE (S&P 2007),Oakland, California, USA, May 2007, pp. 350–364.

[7] M. Golfarelli, S. Rizzi, and I. Cella, “Beyond data warehousing: what’snext in business intelligence?” in Proc. DOLAP, Washington, DC, USA,Nov., 2004, pp. 1–6.

[8] G. Amanatidis, A. Boldyreva, and A. O’Neill, “Provably-secure schemesfor basic query support in outsourced databases,” in Proc. WG 11.3 DAS,Redondo Beach, CA, USA, 2007, pp. 14–30.

[9] J. L. D. Jr. and C. V. Ravishankar, “Security limitations of using secretsharing for data outsourcing,” in Proc. DBSec, Paris, France, Jul. 2012,pp. 145–160.

[10] H. Hacigümüs, B. R. Iyer, C. Li, and S. Mehrotra, “Executing SQL overencrypted data in the database-service-provider model,” in Proc. ACMSIGMOD, Madison, Wisconsin, Jun. 2002, pp. 216–227.

[11] H. Hacigümüs, B. R. Iyer, and S. Mehrotra, “Efficient execution of ag-gregation queries over encrypted relational databases,” in Proc. DASFAA,Jeju Island, Korea, 2004, pp. 125–136.

[12] R. A. Popa, C. M. S. Redfield, N. Zeldovich, and H. Balakrishnan,“Cryptdb: protecting confidentiality with encrypted query processing,” inProc. SOSP, Cascais, Portugal, Oct. 2011, pp. 85–100.

[13] M. Van Dijk, C. Gentry, S. Halevi, and V. Vaikuntanathan, “Fully ho-momorphic encryption over the integers,” in Proc. EUROCRYPT FrenchRiviera, Jun. 2010, pp. 24–43.

[14] R. A. Popa, “Building practical systems that compute on encrypted data,”Ph.D. dissertation, MIT, 2014.

[15] S. Bajaj and R. Sion, “Trusteddb: a trusted hardware based database withprivacy and data confidentiality,” in Proc. SIGMOD, Athens, Greece, Jun.,2011, pp. 205–216.

[16] V. Attasena, N. Harbi, and J. Darmont, “fvss: A new secure andcost-efficient scheme for cloud data warehouses,” arXiv preprintarXiv:1412.3538, 2014.

[17] M. A. Hadavi and R. Jalili, “Secure data outsourcing based on thresholdsecret sharing; towards a more practical solution,” in VLDB 2010 PhDWorkshop, Singapore, 2010, pp. 54–59.

[18] A. Arasu, K. Eguro, M. Joglekar, R. Kaushik, D. Kossmann, and R. Rama-murthy, “Transaction processing on confidential data using cipherbase,” inProc. IEEE ICDE, Seoul, South Korea, Apr., 2015, pp. 435–446.

[19] T. Ge and S. B. Zdonik, “Answering aggregation queries in a secure systemmodel,” in Proc. VLDB, Austria, Sep. 2007, pp. 519–530.

[20] A. Arasu, S. Blanas, K. Eguro, R. Kaushik, D. Kossmann, R. Ramamurthy,and R. Venkatesan, “Orthogonal security with cipherbase,” in CIDR 2013,CA, USA, Jan., 2013.

[21] S. Tu, M. F. Kaashoek, S. Madden, and N. Zeldovich, “Processing ana-lytical queries over encrypted data,” PVLDB, vol. 6, no. 5, pp. 289–300,2013.

[22] B. K. Samanthula, W. Jiang, and E. Bertino, “Privacy-preserving complexquery evaluation over semantically secure encrypted data,” in Proc. ES-ORICS, Wroclaw, Poland, Sep. 2014, pp. 400–418.

[23] M. A. Hadavi, M. Noferesti, R. Jalili, and E. Damiani, “Database as aservice: Towards a unified solution for security requirements,” in Proc.IEEE COMPSAC, Izmir, Turkey, Jul., 2012, pp. 415–420.

[24] C. Yildizli, T. B. Pedersen, Y. Saygin, E. Savas, and A. Levi, “Distributedprivacy preserving clustering via homomorphic secret sharing and itsapplication to (vertically) partitioned spatio-temporal data,” IJDWM ,vol. 7, no. 1, pp. 46–66, 2011.

[25] C. Gentry, “A fully homomorphic encryption scheme,” Ph.D. dissertation,Stanford University, 2009.

[26] F. H. Liu, “Computation over encrypted data,” in Cloud ComputingSecurity: Foundations and Challenges. CRC Press, 2016, pp. 305–320.

[27] M. D. Ryan, “Cloud computing security: The scientific challenge, and asurvey of solutions,” J. S. and S., vol. 86, no. 9, pp. 2263–2268, 2013.

18 VOLUME 4, 2016

Page 19: Toward Securing Cloud-Based Data Analytics: A Discussion ... · With cloud-based technologies and services flourishing, many organizations have started adopting hybrid Information

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

[28] M. van Dijk, C. Gentry, S. Halevi, and V. Vaikuntanathan, “Fully ho-momorphic encryption over the integers,” in Proc. EUROCRYPT, FrenchRiviera, 2010, pp. 24–43.

[29] Z. Brakerski and V. Vaikuntanathan, “Fully homomorphic encryption fromring-lwe and security for key dependent messages,” in Proc. CRYPTO,Santa Barbara, CA, USA, 2011, pp. 505–524.

[30] C. Gentry, A. Sahai, and B. Waters, “Homomorphic encryption from learn-ing with errors: Conceptually-simpler, asymptotically-faster, attribute-based,” in Proc. CRYPTO, Santa Barbara, CA, USA, 2013, pp. 75–92.

[31] D. Boneh, C. Gentry, S. Halevi, F. Wang, and D. J. Wu, “Private databasequeries using somewhat homomorphic encryption,” in Proc. ACNS, Banff,AB, Canada, Jun. 2013, pp. 102–118.

[32] N. P. Smart and F. Vercauteren, “Fully homomorphic encryption withrelatively small key and ciphertext sizes,” in Proc. PKC, Paris, France,May. 2010, pp. 420–443.

[33] C. Gentry and S. Halevi, “A working implementation of fully homomor-phic encryption,” presented at EUROCRYPT, Nice, French, May. 2010.

[34] J.-S. Coron, A. Mandal, D. Naccache, and M. Tibouchi, “Fully homo-morphic encryption over the integers with shorter public keys,” in Proc.CRYPTO, Santa Barbara, CA, Aug. 2011, pp. 487–504.

[35] Z. Brakerski, C. Gentry, and V. Vaikuntanathan, “(leveled) fully homo-morphic encryption without bootstrapping,” in Proc. ITCS, Cambridge,Massachusetts, Jan. 2012, pp. 309–325.

[36] P. Paillier, “Public-key cryptosystems based on composite degree residu-osity classes,” in Proc. EURO-CRYPT’99, Prague, Czech Republic, May1999, pp. 223–238.

[37] S. Sobati-Moghadam, G. Gavin, and J. Darmont, “A secure order-preserving indexing scheme for outsourced data,” in Proc. ICCST’16,Orlando, USA, Oct. 2016, pp. 297–303.

[38] R. A. Popa, “Building practical systems that compute on encrypted data,”Ph.D. dissertation, Massachusetts Institute of Technology, 2014.

[39] M. Bellare, A. Boldyreva, and A. O’Neill, “Deterministic and efficientlysearchable encryption,” in Proc. CRYPTO, Santa Barbara, CA, USA, Aug.2007, pp. 535–552.

[40] M. Bellare, M. Fischlin, A. O’Neill, and T. Ristenpart, “Deterministicencryption: Definitional equivalences and constructions without randomoracles,” in Proc. CRYPTO, Santa Barbara, CA, USA, Aug. 2008, pp 360–378.

[41] A. Boldyreva, N. Chenette, Y. Lee, and A. O’Neill, “Order-preservingsymmetric encryption,” IACR Cryptology ePrint Archive, vol. 2012, p.624, 2012.

[42] A. Boldyreva, N. Chenette, Y. Lee, and A. O’neill, “Order-preservingsymmetric encryption,” in Proc. EUROCRYPT, Cologne, Germany, Apr.2009, pp. 224–241.

[43] R. Agrawal, J. Kiernan, R. Srikant, and Y. Xu, “Order-preserving encryp-tion for numeric data,” in Proc. the ACM SIGMOD, Paris, France, Jun.2004, pp. 563–574.

[44] Z. Liu, X. Chen, J. Yang, C. Jia, and I. You, “New order preservingencryption model for outsourced databases in cloud environments,” J.NETW COMPUT APPL, vol. 59, pp. 198–207, 2016.

[45] E. Damiani, S. D. C. di Vimercati, S. Jajodia, S. Paraboschi, and P. Sama-rati, “Balancing confidentiality and efficiency in untrusted relationaldbmss,” in Proc. ACM CCS, Washington, DC, USA, Oct. 2003, pp. 93–102.

[46] R. A. Popa, F. H. Li, and N. Zeldovich, “An ideal-security protocol fororder-preserving encoding,” in Proc. IEEE (SP), 2013, pp. 463–477.

[47] E. Mykletun and G. Tsudik, “Aggregation queries in the database-as-a-service model,” in Proc. IFIP WG, Sophia Antipolis, France, Jul., 2006,pp. 89–103.

[48] P. Samarati and S. D. C. di Vimercati, “Data protection in outsourcingscenarios: issues and directions,” in Proc. ACM ASIACCS , Beijing,China, 2010, pp. 1–14.

[49] B. Hore, S. Mehrotra, M. Canim, and M. Kantarcioglu, “Secure multidi-mensional range queries over outsourced data,” VLDB J., vol. 21, no. 3,pp. 333–358, 2012.

[50] A. Arasu, K. Eguro, R. Kaushik, D. Kossmann, R. Ramamurthy, andR. Venkatesan, “A secure coprocessor for database applications,” in Proc.FPL , Porto, Portugal, Sep., 2013, pp. 1–8.

[51] S. D. Tetali, M. Lesani, R. Majumdar, and T. D. Millstein, “MrCrypt: staticanalysis for secure cloud computations,” in Proc. ACM SIGPLAN , part ofSPLASH 2013, Indianapolis, IN, USA, Oct. 2013, pp. 271–286.

[52] J. Kepner, V. Gadepally, P. Michaleas, N. Schear, M. Varia, A. Yerukhi-movich, and R. K. Cunningham, “Computing on masked data: a high

performance method for improving big data veracity,” in Proc. IEEEHPEC, Waltham, MA, USA, Sep. 2014, pp. 1–6.

[53] J. J. Stephen, S. Savvides, R. Seidel, and P. Eugster, “Practical confiden-tiality preserving big data analysis,” in 6th USENIX Workshop on HotTopics in Cloud Computing, HotCloud ’14, Philadelphia, PA, USA, Jun.2014.

[54] H. Shafagh, “Toward computing over encrypted data in IoT systems,”ACM Crossroads, vol. 22, no. 2, pp. 48–52, 2015. [Online]. Available:http://doi.acm.org/10.1145/2845157

[55] H. Shafagh, A. Hithnawi, A. Droescher, S. Duquennoy, and W. Hu,“Poster: Towards encrypted query processing for the internet of things,”in Proc. MobiCom 2015, Paris, France, Sep. 2015, pp. 251–253.

[56] H. Shafagh, L. Burkhalter, and A. Hithnawi, “Talos a platform for process-ing encrypted IoT data: Demo abstract,” in Proc. ACM SenSys, Stanford,CA, USA, Nov. 2016, pp. 308–309.

[57] Google, “Encrypted Big Query,” [Online]. Available: https://github.com/google/encrypted-bigquery-client, Accessed on: Jun. 5, 2018

[58] P. Grofig, I. Hang, M. Härterich, F. Kerschbaum, M. Kohler, A. Schaad,A. Schröpfer, and W. Tighzert, “Privacy by encrypted databases,” in Proc.APF, Athens, Greece, May, 2014, pp. 56–69.

[59] “Always encrypted,” [Online]. Available: https://msdn.microsoft.com/enus/library/mt163865(v=sql.130).aspx., Accessed on: Jul. 2018.

[60] Dotissi, [Online]. Available:“CryptonorDB,” http://www.cryptonordb.com/, Accessed on Jul. 2018.

[61] [Online]. Available: “Lincoln Laboratory,” http://www.ll.mit.edu/index.html, Accessed on Jul. 2018.

[62] M. Naveed, S. Kamara, and C. V. Wright, “Inference attacks on property-preserving encrypted databases,” in Proc. SIGSAC, Denver, Colorado,USA âAT Oct. 2015, pp. 644–655.

[63] J. Köhler, Tunable Security for Deployable Data Outsourcing. KITScientific Publishing, Ph.D. dissertation, Dep. of Comp. Sce., 2015.

[64] E. Saleh, “Processing over encrypted data: Between theory and practice,”in the 8th Ph. D. Retreat of the HPI Research School on Service-orientedSystems Engineering, 2015.

[65] F. Kerschbaum, P. Grofig, I. Hang, M. Härterich, M. Kohler, A. Schaad,A. Schröpfer, and W. Tighzert, “Adjustably encrypted in-memory column-store,” in Proc. ACM SIGSAC, Berlin, Germany, Nov., 2013, pp. 1325–1328.

[66] Q. Cai, J. Lin, F. Li, and Q. Wang, “SEDB: building secure databaseservices for sensitive data,” in Proc. ICICS, Hong Kong, China, Dec. ,2014, pp. 16–30.

[67] Z. He, W. K. Wong, B. Kao, D. W. Cheung, R. Li, S. Yiu, and E. Lo, “SDB:A secure query processing system with data interoperability,” PVLDB,vol. 8, no. 12, pp. 1876–1879, 2015.

[68] A. Shamir, “How to share a secret,” Communications of the ACM, vol. 22,no. 11, pp. 612–613, 1979.

[69] S. S. Moghadam, J. Darmont, and G. Gavin, “Enforcing privacy in clouddatabases,” in Proc. DaWaK 2017, Lyon, France, Aug., ser. LNCS, L. Bel-latreche and S. Chakravarthy, Eds., vol. 10440. Springer, 2017, pp. 53–73.

[70] Z. Liu, X. Chen, J. Yang, C. Jia, and I. You, “New order preservingencryption model for outsourced databases in cloud environments,” J.NETW COMPUT APPL. , 2014.

[71] S. Bajaj and R. Sion, “Trusteddb: A trusted hardware-based database withprivacy and data confidentiality,” IEEE TKDE, vol. 26, no. 3, pp. 752–765,2014.

[72] “IBM 4765 PCIe cryptographic coprocessor.” http://www-03.ibm.com/security/cryptocards/pciecc/overview.shtml, 2018.

[73] S. Bajaj and R. Sion, “Trusteddb: A trusted hardware based outsourceddatabase engine,” PVLDB, vol. 4, no. 12, pp. 1359–1362, 2011.

[74] F. B. Durak, T. M. DuBuisson, and D. Cash, “What else is revealed byorder-revealing encryption?” in Proc. t ACM SIGSAC, Vienna, Austria,Oct. 2016, pp. 1155–1166.

[75] G. Kellaris, G. Kollios, K. Nissim, and A. O’Neill, “Generic attacks onsecure outsourced databases,” in Proc. ACM SIGSAC, Vienna, Austria,Oct. , 2016, pp. 1329–1340.

[76] R. L. Lagendijk, Z. Erkin, and M. Barni, “Encrypted signal processing forprivacy protection: Conveying the utility of homomorphic encryption andmultiparty computation,” IEEE Signal Process. Mag., vol. 30, no. 1, pp.82–105, 2013.

[77] R. Sion, “Towards secure data outsourcing,” in Gertz M., Jajodia S. (eds)Handbook of Database Security. Springer, Boston, MA, 2008.

[78] W. Lueks and I. Goldberg, “Sublinear scaling for multi-client privateinformation retrieval,” in Proc.FC 2015, San Juan, Puerto Rico, RevisedSelected Papers, 2015, pp. 168–186.

VOLUME 4, 2016 19

Page 20: Toward Securing Cloud-Based Data Analytics: A Discussion ... · With cloud-based technologies and services flourishing, many organizations have started adopting hybrid Information

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

[79] C. Cachin, S. Micali, and M. Stadler, “Computationally private infor-mation retrieval with polylogarithmic communication,” in Proc. EURO-CRYPT ’99, Prague, Czech Republic, 1999, pp. 402–414.

[80] Y. Chang, “Single database private information retrieval with logarithmiccommunication,” in Proc.I ACISP Sydney, Australia, 2004, pp. 50–61.

[81] E. Kushilevitz and R. Ostrovsky, “One-way trapdoor permutations aresufficient for non-trivial single-server private information retrieval,” inProc. EUROCRYPT, Bruges, Belgium, 2000, pp. 104–121.

[82] B. Chor, E. Kushilevitz, O. Goldreich, and M. Sudan, “Private informationretrieval,” J. ACM, vol. 45, no. 6, pp. 965–981, 1998.

[83] E. Mykletun, M. Narasimha, and G. Tsudik, “Signature bouquets: Im-mutability for aggregated/condensed signatures,” in Proc. ESORICS,Sophia Antipolis, France, Sep. 2004, pp. 160–176.

[84] P. Williams and R. Sion, “Usable PIR,” in Proc. NDSS 2008, San Diego,California, USA, Feb. , 2008.

[85] O. Goldreich and R. Ostrovsky, “Software protection and simulation onoblivious rams,” J. ACM, vol. 43, no. 3, pp. 431–473, 1996.

[86] E. Stefanov, M. van Dijk, E. Shi, C. W. Fletcher, L. Ren, X. Yu, andS. Devadas, “Path ORAM: an extremely simple oblivious RAM protocol,”in Proc. ACM SIGSAC , CCS’13, Berlin, Germany, 2013, pp. 299–310.

[87] C. Mavroforakis, N. Chenette, A. O’Neill, G. Kollios, and R. Canetti,“Modular order-preserving encryption, revisited,” in Proc. the 2015 ACMSIGMOD , Melbourne, Victoria, Australia, 2015, pp. 763–777.

[88] H. Pang, X. Xiao, and J. Shen, “Obfuscating the topical intention inenterprise text search,” in Proc. IEEE ICDE 2012, Washington, DC, USA, 2012, pp. 1168–1179.

[89] M. Narasimha and G. Tsudik, “DSAC: integrity for outsourced databaseswith signature aggregation and chaining,” in Proc. the 2005 ACM CIKM ,Bremen, Germany, October 31 - November 5, 2005, 2005, pp. 235–236.

[90] V. Attasena, N. Harbi, and J. Darmont, “A novel multi-secret sharingapproach for secure data warehousing and on-line analysis processing inthe cloud,” IJDWM, vol. 11, no. 2, pp. 22–43, 2015.

[91] A. Harmanci and M. Gerstein. Quantification of private informationleakage from phenotype-genotype data: linking attacks. Nature methods,vol. 13, no. 3, pp. 251–256, 2016.

[92] E. Shi, J. Bethencourt, H. T. Chan, D. X. Song, and A. Perrig. Multi-dimensional range query over encrypted data. In 2007 IEEE Symposiumon Security and Privacy (S&P 2007), 20-23 May 2007, Oakland, Califor-nia, USA, pages 350–364, 2007.

[93] C. Guo, X. Chen, Y. Jie, F. Zhangjie, M. Li, and B. Feng. Dynamicmulti-phrase ranked search over encrypted data with symmetric searchableencryption. IEEE Transactions on Services Computing, pp. 1–12, 2017.

[94] Z. Shen, J. Shu, and W. Xue. Preferred search over encrypted data.Frontiers of Computer Science, vol. 12, no. 3, pp. 593–607, 2018.

[95] C. Guo, R. Zhuang, Y. Jie, Y. Ren, T. Wu, and K. R. Choo. Fine-graineddatabase field search using attribute-based encryption for e-healthcareclouds. J. Medical Systems, vol. 40, no. 11, pp. 235–243, 2016.

[96] W. Wang, L. Chen, and Q. Zhang. Outsourcing high-dimensional health-care data to cloud with personalized privacy preservation. ComputerNetworks, vol. 88, pp. 136–148, 2015.

[97] W. C. Xue, H. Li, Y. Peng, J. Cui, and Y. Shi. Secure k nearest neighborsquery for high-dimensional vectors in outsourced environments. IEEETrans. Big Data, vol. 4, pp. 586–599, 2018.

[98] M. Li, S. Yu, Y. Zheng, K. Ren, and W. Lou. Scalable and securesharing of personal health records in cloud computing using attribute-based encryption. IEEE Transactions on Parallel and Distributed Systems,vol. 24, no. 1, pp. 131–143, 2013.

[99] C. Xu, J. Ren, Y. Zhang, Z. Qin, and K. Ren. Dppro: Differentially privatehigh-dimensional data release via random projection. IEEE Transactionson Information Forensics and Security, vol. 12, no. 12 , pp. 3081–3093,2017.

[100] S. Su, P. Tang, X. Cheng, R. Chen, and Z. Wu. Differentially privatemulti-party high-dimensional data publishing. In 2016 IEEE 32nd Interna-tional Conference on Data Engineering (ICDE), May 2016, pp. 205–216.

[101] D. Naous, J. Schwarz, and C. Legner. Analytics as a service: Cloudcomputing and the transformation of business analytics business modelsand ecosystems. In Proceedings of the 25th European Conference onInformation Systems (ECIS), Portugal, Jun. 2017, pp. 487–501.

[102] D. Kossmann. Confidentiality à la carte with cipherbase. InDatenbanksysteme für Business, Technologie und Web (BTW 2017),Gesellschaft für Informatik, Bonn, 2017, pp. 23–24.

SOMAYEH SOBATI M. received the master de-gree from INSA de Lyon, France, and the Ph.D.degree from Lyon2 University, France in 2016. Dr.Sobati M. is currently an assistant professor withthe Electrical and Computer Engineering Faculty,Hakim Sabzevari University, Iran. Her current re-search includes privacy, security issues in cloudcomputing, decision support systems, data mining,IoT, and business information systems. She hasauthored several journal and conference papers.

AMJAD FAYOUMI is a Lecturer in Informa-tion Systems (IS) within the Management Sci-ence Department at Lancaster University Manage-ment School. He is researching in the informa-tion systems area focusing on enterprise modelingand simulation. Dr. Fayoumi was awarded hisPh.D. degree in the information systems area fromLoughborough University, for his thesis focusingon the use of enterprise modeling and simulationfor analyzing and designing business strategy and

operation. His recent research interests lie in the field of architecting digitalenterprises with a particular focus on the use of emergent intelligent systemsin organizations and societies. Dr. Fayoumi is currently a member of anumber of professional associations and he has taken roles in reviewing andediting research papers for several international conferences and journals.

20 VOLUME 4, 2016