journal of la catch you if you misbehave: ranked keyword ... · catch you if you misbehave: ranked...

14
2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2015.2481389, IEEE Transactions on Cloud Computing JOURNAL OF L A T E X CLASS FILES, VOL. 6, NO. 1, JANUARY 2015 1 Catch You if You Misbehave: Ranked Keyword Search Results Verification in Cloud Computing Wei Zhang, Student Member, IEEE, and Yaping Lin, Member, IEEE Abstract—With the advent of cloud computing, more and more people tend to outsource their data to the cloud. As a fundamental data utilization, secure keyword search over encrypted cloud data has attracted the interest of many researchers recently. However, most of existing researches are based on an ideal assumption that the cloud server is “curious but honest”, where the search results are not verified. In this paper, we consider a more challenging model, where the cloud server would probably behave dishonestly. Based on this model, we explore the problem of result verification for the secure ranked keyword search. Different from previous data verification schemes, we propose a novel deterrent-based scheme. With our carefully devised verification data, the cloud server cannot know which data owners, or how many data owners exchange anchor data which will be used for verifying the cloud server’s misbehavior. With our systematically designed verification construction, the cloud server cannot know which data owners’ data are embedded in the verification data buffer, or how many data owners’ verification data are actually used for verification. All the cloud server knows is that, once he behaves dishonestly, he would be discovered with a high probability, and punished seriously once discovered. Furthermore, we propose to optimize the value of parameters used in the construction of the secret verification data buffer. Finally, with thorough analysis and extensive experiments, we confirm the efficacy and efficiency of our proposed schemes. Index Terms—Cloud computing, dishonest cloud server, data verification, deterrent 1 I NTRODUCTION With the advent of cloud computing, more and more people tend to outsource their data to the cloud. Cloud computing provides tremendous benefits including easy access, decreased costs, quick deployment, and flexible resource management [1], [2]. Enterprises of all sizes can leverage the cloud to increase innovation and collabora- tion. Although cloud computing brings a lot of benefits, for privacy concerns, individuals and enterprise users are reluctant to outsource their sensitive data, including private photos, personal health records, and commer- cial confidential documents, to the cloud. Because once sensitive data are outsourced to a remote cloud, the corresponding data owner directly loses control of these data. The Apple’s iCloud leakage of celebrity photo in 2014 [3] has furthered our concern regarding the cloud’s data security. Encryption on sensitive data before out- sourcing is an alternative way to preserve data privacy against adversaries. However, data encryption becomes an obstacle to the utilization of traditional applications, e.g., plaintext based keyword search. To achieve efficient data retrieval from encrypted data, many researchers have recently put efforts on secure keyword search over encrypted data [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18]. W. Zhang and Y. p. Lin are with College of Computer Science and Electronic Engineering, Hunan University, Changsha 410082, China, and with Hunan Provincial Key Laboratory of Dependable Systems and Networks, Changsha, 410082, China. Yaping Lin is the Corresponding Author. E-mail: {zhangweidoc, yplin}@hnu.edu.cn. However, all these schemes are based on the ideal as- sumption that the cloud server is “curious but honest”. Unfortunately, in practical applications, the cloud server may be compromised and behave dishonestly [19], [20]. A compromised cloud server would return false search results to data users for various reasons: 1) The cloud server may return forged search results. For example, the cloud may rank an advertisement higher than others, since the cloud can profit from it, or the cloud would return random large files to earn money, since the cloud adopts the ‘pay as you consume’ model. 2) The cloud server may return incomplete search re- sults in peak hours to avoid suffering from performance bottlenecks. There are some researches that focus on search results verification [21], [22], [23], [24], [25], [26], [27], [28], [29]. However, these methods cannot be applied to verify the top-k ranked search results in the cloud computing en- vironment, where numerous data owners are involved. We illustrate the reason from two aspects. 1) Existing schemes share a common assumption, i.e., data owners foresee the order of search results. How- ever, in practical applications, numerous data owners are involved; each data owner only knows its own partial order. Without knowing the total order, these data owners cannot use the conventional schemes to verify the search results. 2) For a top-k ranked keyword search (e.g., k = 10), only a few data owners will have satisfied files in the search results. Traditional methods need to return a lot of data to verify whether the huge amount of absen- t data owners have satisfied search results. However, the top-k ranked keyword search is, to some extent,

Upload: duongtram

Post on 29-Aug-2019

226 views

Category:

Documents


0 download

TRANSCRIPT

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TCC.2015.2481389, IEEE Transactions on Cloud Computing

JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2015 1

Catch You if You Misbehave: Ranked KeywordSearch Results Verification in Cloud Computing

Wei Zhang, Student Member, IEEE, and Yaping Lin, Member, IEEE

Abstract—With the advent of cloud computing, more and more people tend to outsource their data to the cloud. As a fundamental datautilization, secure keyword search over encrypted cloud data has attracted the interest of many researchers recently. However, most ofexisting researches are based on an ideal assumption that the cloud server is “curious but honest”, where the search results are notverified. In this paper, we consider a more challenging model, where the cloud server would probably behave dishonestly. Based onthis model, we explore the problem of result verification for the secure ranked keyword search. Different from previous data verificationschemes, we propose a novel deterrent-based scheme. With our carefully devised verification data, the cloud server cannot knowwhich data owners, or how many data owners exchange anchor data which will be used for verifying the cloud server’s misbehavior.With our systematically designed verification construction, the cloud server cannot know which data owners’ data are embedded in theverification data buffer, or how many data owners’ verification data are actually used for verification. All the cloud server knows is that,once he behaves dishonestly, he would be discovered with a high probability, and punished seriously once discovered. Furthermore,we propose to optimize the value of parameters used in the construction of the secret verification data buffer. Finally, with thoroughanalysis and extensive experiments, we confirm the efficacy and efficiency of our proposed schemes.

Index Terms—Cloud computing, dishonest cloud server, data verification, deterrent

F

1 INTRODUCTION

With the advent of cloud computing, more and morepeople tend to outsource their data to the cloud. Cloudcomputing provides tremendous benefits including easyaccess, decreased costs, quick deployment, and flexibleresource management [1], [2]. Enterprises of all sizes canleverage the cloud to increase innovation and collabora-tion.

Although cloud computing brings a lot of benefits,for privacy concerns, individuals and enterprise usersare reluctant to outsource their sensitive data, includingprivate photos, personal health records, and commer-cial confidential documents, to the cloud. Because oncesensitive data are outsourced to a remote cloud, thecorresponding data owner directly loses control of thesedata. The Apple’s iCloud leakage of celebrity photo in2014 [3] has furthered our concern regarding the cloud’sdata security. Encryption on sensitive data before out-sourcing is an alternative way to preserve data privacyagainst adversaries. However, data encryption becomesan obstacle to the utilization of traditional applications,e.g., plaintext based keyword search.

To achieve efficient data retrieval from encrypted data,many researchers have recently put efforts on securekeyword search over encrypted data [4], [5], [6], [7],[8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18].

• W. Zhang and Y. p. Lin are with College of Computer Science andElectronic Engineering, Hunan University, Changsha 410082, China,and with Hunan Provincial Key Laboratory of Dependable Systems andNetworks, Changsha, 410082, China. Yaping Lin is the CorrespondingAuthor. E-mail: {zhangweidoc, yplin}@hnu.edu.cn.

However, all these schemes are based on the ideal as-sumption that the cloud server is “curious but honest”.Unfortunately, in practical applications, the cloud servermay be compromised and behave dishonestly [19], [20].A compromised cloud server would return false searchresults to data users for various reasons:

1) The cloud server may return forged search results.For example, the cloud may rank an advertisementhigher than others, since the cloud can profit from it, orthe cloud would return random large files to earn money,since the cloud adopts the ‘pay as you consume’ model.

2) The cloud server may return incomplete search re-sults in peak hours to avoid suffering from performancebottlenecks.

There are some researches that focus on search resultsverification [21], [22], [23], [24], [25], [26], [27], [28], [29].However, these methods cannot be applied to verify thetop-k ranked search results in the cloud computing en-vironment, where numerous data owners are involved.We illustrate the reason from two aspects.

1) Existing schemes share a common assumption, i.e.,data owners foresee the order of search results. How-ever, in practical applications, numerous data ownersare involved; each data owner only knows its ownpartial order. Without knowing the total order, these dataowners cannot use the conventional schemes to verifythe search results.

2) For a top-k ranked keyword search (e.g., k = 10),only a few data owners will have satisfied files in thesearch results. Traditional methods need to return a lotof data to verify whether the huge amount of absen-t data owners have satisfied search results. However,the top-k ranked keyword search is, to some extent,

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TCC.2015.2481389, IEEE Transactions on Cloud Computing

JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2015 2

proposed to save communication cost; returning toomuch verification data would make the top-k rankedsearch meaningless. Additionally, in the ‘pay as youconsume’ cloud computing environment, returning toomuch data would cause considerable expenses for datausers, which would make the cloud computing lose itsattractiveness.

In this paper, we consider a more challenging model,where multiple data owners are involved, and the cloudserver would probably behave dishonestly. Based on thismodel, we explore the problem of result verificationfor the secure ranked keyword search. Different fromprevious data verification schemes, we propose a nov-el deterrent-based scheme. With our carefully devisedverification data, the cloud server cannot know whichdata owners, or how many data owners exchange anchordata which will be used for verifying the cloud server’smisbehavior. With our systematically designed verifica-tion construction, the cloud server cannot know whichdata owners’ data are embedded in the verification databuffer, or how many data owners’ verification data areactually used for verification. All the cloud server knowsis that, once he behaves dishonestly, he would be dis-covered with a high probability, and punished seriouslyonce discovered. Additionally, when any suspicious ac-tion is detected, data owners can dynamically update theverification data stored on the cloud server. Furthermore,we propose to optimize the value of parameters used inthe construction of the secret verification data buffer. Fi-nally, with thorough analysis and extensive experiments,we confirm the efficacy and efficiency of our proposedschemes.

The main contributions of this paper are:

• We formalize the ranked keyword search resultverification problem where multiple data ownersare involved and the cloud server would probablybehave dishonestly.

• We propose a novel secure and efficient deterrent-based verification scheme for secure ranked key-word search.

• We propose to optimize the value of parametersused in the construction of verification data buffer.

• We give a thorough analysis and conduct extensiveperformance experiments to show the efficacy andefficiency of our proposed scheme.

The rest of this paper is organized as follows. Section2 reviews the related works. Section 3 formulates theproblem and introduces notations used in later discus-sions. Section 4 describes the preliminary techniques thatwill be used in this paper. In Section 5, we illustratehow to efficiently and securely verify the ranked searchresults. In Section 6, we show how to optimize the valueof parameters. In Sections 7 and 8, we present analysisand performance evaluation for our proposed schemes,respectively. Finally, we conclude the paper in Section 9.

2 RELATED WORK

2.1 Secure Keyword Search in Cloud ComputingRecently, there have been a lot of research works con-cerned with secure keyword search in cloud computing.The first securely ranked keyword search over encrypteddata was proposed by Wang et al. [4]. Cao et al. [5]and Wen et al. [6], further strengthening the rankedkeyword search and constructing schemes for privacy-preserving multi-keyword ranked search. In [7], Xu etal. proposed a multi-keyword ranked query schemeon encrypted data, which enables a dynamic keyworddictionary and avoids the problem in which the rankorder is perturbed by several high frequency keywords.Based on information retrieval systems and cryptogra-phy approaches, Ibrahim et al. [8] proposed a rankedsearchable encryption scheme of multi-keyword searchover a cloud server. Hore et al. [9] further proposedusing a set of colors to encode the presence of thekeywords and creating an index to accelerate the searchprocess. To enrich search functionality, Li et al. [10],Chuah et al. [11], Xu et al. [12] and Wang et al. [13]proposed fuzzy keyword search over encrypted clouddata, respectively. Meanwhile, Wang et al. [14] proposeda privacy preserving similarity search mechanism overcloud data. To support secure searches in the systemwhere multiple data owners are involved, Sun et al. [15]and Zheng et al. [16] proposed secure attribute-basedkeyword search schemes. In our previous work [17],[30], we proposed a secure ranked multi-keyword searchscheme to support multiple data owners. In [18], weproposed a secure and efficient keyword search protocolin the geo-distributed cloud environment.

All these schemes assume that the cloud server is“curious but honest”. However, in practical application-s, the cloud server may be compromised and behavedishonestly. Different from these works, we considerthat the cloud server would probably be compromisedand behave dishonestly. Based on this consideration,we propose a deterrent based scheme which will makethe cloud server dare not to behave dishonestly. Addi-tionally, once the cloud server behaves dishonestly, ourproposed scheme will detect it with a high probability.

2.2 Verification for Search ResultsThere are a lot of researches concerned on the searchresults verification. Methods used in these works canbe classified into two categories, i.e., linked signaturechaining and Merkle hash tree.

The linked signature chaining schemes [21], [22], as-sume all original data are ordered, then the data ownersigns for consecutive data items. These signatures willfinally form a linked chaining, and any data tamperingor data deletion will be easily detected since it will leadto an incomplete of the signature chaining. However,the linked signature chaining will lead to very highcomputational costs, storage overhead, and user-sideverification time [23].

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TCC.2015.2481389, IEEE Transactions on Cloud Computing

JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2015 3

Data

OwnersData

UsersFile decryption keys

Dishonest

Cloud Servers

Encrypted Files

& indexes

&Verification Data

Trapdoors

Ranked Results &

Verification Data

Buffer

Fig. 1: Architecture of verifying secure ranked keywordsearch result in cloud computing

The Merkle hash tree [24], [25], [26], [27], [28], [29] isproposed to verify the integrity of a very large data set.Data owners first rank all data items, then place the or-dered data items in the leaf node; further, they constructa Merkle hash tree from the leaf node recursively untilthey get a root node. Finally, data owners sign on theroot node. For each query, the server has to return allnecessary items to reconstruct the root node. For datausers, they first reconstruct the root node, then theydecrypt the signed root. Finally, they compare whethertwo root nodes match. Any data tampering or deletionwill lead to the inconsistency of the comparison.

However, these methods cannot be applied to verifythe top-k ranked search results in cloud computingwhere multiple data owners are involved. First, each da-ta owner only knows its own partial order; they cannotforesee the total order. Second, these methods need toreturn a lot of data to verify whether the absent dataowners have satisfied search results. However, returningtoo much verification data would make the top-k rankedsearch meaningless. Additionally, in the ‘pay as youconsume’ cloud computing environment, returning toomuch data would cause considerable expenses for datausers, which would cause the cloud computing to lose itsattractiveness. Different from these methods, we proposea novel deterrent based verification scheme. Instead ofreturning each data owner’s data for verification, weonly need to return a verification data buffer, wheresome other data owners’ anchor values are secretlyembedded, which will force the cloud server to dare notbehave dishonestly. On the other hand, once the cloudbehaves dishonestly, our scheme can detect it with a highprobability.

3 PROBLEM FORMULATION

3.1 System ModelThere are three entities involved in our system model, asillustrated in Fig. 1, they are data owners, cloud server,and data users. First of all, each data owner extractskeywords from his file collection and constructs indexes(i.e., computing relevance scores between keywords andfiles, ranking files based on relevance scores, and gettingthe partial order). For each keyword, he samples ψ filesfrom the corresponding file set, and obtains the file IDsand relevance scores. Then he exchanges these file IDs

and relevance scores as anchor data with other θ−1 dataowners uniformly at random. After getting the sampleddata and anchor data, each data owner concatenatesthese data into a string. The encryption of the stringis used as each owner’s verification data. At last, eachdata owner outsources his encrypted files, indexes, andverification data to the cloud server. Once an authorizeddata user wants to perform a ranked (top-k) secure key-word search over these encrypted files, he first generateshis trapdoor (encrypted keyword) and submits it withvariable k to the cloud server. Upon receiving the searchrequest, the cloud server searches locally and returnsthe top-k relevant data files. The authorized data userdecrypts his search results. If the data user finds anysuspicious data, he will construct and submit a secretverification request. The cloud server further returns averification data buffer without knowing which dataowners’ verification data are returned. Finally, the datauser decrypts and recovers the verification data, andverifies the search results. If the search results do notpass the verification, the search results are consideredas contaminated and abandoned. Note that, we do notconsider how to securely obtain the ranked search resultshere; interested readers can refer to [7], [14], [17]. Instead,we concern on the secure verification for the secureranked keyword search results, which are also veryimportant. Tab. 1 lists the notations used in this paper.

TABLE 1: Notations used in the paper

Notations DescriptionsO Data ownersF The plaintext file.w Keyword.V Verification data.m Number of data owners.d Number of files corresponding to each keyword.ψ The number of sampled data.θ Each data owner exchanges anchor data with θ − 1

data owners.κ The number of hash functions used for mappingλ The number of entries in the verification data bufferα The actual number of verification data that is requestedβ The enlarged size of ID set submitted by the data user

3.2 Threat ModelIn our threat model, both data owners and authorizeddata users are trusted. However, different from previousworks [4], [5], [7], [14], the cloud server is not trusted andwould probably behave dishonestly, i.e., the cloud servernot only aims at revealing the contents of encryptedfiles, keywords and verification data, but also tends toreturn false search results and contaminate the secretverification data. We assume that the data owners andauthorized data users share a secret hash function, e.g.,the keyed-Hash Message Authentication Code (HMAC)[31]. This assumption is applicable, e.g., for a large scalecompany, the network administrator would distributethis secret hash function when a new employer is en-rolled.

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TCC.2015.2481389, IEEE Transactions on Cloud Computing

JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2015 4

3.3 Design GoalsTo enable a secure and efficient verification for theranked keyword search, our system design should si-multaneously satisfy the following goals.

• Efficiency: The proposed scheme should allow dataowners to construct the verification data efficiently.The cloud server should also return the verificationdata without introducing heavy costs. Additionally,data users can verify the search result efficiently.

• Security: The proposed scheme should prevent thecloud server from knowing the actual value of thesecret verification data, and which data owners’ dataare returned as verification data.

• Detectability: The proposed scheme should deterthe cloud server from behaving dishonestly. Oncethe cloud server behaves dishonestly, the schemeshould detect it with a high probability.

4 PRELIMINARIES

Before we introduce our detailed construction, we firstbriefly introduce some techniques that will be used inthis paper.

4.1 Paillier CryptosystemPaillier cryptosystem [32] is a public key cryptosystemwith additive homomorphic properties. Let E(a) denotethe paillier encryption on a, and D(E(a)) denotes paillierdecryption on E(a); we have the following properties:∀a, b ∈ Zn,

D(E(a) · E(b) mod n2) = a+ b mod n

D(E(a)b mod n2) = a · b mod n

4.2 Privacy Preserving Ranked Keyword Search A-mong Multiple Data OwnersIn our previous work [17], we introduce how to achieveranked and privacy-preserving keyword search amongmultiple data owners. First of all, we systematicallyconstruct protocols on how to encrypt keywords for dataowners, how to generate trapdoors for data users, andhow to perform blind searching for the cloud server.As a result, different data owners use their own secretkeys to encrypt their files and keywords. Authorizeddata users can issue queries without knowing secretkeys of these data owners. Then an Additive OrderPreserving Function family is proposed, which enablesdifferent data owners to encode their relevance scoreswith different secret keys, and helps cloud server returnthe top-k relevant search results to data users withoutrevealing any sensitive information.

In this paper, we adopt this ranked and privacy p-reserving keyword search scheme to return the top-ksearch results. Our goal is to systematically constructschemes that can verify whether the returned top-ksearch results are correct.

1. Preparing

verification data

2. Constructing

verification request

3. Mapping verification

data to a data buffer

4. Recovering

and verifying

Fig. 2: The process of verification

5 VERIFYING RANKED TOP-k SEARCH RE-SULTS

The basic idea of our deterrent based verification schemeis elaborated as follows: We can consider the dishonestcloud server as a suspect, the data user as a police chief,and each verification data as a policeman, who masterspart of the suspect’s actions. Intuitively, the police chiefcan gather all the policemen to verify whether the sus-pect commits a crime. However, this will cause a lot ofmanpower, financial and time waste. To overcome thisproblem, each time the suspect takes an action, the policechief only inquires a few policemen to verify whether thesuspect commits a crime. During the process, the policechief ensures that the suspect does not know whichpolicemen know his action, and which policemen areinquired by the police chief. What the suspect knows isthat, once he behaves dishonestly, he will be discoveredwith high probability, and punished seriously once dis-covered. By doing this, we can deter the suspect not tobehave dishonestly.

Different from discovering misbehaviors of the cloudafter the misbehaviors occur, we propose a comple-mentary and preventive scheme to deter the cloud notto behave dishonestly. The deterrent in our scheme isderived from a series of constructions, which includeembedding secret sampling data and anchor data in theverification data buffer, forcing the cloud conduct blindcomputations on ciphertext, updating the verificationdata dynamically, and so on. The final goal of ourdeterrent based scheme is to deter the cloud not tobehave dishonestly, and once it misbehaves, it would bedetected with high probability.

In what follows, we first give an overview of theverification construction. Then we introduce the detailedconstruction step by step. Note that, we first introducehow to achieve the single dimensional verification, whilewe leave the multi-dimensional verification in the exten-sion subsection.

5.1 Verification Construction Overview

The verification construction is composed of four steps,which is illustrated in Fig. 2.

First, each data owner prepares the verification data.Specifically, each data owner samples ψ files from thecorresponding file set, and obtains the file ID and rele-vance score. With the effective data sampling, a data usercan verify the correctness of search results belonging toa specific data owner with a high probability. Then eachdata owner exchanges these file IDs and relevance scores

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TCC.2015.2481389, IEEE Transactions on Cloud Computing

JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2015 5

as anchor data with other θ − 1 data owners uniformlyat random, which will be used to verify the correctnessof search results among data owners. After getting thesampled data and anchor data, each data owner con-catenates these data into a string. The encryption of thestring is used as each owner’s verification data.

Second, data users construct a secret verification re-quest, and indicate the size of verification data buffer.

Third, cloud server operates on the encrypted data andreturns the verification data buffer.

Forth, data users decrypt the returned search resultsand verify whether misbehavior occurs.

5.2 Preparing the verification data

In this subsection, we introduce how to prepare theverification data step by step.

5.2.1 Sampling from original dataOur sampling method is conducted in three steps. First,the data owner samples files from its original data set.Second, he extracts the corresponding file IDs, relevancescores. Third, he attaches the file ID and relevance scoreto the owner’s ID. Assume data owner Oi has d filesbelonging to keyword wt, he samples ψ files’ data fromthese d files for keyword wt, the corresponding processis shown in Algorithm 1. First of all, Oi initializes thehead of sampled data string as wt||i, where wt denotesthe keyword and i denotes Oi’s ID. Second, Oi ranksall the files corresponding to wt in descending order ofrelevance scores. Third, Oi concatenates FID[0]||RS0,t towt||i, where FID[0] denotes the file ID of F0, and RS0,t

denotes the relevance score between F0 and wt. Fourth,Oi samples ψ − 1 data items from the remaining d − 1files. The sampled files are composed of two kinds, i.e.,the first file in wt’s sorted file list, and the other ψ − 1uniformly and randomly sampled files. Then, all theseψ sampled data and the head data are concatenated.Finally, the algorithm outputs the sampled data set SDi.

Algorithm 1 Constructing Sampled data

Input:Oi’s ID: i, number of sampled data: ψ, and wt’s filelist: FID[d]

Output:Sampled data: SDi

1: Initialize sampled data SDi to wt||i2: Rank wt’s file list FID[d] in descending order of

relevance scores3: Concatenate FID[0]||RS0,t to SDi

4: Uniformly and randomly generate ψ− 1 number setR where R[i] ∈ [1, d]

5: Rank R incrementally6: for ind =1 to ψ − 1 do7: concatenate FID[R[ind]]||RSR[ind],t to SDi

8: end for9: return SDi

Now we give an example to illustrate the feasibility ofsampling a subset of data as verification data. Assumeprofessor A has 100 students who have different numberof publication papers. Professor A is very sure that hisstudent SB has the most papers, and SC has the thirdmost papers, but he is not clear of the publicationsof other students. When the professor asks the corre-sponding students who have the top-5 most papers,professor will detect a false answer with a probabilityof more than 0.999. This probability is computed asfollows, to compute the probability that the professorA can detect a misbehavior, we first compute the prob-ability that A cannot detect the misbehavior. When thesearch results guarantee that SB ranks the first and SCranks the third, then A cannot detect the misbehavior,there are P (98, 3) such conditions, where P (n, k) de-notes the number of permutation. Additionally, thereare P (100, 5) conditions of the probable search results.Therefore, the probability that the A cannot detect themisbehavior is P (98, 3)/P (100, 5). Correspondingly, pro-fessor A can detect the misbehavior with the probability1− P (98, 3)/P (100, 5) > 0.9998.

5.2.2 Exchanging data among data ownersIn our system, multiple data owners are involved. Fora given keyword, each data owner only knows its ownpartial order, i.e., each data owner cannot obtain a totalorder for the keyword. This brings a great challengefor data users to verify whether the returned resultsare top-k relevant to the search request. A trivial wayis to ask the cloud to return all encoded relevancescores belonging to different data owners to the datauser, and the data user recomputes the top-k resultsfor verification, which requires gigantic computation andcommunication costs from the data user.

In our scheme, we propose to let data owners ex-change a very small amount of data. Specifically, givenOi’s keyword wt, each data owner uses the ψ aforemen-tioned sampled files to generate interactive data. Sincethese operations are conducted among data owners, thecloud server would not know whether a data ownerhas the interactive data of another data owner. For easydescription, we assume each data owner exchanges the ψdata items with θ−1 data owners uniformly at random,i.e., after exchanging, each data owner will have θ − 1data owners’ data, which will be used as secret data todetect false search results even if a data owner’s data isnot involved in the search results.

5.2.3 Assembling the verification dataAssume data owner Oi receives other θ−1 data owners’interactive data, Oi will further assemble his verificationdata as follows: first, Oi extracts all the file IDs and rele-vance scores from the interactive data of other θ−1 dataowners. Second, Oi ranks on his ψ sampled data and(θ−1)·ψ received interactive data. Third, Oi concatenatesall the θ ·ψ data entries in descending order, where eachentry is composed of a file ID and its corresponding

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TCC.2015.2481389, IEEE Transactions on Cloud Computing

JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2015 6

order. Finally, Oi uses the symmetric encryption (e.g.,AES [33]) to encrypt the concatenation result with hissecret key Hs(i), where Hs() is the aforementioned se-cretly shared hash function, i is the Oi’s ID. We denotethe result of the encryption as Vi, therefore, Vi is usedas Oi’s verification data, which will be outsourced to thecloud server along with Oi’s encrypted indexes and files.

5.3 Submitting verification requestWhen an authorized data user wants to verify the searchresults, he specifies a set of data owners whose veri-fication data need to be returned to help verification.The data user can achieve this goal by simply settingan ID set of his desired data owners. However, theID set should not be exposed to the cloud server. Thefundamental reason is illustrated as follows: if the cloudserver knows which data owners’ data are frequentlyverified, he can deduce that these data owners’ data arevery useful or sensitive, therefore, these data owners’data would easily become attackers’ targets. On the otherhand, if the cloud server knows which data owners’ dataare rarely verified, the cloud server will maliciously filterout or delete these data owners’ data as search results.

To prevent the cloud server from knowing whichdata owners’ data are actually returned, we propose toconstruct a secret verification request which is illustratedas follows: First, the data user enlarges the ID set ofverification by inserting random IDs. Assume a data userwants to get Oi’s verification data, he can add other n−1data owners’ ID in the set (we can adopt encryption orobfuscation to hide the true ID, for easy description, wesimply demonstrate with ID hereafter). Second, the datauser attaches a data 0 or 1 to each ID. Here, if the datauser wants to return a data owner’s verification data,then he attaches 1 to the corresponding ID, otherwise,0 is attached. Third, the data user encrypts the attached0 or 1 with the Paillier encryption. Here, we assume allthe data owners and authorized data users share a keypair, i.e., the public key PK and the private key SK,for the Paillier encryption. Therefore, 0 is encrypted toE(PK, 0) and 1 is encrypted to E(PK, 1). With the welldesigned Paillier encryption, the cipher-text of the samedata would be different each time. Finally, the data usersubmits the ID set and the attached encrypted data setto the cloud server.

Now we give an example here. Assume a data us-er Uj needs to download O1’s and O2’s verificationdata, first of all, he formulates a large ID set, say,{O1, O2, O3, · · · , On}, then he attaches E(PK, 1) to O1

and O2, and E(PK, 0) to other IDs. Finally, the datauser submits {< O1, E(PK, 1) >,< O2, E(PK, 1) >,<O3, E(PK, 0) >, · · · , < On, E(PK, 0) >} as the verifica-tion request to the cloud server.

5.4 Returning verification dataUpon receiving data user’s verification request, the cloudserver follows Algorithm 2 to prepare and return the

Algorithm 2 Securely returning verification data

Input:Verification request set [< j,E(PK, rj) >], j ∈ [1, β],the size of verification data buffer λ

Output:Verification data buffer V B

1: The cloud initializes V B with λ entries, each entrywith initial value 1

2: for j ∈ [1, β] do3: Locates Oj ’s verification data Vj4: Compute vd = E(PK, rj)

Vj

5: for i in range (0,κ) do6: V B[hi(j)] = V B[hi(j)] · vd7: end for8: end for9: return V B

verification data. Specifically, the cloud server first ini-tializes a verification data buffer with λ entries, whereλ is specified by the data user. Then, the cloud serverfinds the data owner’s verification data indicated bythe requested ID set (line 3), conducts calculations onthe encrypted data (line 4), and maps the results to theverification data buffer with κ hash functions (line 5-7),where the output of each hash function belongs to [0, λ].Note that, since the size of the verification buffer, i.e., λ,is specified by the data user, different users will submitdifferent λ to the cloud. To ensure that the output of eachfunction belongs to [0, λ], instead of changing the κ hashfunctions all the time, we only need to specify the cloudto do a module-λ operation for the output of each hashfunction. When the cloud server finishes proceeding allthe IDs in the requested ID set, the result verificationdata buffer is returned to the data user. Note that, duringthe whole process, the cloud server only sees the large IDsets, and conducts computation on the encrypted data.Therefore, the cloud server know nothing about whichdata owner’s verification data are actually returned andused for verification.

In the above description, we propose to use multiplehash functions, here we elaborate on the reason of in-troducing these hash functions. To obfuscate the cloudto know which verification data are actually returned.There are three alternative ways to prevent the cloudfrom knowing which verification data are actually re-trieved by the data user. First, we can request the cloudto return all the verification data each time, so that thecloud knows nothing about which specific verificationdata are actually used. However, this method will lead toa heavy communication cost between the cloud and thedata user. Second, the data user can prepare the enlargeddata set where some fake IDs are also involved, andspecify the cloud to put the verification data in specificpositions in the verification data buffer. However, thedata user has to control the cloud to return data correctly,which is not user friendly. In addition, during the process

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TCC.2015.2481389, IEEE Transactions on Cloud Computing

JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2015 7

V1 V2 V3 V4

E(PK,1)V1

· E(PK,1)V2 E(PK,0)

V3E(PK,0)

V3

· E(PK,0)V4

E(PK,1) E(PK,1) E(PK,0) E(PK,0)

E(PK,1)V1

E(PK,1)V1

E(PK,1)V2

E(PK,0)V3

E(PK,0)V4

Verification

data

Request

data

E(PK,1)V2

E(PK,0)V4

Verification

data buffer

E(PK,

V1+V2)E(PK,0) E(PK,0)E(PK,V1) E(PK,V2) E(PK,0)

Fig. 3: Example of verification data buffer construction

of specification, the data user would reveal his sensitivedata. Therefore, we propose the third way, i.e., hashverification data into verification data buffer directly,to let the cloud return the verification data withoutknowing which and how many verification data areactually returned.

Now we give an example (shown in Fig. 3) to illus-trate how to map the verification data into the veri-fication data buffer, based on the homomorphic prop-erty of Paillier encryption. First of all, the cloud serv-er finds the encrypted verification data {V1, V2, V3, V4}.Then he conducts calculation on the cipher-text, i.e.,{E(PK, 1)

V1 , E(PK, 1)V2 , E(PK, 0)

V3 , E(PK, 0)V4}. Due

to the homomorphism of Paillier encryption, we have:

D(E(PK, 1)V1) = D(E(PK, 1 · V1)) = D(E(PK, V1))D(E(PK, 1)V2) = D(E(PK, 1 · V2)) = D(E(PK, V2))D(E(PK, 0)V3) = D(E(PK, 1 · V3)) = D(E(PK, V3))D(E(PK, 0)V4) = D(E(PK, 1 · V4)) = D(E(PK, V4))

Further, the cloud uses two hash function h1(i) andh2(i) to map the encrypted data to the verification databuffer. Again, with the homomorphic property, we haveinteresting outcomes, e.g., D(E(PK, 0) · E(PK, V1)) =D(E(PK, 0+V1)) = D(E(PK, V1)). In this way, the veri-fication data V1 is returned without known by the cloud.Finally, the result buffer is returned to the data user. Aswe can see, from the viewpoint of the cloud server, theverification data of four data owners (i.e., O1, O2, O3,and O4) are processed. As a matter of fact, only O1’s andO2’s verification data are returned, which is not knownby the cloud.

5.5 Verifying Search Results

5.5.1 Recovering verification dataUpon receiving the verification data buffer V B, the datauser decrypts it with the corresponding private key SK.After decryption, a data user can recover verificationdata from each entry where no collision happens (onlyone owner’s data is mapped in the entry, and no otherdata are mapped).

Fig. 4 shows the decryption result of V B in Fig. 3, thedata user can recover V1, V2 from the first and second

V1+V2 0 0V1 V2 0

E(PK,

V1+V2)E(PK,0) E(PK,0)E(PK,V1) E(PK,V2) E(PK,0)

Verification

data buffer

Decrypted

data

Fig. 4: Example of decrypting the verification data buffer

entries of V B, respectively. Note that, since the datausers can pre-compute the entries where no collisionoccurs, instead of decrypting the whole verification databuffer, the authorized data user only needs to decryptthe entries where no collision occurs, which helps im-prove the decryption efficiency.

Note that, since the cloud server knows that, if datacollision happens in an entry, the data in that entrycannot be recovered. To prevent the data user fromrecovering the verification data and detecting a misbe-havior, the cloud server would contaminate the entriesin the verification buffer set, and deceive that collisionhappens in these entries. However, this attack cannotbe achieved. The fundamental reason is that, the datauser specifies the IDs of data owners whose verificationdata will be returned, and knows the κ hash functions,therefore, the data user can foresee whether collisionhappens in an entry in the verification data buffer. Whenthe cloud server contaminates the data in these entries,the misbehavior can be easily detected.

5.5.2 Verifying the ranked search resultsWhen the data user gets some data owners’ verificationdata, he can further recover all the sampled data and an-chor data. The data user will use them to verify whetherthe returned results are correct. The verification is donein two steps: first, the data user verifies whether the datafrom a specific data owner is correct. If the search resultspass the first verification, the verification process turnsto the second step, i.e., with the help of anchor data,the data user verifies whether the search results fromdifferent data owners are correct. After verification, thedata user can detect the cloud server’s misbehavior witha high probability. In Section 7, we will give an analysisof the detection probability.

5.6 Extension: Multi-dimensional Verification

As we know, a file is often accompanied with severalkeywords, i.e., the index of the files are often multi-dimensional. If we return verification data for eachdimension (keyword), the communication cost will in-crease linearly with the amount of dimension increasing.As a matter of fact, there will be some relationshipamong different dimensions. For example, assume theheight of a man is 2.1 meters, then the weight of thisman would very likely be more than 60 kilograms.

We use the pearson correlation coefficient to e-valuate the correlation of the order list between t-wo keywords (dimensions). Given w1’s order list

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TCC.2015.2481389, IEEE Transactions on Cloud Computing

JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2015 8

0 1 2 3 4 5 6 7 8 9 100

0.5

1

1.5

2

2.5

3

3.5

4

r1

r 2

Fig. 5: Example of binding dimensions

r1 =< r11, r12, · · · , r1n > and w2’s order list r2 =<r21, r22, · · · , r2n >, we first compute the covariancecov(r1, r2) of the two lists, where cov(r1, r2) = E((r1i −E(r1))(r2i −E(r2))). Then we compute the pearson cor-relation coefficient r = cov(r1, r2)/(σ(r1) · σ(r2)). Finally,we use the relationship rule of thumb [34] to evaluatethe correlation of r1 and r2. If |r| ≥ 2/

√n, then there

exists a strong correlation between r1 and r2. Otherwise,the correlation is ignored.

Fig. 5 shows an example of binding two dimensionstogether. Dimension r1 has value <0.5, 1, 1.5, 2, 2.5, 3, 3.5,4, 4.5, 5, 5.5, 6, 6.5, 7, 7.5, 8, 8.5, 9> and r2 has value <0.5,1, 1.2, 1.3, 1.5, 1.8, 1.9, 2, 2.1, 2.15, 2.4, 2.45, 2.5, 2.7, 2.75,2.78, 2.8, 2.88>. We first compute the standard deviationof r1 and r2, and we get σ(r1) = 2.67 and σ(r2) = 0.7respectively. Then we compute the covariance of r1 andr2 and get cov(r1, r2) = 1.82. Now we can get the cor-relation coefficient r = cov(r1, r2)/(σ(r1) · σ(r2)) = 0.97.Since 0.97 > 2/

√18, i.e., r > 2/

√n, according to the

relationship rule of thumb, we are sure that there existsa strong correlation between r1 and r2. Therefore, wefurther deduce the relationship between r1 and r2, thatis r2 =

√r1, and the sliding interval is ±0.2.

By finding the relationships among dimensions, wecan bind these dimensions together, and only return veryfew dimensions as verification data. As a result, evenif some dimensions are not returned, its value can alsobe estimated and verified. In our scheme, data ownersfirst dig into the relationship among different dimen-sions, then they model these relationships with somefunctions, and further set some bound value for theserelationships. Finally, these functions and bound valuesare used to bind the correlated dimensions. Therefore,once the value of one dimension is returned, the valueof its correlated dimensions can also be estimated.

5.6.1 DiscussionBinding data among dimensions would increase thedetection efficiency. However, when we bind too manydimensions, the cloud would be much easier to cheat.Therefore, the problem becomes how to improve the de-tection efficiency as much as possible without sacrificingtoo much security goals. In the future work, we plan to

discuss how to combine different dimensions based onthe characteristics of the data set, and to what extent canwe combine data, so that there would be an excellenttradeoff.

5.7 Updating verification data stored on the cloudserverSince the data users in the system would also involvedata owners, if they find that some dishonest behavioris not detected by the existed verification data, they willupdate the verification data stored on the cloud server.The update can be finished in three steps: first, theydownload the verification data from the cloud server.Second, they decrypt the verification data, update it withnew sampled data and anchor data, and encrypt the newverification data. Third, they outsource the ciphertext ofthe updated verification data to the cloud.

5.8 Auxiliary Deterrent MethodTo strengthen the deterrent on the cloud server, weillustrate some auxiliary deterrent methods here. If thedata user detects any problems during the process ofverification, he will announce the errors to the public.Otherwise, the data user publishes all the returned datafrom the cloud, his trapdoor, and secret verificationrequest to the public for supervision. We assume thedata owners will periodically check this data publishedby the data users. Data owners who have more relevantfiles, but are not returned as results, will soon detect thedishonest behavior of the cloud. This scheme will suffersome delay, but it will strengthen the deterrent on thecloud server. On the other hand, once the data usersdiscover the dishonest behavior, the cloud server shouldbe seriously penalized. This will force the cloud serverto not dare behave dishonestly.

6 SETTING THE OPTIMAL PARAMETERS

Since the data users are often resource-limited, to controlthe communication cost, it would be significant to enabledata users to specify the length of verification databuffer. To increase the detection probability of potentiallydishonest behaviors of the cloud server, it would becrucial to recover as many verification data from thedata buffer as possible. An intuitive way of improvingthe detection probability is to let the cloud server mapas many data items in the verification data buffer aspossible. However, the more data items are mapped,the higher probability that a collision will occur. As aresult, when too many data items are mapped into theverification data buffer, the amount of data that can berecovered from the data buffer would be very small.

Assume the data user specifies the length of the ver-ification data buffer as λ, the number of hash functionused for mapping as κ, and the size of enlarged ID set(i.e., the number of verification data that we map intothe data buffer) as β. The coming question is that, given

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TCC.2015.2481389, IEEE Transactions on Cloud Computing

JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2015 9

λ and κ, to maximize the number of verification datathat can be recovered from the data buffer, how to setthe optimal β.

Next, we introduce how to obtain the optimal β stepby step. First of all, we compute the probability of re-covering x data items from the data buffer where β dataitems are already mapped. Obviously, the probability isthe same with recovering x colors from the followingcolor survival game [35] [36].

Color Survival Game: Assume there are β colors,each color has the same κ balls, we throw these ballsinto λ buckets uniformly at random. Once only one ballfalls into a bucket, we say the ball survives. Otherwise,we say the ball fails. Once any one of κ balls of acolor survives, we say the color survives. Obviously,the probability of successfully recovering x data fromthe data buffer is equal to the probability of x colorssurviving.

Since the probability that color i survives depends onhow many buckets are covered by other β − 1 colors,we first compute the number of buckets covered by theother β − 1 colors. Obviously, each color covers κ/λpercent of the buckets, and the coverage of these colors isrelatively independent; therefore, β− 1 colors will cover1− (1− κ/λ)β−1 percent of the total buckets. We denotethe number of buckets covered by the other β− 1 colorsas T , therefore,

T =⌊λ ·

(1− (1− κ/λ)

β−1)⌋

(1)

The probability of color i covered by other colors isC(T, κ)/C(λ, κ), where C(T, κ) denotes the combinato-rial number

(Tκ

). We denote the survival probability of

color i as ps, therefore

ps = 1− C(T, κ)/C(λ, κ) (2)

Obviously, each color shares the same survival prob-ability. Assume the probability that the exact x colorssurvive from the buckets is denoted as px, then px =C(β, x) · (ps)x · (1− ps)

β−x. As we can see, during thiscomputation, these colors are not strictly independent,but for the values that make sense, they are essentiallyindependent. Therefore, the probability that exactly xdata items can be recovered from the verification databuffer is:

px = C(β, x) · (ps)x · (1− ps)β−x (3)

We denote the expected number of data items that canbe recovered from the data buffer, where β data items

are already mapped, as E(x), therefore,

E(x) =

β∑x=1

x · px

=

β∑x=1

x · C(β, x) · (ps)x · (1− ps)β−x (4)

= β · ps ·β−1∑x=0

C(β − 1, x) · (ps)x · (1− ps)β−x−1

= β · ps

Obviously, for any κ ≥ 4, the expected number thatwe can recover from the data buffer is E(x) = ⌊λ/κ⌋,i.e., ⌊β · ps⌋ = ⌊λ/κ⌋. With Eq. 1, Eq. 2, and Eq. 4, we cancompute β, which is very close to

⌊2− ln(κ)

ln(λ−κ)−ln(λ)

⌋.

Therefore, to recover the max number of data items fromthe data buffer, the optimal number of data items thatwe map into the data buffer is:

β =

⌊2− ln(κ)

ln(λ− κ)− ln(λ)

⌋(5)

Now, we can conclude that, suppose the data userspecify the length of the verification data buffer as λ,and the number of hash function used for mapping asκ, to maximize the number of verification data that willbe recovered from the data buffer, we need to specifythe size of the enlarged ID set to be

⌊2− ln(κ)

ln(λ−κ)−ln(λ)

⌋.

7 ANALYSIS

In this Section, we will give a thorough analysis of thesecurity and performance of our proposed schemes. Firstof all, we will analyze the security. Then we illustrate thedeterrent proposed by our scheme. Further, we describethe detailed computation and communication cost fordata owners, cloud server and data users. Finally, we an-alyze the detection probability of our proposed schemeswith the verification data.

7.1 Security Analysis

Recall that, for ranked and privacy preserving keywordsearch, data owners encrypt the keywords, relevancescores, and files before they outsource these data itemsto the cloud. Data users also generate secure trapdoorsbefore submitting them to the cloud. The security isproven in [17]. For search result verification, each dataowner constructs a secret verification data which isencrypted with the AES encryption [33]. Therefore, theverification data is secure as long as the AES encryptionis secure. The verification request is encrypted withPaillier encryption [32], the cloud server only conductscomputation on the cipher-texts. Therefore, the verifica-tion process is secure as long as the Paillier encryptionis secure.

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TCC.2015.2481389, IEEE Transactions on Cloud Computing

JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2015 10

10 20 30 40 50 60 70 80 90 1000.999

0.9991

0.9992

0.9993

0.9994

0.9995

k

Det

ecti

on p

robab

ilit

y

γ=4

γ=6

(a) m = 10, d = 100, ki = 1.

1 2 3 4 5 6 7 8 9 100.999

0.9992

0.9994

0.9996

0.9998

1

ki

Det

ecti

on

pro

bab

ilit

y

m=10

m=20

(b) d = 100, γ = 10, k = 50.

1 2 3 4 5 6 7 8 9 100.999

0.9992

0.9994

0.9996

0.9998

γ

Det

ecti

on

pro

bab

ilit

y

m=10

m=20

(c) d = 100, ki = 1, k = 50.

Fig. 6: Misbehavior Detection Probability

7.2 Deterrent AnalysisIn this paper, we propose the deterrent-based verificationscheme. During the whole process of verification, thecloud server only conducts computation on cipher-texts.Therefore, he does not know how many data owners’verification data are actually used for verification, orwhich data owners’ data are embedded in the verifi-cation data buffer. Furthermore, the cloud server is notclear of which data owners, or how many data ownersexchange anchor data. We keep all these informationsecret to the cloud server. All the cloud server knows isthat, once he behaves dishonestly, he will be discoveredwith a high probability, and punished seriously oncediscovered.

7.3 Performance Analysis7.3.1 Costs for Data OwnersThe computational cost for the data owners spent onverification mainly comes from constructing the verifi-cation data. For data sampling, the running time mainlycomes from ordering files for each keyword; thereforethe computational complexity is O(d · log(d)). For dataassembling, data owners need to rank the ψ ·θ data itemsand then encrypt the assembled data, which is O(ψ · θ).Therefore, the computational complexity for each dataowner is O(max {d · log(d), ψ · θ}).

The communication cost mainly comes from two as-pects, i.e., anchor data exchanging, and the verificationdata buffer transmission. For anchor data exchanging,each data owner needs to transmit ψ anchor data to θ−1data owners, the cost is O(θ · ψ). For verification databuffer transmission, the communication cost is O(θ · ψ).So the total communication cost is O(θ · ψ).

7.3.2 Costs for Cloud ServerThe computational cost for the cloud server spent onverification mainly comes from mapping the verificationdata into the data buffer. Since the data user provides anenlarged size of ID set, i.e., β, the cloud server needs tomap the corresponding β data owners’ verification datato the data buffer, where each data item is mapped κtimes. Therefore, the computational complexity for thecloud server is O(β · κ).

The communication cost mainly comes from transmit-ting the data buffer and receiving verification data. Thecloud server needs to return data buffer with λ entries,where the communication cost of each entry is O(ψ · θ).For verification data receiving, assume there are m dataowners in the system, the cost is O(m · θ · ψ). Therefore,the communication cost for the cloud server is O(λ·ψ ·θ).O(max {λ · ψ · θ,m · θ · ψ}).

7.3.3 Costs for Data UserThe computational cost for data users spent on verifica-tion mainly comes from three aspects: first, constructingthe verification request. That’s O(β). Second, decryptingthe entries where the α requested ID corresponds and nocollision happen, that’s O(κ·α). Third, detecting whethermisbehavior happens, the cost is O(α · θ · ψ). So thecomputational cost is O(max {β, κ · α, α · θ · ψ}).

The communication cost for the data users spent onverification comes from receiving the verification databuffers, so the communication cost is O(λ · ψ · θ).

7.4 Misbehavior Detection probability

Our proposed scheme should not only ensure a strongdeterrent for potential attacks, but also achieve highdetection probability once the compromised cloud servermisbehaves. Now we analyze the detection probability.

For the k returned search results, suppose ki out ofthe k results are also contained in the verification data.Assume there are m data owners in our system, andthe data user recovers γ distinct verification data fromthe data buffer. The data user can detect an error withprobability Pe:

Pe = 1− P (m · d− γ, k − ki)

P (m · d, k)(6)

The figure shown in Fig. 6 describes the relationshipbetween the detection probability and the correspondingparameters. From Fig. 6(a), we observe that, when weset m = 10 (number of data owners involved in thesystem), d = 100 (average number of files correspondingto a keyword), even if ki = 1, i.e., there are only oneout of k search results has its corresponding verificationdata, the detection probability is still more than 0.999.

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TCC.2015.2481389, IEEE Transactions on Cloud Computing

JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2015 11

We can also see that, with the number of returned results(k) increases, the detection probability increases linearly.Additionally, the larger γ (number of distinct returnedverification data) is, the higher detection probability isachieved. The fundamental reason can be easily inferredfrom the Equation 6. In Fig. 6(b), we set d = 100, γ = 10,k = 50, with ki increases, the detection probability alsoincreases. When ki is larger than 2, the detection proba-bility is very close to 1. From Fig. 6 (c), we see that, whenwe set k = 50, ki = 1, with the value of γ grows up, thedetection probability also grows linearly. Additionally,the larger m is, the higher probability would be achieved.

8 PERFORMANCE EVALUATION

In this Section, we demonstrate a thorough evaluationon our proposed scheme. First of all, we evaluate thecomputational cost of the data owners, the cloud server,and the data users. Then we evaluate the functionalityof the verification data buffer.

8.1 Experiment SettingsThe experiment programs are coded using the Pythonprogramming language on a Laptop with 2.2GHZ IntelCore CPU and 2GB memory. We use the Pailier Encryp-tion [32] for data encryption; the secret key is set to be512 bits.

8.2 Experiment ResultsRecall that the time cost for each data owner mainlycomes from ranking, string concatenating, and sym-metric encryption. Fig. 7(a) and Fig. 7(b) show that,with the number of sampled data (ψ) and exchangeddata (θ) increasing, the time cost of the data ownersincreases linearly. The fundamental reason is that, with ψand θ increasing, more data items are concatenated andencrypted, which results in more time consumption.

The time cost of the cloud server spent on verificationmainly comes from two aspects, i.e, conducting compu-tation on the user submitted cipher-text and mappingthe computation result to the verification data buffer.Fig. 8(a) shows that, with the size of enlarged ID set(β) increasing, the corresponding time cost will increaselinearly. The reason is that more IDs will lead to morecomputation on the cipher-text, and more time will beneeded. Fig. 8(b) demonstrates that, the time cost of thecloud server has little connection with the size of theverification data buffer.

Fig. 9 shows the time cost of data users. Fig. 9(a) illus-trates the time cost of generating verification request. Aswe can see, when the size of the enlarged ID set increasesfrom 10 to 100, the corresponding time cost increasesfrom 0.024s to 0.24s. The reason is that, the larger theβ is, the more Paillier encryption will be conducted. Wealso observe that the time cost of the data user has littleconnection with α. The reason is that, compared withthe time cost spent on Paillier encryption, conducting

α symmetric encryption operations is relatively low.Fig. 9(b) demonstrates the time cost of decrypting andrecovering the verification data. Since the time is mainlyspent on decrypting the verification data buffer, with thesize of the data buffer increasing, the corresponding timecost of data users increases linearly.

Fig. 10 shows the ratio of data recovered from theverification data buffer with different parameters. Fig.10(a) shows that, when the size of the verification databuffer is set to be 500, i.e., λ = 500, the ratio of datarecovered from the verification data buffer will decreasewith the number of mapped data increasing. Meantime,the more hash functions we use, the lower ratio thatdata can be recovered. The fundamental reason is that,the more data we map into the data buffer, the higherprobability that data collision will occur, which renderssome data cannot be recovered. Therefore, the ratio ofrecovered data will decrease accordingly.

Fig. 10(b) demonstrates that, when we set the numberof hash functions to be 30, i.e., κ=30, with the size of thedata buffer increasing, the ratio of recovered data willincrease. Meanwhile, the less data we map, the fasterthe data recovery ratio will increase. The reason is thatlarger data buffer and less data mapped into the bufferwill reduce the probability of data collision; therefore,the ratio of recovered data will increase.

Fig. 10(c) demonstrates that, when the number ofentries is set to 500, the ratio of data recovery willdecrease with the number of hash functions increasing;the more data items we map into the verification databuffer, the faster the data recover ratio will decrease.

Fig. 11 shows the number of data recovered from theverification data buffer with different parameters. Fig.11(a) shows that, when λ = 500, with the amount ofmapped data increasing, the number of data recoveredfrom the verification data buffer will first increase andthen fall down. We illustrate this phenomenon as fol-lows: when we map a few data items into the verificationdata buffer, few data collisions will occur, and almostall the data can be recovered from the verification databuffer. Therefore, the amount of recovered data will in-crease. However, when the amount of data we map intothe verification data buffer increases to a threshold, datacollision will also increase with the increasing number ofdistributed data. Obviously, data collision will cause datato be unrecoverable. Therefore, the amount of recovereddata will decrease when we map too many data itemsinto the verification data buffer.

Fig. 11(b) demonstrates that, when we set κ = 30, withthe size of the verification data buffer increasing, thenumber of recovered data will increase. An interestingfeature shown in Fig. 11(b) is that, the fewer items wemap into the verification data buffer, the prior datacan be recovered from the verification data buffer. Thefundamental reason is that, with fewer data items, weneed fewer entries to accommodate data collision.

Fig. 11(c) demonstrates that, when λ = 500, theamount of recovered data will decrease with the number

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TCC.2015.2481389, IEEE Transactions on Cloud Computing

JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2015 12

10 20 30 40 50 60 70 80 90 1000

1

2

3

4

5

6

7

8

Number of Sampled Data

Tim

e C

ost

of

Oper

atio

n (

ms)

θ=10

θ=20

(a)

10 20 30 40 50 60 70 80 90 1000

1

2

3

4

5

6

7

8

Number of Exchanged data

Tim

e C

ost

of

Oper

atio

n (

ms)

ψ=10

ψ=20

(b)

Fig. 7: Time cost of the data owner

10 20 30 40 50 60 70 80 90 1000

1

2

3

4

5

6

7

8

9

10

Size of Enlarged ID Data

Tim

e C

ost

of

Op

erat

ion

(s)

λ=100

λ=200

(a)

10 20 30 40 50 60 70 80 90 1000

1

2

3

4

5

Size of Verification Data Buffer

Tim

e C

ost

of

Oper

atio

n (

s)

β=10

β=30

(b)

Fig. 8: Time cost of the cloud server

10 20 30 40 50 60 70 80 90 1000

5

10

15

20

25

Size of Enlarged Data Set

Tim

e C

ost

of

Oper

atio

n (

× 1

0−2s) α=4

α=10

(a)

1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

Size of Verification Data Buffer ( × 102)

Tim

e C

ost

of

Oper

atio

n (

× 1

0−2s) α=4

α=10

(b)

Fig. 9: Time cost of the data user

of hash functions increases; additionally, the more dataitems we map into the verification data buffer, the fasterthe number of recovered data will decrease. The reasonis that, the more mapping operations and data itemswe map into the verification data buffer, the higherprobability that a data collision will occur, which leadsto the amount of recovered data decreasing.

9 CONCLUSION

In this paper, we explore the problem of verificationfor the secure ranked keyword search, under the modelwhere cloud servers would probably behave dishonestly.Different from previous data verification schemes, wepropose a novel deterrent-based scheme. During thewhole process of verification, the cloud server is notclear of which data owners, or how many data owners

exchange anchor data used for verification, he also doesnot know which data owners’ data are embedded inthe verification data buffer or how many data owners’verification data are actually used for verification. Allthe cloud server knows is that, once he behaves dishon-estly, he would be discovered with a high probability,and punished seriously once discovered. Additionally,when any suspicious action is detected, data ownerscan dynamically update the verification data stored onthe cloud server. Furthermore, our proposed schemeallows the data users to control the communicationcost for the verification according to their preferences,which is especially important for the resource limiteddata users. Finally, with thorough analysis and extensiveexperiments, we confirm the efficacy and efficiency ofour proposed schemes.

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TCC.2015.2481389, IEEE Transactions on Cloud Computing

JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2015 13

10 20 30 40 50 60 70 80 90 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Number of Mapped Data

Rat

io o

f R

ecover

ed D

ata

κ=20

κ=30

κ=40

(a) λ = 500.

100 200 300 400 500 600 700 800 900 10000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Number of Entries

Rati

o o

f R

ecovere

d D

ata

θ=50

θ=70

θ=90

(b) κ=30.

10 20 30 40 50 60 70 80 90 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Number of hash function

Rati

o o

f R

ecovere

d D

ata

θ=30

θ=40

θ=50

(c) λ =500.

Fig. 10: Ratio of data recovered from the verification data buffer

10 20 30 40 50 60 70 80 90 1000

5

10

15

20

25

30

35

40

45

50

Number of Mapped Data

Nu

mb

er o

f R

ecov

ered

Dat

a

κ=20

κ=30

κ=40

(a) λ = 500.

100 200 300 400 500 600 700 800 900 10000

10

20

30

40

50

60

70

80

Number of Entries

Nu

mb

er

of

Recov

ere

d D

ata

θ=50

θ=70

θ=90

(b) κ=30.

10 20 30 40 50 60 70 80 90 1000

5

10

15

20

25

30

35

40

45

50

Number of hash function

Nu

mb

er

of

Recov

ere

d D

ata

κ=30

κ=40

κ=50

(c) λ =500.

Fig. 11: Amount of data recovered from the verification data buffer

ACKNOWLEDGMENTS

This work is supported in part by the National Natu-ral Science Foundation of China (Project No. 61173038,61472125). Thanks to the China Scholarship Council forproviding the foundation.

REFERENCES[1] M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. Katz, A. Kon-

winski, G. Lee, D. Patterson, A. Rabkin, I. Stoica, and M. Zaharia,“A view of cloud computing,” Communication of the ACM, vol. 53,no. 4, pp. 50–58, 2010.

[2] C. Zhu, V. Leung, X. Hu, L. Shu, and L. T. Yang, “A review of keyissues that concern the feasibility of mobile cloud computing,” inGreen Computing and Communications (GreenCom), 2013 IEEE andInternet of Things (iThings/CPSCom), IEEE International Conferenceon and IEEE Cyber, Physical and Social Computing. IEEE, 2013, pp.769–776.

[3] Ritz, “Vulnerable icloud may be the reason to celebrity photoleak.” [Online]. Available: http://marcritz.com/icloud-flaw-leak/

[4] C. Wang, N. Cao, J. Li, K. Ren, and W. Lou, “Secure ranked key-word search over encrypted cloud data,” in Proc. IEEE DistributedComputing Systems (ICDCS’10), Genoa, Italy, Jun. 2010, pp. 253–262.

[5] N. Cao, C. Wang, M. Li, K. Ren, and W. Lou, “Privacy-preservingmulti-keyword ranked search over encrypted cloud data,” in Proc.IEEE INFOCOM’11, Shanghai, China, Apr. 2011, pp. 829–837.

[6] W. Sun, B. Wang, N. Cao, M. Li, W. Lou, Y. T. Hou, andH. Li, “Privacy-preserving multi-keyword text search in the cloudsupporting similarity-based ranking,” in Proc. IEEE ASIACCS’13,Hangzhou, China, May 2013, pp. 71–81.

[7] Z. Xu, W. Kang, R. Li, K. Yow, and C. Xu, “Efficient multi-keywordranked query on encrypted data in the cloud,” in Proc. IEEEParallel and Distributed Systems (ICPADS’12), Singapore, Dec. 2012,pp. 244–251.

[8] A. Ibrahim, H. Jin, A. A. Yassin, and D. Zou, “Secure rank-ordered search of multi-keyword trapdoor over encrypted clouddata,” in Proc. IEEE Asia-Pacific Conference on Services Computing(APSCC’12), Guilin, China, Dec. 2012, pp. 263–270.

[9] B. Hore, E. C. Chang, M. H. Diallo, and S. Mehrotra, “Indexingencrypted documents for supporting efficient keyword search,”in Proc. Secure Data Management (SDM’12), Istanbul, Turkey, Aug.2012, pp. 93–110.

[10] J. Li, Q. Wang, C. Wang, N. Cao, K. Ren, and W. Lou, “Fuzzykeyword search over encrypted data in cloud computing,” in Proc.IEEE INFOCOM’10, San Diego, CA, Mar. 2010, pp. 1–5.

[11] M. Chuah and W. Hu, “Privacy-aware bedtree based solutionfor fuzzy multi-keyword search over encrypted data,” in Proc.IEEE 31th International Conference on Distributed Computing Systems(ICDCS’11), Minneapolis, MN, Jun. 2011, pp. 383–392.

[12] P. Xu, H. Jin, Q. Wu, and W. Wang, “Public-key encryption withfuzzy keyword search: A provably secure scheme under keywordguessing attack,” IEEE Transactions on Computers, vol. 62, no. 11,pp. 2266–2277, 2013.

[13] B. Wang, S. Yu, W. Lou, and Y. T. Hou, “Privacy-preserving multi-keyword fuzzy search over encrypted data in the cloud,” in IEEEINFOCOM, Toronto, Canada, May 2014, pp. 2112–2120.

[14] C. Wang, K. Ren, S. Yu, and K. M. R. Urs, “Achieving usable andprivacy-assured similarity search over outsourced cloud data,” inProc. IEEE INFOCOM’12, Orlando, FL, Mar. 2012, pp. 451–459.

[15] W. Sun, S. Yu, W. Lou, Y. T. Hou, and H. Li, “Protecting yourright: Attribute-based keyword search with fine-grained owner-enforced search authorization in the cloud,” in Proc. IEEE INFO-COM’14, Toronto, Canada, May 2014, pp. 226–234.

[16] Q. Zheng, S. Xu, and G. Ateniese, “Vabks: Verifiable attribute-based keyword search over outsourced encrypted data,” in Proc.IEEE INFOCOM’14, Toronto, Canada, May 2014, pp. 522–530.

[17] W. Zhang, S. Xiao, Y. Lin, T. Zhou, and S. Zhou, “Secure rankedmulti-keyword search for multiple data owners in cloud com-puting,” in Proc. 44th Annual IEEE/IFIP International Conference onDependable Systems and Networks (DSN2014). Atlanta, USA: IEEE,jun 2014, pp. 276–286.

[18] W. Zhang, Y. Lin, S. Xiao, Q. Liu, and T. Zhou, “Secure dis-tributed keyword search in multiple clouds,” in Proc. IEEE/ACMIWQOS’14, Hongkong, May 2014, pp. 370–379.

[19] B. Wang, B. Li, and H. Li, “Oruta: Privacy-preserving publicauditing for shared data in the cloud,” IEEE Transactions on CloudComputing, vol. 2, no. 1, pp. 43–56, 2014.

[20] J. Li, X. Tan, X. Chen, D. Wong, and F. Xhafa, “Opor: En-

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TCC.2015.2481389, IEEE Transactions on Cloud Computing

JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2015 14

abling proof of retrievability in cloud computing with resource-constrained devices,” IEEE Transactions on Cloud Computing, 2014.

[21] H. Pang, A. Jain, K. Ramamritham, and K.-L. Tan, “Verifyingcompleteness of relational query results in data publishing,” inProceedings of the 2005 ACM SIGMOD international conference onManagement of data. ACM, 2005, pp. 407–418.

[22] M. Narasimha and G. Tsudik, “Dsac: integrity for outsourceddatabases with signature aggregation and chaining,” in Proceed-ings of the 14th ACM international conference on Information andknowledge management. ACM, 2005, pp. 235–236.

[23] H. Pang and K. Mouratidis, “Authenticating the query results oftext search engines,” Proceedings of the VLDB Endowment, vol. 1,no. 1, pp. 126–137, 2008.

[24] R. C. Merkle, “A certified digital signature,” in Proc. Advances inCryptology (CRYPTO’89), California USA, Aug. 1989, pp. 218–238.

[25] F. Li, M. Hadjieleftheriou, G. Kollios, and L. Reyzin, “Dynam-ic authenticated index structures for outsourced databases,” inProceedings of the 2006 ACM SIGMOD international conference onManagement of data. ACM, 2006, pp. 121–132.

[26] Y. Yang, S. Papadopoulos, D. Papadias, and G. Kollios, “Au-thenticated indexing for outsourced spatial databases,” The VLDBJournalłThe International Journal on Very Large Data Bases, vol. 18,no. 3, pp. 631–648, 2009.

[27] Q. Chen, H. Hu, and J. Xu, “Authenticating top-k queries inlocation-based services with confidentiality,” Proceedings of theVLDB Endowment, vol. 7, no. 1, 2013.

[28] H. Hu, J. Xu, Q. Chen, and Z. Yang, “Authenticating location-based services without compromising location privacy,” in Pro-ceedings of the 2012 ACM SIGMOD International Conference onManagement of Data. ACM, 2012, pp. 301–312.

[29] W. Sun, B. Wang, N. Cao, M. Li, W. Lou, Y. Hou, and H. Li,“Verifiable privacy-preserving multi-keyword text search in thecloud supporting similarity-based ranking,” TPDS, 2013.

[30] W. Zhang, S. Xiao, Y. Lin, J. Wu, and S. Zhou, “Privacy preservingranked multi-keyword search for multiple data owners in cloudcomputing,” IEEE Transactions on Computers, 2015.

[31] D. Eastlake and P. Jones, “Us secure hash algorithm 1 (sha1),”2001.

[32] P. Paillier, “Public-key cryptosystems based on composite degree

residuosity classes,” in Advances in cryptologyłEUROCRYPT99.Springer, 1999, pp. 223–238.

[33] J. Daemen and V. Rijmen, The design of Rijndael: AES-the advancedencryption standard. Springer, 2002.

[34] S. B. Green, “How many subjects does it take to do a regressionanalysis,” Multivariate behavioral research, vol. 26, no. 3, pp. 499–510, 1991.

[35] “Importance of being urnest.” [Online]. Available:http://www.mathpages.com/home/kmath321.htm

[36] R. Ostrovsky and W. E. Skeith III, “Private searching on streamingdata,” in Advances in Cryptology–CRYPTO 2005. Springer, 2005,pp. 223–240.

Wei Zhang was born in 1990, received his B.S.degree in Computer Science from Hunan Uni-versity, China, in 2011. Since 2011, he has beena Ph.D. candidate in College of Computer Sci-ence and Electronic Engineering, Hunan Univer-sity. Since 2014, he has been a visiting studentin Department of Computer and Information Sci-ences, Temple University. His research interestsinclude cloud computing, network security anddata mining.

Yaping Lin received the B.S. degree in Comput-er Application from Hunan University, China, in1982, and the M.S. degree in Computer Applica-tion from National University of Defense Tech-nology, China in 1985. He received the Ph.D.degree in Control Theory and Application fromHunan University in 2000. He has been a pro-fessor and Ph.D supervisor in Hunan Universitysince 1996. During 2004-2005, he worked asa visiting researcher at the University of Texasat Arlington. His research interests include ma-

chine learning, network security and wireless sensor networks.