an efficeient privacy preserving ranked keyword search

An Efficient Privacy-Preserving RankedKeyword Search Method

Chi Chen,Member, IEEE, Xiaojie Zhu, Student Member, IEEE, Peisong Shen, Student Member, IEEE,

Jiankun Hu,Member, IEEE, Song Guo, Senior Member, IEEE, Zahir Tari, Senior Member, IEEE, and

Albert Y. Zomaya, Fellow, IEEE

Abstract—Cloud data owners prefer to outsource documents in an encrypted form for the purpose of privacy preserving. Therefore it

is essential to develop efficient and reliable ciphertext search techniques. One challenge is that the relationship between documents

will be normally concealed in the process of encryption, which will lead to significant search accuracy performance degradation. Also

the volume of data in data centers has experienced a dramatic growth. This will make it even more challenging to design ciphertext

search schemes that can provide efficient and reliable online information retrieval on large volume of encrypted data. In this paper, a

hierarchical clustering method is proposed to support more search semantics and also to meet the demand for fast ciphertext search

within a big data environment. The proposed hierarchical approach clusters the documents based on the minimum relevance threshold,

and then partitions the resulting clusters into sub-clusters until the constraint on the maximum size of cluster is reached. In the search

phase, this approach can reach a linear computational complexity against an exponential size increase of document collection. In order

to verify the authenticity of search results, a structure called minimum hash sub-tree is designed in this paper. Experiments have been

conducted using the collection set built from the IEEE Xplore. The results show that with a sharp increase of documents in the dataset

the search time of the proposed method increases linearly whereas the search time of the traditional method increases exponentially.

Furthermore, the proposed method has an advantage over the traditional method in the rank privacy and relevance of retrieved

documents.

Index Terms—Cloud computing, ciphertext search, ranked search, multi-keyword search, hierarchical clustering, security

Ç

1 INTRODUCTION

AS we step into the big data era, terabyte of data are pro-duced world-wide per day. Enterprises and users who

own a large amount of data usually choose to outsourcetheir precious data to cloud facility in order to reduce datamanagement cost and storage facility spending. As a result,data volume in cloud storage facilities is experiencing adramatic increase. Although cloud server providers (CSPs)claim that their cloud service is armed with strong securitymeasures, security and privacy are major obstacles prevent-ing the wider acceptance of cloud computing service [1].

A traditional way to reduce information leakage is dataencryption. However, this will make server-side data utili-zation, such as searching on encrypted data, become a very

challenging task. In the recent years, researchers haveproposed many ciphertext search schemes [34], [35], [36],[37], [43] by incorporating the cryptography techniques.These methods have been proven with provable security,but their methods need massive operations and have hightime complexity. Therefore, former methods are not suitablefor the big data scenario where data volume is very big andapplications require online data processing. In addition, therelationship between documents is concealed in the abovemethods. The relationship between documents representsthe properties of the documents and hence maintaining therelationship is vital to fully express a document. For exam-ple, the relationship can be used to express its category. If adocument is independent of any other documents exceptthose documents that are related to sports, then it is easy forus to assert this document belongs to the category of thesports. Due to the blind encryption, this important propertyhas been concealed in the traditional methods. Therefore,proposing a method which can maintain and utilize thisrelationship to speed the search phase is desirable.

On the other hand, due to software/hardware failure,and storage corruption, data search results returning to theusers may contain damaged data or have been distorted bythe malicious administrator or intruder. Thus, a verifiablemechanism should be provided for users to verify the cor-rectness and completeness of the search results.

In this paper, a vector space model is used and every doc-ument is represented by a vector, which means every docu-ment can be seen as a point in a high dimensional space.Due to the relationship between different documents, all the

� C. Chen, X. Zhu and P. Shen is with the State Key Laboratory OfInformation Security, Institute of Information Engineering, ChineseAcademy of Sciences, Beijing, China.E-mail: {chenchi, zhuxiaojie, shenpeisong}@iie.ac.cn.

� J. Hu is with the Cyber Security Lab, School of Engineering and IT,University of New South Wales at the Australian Defence Force Academy,Canberra, ACT 2600, Australia. E-mail: [email protected].

� S. Guo is with the School of Computer Science and Engineering, TheUniversity of Aizu, Japan. E-mail: [email protected].

� Z. Tari is with the School of Computer Science, RMIT University,Australia. E-mail: [email protected].

� A. Y. Zomaya is with the School of Information Technologies, TheUniversity of Sydney, Australia. E-mail: [email protected].

Manuscript received 29 Sept. 2014; revised 8 Apr. 2015; accepted 8 Apr. 2015.Date of publication 21 Apr. 2015; date of current version 16 Mar. 2016.Recommended for acceptance by R. Kwok.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference the Digital Object Identifier below.Digital Object Identifier no. 10.1109/TPDS.2015.2425407

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 27, NO. 4, APRIL 2016 951

1045-9219� 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistributionrequires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

www.redpel.com +917620593389


documents can be divided into several categories. In otherwords, the points whose distance are short in the highdimensional space can be classified into a specific category.The search time can be largely reduced by selecting thedesired category and abandoning the irrelevant categories.Comparing with all the documents in the dataset, the num-ber of documents which user aims at is very small. Due tothe small number of the desired documents, a specific cate-gory can be further divided into several sub-categories.Instead of using the traditional sequence search method, abacktracking algorithm is produced to search the targetdocuments. Cloud server will first search the categories andget the minimum desired sub-category. Then the cloudserver will select the desired k documents from the mini-mum desired sub-category. The value of k is previouslydecided by the user and sent to the cloud server. If currentsub-category can not satisfy the k documents, cloud serverwill trace back to its parent and select the desired documentsfrom its brother categories. This process will be executedrecursively until the desired k documents are satisfied or theroot is reached. To verify the integrity of the search result, averifiable structure based on hash function is constructed.Every document will be hashed and the hash result will beused to represent the document. The hashed results of docu-ments will be hashed again with the category informationthat these documents belong to and the result will be used torepresent the current category. Similarly, every categorywill be represented by the hash result of the combination ofcurrent category information and sub-categories informa-tion. A virtual root is constructed to represent all the dataand categories. The virtual root is denoted by the hash resultof the concatenation of all the categories located in the firstlevel. The virtual root will be signed so that it is verifiable.To verify the search result, user only needs to verify the vir-tual root, instead of verifying every document.

2 EXISTING SOLUTIONS

In recent years, searchable encryption which provides textsearch function based on encrypted data has been widelystudied, especially in security definition, formalizationsand efficiency improvement, e.g. [2], [3], [4], [5], [6], [7].As shown in Fig. 1, the proposed method is comparedwith existing solutions and has the advantage in main-taining the relationship between documents.

2.1 Single Keyword Searchable Encryption

Song et al. [2] first introduced the notion of searchableencryption. They propose to encrypt each word in the docu-ment independently. This method has a high searching costdue to the scanning of the whole data collection word by

word. Goh [8] formally defined a secure index structure andformulate a security model for index known as semanticsecurity against adaptive chosen keyword attack (ind-cka).They also developed an efficient ind-cka secure index con-struction called z-idx by using pseudo-random functionsand bloom filters. Cash et al. [41] recently design and imple-ment an efficient data structure. Due to the lack of rankmechanism, users have to take a long time to select whatthey want when massive documents contain the query key-word. Thus, the order-preserving techniques are utilized torealize the rank mechanism, e.g. [9], [10], [11]. Wang et al.[12] use encrypted invert index to achieve secure rankedkeyword search over the encrypted documents. In thesearch phase, the cloud server computes the relevance scorebetween documents and the query. In this way, relevantdocuments are ranked according to their relevance scoreand users can get the top-k results. In the public key setting,Boneh et al. [3] designed the first searchable encryption con-struction, where anyone can use public key to write to thedata stored on server but only authorized users owning pri-vate key can search. However, all the above mentioned tech-niques only support single keyword search.

2.2 Multiple Keyword Searchable Encryption

To enrich search predicates, a variety of conjunctive key-word search methods (e.g. [7], [13], [14], [15], [16]) have beenproposed. These methods show large overhead, such ascommunication cost by sharing secret, e.g. [14], or computa-tional cost by bilinear map, e.g.[7]. Pang et al. [17] propose asecure search scheme based on vector space model. Due tothe lack of the security analysis for frequency informationand practical search performance, it is unclear whether theirscheme is secure and efficient or not. Cao et al. [18] present anovel architecture to solve the problem of multi-keywordranked search over encrypted cloud data. But the searchtime of this method grows exponentially accompanyingwith the exponentially increasing size of the documentcollections. Sun et al. [19] give a new architecture whichachieves better search efficiency. However, at the stage ofindex building process, the relevance between documents isignored. As a result, the relevance of plaintexts is concealedby the encryption, users expectation cannot be fulfilled well.For example: given a query containing Mobile and Phone,only the documents containing both of the keywords will beretrieved by traditional methods. But if taking the semanticrelationship between the documents into consideration, thedocuments containing Cell and Phone should also beretrieved. Obviously, the second result is better at meetingthe users expectation.

2.3 Verifiable Search Based on Authenticated Index

The idea of data verification has been well studied in thearea of databases. In a plaintext database scenario, a varietyof methods have been produced, e.g. [20], [21], [22]. Most ofthese works are based on the original work by Merkle [23],[24] and refinements by Naor and Nissm [25] for certificaterevocation. Merkle hash tree and cryptographic signaturetechniques are used to construct authenticated tree struc-ture upon which end users can verify the correctness andcompleteness of the query results.

Fig. 1. Architecture of ciphertext search.

952 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 27, NO. 4, APRIL 2016



Pang and Mouratidis [26] apply the Merkle hash treebased on authenticated structure to text search engines.However, they only focus on the verification-specific issuesignoring the search privacy preserving capabilities that willbe addressed in this paper.

The hash chain is used to construct a single keywordsearch result verification scheme by Wang et al. [9]. Sun et al.[19] use Merkle hash tree and cryptographic signature tocreate a verifiable MDB-tree. However, their work cannot bedirectly used in our architecture which is oriented for pri-vacy-preserving multiple keyword search. Thus, a propermechanism that can be used to verify the search resultswithinbig data scenario is essential to both the CSPs and end users.

3 OUR CONTRIBUTION

In this paper, we propose a multi-keyword ranked searchover encrypted data based on hierarchical clustering index(MRSE-HCI) to maintain the close relationship between dif-ferent plain documents over the encrypted domain in orderto enhance the search efficiency. In the proposed architec-ture, the search time has a linear growth accompanying withan exponential growing size of data collection. We derivethis idea from the observation that users retrieval needs usu-ally concentrate on a specific field. So we can speed upthe searching process by computing relevance score betweenthe query and documents which belong to the same specificfield with the query. As a result, only documents which areclassified to the field specified by users query will be evalu-ated to get their relevance score. Due to the irrelevant fieldsignored, the search speed is enhanced.

We investigate the problem of maintaining the close rela-tionship between different plain documents over anencrypted domain and propose a clustering method to solvethis problem. According to the proposed clustering method,every document will be dynamically classified into a specificcluster which has a constraint on the minimum relevancescore between different documents in the dataset. Therelevance score is a metric used to evaluate the relationshipbetween different documents. Due to the new documentsadded to a cluster, the constraint on the cluster may be bro-ken. If one of the new documents breaks the constraint, anew cluster center will be added and the current documentwill be chosen as a temporal cluster center. Then all the docu-ments will be reassigned and all the cluster centers will bereelected. Therefore, the number of clusters depends on thenumber of documents in the dataset and the close relation-ship between different plain documents. In other words, thecluster centers are created dynamically and the number ofclusters is decided by the property of the dataset.

We propose a hierarchical method in order to get a betterclustering result within a large amount of data collection.The size of each cluster is controlled as a trade-off betweenclustering accuracy and query efficiency. According to theproposed method, the number of clusters and the minimumrelevance score increase with the increase of the levelswhereas the maximum size of a cluster reduces. Dependingon the needs of the grain level, the maximum size of a clusteris set at each level. Every cluster needs to satisfy the con-straints. If there is a cluster whose size exceeds the limitation,this cluster will be divided into several sub-clusters.

We design a search strategy to improve the rank pri-vacy. In the search phase, the cloud server will first com-pute the relevance score between query and clustercenters of the first level and then chooses the nearest clus-ter. This process will be iterated to get the nearest childcluster until the smallest cluster has been found. Thecloud server computes the relevance score between queryand documents included in the smallest cluster. If thesmallest cluster can not satisfy the number of desireddocuments which is previously decided by user, cloudserver will trace back to the parent cluster of the smallestcluster and the brother clusters of the smallest cluster willbe searched. This process will be iterated until the num-ber of desired documents is satisfied or the root isreached. Due to the special search procedures, the rank-ings of documents among their search results are differ-ent with the rankings derived from traditional sequencesearch. Therefore, the rank privacy is enhanced.

Some part of the above work has been presented in[27]. For further improvement, we also construct a verifi-able tree structure upon the hierarchical clusteringmethod to verify the integrity of the search result in thispaper. This authenticated tree structure mainly takes theadvantage of the Merkle hash tree and cryptographic sig-nature. Every document will be hashed and the hashresult will be used as the representative of the document.The smallest cluster will be represented by the hash resultof the combination of the concatenation of the documentsincluded in the smallest cluster and own category infor-mation. The parent cluster is represented by the hashresult of the combination of the concatenation of its chil-dren and own category information. A virtual root isadded and represented by the hash result of the concate-nation of the categories located in the first level. In addi-tion, the virtual root will be signed so that user canachieve the goal of verifying the search result by verifyingthe virtual root.

In short, our contributions can be summarized as follows:

1) We investigate the problem of maintaining the closerelationship between different plain documents overan encrypted domain and propose a clusteringmethod to solve this problem.

2) We proposed the MRSE-HCI architecture to speedup server-side searching phase. Accompanying withthe exponential growth of document collection, thesearch time is reduced to a linear time instead ofexponential time.

3) We design a search strategy to improve the rank pri-vacy. This search strategy adopts the backtrackingalgorithm upon the above clustering method. Withthe growing of the data volume, the advantage ofthe proposed method in rank privacy tends to bemore apparent.

4) By applying the Merkle hash tree and cryptographicsignature to authenticated tree structure, we providea verification mechanism to assure the correctnessand completeness of search results.

The organization of the following parts of the paper isas follows: Section 4 describes the system model, threatmodel, design goals and notations. The architecture and

CHEN ETAL.: AN EFFICIENT PRIVACY-PRESERVING RANKED KEYWORD SEARCH METHOD 953



detailed algorithm are displayed in Section 5. We discussthe efficiency and security of MRSE-HCI scheme in Section6. An evaluation method is provided in Section 7. Section 8demonstrates the result of our experiments. Section 9concludes the paper.

4 DEFINITION AND BACKGROUND

4.1 System Model

The system model contains three entities, as illustrated inFig. 1, the data owner, the data user, and the cloud server.The box with dashed lines in the figure indicates the addedcomponent to the existing architecture.

The data owner is responsible for collecting documents,building document index and outsourcing them in anencrypted format to the cloud server. Apart from that, thedata user needs to get the authorization from the data ownerbefore accessing to the data. The cloud server provides ahuge storage space, and the computation resources neededby ciphertext search. Upon receiving a legal request from thedata user, the cloud server searches the encrypted index, andsends back top-k documents that are most likely to matchusers query [11]. The number k is properly chosen by thedata user. Our system aims at protecting data from leakinginformation to the cloud server while improving the effi-ciency of ciphertext search.

In this model, both the data owner and the data user aretrusted, while the cloud server is semi-trusted, which is con-sistent with the architecture in [9], [18], [28]. In other words,the cloud server will strictly follow the predicated order andtry to getmore information about the data and the index.

4.2 Threat Model

The adversarys ability can be concluded in two threatmodels.

Known ciphertext model. In thismodel, Cloud server can getencrypted document collection, encrypted data index, andencrypted query keywords.

Known background model. In this model, cloud serverknows more information than that in known ciphertextmodel. Statistical background information of dataset, suchas the document frequency and term frequency informationof a specific keyword, can be used by the cloud server tolaunch a statistical attack to infer or identify specific key-word in the query [9], [10], which further reveals the plain-text content of documents. The adversarys ability can berepresented in the above two threat models.

4.3 Design Goals

� Search efficiency. The time complexity of search timeof the MRSE-HCI scheme needs to be logarithmicagainst the size of data collection in order to dealwith the explosive growth of document size in bigdata scenario.

� Retrieval accuracy. Retrieval precision is related totwo factors: the relevance between the query and thedocuments in result set, and the relevance of docu-ments in the result set.

� Integrity of the search result. The integrity of thesearch results includes three aspects:

1) Correctness. All the documents returned fromservers are originally uploaded by the dataowner and remain unmodified.

2) Completeness. No qualified documents are omit-ted from the search results.

3) Freshness. The returned documents are the latestversion of documents in the dataset.

� Privacy requirements. We set a series of privacyrequirements which current researchers mostlyfocus on.1) Data privacy. Data privacy presents the confi-

dentiality and privacy of documents. The adver-sary cannot get the plaintext of documentsstored on the cloud server if data privacy isguaranteed. Symmetric cryptography is a con-ventional way to achieve data privacy.

2) Index privacy. Index privacy means the ability tofrustrate the adversary attempt to steal the infor-mation stored in the index. Such informationincludes keywords and the TF (Term Frequency)of keywords in documents, the topic of docu-ments, and so on.

3) Keyword privacy. It is important to protect usersquery keywords. Secure query generation algo-rithm should output trapdoors which leak noinformation about the query keywords.

4) Trapdoor unlinkability. Trapdoor unlinkabilitymeans that each trapdoor generated from thequery is different, even for the same query. It canbe realized by integrating a random function inthe trapdoor generation process. If the adversarycan deduce the certain set of trapdoors which allcorresponds to the same keyword, he can calculatethe frequency of this keyword in search request ina certain period. Combined with the documentfrequency of keyword in known backgroundmodel, he/she can use statistical attack to identifythe plain keyword behind these trapdoors.

5) Rank privacy. Rank order of search resultsshould be well protected. If the rank orderremains unchanged, the adversary can comparethe rank order of different search results, furtheridentify the search keyword.

4.4 Notations

In this paper, notations presented in Table 1 are used.

5 ARCHITECTURE AND ALGORITHM

5.1 System Model

In this section, we will introduce the MRSE-HCI scheme.The vector space model adopted by the MRSE-HCI schemeis same as the MRSE [18], while the process of buildingindex is totally different. The hierarchical index structure isintroduced into the MRSE-HCI instead of sequence index.In MRSE-HCI, every document is indexed by a vector.Every dimension of the vector stands for a keyword and thevalue represents whether the keyword appears or not in thedocument. Similarly, the query is also represented by a vec-tor. In the search phase, cloud server calculates the rele-vance score between the query and documents by




computing the inner product of the query vector and docu-ment vectors and return the target documents to useraccording to the top k relevance score.

Due to the fact that all the documents outsourced to thecloud server is encrypted, the semantic relationshipbetween plain documents over the encrypted documents islost. In order to maintain the semantic relationship betweenplain documents over the encrypted documents, a cluster-ing method is used to cluster the documents by clusteringtheir related index vectors. Every document vector isviewed as a point in the n-dimensional space. With thelength of vectors being normalized, we know that the dis-tance of points in the n-dimensional space reflect the rele-vance of corresponding documents. In other word, points ofhigh relevant documents are very close to each other in then-dimensional space. As a result, we can cluster the docu-ments based on the distance measure.

With the volume of data in the data center has experienceda dramatic growth, conventional sequence search approachwill be very inefficient. To promote the search efficiency, ahierarchical clustering method is proposed. The proposedhierarchical approach clusters the documents based on theminimum relevance threshold at different levels, and thenpartitions the resulting clusters into sub-clusters until the con-straint on the maximum size of cluster is reached. Uponreceiving a legal request, cloud server will search the relatedindexes layer by layer instead of scanning all indexes.

5.2 MRSE-HCI Architecture

MRSE-HCI architecture is depicted by Fig. 2, where thedata owner builds the encrypted index depending on the

dictionary, random numbers and secret key, the data usersubmits a query to the cloud server for getting desireddocuments, and the cloud server returns the target docu-ments to the data user. This architecture mainly consistsof following algorithms.

� Keygenð1lðnÞÞ ! ðsk; kÞ. It is used to generate thesecret key to encrypt index and documents.

� IndexðD; skÞ ! I. Encrypted index is generated inthis phase by using the above mentioned secret key.At the same time, clustering process is also includedcurrent phase.

� EncðD; kÞ ! E. The document collection isencrypted by a symmetric encryption algorithmwhich achieves semantic security.

� Trapdoorðw; skÞ ! Tw. It generates encrypted queryvector Tw with users input keywords and secret key.

� SearchðTw; I; ktopÞ ! ðIw; EwÞ. In this phase, cloudserver compares trapdoor with index to get the top-kretrieval results.

� DecðEw; kÞ ! Fw. The returned encrypted documentsare decrypted by the key generated in the first step.

The concrete functions of different components isdescribed as below.

1) Keygenð1lðnÞÞ. The data owner randomly generatesa ðnþ uþ 1Þ bit vector S where every element is ainteger 1 or 0 and two invertible ðnþ uþ 1Þ�ðnþ uþ 1Þmatrices whose elements are random inte-gers as secret key sk. The secret key k is generated bythe data owner choosing an n-bit pseudo sequence.

2) IndexðD; skÞ. As show in the Fig. 3, the data owneruses tokenizer and parser to analyze every docu-ment and gets all keywords. Then data owner usesthe dictionary Dw to transform documents to acollection of document vectors DV . Then the dataowner calculates the DC and CCV by using a qual-ity hierarchical clustering (QHC) method whichwill be illustrated in section C. After that, the dataowner applies the dimension-expanding and

TABEL 1Notations

di The ith document vector, denoted asdi ¼ fdi;1; . . . ; di;ng, where di;j represents whetherthe jth keyword in the dictionary appears indocument di.

m The number of documents in the data collection.n The size of dictionaryDW .CCV The collection of cluster centers vectors, denoted as

CCV ¼ fc1; . . . ; cng, where ci is the average vectorof all document vectors in the cluster.

CCVi The collection of the ith level cluster center vectors,denoted as CCVi ¼ fvi;1; . . . ; vi;ngwhere Vi;j

represents the jth vector in the ith level.DC The information of documents classification such

as document id list of a certain cluster.DV The collection of document vectors, denoted as

DV ¼ fd1; d2; . . . ; dmg.DW The dictionary, denoted asDw ¼ fw1; w2; . . . ; wng.Fw The ranked id list of all documents according to

their relevance to keyword w.Ic The clustering index which contains the encrypted

vectors of cluster centers.Id The traditional index which contains encrypted

document vectors.Li The minimum relevance score between different

documents in the ith level of a cluster.QV The query vector.TH A fixed maximum number of documents in a

cluster.Tw The encrypted query vector for users query.

Fig. 2. MRSE-HCI architecture.

Fig. 3. Algorithm index.




vector-splitting procedure to every document vec-tor. It is worth noting that CCV is treated equallyas DV . For dimension-expanding, every vector inDV is extended to ðnþ uþ 1Þ bit-long, where thevalue in nþ jð0 � j � uÞ dimension is an integernumber generated randomly and the last dimen-sion is set to 1. For vector-splitting, every extendeddocument vector is split into two ðnþ uþ 1Þ bit-long vectors, V 0 and V 00 with the help of theðnþ uþ 1Þbit vector S as a splitting indicator. Ifthe ith element of S (Si ) is 0, then we setV 00i ¼ V 0

i ¼ Vi ; If ith element of S (Si ) is 1, then V 00i

is set to a random number and V 0i ¼ Vi � V 00

i .Finally, the traditional index Id is encrypted as

Id ¼ fMT1 V

0;MT2 V

00gby using matrix multiplicationwith the sk, and Ic is generated in a similar way.After this, Id ,Ic , and DC are outsourced to thecloud server.

3) EncðD; kÞ. The data owner adopts a secure symmet-ric encryption algorithm (e.g. AES) to encrypt theplain document set D and outsources it to the cloudserver.

4) Trapdoorðw; skÞ. The data user sends the query tothe data owner who will later analyze the queryand builds the query vector QV by analyzing thekeywords of query with the help of dictionary DW ,QV then is extended to a ðnþ uþ 1Þ bit queryvector. Subsequently,v random positions chosenfrom a range ðn; nþ u� in QV are set to 1, othersare set to 0.The value at last dimension of QV isset to a random number t�½0; 1�. Then the firstðnþ uÞdimensions of QW , denoted as qw, is scaledby a random number rðr 6¼ 0Þ ,Qw ¼ ðr � qw; tÞ .After that, Qw is split into two random vectors asfQ0

W;Q00Wg with vector-splitting procedure which is

similar to that in the IndexðD; skÞ phase. The dif-ference is that if the ith bit of S is 1, then we haveq0i ¼ q00i ¼ qi; If the ith bit of S is 0, q0i is set as arandom number and q00i ¼ qi � q0i. Finally, theencrypted query vector Tw is generated as Tw ¼fM�1

1 Q0w;M

�12 Q00

wg and sent back to the data user.5) SearchðTw; I; ktopÞ. Upon receiving the Tw from data

user, the cloud server computes the relevance scorebetween Tw and index Ic and then choosesthe matched cluster which has the highest rele-vance score. For every document contained in thematched cluster, the cloud server extract its corre-sponding encrypted document vector in Id , andcalculates its relevance score S with Tw , asdescribed in the Equation (1). Finally, these scoresof documents in the matched cluster are sorted and

the top ktop documents are returned by the cloud

server. The detail will be discussed in the Section 5.5.S ¼ Tw � Ic¼ fM�1

1 Q0w;M

�12 Q00

wg � fMT1 V

0;MT2 V

00g¼ Q0

w � V 0 þQ0w � V 00

¼ Qw � V:

(1)

6) DecðEw; kÞ. The data user utilizes the secret key k todecrypt the returned ciphertext Ew.

5.3 Relevance Measure

In this paper, the concept of coordinate matching [29] isadopted as a relevance measure. It is used to quantify therelevance of document-query and document-document. Itis also used to quantify the relevance of the query andcluster centers. Equation (2) defines the relevance scorebetween document di and query qw . Equation (3) definesthe relevance score between query qw and cluster centerlci;j . Equation (4) defines the relevance score betweendocument di and dj.

Sqdi ¼Xnþuþ1

t¼1

ðqw;t � di;tÞ (2)

Sqci ¼Xnþuþ1

t¼1

ðqw;t � lci;j;tÞ (3)

Sddi ¼Xnþuþ1

t¼1

ðdi;t � dj;tÞ: (4)

5.4 Quality Hierarchical Clustering Algorithm

So far, a lot of hierarchical clustering methods has beenproposed. However all of these methods are not compara-ble to the partition clustering method in terms of timecomplexity performance. K-means [30] and K-medois [31]are popular partition clustering algorithms. But the k isfixed in the above two methods, which can not be appliedto the situation of dynamic number of cluster centers. Wepropose a quality hierarchical clustering algorithm basedon the novel dynamic K-means.

As the proposed dynamic K-means algorithm shown inthe Fig. 4, the minimum relevance threshold of the clus-ters is defined to keep the cluster compact and dense. Ifthe relevance score between a document and its centeris smaller than the threshold, a new cluster center isadded and all the documents are reassigned. The aboveprocedure will be iterated until k is stable. Comparingwith the traditional clustering method, k is dynamicallychanged during the clustering process. This is why it iscalled dynamic K-means algorithm .

The QHC algorithm is illustrated in the Fig. 5. It goeslike that. Every cluster will be checked on whether its

Fig. 4. Algorithm dynamic k-means.

Fig. 5. Algorithm quality hierarchical clustering (QHC).




size exceeds the maximum number TH or not. If theanswer is “yes”, this “big” cluster will be split into childclusters which are formed by using the dynamic K-meanson the documents of this cluster. This procedure will beiterated until all clusters meet the requirement of maxi-mum cluster size. Clustering procedure is illustrated inFig. 6. All the documents are denoted as points in a coor-dinate system. These points are initially partitioned intotwo clusters by using dynamic K-means algorithm whenthe k ¼ 2. These two bigger clusters are depicted by theelliptical shape. Then these two clusters are checked tosee whether their points satisfy the distance constraint.The second cluster does not meet this requirement, thusa new cluster center is added with k ¼ 3 and the dynamicK-means algorithm runs again to partition the second clus-ter into two parts. Then the data owner checks whetherthese clusters size exceed the maximum number TH.Cluster 1 is split into two sub-clusters again due to its bigsize. Finally all points are clustered into four clusters asdepicted by the rectangle.

5.5 Search Algorithm

The cloud server needs to find the cluster that most matchesthe query. With the help of cluster index Ic and documentclassification DC , the cloud server uses an iterative proce-dure to find the best matched cluster. Following instancedemonstrates how to get matched one:

1) The cloud server computes the relevance scorebetween Query Tw and encrypted vectors of the firstlevel cluster centers in cluster index Ic, then choosesthe ith cluster center Ic;1;i which has the highest score.

2) The cloud server gets the child cluster centers of thecluster center, then computes the relevance scorebetween Tw and every encrypted vectors of childcluster centers, and finally gets the cluster center Ic;2;iwith the highest score. This procedure will be iter-ated until that the ultimate cluster center Ic;l;j in lastlevel l is achieved.

In the situation depicted by Fig. 7, there are nine docu-ments which are grouped into three clusters. After calculat-ing the relevance score with trapdoor Tw , cluster 1, which isshown within the box of dummy line in Fig. 7, is found to bethe best match. Documents d1,d3 ,d9 belong to cluster 1, thentheir encrypted document vectors in the Id are extracted outto compute the relevance score with Tw.

5.6 Search Result Verification

The retrieved data have high possibility to be wrong sincethe network is unstable and the data may be damaged dueto the hardware/software failure or malicious administratoror intruder. Verifying the authenticity of search results isemerging as a critical issue in the cloud environment. We,therefore, designed a signed hash tree to verify the correct-ness and freshness of the search results.

� Building. The data owner builds the hash tree basedon the hierarchical index structure. The algorithmshown in the Fig. 8 is described as follows. The hashvalue of the leaf node of the tree is hðid k version kFðidÞÞ where id means document id, version meansdocument version and FðidÞ means the documentcontents. The value of non-leaf node is a pair of valuesðid; hðid k hchildÞÞ where id denotes the value of thecluster center or document vector in the encryptedindex, and hchild is the hash value of its child node.The hash value of tree root node is based on the hashvalues of all clusters in the first level. It is worth not-ing that the root node denotes the data set whichcontains all clusters. Then the data owner generatesthe signature of the hash values of the root node andoutsources the hash tree including the root signatureto the cloud server. Cryptographic signature s (e.g.,RSA signature, DSA signature) can be used here toauthenticate the hash value of root node.

� Processing. By the algorithm shown in the Fig. 9, thecloud server returns the root signature and the mini-mum hash sub-tree (MHST) to client. The minimumhash sub-tree includes the hash values of leaf nodesin the matched cluster and non-leaf node corre-sponding to all cluster centers used to find thematched cluster in the searching phase. For example,in the Fig. 10, the search result is document D, E andF . Then the leaf nodes are D, E, F and G, and non-leaf nodes includes C1, C2, C3, C4, dD, dE , dF , and dG.In addition, the root is included in the non-leaf node.

� Verifying. The data owner uses the minimum hashsub-tree to re-compute the hash values of nodes, inparticular the root node which can be further verifiedby the root signature. If all nodes are matched, then

Fig. 7. Retrieval process.

Fig. 8. Algorithm building-minimum hash sub-tree.

Fig. 9. Algorithm processing-minimum hash sub-tree.

Fig. 6. Clustering process.




the correctness and freshness is guaranteed. Then thedata owner re-searches the index constructed byretrieved values inMHST. If the search result is sameas the retrieved result, the completeness, correctnessand freshness all are guaranteed.

As shown in the Fig. 10, in the building phase, all docu-ments are clustered into two big clusters and four small clus-ters, and each big cluster contains two small clusters. Thehash value of leaf node A is hðidA k version k FðidAÞÞ , thevalue of the non-leaf nodeC3 is ðidC3

; hðidC3k hA k hB k hCÞÞ,

and the value of non-leaf node C1 is ðidC1; hðidC1

k hC3k

hC4ÞÞ. The other values of leaf nodes and non-leaf nodes are

generated similarly. In order to combine all first-level clustersinto a tree, a virtual root node is created by the data ownerwith a hash value hðhC1;2

k hC2;2Þwhere C1;2 and C2;2 denotes

the second part of cluster center 1 and 2 respectively. Then thedata owner signs the root node, e.g., sðhðhC1;2

k hC2;2ÞÞ ¼

ðhC1;2k hC2;2

; eðhðhC1;2k hC2;2

ÞÞk; gÞ, and outsources it to the

cloud server.In the processing phase, suppose that the cluster C4 is the

matched cluster and the returned top-three documents areD, E, and F . Then the minimum hash sub-tree includes thehash values of node D, E, F , dD, dE , dF , dG, C3, C2, C1, C4

and the signed root sðhðhC1;2k hC2;2

ÞÞ.In the verifying phase, upon receiving the signed root, the

data user first check eðhðhC1;2k hC2;2

Þ; gÞk ¼? eðsigkhðhC1;2k

hC2;2Þ; gÞ . If it is not true, the retrieved hash tree is not

authentic, otherwise the returned nodes,D, E, F , dD, dE , dF ,dG, C3, C2, C1, C4, works together to verify each other andreconstruct the hash tree. If all the nodes are authenticate,the returned hash tree are authenticate. Then the data userre-computes the hash value of the leaf nodes D, E and F byusing returned documents. These new generated hash val-ues are compared with the corresponding returned hashvalues. If there is no difference, the retrieved documents iscorrect. Finally, the data user uses the trapdoor to re-searchthe index constructed by the first part of retrieved nodes. Ifthe search result is same as the retrieved result, the searchresult is complete.

5.7 Dynamic Data Collection

As the documents stored at server may be deleted or modi-fied and new documents may be added to the original datacollection, a mechanismwhich supports dynamic data collec-tion is necessary. A naive way to address these problems isdownloading all documents and index locally and updatingthe data collection and index. However, this method needshuge cost in bandwidth and local storage space.

To avoid updating index frequently, we provide a practi-cal strategy to deal with insertion, deletion and modificationoperations. Without loss of generality, we use followingexamples to illustrate the workings of the strategy. The dataowner preserves many empty entries in the dictionary fornew documents. If a new document contains new key-words, the data owner first adds these new keywords to thedictionary and then constructs a document vector based onthe new dictionary. The data owner sends the trapdoor gen-erated by the document vector, encrypted document andencrypted document vector to the cloud sever. The cloudsever finds the closest cluster, and puts the encrypted docu-ment and encrypted document vector into it.

As every cluster has a constraint on the maximum size, itis possible that the number of documents in a clusterexceeds the limitation due to the insertion operation. In thiscase, all the encrypted document vectors belonging to thebroken cluster are returned to the data owner. After decryp-tion of the retrieved document vectors, the data owner re-builds the sub-index based on the deciphered documentvectors. The sub-index is re-encrypted and re-outsourced tothe cloud server.

Upon receiving a deletion order, the cloud serversearches the target document. Then the cloud server deletesthe document and the corresponding document vector.

Modifying a document can be described as deleting the oldversion of the document and inserting the new version. Theoperation of modifying documents, therefore, can be realizedby combining insertion operation and deletion operation.

To deal with this impact on the hash tree, a lazy updatestrategy is designed. For the insertion operation, the corre-sponding hash value will be calculated and marked as araw node, while the original nodes in the hash tree will bekept unchanged because the original hash tree still supportsdocument verification except the new document. Onlywhen the new added document is accessed, the hash treewill be updated. Similar concept is used in the deletionoperation. The only difference is that the deletion operationwill not bring the hash tree update.

6 EFFICIENCY AND SECURITY

6.1 Search Efficiency

The search process can be divided into Trapdoorðw; skÞphase and SearchðTw; I; ktopÞ phase. The number of opera-tion needed in Trapdoorðw; skÞ phase is illustrated as inEquation (5), where, n is the number of keywords in the dic-tionary, and w is the number of query keywords,

OðMRSE �HCIÞ ¼ 5nþ u� v� wþ 5: (5)

Due to the time complexity of Trapdoorðw; skÞ phaseindependent to DC, when DC increases exponentially,itcan be described as O(1).

The difference of the search process between the MRSE-HCI and the MRSE is the retrieval algorithm used in thisphase. In the SearchðTw; I; ktopÞ phase of the MRSE, thecloud server needs to compute the relevance score betweenthe encrypted query vector Tw and all encrypted documentvectors in Id , and get the top-k ranked document list Fw .The number of operations need in SearchðTW ; I; ktopÞ phaseis illustrated as in Equation (6), where m represents the

Fig. 10. Authentication for hierarchical clustering index.




number of documents in DC ,and n represents the numberof keywords in the dictionary,

OðMRSEÞ ¼ 2m � ð2nþ 2uþ 1Þ þm� 1: (6)

However, in the SearchðTW ; I; ktopÞ phase of MRSE-HCI, thecloud server uses the information DC to quickly locate thematched cluster and only compares Tw to a limited numberof encrypted document vectors in Id . The number of opera-tions needed in SearchðTW ; I; ktopÞ phase is illustrated inEquation (7), where ki represents the number of cluster cen-ters needed to be comparedwith in the ith level, and c repre-sents the number of document vectors in the matchedcluster,

OðMRSE �HCIÞ ¼Xli¼1

ki

!� 2 � ð2nþ 2uþ 1Þ

þ cð2 � ð2nþ 2uþ 1ÞÞ þ c� 1:

(7)

When DC increases exponentially, m can be set to 2l. The

time complexity of the traditional MRSE is Oð2lÞ , while thetime complexity of the proposedMRSE-HCI is only OðlÞ.

The total search time can be calculated as given inEquation (8) below, whereOðtrapdoorÞ isOð1Þ ,andOðqueryÞrelies on theDC,

OðsearchTimeÞ ¼ OðtrapdoorÞ þOðqueryÞ: (8)

In short, when the number of documents inDC has an expo-nential growth, the search time of MRSE-HCI increases line-arly while the traditional methods increase exponentially.

6.2 Security Analysis

To express the security analysis briefly, we adopt some con-cepts from [37], [38], [39] and define what kinds of informa-tion will be leaked to the curious-but-honest server.

The basic information of documents and queries areinevitably leaked to the honest-but-curious server since allthe data are stored at the server and the queries submittedto the server. Moreover, the access pattern and search pat-tern cannot be preserved in MRSE-HCI as well as previoussearchable encryption [18], [38], [39], [40].

Definition 1 (Size pattern). Let D be a document collection.The size pattern induced by a q-query is a tupleaðD;QÞ ¼ ðm; jQ1j; . . . ; jQqjÞ wherem is the number of docu-ments and jQij is the size of query Qi.

Definition 2 (Access pattern). Let D be a document collectionand I be an index over D. The access pattern induced by aq-query is a tuple bðD;QÞ ¼ ðIðQ1Þ; IðQqÞÞ, where IðQiÞ is aset of identifiers returned by query Qi, for 1 � i � q.

Definition 3 (Search pattern). Let D be a document collection.The search pattern induced by a q-query is a m� q binarymatrix cðD;QÞ such that for 1 � i � m and 1 � j � q the ele-ment in the ith row and jth column is 1, if an document iden-tifier idi is returned by a query Qj.

Definition 4 (known ciphertext model secure). LetP ¼ ðKeygen; Index; Enc; Trapdoor; Search;DecÞ be anindex-based MRSE-HCI scheme over dictionary Dw, n 2 N ,be the security parameter, the known ciphertext model secure

experiment PrivKkcmA;PðnÞ is described as follows.

1) The adversary submits two document collections D0

andD1 with the same length to a challenger.2) The challenger generates a secret key fsk; kg by run-

ningKeygenð1lðnÞÞ.3) The challenger randomly choose a bit b 2 f0; 1g, and

returns IndexðDb; skbÞ ! Ib and EncðDb; kbÞ ! Eb

to the adversary.4) The adversary outputs a bit b0

5) The output of the experiment is defined to be 1 if b0 ¼ b,and 0 otherwise.

We sayMRSE-HCI scheme is secure under known cipher-text model if for all probabilistic polynomial-time adversar-iesA there exists a negligible function neglðnÞ such that

PrðPrivkkcmA;P ¼ 1Þ � 1=2þ neglðnÞ: (9)

Proof. The adversary A distinguishes the document collec-tions depending on analyzing the secret key, indexand encrypted document collection. Then we haveEquation (10), where AdvðADðsk; kÞÞ is the advantage foradversary A to distinguish the secret key from tworandom matrixes and two random strings, AdvðADðIÞÞ isthe advantage to distinguish the index from a randomstring and AdvðADðEÞÞ is the advantage to distinguishthe encrypted documents from random strings.

PrðPrivKkcmA;PðnÞ ¼ 1Þ ¼ 1=2

þAdvðADðsk; kÞÞ þAdvðADðIÞÞ þAdvðADðEÞÞ (10)

tuThe elements of two matrixes in the secret key are

randomly chosen from f0; 1glðnÞ, and the split indicator Sand key k are also chosen uniformly at random from

f0; 1glðnÞ. Given f0; 1glðnÞ, A distinguishes the secret keyfrom two random matrixes and two random strings witha negligible probability. Then there exits a negligiblefunction negl1ðnÞ such that

AdvðADðsk; kÞÞ ¼ jPrðKeygenð1lðnÞÞ ! ðsk; kÞÞ� PrðRandom ! ðskr; krÞÞj � negl1ðnÞ;

(11)

where skr denotes two random matrixes and a randomstring, and kr is a random string. In our scheme, the encryp-tion of hierarchical index is essential to encrypt all the docu-ment vectors and cluster center vectors. All the clustercenter vectors are treated as document vectors in theencryption phase. Eventually, all the document vectors andcluster center vectors are encrypted by the secure KNN. Asthe secure KNN is known plaintext attack (KPA) secure[32], the hierarchical index is secure under the knownciphertext model. Then there exists a negligible functionnegl2ðnÞ satisfying that

AdvðADðIÞÞ ¼ jPrðIndexðD; skÞ ! ðIÞÞ� PrðRandom ! ðIrÞÞj � negl2ðnÞ;

(12)

where Ir is a random string.Since the encryption algorithm used to encrypt Db is

semantic secure, the encrypted documents are secure under




known ciphertext model. Then there exists a negligible func-tion negl3ðnÞ such that

AdvðADðEÞÞ ¼ jPrðEncðD; kÞ ! ðEÞÞ� PrðRandom ! ðErÞÞj � negl3ðnÞ:

(13)

Where Er is a random string set.According Equations (10), (11), (12) and (13), we can get

Equation (14),

PrðPrivkkcmA;P ¼ 1Þ � 1=2

þ negl1ðnÞ þ negl2ðnÞ þ negl3ðnÞ(14)

neglðnÞ ¼ negl1ðnÞ þ negl2ðnÞ þ negl3ðnÞ (15)

PrðPrivkkcmA;PÞ � 1=2þ neglðnÞ: (16)

By combining Equations (14) and (15), we can concludeEquation (16). Then, we say MRSE-HCI is secure underknow ciphertext model.

7 EVALUATION METHOD

7.1 Search Precision

The search precision can quantify the users satisfaction. TheRetrieval precision is related to two factors: the relevancebetween documents and the query, and the relevance ofdocuments between each other. Equation (17) defines therelevance between retrieved documents and the query,

Pq ¼Xk0i¼1

Sðqw; diÞ=Xki¼1

Sðqw; diÞ !

: (17)

Here, k0 denotes the number of files retrieved by the evalu-ated method, k denotes the number of files retrieved byplain text search, qw represents query vector, di representsdocument vector, and S is a function to compute the rele-vance score between qw and di. Equation (18) defines the rel-evance of different retrieved documents,

Pd ¼Xk0j¼1

Xk0i¼1

Sðdj; diÞ=Xkj¼1

Xki¼1

Sðdj; diÞ !

: (18)

Here, k0 denotes the number of files retrieved by the evalu-ated method, k denotes the number of files retrieved byplaintext search, and both di and dj denote document vector.

Equation (19) combines the relevance between query andretrieved documents and relevance of documents to quan-tify the search precision such that

Acc ¼ aPq þ Pd; (19)

where a functions as a tradeoff parameter to balance the rel-evance between query and documents and relevance ofdocuments. If a is smaller than 1, it puts more emphasis onthe relevance of documents otherwise query keywords.

The above evaluation strategies should be based on thesame dataset and keywords.

7.2 Rank Privacy

Rank privacy can quantify the information leakage of thesearch results. The definition of rank privacy is adoptedfrom [18]. Equation (20) is used to evaluate the rank privacy,

Pk ¼Xki¼1

Pi=k: (20)

Here, k denotes the number of top-k retrieved docu-ments, pi ¼ ci0 � cij j , ci0 is the ranking of document di in theretrieved top-k documents,ci is the actual ranking of docu-ment di in the data set, and Pi is set to k if greater than k .The overall rank privacy measure at point k, denoted as Pk,is defined as the average value of pi for every document diin the retrieved top-k documents.

8 PERFORMANCE ANALYSIS

In order to test the performance of MRSE-HCI on real data-set, we built an experimental platform to test the search effi-ciency, accuracy and rank privacy. We implemented thetarget experiment based on a distributed platform whichincludes three ThinkServer RD830 and a ThinkCenter M8400t.The data set is built from IEEE Xplore, including about51;000 documents, and 22;000 keywords.

According to the notations defined in Section 4, n denotesthe dictionary size, k denotes the number of top-k docu-ments, m denotes the number of documents in the data set,and w denotes the number of keywords in the users query.

Fig. 11 is used to describe search efficiency with differentconditions. Fig. 11a describes search efficiency using the dif-ferent size of document set with unchanged dictionary size,number of retrieved documents and number of query key-words, n ¼ 22;157; k ¼ 20; w ¼ 5. In Fig. 11b, we adjust thevalue of k with unchanged dictionary size, document setsize and number of query keywords, n ¼ 22;157;m ¼ 51;312; w ¼ 5. Fig. 11c tests the different number ofquery keywords with unchanged dictionary size, documentset size and number of retrieved documents, n ¼ 22;157;m ¼ 51;312; k ¼ 20.

From the Fig. 11a, we can observe that with the exponen-tial growth of document set size, the search time of MRSEincreases exponentially, while the search time ofMRSE �HCI increases linearly. As the Figs. 11b and 11cshows, the search time of MRSE �HCI keeps stable withthe increase of query keywords and retrieved documents.Meanwhile, the search time is far below that ofMRSE.

Fig. 11. Search efficiency.




Fig. 12 describes search accuracy by utilizing plaintextsearch as a standard. Fig. 12a illustrates the relevance ofretrieved documents. With the number of documentsincreases from 3;200 to 51;200, the ratio of MRSE-to-plaintextsearch fluctuates at 1, while MRSE-HCI-to-plaintext searchincreases from 1:5 to 2. From the Fig. 12a, we can observethat the relevance of retrieved documents in the MRSE-HCIis almost twice as many as that in the MRSE, which meansretrieved documents generated by MRSE-HCI are muchcloser to each other. Fig. 12b shows the relevance betweenquery and retrieved documents. With the size of documentset increases from 3;200 to 51;200, the MRSE-to-plaintextsearch ratio fluctuates at 0:75. MRSE-HCI-to-plaintext searchratio increases from 0:65 to 0:75 accompanying with thegrowth of document set size. From the Fig. 12b, we can seethat the relevance between query and retrieved documentsin MRSE-HCI is slightly lower than that in MRSE. Espe-cially, this gap narrows when the data size increases since abig document data set has a clear category distributionwhich improves the relevance between query and docu-ments. Fig. 12c shows the rank accuracy according toEquation (19). The tradeoff parameter a is set to 1, whichmeans there is no bias towards relevance of documents orrelevance between documents and query. From the result,we can conclude that MRSE-HCI is better than MRSE inrank accuracy.

Fig. 13 describes the rank privacy according toEquation (20). In this test, no matter the number of retrieveddocuments, MRSE �HCI has better rank privacy thanMRSE. This mainly caused by the relevance of documentsintroduced into search strategy.

9 CONCLUSION

In this paper, we investigated ciphertext search in the sce-nario of cloud storage. We explore the problem of maintain-ing the semantic relationship between different plaindocuments over the related encrypted documents and givethe design method to enhance the performance of thesemantic search. We also propose the MRSE-HCI architec-ture to adapt to the requirements of data explosion, onlineinformation retrieval and semantic search. At the sametime, a verifiable mechanism is also proposed to guaranteethe correctness and completeness of search results. In

addition, we analyze the search efficiency and securityunder two popular threat models. An experimental plat-form is built to evaluate the search efficiency, accuracy, andrank security. The experiment result proves that the pro-posed architecture not only properly solves the multi-key-word ranked search problem, but also brings animprovement in search efficiency, rank security, and the rel-evance between retrieved documents.

ACKNOWLEDGMENTS

This work was supported by Strategic Priority Re-search Program of Chinese Academy of Sciences(No. XDA06040601) and Xinjiang Uygur AutonomousRegion science and technology plan (No. 201230121). Anearly version of this paper is presented at the Workshop onSecurity and Privacy in Big Data at IEEE INFOCOM 2014[27]. Extensive enhancements have been made whichincludes incorporating a novel verification scheme to helpdata user verify the authenticity of the search results, andadding a security analysis as well more details of the pro-posed scheme. Thisworkwas supported by Strategic PriorityResearch Program of Chinese Academy of Sciences(No. XDA06010701) andNational High Technology ResearchandDevelopment Program of China(No. 2013AA01A24).

REFERENCES

[1] S. Grzonkowski, P. M. Corcoran, and T. Coughlin, “Security anal-ysis of authentication protocols for next-generation mobile andCE cloud services,” in Proc. IEEE Int. Conf. Consumer Electron.,2011, Berlin, Germany, 2011, pp. 83–87.

[2] D. X. D. Song, D. Wagner, and A. Perrig, “Practical techniques forsearches on encrypted data,” in Proc. IEEE Symp. Security Priv.,BERKELEY, CA, 2000, pp. 44–55.

[3] D. Boneh, G. Di Crescenzo, R. Ostrovsky, and G. Persiano, “Publickey encryption with keyword search,” in Proc. EUROCRYPT,Interlaken, SWITZERLAND, 2004, pp. 506–522.

[4] Y. C. Chang and M. Mitzenmacher, “Privacy preserving key-word searches on remote encrypted data,” in Proc. 3rd Int.Conf. Applied Cryptography Netw. Security, New York, NY, 2005,pp. 442–455.

[5] R. Curtmola, J. Garay, S. Kamara, and R. Ostrovsky, “Searchablesymmetric encryption: improved definitions and efficientconstructions,” in Proc. 13th ACM Conf. Comput. Commun. Security,Alexandria, Virginia, 2006, pp. 79–88.

[6] M. Bellare, A. Boldyreva, and A. O’Neill, “Deterministic and effi-ciently searchable encryption,” in Proc. 27th Annu. Int. Cryptol.Conf. Adv. Cryptol., Santa Barbara, CA, 2007, pp. 535–552.

[7] D. Boneh and B. Waters, “Conjunctive, subset, and range querieson encrypted data,” in Proc. 4th Conf. Theory Cryptography,Amsterdam, NETHERLANDS, 2007, pp. 535–554.

[8] E.-J. Goh, Secure Indexes, IACR Cryptology ePrint Archive,vol. 2003, pp. 216. 2003.

[9] C. Wang, N. Cao, K. Ren, andW. J. Lou, “Enabling secure and effi-cient ranked keyword search over outsourced cloud data,” IEEETrans. Parallel Distrib. Syst., vol. 23, no. 8, pp. 1467–1479, Aug.2012.

Fig. 12. Search precision.

Fig. 13. Rank privacy.




[10] A. Swaminathan, Y. Mao, G. M. Su, H. Gou, A. Varna, S. He,M. Wu, and D. Oard, “Confidentiality-preserving rank-orderedsearch,” in Proc. ACM ACMWorkshop Storage Security Survivability,Alexandria, VA, 2007, pp. 7–12.

[11] S. Zerr, D. Olmedilla, W. Nejdl, and W. Siberski, “Zerber+R: Top-k retrieval from a confidential index,” in Proc. 12th Int. Conf.Extending Database Technol.: Adv. Database Technol., SaintPetersburg, Russia, 2009, pp. 439–449.

[12] C. Wang, N. Cao, J. Li, K. Ren, and W. J. Lou, “Secure ranked key-word search over encrypted cloud data,” in Proc. IEEE 30th Int.Conf. Distrib. Comput. Syst., Genova, ITALY, 2010, pp. 253–262.

[13] P. Golle, J. Staddon, and B. Waters, “Secure conjunctive keywordsearch over encrypted data,” in Proc. Proc. 2nd Int. Conf. Appl.Cryptography Netw. Security, YellowMt, China, 2004, pp. 31–45.

[14] L. Ballard, S. Kamara, and F. Monrose, “Achieving efficient con-junctive keyword searches over encrypted data,” in Proc. 7th Int.Conf. Inform. Commun. Security, Beijing, China, 2005, pp. 414–426.

[15] R. Brinkman, “Searching in encrypted data” in University ofTwente, PhD thesis, 2007.

[16] Y. H. Hwang and P. J. Lee, “Public key encryption with conjunc-tive keyword search and its extension to a multi-user system,” inProc. 1st Int. Conf. Pairing-Based Cryptography, Tokyo, JAPAN,2007, pp. 2–22.

[17] H. Pang, J. Shen, and R. Krishnan, “Privacy-preserving similarity-based text retrieval,” ACM Trans. Internet Technol., vol. 10, no. 1,pp. 39, Feb. 2010.

[18] N. Cao, C.Wang,M. Li, K. Ren, andW. J. Lou, “Privacy-preservingmulti-keyword ranked search over encrypted cloud data,” in Proc.IEEE INFOCOM, Shanghai, China, 2011, pp. 829–837.

[19] W. Sun, B. Wang, N. Cao, M. Li, W. Lou, Y. T. Hou, and H. Li,“Privacy-preserving multi-keyword text search in the cloud sup-porting similarity-based ranking,” in Proc. 8th ACM SIGSACSymp. Inform., Comput. Commun. Security, Hangzhou, China, 2013,pp. 71–82.

[20] F. Li, M. Hadjieleftheriou, G. Kollios, and L. Reyzin, “Dynamicauthenticated index structures for outsourced databases,” in Proc.ACM SIGMOD, Chicago, IL, 2006, pp. 121–132.

[21] H. H. Pang and K. L. Tan, “Authenticating query results in edgecomputing,” in Proc. 20th Int. Conf. Data Eng., Boston, MA , 2004,pp. 560–571.

[22] C. Martel, G. Nuckolls, P. Devanbu, M. Gertz, A. Kwong, and S. G.Stubblebine, “A general model for authenticated data structures,”Algorithmica, vol. 39, no. 1, pp. 21–41, May 2004.

[23] C. M. Ralph, “Protocols for public key cryptosystems,” in Proc.IEEE Symp. Security Priv, Oakland, CA, 1980, pp. 122–122.

[24] R. C. Merkle, “A certified digital signature,” in Proc. Adv. cryptol. ,1990, vol. 435, pp. 218–238.

[25] M. Naor and K. Nissim, “Certificate revocation and certificateupdate,” IEEE J. Sel. Areas Commun., vol. 18, no. 4, pp. 561–570,Apr. 2000.

[26] H. Pang and K. Mouratidis, “Authenticating the query resultsof text search engines,” in Proc. VLDB Endow., vol. 1, no. 1,pp. 126–137, Aug. 2008.

[27] C. Chen, X. J. Zhu, P. S. Shen, and J. K. Hu, “A hierarchical cluster-ing method For big data oriented ciphertext search,” in Proc. IEEEINFOCOM, Workshop on Security and Privacy in Big Data, Toronto,Canada, 2014, pp. 559–564.

[28] S. C. Yu, C. Wang, K. Ren, and W. J. Lou, “Achieving secure, scal-able, and fine-grained data access control in cloud computing,” inProc. IEEE INFOCOM, San Diego, CA, 2010, pp. 1–9.

[29] I. H. Witten, A. Moffat, and T. C. Bell, Managing Gigabytes:Compressing and Indexing Documents and Images, 2nd ed. SanFrancisco, CA, USA : Morgan Kaufmann, 1999.

[30] J. MacQueen, “Some methods for classification and analysis ofmultivariate observations,” in Proc. Berkeley Symp. Math. Stat.Prob., California, 1967, p. 14.

[31] Z. X. Huang, “Extensions to the k-means algorithm for clusteringlarge data sets with categorical values,” Data Min. Knowl. Discov.,vol. 2, no. 3, pp. 283–304, Sep. 1998.

[32] W. K. Wong, D. W. Cheung, B. Kao, and N. Mamoulis, “SecurekNN computation on encrypted databases,” in Proc. ACM SIG-MOD Int. Conf. Manage. Data, Providence, RI, 2009, pp. 139–152.

[33] R. X. Li, Z. Y. Xu, W. S. Kang, K. C. Yow, and C. Z. Xu, “EfficientMulti-keyword ranked query over encrypted data in cloud com-puting, Futur. Gener. Comp. Syst., vol. 30, pp. 179–190, Jan. 2014.

[34] G. Craig, “Fully homomorphic encryption using ideal lattices,” inProc. 41st Annu. ACMSymp. Theory Comput., 2009, vol. 9, pp. 169–178

[35] D. Boneh, G. Di Crescenzo, R. Ostrovsky, and G. Persiano, “Publickey encryption with keyword search[C],” in Proc. Adv. Cryptol.,Berlin, Heidelberg, 2004, pp. 506–522.

[36] D. Cash, S. Jarecki, C. Jutla, H. Krawczyk, M. Rosu, and M.Steiner, “Highly-scalable searchable symmetric encryption withsupport for Boolean que-ries,” in Proc. Adv. Cryptol,. Berlin, Hei-delberg, 2013, pp. 353–373.

[37] S. Kamara, C. Papamanthou, and T. Roeder, “Dynamic searchablesymmetric encryption,” in Proc. Conf. Comput. Commun. Secur.,2012, pp. 965–976.

[38] R. Curtmola, J. Garay, S. Kamara, and R. Ostrovsky, “Searchablesymmetric encryption: Improved definitions and efficientconstruc-tions,” in Proc. 13th ACM Conf. Comput. Commun. Secur.,2006, pp. 79–88.

[39] M. Chase and S. Kamara, “Structured encryption and controlleddisclosure,” in Proc. Adv. Cryptol., 2010, pp. 577–594.

[40] D. Cash, J. Jaeger, S. Jarecki, C. Jutla, H. Krawczyk, M. C. Rosu,and M. Steiner, “Dynamic searchable encryption in very largedatabases: Data structures and implementation,” in Proc. Netw.Distrib. Syst. Security Symp., vol. 14, 2014, Doi: http://dx.doi.org/10.14722/ndss.2014.23264.

[41] S. Jarecki, C. Jutla, H. Krawczyk, M. Rosu, and M. Steiner,“Outsourced symmetric private information retrieval,” in Proc.ACMSIGSACConf. Comput. Commun. Secur., Nov. 2013, pp. 875–888.

Chi Chen received the BS and MS degrees fromShandong University, Jinan, China, in 2000 and2003, respectively, and the PHD degree fromthe Institute of Software Chinese Academy of Sci-ences, Beijing, China in 2008. He is an associateresearch fellow of the Institute of InformationEngineering, Chinese Academy of Sciences. Hisresearch interest includes the cloud security anddatabase security. From 2003 to 2011, he was aresearch apprentice, research assistant, andassociate research fellow with the State Key Lab-

oratory of Information Security, institute of software Chinese Academy ofSciences. Since 2012, he is an associate research fellow with the StateKey Laboratory of Information Security, institute of information engineer-ing, Chinese Academy of Sciences, Beijing, China. He is a member ofthe IEEE.

Xiaojie Zhu received the BS degree in theZhejiang University of Technology, HangZhou,China, in 2011. He is currently working towardsthe MS degree in the Institute of InformationEngineering, Chinese Academy of Sciences.His research interest includes the informationretrieval, secure cloud storage, and data security.He is a student member of the IEEE.

Peisong Shen received the BS degree in theUniversity of Science and Technology of China,HeFei, China, in 2012. He is currently workingtowards thePhDdegree in the Institute of Informa-tion Engineering, Chinese Academy of Sciences.His research interest includes the informationretrieval, secure cloud storage, and data security.He is a student member of the IEEE.

Jiankun Hu is a professor and research directorof the Cyber Security Lab, School of Engineeringand IT, The University of New South Wales, Can-berra, Australia. His main research interest isCyber security with focus on bio-cryptography,and anomaly intrusion detection. He has obtainedseven ARC (Australian Research Council) Grantsand has served at the prestigious Panel of mathe-matics, information and computing sciences,ARC ERA Evaluation Committee. He is a mem-ber of the IEEE.




Song Guo received the PhD degree in computerscience from University of Ottawa, Canada. He iscurrently a Full Professor at School of ComputerScience and Engineering, the University of Aizu,Japan. His research interests are mainly in theareas of wireless communication and mobilecomputing, cyber-physical systems, data centernetworks, cloud computing and networking, bigdata, and green computing. He has publishedover 250 papers in referred journals and confer-ences in these areas and received three IEEE/

ACM best paper awards. Dr. Guo currently serves as Secretary of IEEEComSoc Technical Committee on Satellite and Space Communication(TCSSC) and Technical Subcommittee on Big Data (TSCBD), AssociateEditor of IEEE Transactions on Parallel and Distributed Systems(TPDS), IEEE Transactions on Emerging Topics (TETC) for the Compu-tational Networks Track, and on editorial boards of many others. He hasalso been in organizing and technical committees of numerous interna-tional conferences and workshops. Dr. Guo is a senior member of theIEEE and the ACM.

Zahir Tari received the degree in mathematicsfrom the University of Science and TechnologyHouari Boumediene, Bab-Ezzouar, Algeria, in1984, the masters degree in operational researchfrom the University of Grenoble, Grenoble,France, in 1985, and the PhD degree in computerscience from the University of Grenoble, in 1989.He is a professor, in distributed systems, at RMITUniversity, Melbourne, Australia. Later, he joinedthe Database Laboratory at EPFL (Swiss FederalInstitute of Technology, 1990-1992) and then

moved to QUT (Queensland University of Technology, 1993-1995) andRMIT (Royal Melbourne Institute of Technology, since 1996). He is thehead of the DSN (Distributed Systems and Networking) at the School ofComputer Scienceand IT, where he pursues high-impact research anddevelopment in computer science. He leads a few research groups thatfocus on some of the core areas, including networking (QoS routing,TCP/IP congestion), distributed systems (performance, security, mobility,reliability), and distributed applications (SCADA, Web/Internet applica-tions, mobile applications).His recent research interests are in perfor-mance (in Cloud) and security (in SCADA systems). He regularlypublishes in prestigious journals (like IEEE Transactions on Parallel andDistributed Systems, IEEE Trans-actions on Web Services, ACM Trans-actions on Databases) and conferences (ICDCS, WWW, ICSOC, etc.).He co-authored two books (John Wiley) and edited more than 10 books.He has been the program committee chair of several internationalconferences, including the DOA (Distributed Object and Appli-cationSymposium), IFIP DS 11.3 on Database Security, and IFIP 2.6 on DataSemantics. He has also been the general chair of more than 12conferences. He is the recipient of 14 ARC (Australian ResearchCouncil)grants. He is a senior member of the IEEE.

Albert Y. Zomaya is currently the chairprofessor of high-performance computing andnetworking and Australian Research Councilprofessorial fellow in the School of InformationTechnologies, The University of Sydney,Sydney, Australia. He is also the director of theCentre for distributed and high-performancecomputing which was established in late 2009.He is the author/co-author of seven books,more than 370 papers, and the editor of ninebooks and 11 conference proceedings. He is

the editor in chief of the IEEE Transactions on Computers and servesas an associate editor for 19 leading journals. He is the recipient ofthe Meritorious Service Award in 2000 and the Golden Core Recogni-tion in 2006, both from the IEEE Computer Society. He is a charteredengineer (CEng), a fellow of the AAAS, the IEEE, the IET (UK), and adistinguished engineer of the ACM.

" For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.




an efficeient privacy preserving ranked keyword search

Education