340 ieee transactions on parallel and distributed … · a secure multi-keyword ranked search...

13
A Secure and Dynamic Multi-Keyword Ranked Search Scheme over Encrypted Cloud Data Zhihua Xia, Member, IEEE, Xinhui Wang, Xingming Sun, Senior Member, IEEE, and Qian Wang, Member, IEEE Abstract—Due to the increasing popularity of cloud computing, more and more data owners are motivated to outsource their data to cloud servers for great convenience and reduced cost in data management. However, sensitive data should be encrypted before outsourcing for privacy requirements, which obsoletes data utilization like keyword-based document retrieval. In this paper, we present a secure multi-keyword ranked search scheme over encrypted cloud data, which simultaneously supports dynamic update operations like deletion and insertion of documents. Specifically, the vector space model and the widely-used TF IDF model are combined in the index construction and query generation. We construct a special tree-based index structure and propose a “Greedy Depth-first Search” algorithm to provide efficient multi-keyword ranked search. The secure kNN algorithm is utilized to encrypt the index and query vectors, and meanwhile ensure accurate relevance score calculation between encrypted index and query vectors. In order to resist statistical attacks, phantom terms are added to the index vector for blinding search results. Due to the use of our special tree-based index structure, the proposed scheme can achieve sub-linear search time and deal with the deletion and insertion of documents flexibly. Extensive experiments are conducted to demonstrate the efficiency of the proposed scheme. Index Terms—Searchable encryption, multi-keyword ranked search, dynamic update, cloud computing Ç 1 INTRODUCTION C LOUD computing has been considered as a new model of enterprise IT infrastructure, which can organize huge resource of computing, storage and applications, and enable users to enjoy ubiquitous, convenient and on- demand network access to a shared pool of configurable computing resources with great efficiency and minimal eco- nomic overhead [1]. Attracted by these appealing features, both individuals and enterprises are motivated to outsource their data to the cloud, instead of purchasing software and hardware to manage the data themselves. Despite of the various advantages of cloud services, outsourcing sensitive information (such as e-mails, per- sonal health records, company finance data, government documents, etc.) to remote servers brings privacy con- cerns. The cloud service providers (CSPs) that keep the data for users may access users’ sensitive information without authorization. A general approach to protect the data confidentiality is to encrypt the data before outsourc- ing [2]. However, this will cause a huge cost in terms of data usability. For example, the existing techniques on keyword-based information retrieval, which are widely used on the plaintext data, cannot be directly applied on the encrypted data. Downloading all the data from the cloud and decrypt locally is obviously impractical. In order to address the above problem, researchers have designed some general-purpose solutions with fully-homo- morphic encryption [3] or oblivious RAMs [4]. However, these methods are not practical due to their high computa- tional overhead for both the cloud sever and user. On the contrary, more practical special-purpose solutions, such as searchable encryption (SE) schemes have made specific con- tributions in terms of efficiency, functionality and security. Searchable encryption schemes enable the client to store the encrypted data to the cloud and execute keyword search over ciphertext domain. So far, abundant works have been proposed under different threat models to achieve various search functionality, such as single keyword search, similar- ity search, multi-keyword boolean search, ranked search, multi-keyword ranked search, etc. Among them, multi-key- word ranked search achieves more and more attention for its practical applicability. Recently, some dynamic schemes have been proposed to support inserting and deleting oper- ations on document collection. These are significant works as it is highly possible that the data owners need to update their data on the cloud server. But few of the dynamic schemes support efficient multi-keyword ranked search. This paper proposes a secure tree-based search scheme over the encrypted cloud data, which supports multi-key- word ranked search and dynamic operation on the docu- ment collection. Specifically, the vector space model and the widely-used “term frequency (TF) inverse document frequency (IDF)” model are combined in the index construc- tion and query generation to provide multi-keyword ranked search. In order to obtain high search efficiency, we con- struct a tree-based index structure and propose a “Greedy Z. Xia, X. Wang, and X. Sun are with the Jiangsu Engineering Center of Network Monitoring, Jiangsu Collaborative Innovation Center on Atmo- spheric Environment and Equipment Technology, and School of Computer and Software, Nanjing University of Information Science & Technology, Nanjing, China. E-mail: {xia_zhihua, wxh_nuist, sunnudt}@163.com. Q. Wang is with the Key Lab of Aerospace Information Security and Trusted Computing, School of Computer, Wuhan University, Wuhan, China. E-mail: [email protected]. Manuscript received 29 Aug. 2014; revised 3 Feb. 2015; accepted 3 Feb. 2015. Date of publication 10 Feb. 2015; date of current version 20 Jan. 2016. Recommended for acceptance by J. Chen. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference the Digital Object Identifier below. Digital Object Identifier no. 10.1109/TPDS.2015.2401003 340 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 27, NO. 2, FEBRUARY 2016 1045-9219 ß 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Upload: others

Post on 24-Oct-2019

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 340 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED … · a secure multi-keyword ranked search scheme over encrypted cloud data, which simultaneously supports dynamic update operations

A Secure and Dynamic Multi-Keyword RankedSearch Scheme over Encrypted Cloud Data

Zhihua Xia,Member, IEEE, Xinhui Wang, Xingming Sun, Senior Member, IEEE, and

Qian Wang,Member, IEEE

Abstract—Due to the increasing popularity of cloud computing, more and more data owners are motivated to outsource their data to

cloud servers for great convenience and reduced cost in data management. However, sensitive data should be encrypted before

outsourcing for privacy requirements, which obsoletes data utilization like keyword-based document retrieval. In this paper, we present

a secure multi-keyword ranked search scheme over encrypted cloud data, which simultaneously supports dynamic update operations

like deletion and insertion of documents. Specifically, the vector space model and the widely-used TF � IDF model are combined in the

index construction and query generation. We construct a special tree-based index structure and propose a “Greedy Depth-first Search”

algorithm to provide efficient multi-keyword ranked search. The secure kNN algorithm is utilized to encrypt the index and query vectors,

and meanwhile ensure accurate relevance score calculation between encrypted index and query vectors. In order to resist statistical

attacks, phantom terms are added to the index vector for blinding search results. Due to the use of our special tree-based index

structure, the proposed scheme can achieve sub-linear search time and deal with the deletion and insertion of documents flexibly.

Extensive experiments are conducted to demonstrate the efficiency of the proposed scheme.

Index Terms—Searchable encryption, multi-keyword ranked search, dynamic update, cloud computing

Ç

1 INTRODUCTION

CLOUD computing has been considered as a new modelof enterprise IT infrastructure, which can organize

huge resource of computing, storage and applications, andenable users to enjoy ubiquitous, convenient and on-demand network access to a shared pool of configurablecomputing resources with great efficiency and minimal eco-nomic overhead [1]. Attracted by these appealing features,both individuals and enterprises are motivated to outsourcetheir data to the cloud, instead of purchasing software andhardware to manage the data themselves.

Despite of the various advantages of cloud services,outsourcing sensitive information (such as e-mails, per-sonal health records, company finance data, governmentdocuments, etc.) to remote servers brings privacy con-cerns. The cloud service providers (CSPs) that keep thedata for users may access users’ sensitive informationwithout authorization. A general approach to protect thedata confidentiality is to encrypt the data before outsourc-ing [2]. However, this will cause a huge cost in terms ofdata usability. For example, the existing techniques onkeyword-based information retrieval, which are widely

used on the plaintext data, cannot be directly applied onthe encrypted data. Downloading all the data from thecloud and decrypt locally is obviously impractical.

In order to address the above problem, researchers havedesigned some general-purpose solutions with fully-homo-morphic encryption [3] or oblivious RAMs [4]. However,these methods are not practical due to their high computa-tional overhead for both the cloud sever and user. On thecontrary, more practical special-purpose solutions, such assearchable encryption (SE) schemes have made specific con-tributions in terms of efficiency, functionality and security.Searchable encryption schemes enable the client to store theencrypted data to the cloud and execute keyword searchover ciphertext domain. So far, abundant works have beenproposed under different threat models to achieve varioussearch functionality, such as single keyword search, similar-ity search, multi-keyword boolean search, ranked search,multi-keyword ranked search, etc. Among them, multi-key-word ranked search achieves more and more attention forits practical applicability. Recently, some dynamic schemeshave been proposed to support inserting and deleting oper-ations on document collection. These are significant worksas it is highly possible that the data owners need to updatetheir data on the cloud server. But few of the dynamicschemes support efficient multi-keyword ranked search.

This paper proposes a secure tree-based search schemeover the encrypted cloud data, which supports multi-key-word ranked search and dynamic operation on the docu-ment collection. Specifically, the vector space model andthe widely-used “term frequency (TF) � inverse documentfrequency (IDF)” model are combined in the index construc-tion and query generation to provide multi-keyword rankedsearch. In order to obtain high search efficiency, we con-struct a tree-based index structure and propose a “Greedy

� Z. Xia, X. Wang, and X. Sun are with the Jiangsu Engineering Center ofNetwork Monitoring, Jiangsu Collaborative Innovation Center on Atmo-spheric Environment and Equipment Technology, and School of Computerand Software, Nanjing University of Information Science & Technology,Nanjing, China. E-mail: {xia_zhihua, wxh_nuist, sunnudt}@163.com.

� Q. Wang is with the Key Lab of Aerospace Information Security andTrusted Computing, School of Computer, Wuhan University, Wuhan,China. E-mail: [email protected].

Manuscript received 29 Aug. 2014; revised 3 Feb. 2015; accepted 3 Feb. 2015.Date of publication 10 Feb. 2015; date of current version 20 Jan. 2016.Recommended for acceptance by J. Chen.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference the Digital Object Identifier below.Digital Object Identifier no. 10.1109/TPDS.2015.2401003

340 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 27, NO. 2, FEBRUARY 2016

1045-9219� 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: 340 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED … · a secure multi-keyword ranked search scheme over encrypted cloud data, which simultaneously supports dynamic update operations

Depth-first Search (GDFS)” algorithm based on this indextree. Due to the special structure of our tree-based index,the proposed search scheme can flexibly achieve sub-linearsearch time and deal with the deletion and insertion ofdocuments. The secure kNN algorithm is utilized to encryptthe index and query vectors, and meanwhile ensure accu-rate relevance score calculation between encrypted indexand query vectors. To resist different attacks in differentthreat models, we construct two secure search schemes: thebasic dynamic multi-keyword ranked search (BDMRS)scheme in the known ciphertext model, and the enhanceddynamic multi-keyword ranked search (EDMRS) scheme inthe known background model. Our contributions are sum-marized as follows:

1) We design a searchable encryption scheme that sup-ports both the accurate multi-keyword rankedsearch and flexible dynamic operation on documentcollection.

2) Due to the special structure of our tree-based index,the search complexity of the proposed scheme isfundamentally kept to logarithmic. And in practice,the proposed scheme can achieve higher searchefficiency by executing our “Greedy Depth-firstSearch” algorithm. Moreover, parallel search can beflexibly performed to further reduce the time cost ofsearch process.

The reminder of this paper is organized as follows.Related work is discussed in Section 2, and Section 3gives a brief introduction to the system model, threatmodel, the design goals, and the preliminaries. Section 4describes the schemes in detail. Section 5 presents theexperiments and performance analysis. And Section 6covers the conclusion.

2 RELATED WORK

Searchable encryption schemes enable the clients to storethe encrypted data to the cloud and execute keyword searchover ciphertext domain. Due to different cryptographyprimitives, searchable encryption schemes can be con-structed using public key based cryptography [5], [6] orsymmetric key based cryptography [7], [8], [9], [10].

Song et al. [7] proposed the first symmetric searchableencryption (SSE) scheme, and the search time of their schemeis linear to the size of the data collection. Goh [8] proposedformal security definitions for SSE and designed a schemebased on Bloom filter. The search time of Goh’s scheme isO nð Þ, where n is the cardinality of the document collection.Curtmola et al. [10] proposed two schemes (SSE-1 and SSE-2)which achieve the optimal search time. Their SSE-1 schemeis secure against chosen-keyword attacks (CKA1) and SSE-2is secure against adaptive chosen-keyword attacks (CKA2).These early works are single keyword boolean searchschemes, which are very simple in terms of functionality.Afterward, abundant works have been proposed under dif-ferent threat models to achieve various search functionality,such as single keyword search, similarity search[11], [12],[13], [14], multi-keyword boolean search [15], [16], [17], [18],[19], [20], [21], [22], ranked search [23], [24], [25], and multi-keyword ranked search [26], [27], [28], [29], etc.

Multi-keyword boolean search allows the users toinput multiple query keywords to request suitable docu-ments. Among these works, conjunctive keyword searchschemes [15], [16], [17] only return the documents thatcontain all of the query keywords. Disjunctive keywordsearch schemes [18], [19] return all of the documents thatcontain a subset of the query keywords. Predicate searchschemes [20], [21], [22] are proposed to support both con-junctive and disjunctive search. All these multi-keywordsearch schemes retrieve search results based on the exis-tence of keywords, which cannot provide acceptableresult ranking functionality.

Ranked search can enable quick search of the most rele-vant data. Sending back only the top-k most relevant docu-ments can effectively decrease network traffic. Some earlyworks [23], [24], [25] have realized the ranked search usingorder-preserving techniques, but they are designed only forsingle keyword search. Cao et al. [26] realized the first pri-vacy-preserving multi-keyword ranked search scheme, inwhich documents and queries are represented as vectors ofdictionary size. With the “coordinate matching”, the docu-ments are ranked according to the number of matchedquery keywords. However, Cao et al. scheme does not con-sider the importance of the different keywords, and thus isnot accurate enough. In addition, the search efficiency of thescheme is linear with the cardinality of document collection.Sun et al. [27] presented a secure multi-keyword searchscheme that supports similarity-based ranking. The authorsconstructed a searchable index tree based on vector spacemodel and adopted cosine measure together with TF�IDFto provide ranking results. Sun et al. search algorithmachieves better-than-linear search efficiency but results in

precision loss. €Orencik et al. [28] proposed a secure multi-keyword search method which utilized local sensitive hash(LSH) functions to cluster the similar documents. The LSHalgorithm is suitable for similar search but cannot provideexact ranking. In [29], Zhang et al. proposed a scheme todeal with secure multi-keyword ranked search in a multi-owner model. In this scheme, different data owners use dif-ferent secret keys to encrypt their documents and keywordswhile authorized data users can query without knowingkeys of these different data owners. The authors proposedan “Additive Order Preserving Function” to retrieve themost relevant search results. However, these works don’tsupport dynamic operations.

Practically, the data owner may need to update the docu-ment collection after he upload the collection to the cloudserver. Thus, the SE schemes are expected to support theinsertion and deletion of the documents. There are also sev-eral dynamic searchable encryption schemes. In the work ofSong et al. [7], the each document is considered as asequence of fixed length words, and is individuallyindexed. This scheme supports straightforward updateoperations but with low efficiency. Goh [8] proposed ascheme to generate a sub-index (Bloom filter) for every doc-ument based on keywords. Then the dynamic operationscan be easily realized through updating of a Bloom filteralong with the corresponding document. However, Goh’sscheme has linear search time and suffers from false posi-tives. In 2012, Kamara et al. [30] constructed an encryptedinverted index that can handle dynamic data efficiently.

XIA ET AL.: A SECURE AND DYNAMIC MULTI-KEYWORD RANKED SEARCH SCHEME OVER ENCRYPTED CLOUD DATA 341

Page 3: 340 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED … · a secure multi-keyword ranked search scheme over encrypted cloud data, which simultaneously supports dynamic update operations

But, this scheme is very complex to implement. Subse-quently, as an improvement, Kamara and Papamanthou[31] proposed a new search scheme based on tree-basedindex, which can handle dynamic update on document datastored in leaf nodes. However, their scheme is designedonly for single-keyword Boolean search. In [32], Cash et al.presented a data structure for keyword/identity tuplenamed “T-Set”. Then, a document can be represented by aseries of independent T-Sets. Based on this structure, Cashet al. [33] proposed a dynamic searchable encryptionscheme. In their construction, newly added tuples arestored in another database in the cloud, and deleted tuplesare recorded in a revocation list. The final search result isachieved through excluding tuples in the revocation listfrom the ones retrieved from original and newly addedtuples. Yet, Cash et al. dynamic search scheme doesn’t real-ize the multi-keyword ranked search functionality.

3 PROBLEM FORMULATION

3.1 Notations and Preliminaries

� W—The dictionary, namely, the set of keywords,denoted asW ¼ fw1; w2; . . . ; wmg.

� m—The total number of keywords inW.� Wq—The subset of W, representing the keywords in

the query.� F—The plaintext document collection, denoted as a

collection of n documents F ¼ ff1; f2; . . . ; fng. Eachdocument f in the collection can be considered as asequence of keywords.

� n—The total number of documents in F .� C—The encrypted document collection stored in the

cloud server, denoted as C ¼ fc1; c2; . . . ; cng.� T—The unencrypted form of index tree for the

whole document collection F .� I—The searchable encrypted tree index generated

from T .� Q—The query vector for keyword setWq.� TD—The encrypted form of Q, which is named as

trapdoor for the search request.� Du—The index vector stored in tree node u whose

dimension equals to the cardinality of the dictionaryW. Note that the node u can be either a leaf node oran internal node of the tree.

� Iu—The encrypted form ofDu.Vector space model and relevance score function. Vector spacemodel along with TF�IDF rule is widely used in plaintextinformation retrieval, which efficiently supports rankedmulti-keyword search [34]. Here, the term frequency is thenumber of times a given term (keyword) appears within adocument, and the inverse document frequency is obtainedthrough dividing the cardinality of document collection bythe number of documents containing the keyword. In thevector space model, each document is denoted by a vector,whose elements are the normalized TF values of keywordsin this document. Each query is also denoted as a vector Q,whose elements are the normalized IDF values of querykeywords in the document collection. Naturally, the lengthsof both the TF vector and the IDF vector are equal to thetotal number of keywords, and the dot product of the TFvectorDu and the IDF vector Q can be calculated to quantify

the relevance between the query and corresponding docu-ment. Following are the notations used in our relevanceevaluation function:

� Nf;wi—The number of keyword wi in document f .

� N—The total number of documents.� Nwi

—The number of documents that contain key-word wi.

� TF 0f;wi—The TF value of wi in document f .

� IDF 0wi—The IDF value of wi in document collection.

� TFu;wi—The normalized TF value of keyword wi

stored in index vectorDu.� IDFwi

—The normalized IDF value of keyword wi indocument collection.

The relevance evaluation function is defined as:

RScoreðDu;QÞ ¼ Du �Q ¼X

wi2Wq

TFu;wi� IDFwi

: (1)

If u is an internal node of the tree, TFu;wiis calculated from

index vectors in the child nodes of u. If the u is a leaf node,TFu;wi

is calculated as:

TFu;wi¼

TF 0f;wiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPwi2W

�TF 0f;wi

�2q ; (2)

where TF 0f;wi¼ 1þ lnNf;wi

. And in the search vector Q,IDFwi

is calculated as:

IDFwi¼

IDF 0wiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPwi2Wq

�IDF 0wi

�2q ; (3)

where IDF 0wi¼ ln 1þN=Nwi

� �.

Keyword balanced binary tree. The balanced binary tree iswidely used to deal with optimization problems [35], [36].The keyword balanced binary (KBB) tree in our scheme is adynamic data structure whose node stores a vector D. Theelements of vector D are the normalized TF values. Some-times, we refer the vector D in the node u to Du for simplic-ity. Formally, the node u in our KBB tree is defined asfollows:

u ¼ hID;D; Pl; Pr; FIDi; (4)

where ID denotes the identity of node u, Pl and Pr arerespectively the pointers to the left and right child of nodeu. If the node u is a leaf node of the tree, FID stores the iden-tity of a document, and D denotes a vector consisting of thenormalized TF values of the keywords to the document. Ifthe node u is an internal node, FID is set to null, and Ddenotes a vector consisting of the TF values which is calcu-lated as follows:

D½i� ¼ maxfu:Pl ! D½i�; u:Pr ! D½i�g; i ¼ 1; . . . ;m: (5)

The detailed construction process of the tree-based indexis illustrated in Section 4, which is denoted asBuildIndexTreeðFÞ.

342 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 27, NO. 2, FEBRUARY 2016

Page 4: 340 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED … · a secure multi-keyword ranked search scheme over encrypted cloud data, which simultaneously supports dynamic update operations

3.2 The System and Threat Models

The systemmodel in this paper involves three different enti-ties: data owner, data user and cloud server, as illustrated inFig. 1.

Data owner has a collection of documents F ¼ ff1;f2; . . . ; fng that he wants to outsource to the cloud server inencrypted form while still keeping the capability to searchon them for effective utilization. In our scheme, the dataowner first builds a secure searchable tree index I from doc-ument collection F , and then generates an encrypted docu-ment collection C for F . Afterwards, the data owneroutsources the encrypted collection C and the secure indexI to the cloud server, and securely distributes the key infor-mation of trapdoor generation (including keyword IDF val-ues) and document decryption to the authorized data users.

Besides, the data owner is responsible for the updateoperation of his documents stored in the cloud server.While updating, the data owner generates the update infor-mation locally and sends it to the server.

Data users are authorized ones to access the documents ofdata owner. With t query keywords, the authorized usercan generate a trapdoor TD according to search controlmechanisms to fetch k encrypted documents from cloudserver. Then, the data user can decrypt the documents withthe shared secret key.

Cloud server stores the encrypted document collection Cand the encrypted searchable tree index I for data owner.Upon receiving the trapdoor TD from the data user, thecloud server executes search over the index tree I , andfinally returns the corresponding collection of top-k rankedencrypted documents. Besides, upon receiving the updateinformation from the data owner, the server needs toupdate the index I and document collection C according tothe received information.

The cloud server in the proposed scheme is considered as“honest-but-curious”, which is employed by lots of workson secure cloud data search [25], [26], [27]. Specifically, thecloud server honestly and correctly executes instructions inthe designated protocol. Meanwhile, it is curious to inferand analyze received data, which helps it acquire additionalinformation. Depending on what information the cloudserver knows, we adopt the two threat models proposed byCao et al. [26].

Known ciphertext model. In this model, the cloud serveronly knows the encrypted document collection C, the search-able index tree I , and the search trapdoor TD submitted by

the authorized user. That is to say, the cloud server can con-duct ciphertext-only attack (COA) [37] in thismodel.

Known background model. Compared with known cipher-text model, the cloud server in this stronger model isequipped with more knowledge, such as the term frequencystatistics of the document collection. This statistical informa-tion records how many documents are there for each termfrequency of a specific keyword in the whole document col-lection, as shown in Fig. 2, which could be used as the key-word identity. Equipped with such statistical information,the cloud server can conduct TF statistical attack to deduceor even identify certain keywords through analyzing histo-gram and value range of the corresponding frequency dis-tributions [24], [25], [27].

3.3 Design Goals

To enable secure, efficient, accurate and dynamic multi-key-word ranked search over outsourced encrypted cloud dataunder the above models, our system has the followingdesign goals.

Dynamic. The proposed scheme is designed to providenot only multi-keyword query and accurate result ranking,but also dynamic update on document collections.

Search efficiency. The scheme aims to achieve sublinearsearch efficiency by exploring a special tree-based indexand an efficient search algorithm.

Privacy-preserving. The scheme is designed to prevent thecloud server from learning additional information about thedocument collection, the index tree, and the query. The spe-cific privacy requirements are summarized as follows,

1) Index confidentiality and query confidentiality. Theunderlying plaintext information, including key-words in the index and query, TF values of key-words stored in the index, and IDF values of querykeywords, should be protected from cloud server;

2) Trapdoor unlinkability. The cloud server should not beable to determine whether two encrypted queries(trapdoors) are generated from the same searchrequest;

3) Keyword privacy. The cloud server could not identifythe specific keyword in query, index or documentcollection by analyzing the statistical informationlike term frequency. Note that our proposed schemeis not designed to protect access pattern, i.e., thesequence of returned documents.

Fig. 1. The architecture of ranked search over encrypted cloud data.

Fig. 2. Distribution of term frequency (TF) for (a) keyword “subnet”, and(b) keyword “host”.

XIA ET AL.: A SECURE AND DYNAMIC MULTI-KEYWORD RANKED SEARCH SCHEME OVER ENCRYPTED CLOUD DATA 343

Page 5: 340 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED … · a secure multi-keyword ranked search scheme over encrypted cloud data, which simultaneously supports dynamic update operations

4 THE PROPOSED SCHEMES

In this section, we first describe the unencrypted dynamicmulti-keyword ranked search (UDMRS) scheme which isconstructed on the basis of vector space model and KBBtree. Based on the UDMRS scheme, two secure searchschemes (BDMRS and EDMRS schemes) are constructedagainst two threat models, respectively.

4.1 Index Construction of UDMRS Scheme

In Section 3, we have briefly introduced the KBB index treestructure, which assists us in introducing the index con-struction. In the process of index construction, we first gen-erate a tree node for each document in the collection. Thesenodes are the leaf nodes of the index tree. Then, the internaltree nodes are generated based on these leaf nodes. The for-mal construction process of the index is presented inAlgorithm 1. An example of our index tree is shown inFig. 3. Note that the index tree T built here is a plaintext.

Following are some notations for Algorithm 1. Besides,the data structure of the tree node is defined ashID;D; Pl; Pr; FIDi, where the unique identity ID for eachtree node is generated through the function GenIDðÞ.

� CurrentNodeSet—The set of current processingnodes which have no parents. If the number of nodesis even, the cardinality of the set is denoted as

2hðh 2 ZþÞ, else the cardinality is denoted asð2hþ 1Þ.

� TempNodeSet—The set of the newly generatednodes.

In the index, if Du½i� 6¼ 0 for an internal node u, there is atleast one path from the node u to some leaf, which indicatesa document containing the keyword wi. In addition, Du½i�always stores the biggest normalized TF value of wi amongits child nodes. Thus, the possible largest relevance score ofits children can be easily estimated.

Algorithm 1. BuildIndexTreeðFÞInput: the document collection F ¼ ff1; f2; . . . ; fng with the

identifiers FID ¼ fFIDjFID ¼ 1; 2; . . . ; ng.Output: the index tree T1: for each document fFID in F do2: Construct a leaf node u for fFID , with u:ID ¼ GenIDðÞ,

u:Pl ¼ u:Pr ¼ null, u:FID ¼ FID, andD½i� ¼ TFfFID ;wifor

i ¼ 1; . . . ;m;—3: Insert u to CurrentNodeSet;4: end for5: while the number of nodes in CurrentNodeSet is larger

than 1 do6: if the number of nodes in CurrentNodeSet is even, i.e., 2h

then7: for each pair of nodes u0 and u00 in CurrentNodeSet do8: Generate a parent node u for u0 and u00, with

u:ID ¼ GenIDðÞ, u:Pl ¼ u0, u:Pr ¼ u00, u:FID ¼ 0 andD½i� ¼ maxfu0:D½i�; u00:D½i�g for each i ¼ 1; . . . ;m;

9: Insert u to TempNodeSet;10: end for11: else12: for each pair of nodes u0 and u00 of the former ð2h� 2Þ

nodes in CurrentNodeSet do13: Generate a parent node u for u0 and u00;14: Insert u to TempNodeSet;15: end for16: Create a parent node u1 for the ð2h� 1Þth and 2hth

node, and then create a parent node u for u1 and theð2hþ 1Þth node;

17: Insert u to TempNodeSet;18: end if19: Replace CurrentNodeSet with TempNodeSet and then

clear TempNodeSet;20: end while21: return the only node left in CurrentNodeSet, namely, the

root of index tree T ;

4.2 Search Process of UDMRS Scheme

The search process of the UDMRS scheme is a recursive pro-cedure upon the tree, named as “Greedy Depth-first Search”algorithm. We construct a result list denoted as RList,whose element is defined as hRScore; FIDi. Here, theRScore is the relevance score of the document fFID tothe query, which is calculated according to Formula (1). TheRList stores the k accessed documents with the largest rele-vance scores to the query. The elements of the list areranked in descending order according to the RScore, andwill be updated timely during the search process. Followingare some other notations, and the GDFS algorithm isdescribed in Algorithm 2.

� RScoreðDu;QÞ—The function to calculate the rele-vance score for query vector Q and index vector Du

stored in node u, which is defined in Formula (1).� kthscore—The smallest relevance score in current

RList, which is initialized as 0.� hchild—The child node of a tree node with higher

relevance score.� lchild—The child node of a tree node with lower

relevance score.Since the possible largest relevance score of documents

rooted by the node u can be predicted, only a part of the

Fig. 3. An example of the tree-based index with the document collectionF ¼ ffiji ¼ 1; . . . ; 6g and cardinality of the dictionary m ¼ 4. In the con-struction process of the tree index, we first generate leaf nodes from thedocuments. Then, the internal tree nodes are generated based on theleaf nodes. This figure also shows an example of search process, in whichthe query vectorQ is equal to ð0; 0:92; 0; 0:38Þ. In this example, we set theparameter k ¼ 3 with the meaning that three documents will be returnedto the user. According to the search algorithm, the search starts with theroot node, and reaches the first leaf node f4 through r11 and r22. The rele-vance score of f4 to the query is 0:92. After that, the leaf nodes f3 and f2are successively reached with the relevance scores 0:038 and 0:67. Next,the leaf node f1 is reachedwith score 0:58 and replace f3 inRList. Finally,the algorithm will try to search subtree rooted by r12, and find that thereare no reasonable results in this subtree because the relevance score ofr12 is 0:52, which is smaller than the smallest relevance score inRList.

344 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 27, NO. 2, FEBRUARY 2016

Page 6: 340 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED … · a secure multi-keyword ranked search scheme over encrypted cloud data, which simultaneously supports dynamic update operations

nodes in the tree are accessed during the search process.Fig. 3 shows an example of search process with the docu-ment collection F ¼ ffiji ¼ 1; . . . ; 6g, cardinality of the dic-tionarym ¼ 4, and query vector Q ¼ ð0; 0:92; 0; 0:38Þ.

Algorithm 2. GDFSðIndexTreeNode uÞ1: if the node u is not a leaf node then2: if RScoreðDu;QÞ > kthscore then3: GDFSðu:hchildÞ;4: GDFSðu:lchildÞ;5: else6: return7: end if8: else9: if RScoreðDu;QÞ > kth score then10: Delete the element with the smallest relevance score

from RList;11: Insert a new element hRScoreðDu;QÞ; u:FIDi and sort

all the elements of RList;12: end if13: return14: end if

4.3 BDMRS Scheme

Based on the UDMRS scheme, we construct the basicdynamic multi-keyword ranked search scheme by using thesecure kNN algorithm [38]. The BDMRS scheme is designedto achieve the goal of privacy-preserving in the knownciphertext model, and the four algorithms included aredescribed as follows:

� SK SetupðÞ Initially, the data owner generates thesecret key set SK, including 1) a randomly generatedm-bit vector S where m is equal to the cardinality ofdictionary, and 2) two ðm�mÞ invertible matricesM1 andM2. Namely, SK ¼ fS;M1;M2g.

� I GenIndexðF ;SKÞ First, the unencrypted indextree T is built onF by using T BuildIndexTreeðFÞ.Second, the data owner generates two random vectorsfDu

0;Du00g for index vectorDu in each node u, accord-

ing to the secret vector S. Specifically, if S½i� ¼ 0,Du0½i�

and Du00½i� will be set equal to Du½i�; if S½i� ¼ 1, Du

0½i�and Du

00½i� will be set as two random values whosesum equals to Du½i�. Finally, the encrypted index treeI is built where the node u stores two encrypted index

vectors Iu ¼ fMT1 Du

0;MT2 Du

00g.� TD GenTrapdoorðWq;SKÞ With keyword set Wq,

the unencrypted query vector Q with length of mis generated. If wi 2 Wq, Q½i� stores the normalizedIDF value of wi; else Q½i� is set to 0. Similarly, the

query vector Q is split into two random vectors Q0

and Q00. The difference is that if S½i� ¼ 0, Q0½i� andQ00½i� are set to two random values whose sum equals

to Q½i�; else Q0½i� and Q00½i� are set as the same as Q½i�.Finally, the algorithm returns the trapdoor TD ¼fM�1

1 Q0;M�12 Q00g.

� RelevanceScore SRScoreðIu; TDÞ With the trap-door TD, the cloud server computes the relevancescore of node u in the index tree I to the query. Notethat the relevance score calculated from encrypted

vectors is equal to that from unencrypted vectors asfollows:

Iu � TD¼

�MT

1 Du0� � �M�1

1 Q0�þ�MT

2 Du00� � �M�1

2 Q00�

¼�MT

1 Du0�T �M�1

1 Q0�þ�MT

2 Du00�T �M�1

2 Q00�

¼ Du0TM1M

�11 Q0 þDu

00TM2M�12 Q00

¼ Du0 �Q0 þDu

00 �Q00

¼ Du �Q¼ RScoreðDu;QÞ:

(6)

Security analysis. We analyze the BDMRS scheme accord-ing to the three predefined privacy requirements in thedesign goals:

1) Index confidentiality and query confidentiality. In theproposed BDMRS scheme, Iu and TD are obfus-cated vectors, which means the cloud server cannotinfer the original vectors Du and Q without thesecret key set SK. The secret keys M1 and M2

are Gaussian random matrices. According to [38],the attacker (cloud server) of COA cannot calculatethe matrices merely with ciphertext. Thus, theBDMRS scheme is resilient against ciphertext-onlyattack and the index confidentiality and the queryconfidentiality are well protected.

2) Query unlinkability. The trapdoor of query vector isgenerated from a random splitting operation, whichmeans that the same search requests will be trans-formed into different query trapdoors, and thus thequery unlinkability is protected. However, thecloud server is able to link the same search requestsaccording to the same visited path and the same rel-evance scores.

3) Keyword privacy. In this scheme, the confidentiality ofthe index and query are well protected that the origi-nal vectors are kept from the cloud server. And thesearch process merely introduces inner product com-puting of encrypted vectors, which leaks no informa-tion about any specific keyword. Thus, the keywordprivacy is protected in the known ciphertext model.But in the known background model, the cloudserver is supposed to have more knowledge, such asthe term frequency statistics of keywords. This statis-tic information can be visualized as a TF distributionhistogram which reveals how many documents arethere for every TF value of a specific keyword in thedocument collection. Then, due to the specificity ofthe TF distribution histogram, like the graph slopeand value range, the cloud server could conduct TFstatistical attack to deduce/identify keywords [24],[25], [27]. In the worst case, when there is only onekeyword in the query vector, i.e., the normalized IDFvalue for the keyword is 1, the final relevance scoredistribution is exactly the normalized TF distributionof this keyword, which is directly exposed to cloudserver. Therefore, the BDMRS scheme cannot resistTF statistical attack in the known background model.

XIA ET AL.: A SECURE AND DYNAMIC MULTI-KEYWORD RANKED SEARCH SCHEME OVER ENCRYPTED CLOUD DATA 345

Page 7: 340 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED … · a secure multi-keyword ranked search scheme over encrypted cloud data, which simultaneously supports dynamic update operations

4.4 EDMRS Scheme

The security analysis above shows that the BDMRS schemecan protect the Index Confidentiality and Query Confidentialityin the known ciphertext model. However, the cloud serveris able to link the same search requests by tracking path ofvisited nodes. In addition, in the known background model,it is possible for the cloud server to identify a keyword asthe normalized TF distribution of the keyword can beexactly obtained from the final calculated relevance scores.The primary cause is that the relevance score calculatedfrom Iu and TD is exactly equal to that from Du and Q. Aheuristic method to further improve the security is to breaksuch exact equality. Thus, we can introduce some tunablerandomness to disturb the relevance score calculation. Inaddition, to suit different users’ preferences for higher accu-rate ranked results or better protected keyword privacy, therandomness are set adjustable.

The enhanced EDMRS scheme is almost the same asBDMRS scheme except that:

� SK SetupðÞ In this algorithm, we set the secretvector S as a m-bit vector, and set M1 and M2 areðmþm0Þ � ðmþm0Þ invertible matrices, where m0 isthe number of phantom terms.

� I GenIndexðF ;SKÞ Before encrypting the indexvector Du, we extend the vector Du to be aðmþm0Þ-dimensional vector. Each extended ele-ment Du½mþ j�, j ¼ 1; . . . ;m0, is set as a randomnumber "j.

� TD GenTrapdoorðWq;SKÞ The query vector Q isextended to be a ðmþm0Þ-dimensional vector.Among the extended elements, a number of m00 ele-ments are randomly chosen to set as 1, and the restare set as 0.

� RelevanceScore SRScoreðIu; TDÞ After the execu-tion of relevance evaluation by cloud server, the finalrelevance score for index vector Iu equals toDu �Qþ

P"v, where v 2 fjjQ½mþ j� ¼ 1g.

Security analysis. The security of EDMRS scheme is alsoanalyzed according to the three predefined privacy require-ments in the design goals:

1) Index confidentiality and query confidentiality. Inheritedfrom BDMRS scheme, the EDMRS scheme can pro-tect index confidentiality and query confidentialityin the known background model. Due to the utiliza-tion of phantom terms, the confidentiality is furtherenhanced as the transformation matrices are harderto figure out [38].

2) Query unlinkability. By introducing the random value", the same search requests will generate differentquery vectors and receive different relevance scoredistributions. Thus, the query unlinkability is pro-tected better. However, since the proposed scheme isnot designed to protect access pattern for efficiencyissues, the motivated cloud server can analyze thesimilarity of search results to judge whether theretrieved results come from the same requests. In theproposed EDMRS scheme, the data user can controlthe level of unlinkability by adjusting the value ofP

"v. This is a trade-off between accuracy and pri-vacy, which is determined by the user.

3) Keyword privacy. As is discussed in Section 4.3, theBDMRS scheme cannot resist TF statistical attack inthe known background model, as the cloud server isable to deduce/identify keywords through analyzingthe TF distribution histogram. Thus, the EDMRSscheme is designed to obscure the TF distributions ofkeywords with the randomness of

P"v. In order to

maximize the randomness of relevance score distri-butions, we need to get as many different

P"v as pos-

sible. Given that there are 2v different choices ofP

"vfor each index vector, the possibility that two

P"v

sharing the same value is 1=2v. In the EDMRS

scheme, the number of differentP

"v is equal to (m0

m00),

which reaches the maximum when m0=m00 ¼ 2.

Hence, considering (m0m00 � ðm0=m00Þm00¼ 2m

00, we

set m0 ¼ 2v and m00 ¼ v so that the number of differ-ent

P"v is greater than 2v. Therefore, there are at

least 2v dummy elements in every vector, and halfof them need to be randomly selected to generateP

"v in each query. In addition, we set every "j to fol-low the same uniform distribution Uðm0 � d;m0 þ dÞ.According to the central limit theorem, the

P"v

follows the normal distribution Nðm; s2Þ, whereexpectation m and standard deviation s can be calcu-lated as:

m ¼ vm0

s2 ¼ vd2=3:

�(7)

In the real application, we can set m ¼ 0, and bal-ance the accuracy and privacy by adjusting the vari-ance s.

4.5 Dynamic Update Operation of DMRS

After insertion or deletion of a document, we need toupdate synchronously the index. Since the index of DMRSscheme is designed as a balanced binary tree, the dynamicoperation is carried out by updating nodes in the index tree.Note that the update on index is merely based on documentidentifies, and no access to the content of documents isrequired. The specific process is presented as follows:

� fI0s; cig GenUpdateInfoðSK; T s; i; updtypeÞÞ Thisalgorithm generates the update information fI0s; cigwhich will be sent to the cloud server. In order toreduce the communication overhead, the data ownerstores a copy of unencrypted index tree. Here, thenotion updtype 2 fIns;Delg denotes either an inser-tion or a deletion for the document fi. The notion T s

denotes the set consisting of the tree nodes that needto be changed during the update. For example, if wewant to delete the document f4 in Fig. 3, the subtreeT s includes a set of nodes fr22; r11; rg.- If updtype is equal to Del, the data owner deletes

from the subtree the leaf node that stores the doc-ument identity i and updates the vector D ofother nodes in subtree T s, so as to generate the

updated subtree T 0s. In particular, if the deletionof the leaf node breaks the balance of the binaryindex tree, we replace the deleted node with afake node whose vector is padded with 0 and file

346 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 27, NO. 2, FEBRUARY 2016

Page 8: 340 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED … · a secure multi-keyword ranked search scheme over encrypted cloud data, which simultaneously supports dynamic update operations

identity is null. Then, the data owner encrypts

the vectors stored in the subtree T 0s with the key

set SK to generate encrypted subtree I0s, and setthe output ci as null.

- If updtype is equal to Ins, the data owner gener-ates a tree node u ¼ hGenIDðÞ; D; null; null; ii forthe document fi, where D½j� ¼ TFfi;wj

for

j ¼ 1; . . . ;m. Then, the data owner inserts thisnew node into the subtree T s as a leaf node andupdates the vector D of other nodes in subtreeT s according to the Formula (5), so as to generate

the new subtree T 0s. Here, the data owner isalways preferable to replace the fake leaf nodesgenerated by Del operation with newly insertednodes, instead of directly inserting new nodes.Next, the data owner encrypts the vectors stored

in subtree T 0s with the key set SK as described in

Section 4.4, to generate encrypted subtree I0s.Finally, the document fi is encrypted to ci.

� fI0; C0g UpdateðI ; C; updtype; I0s; ciÞ In this algo-rithm, cloud server replaces the correspondingsubtree I s(the encrypted form of T s) with I0s, so as

to generate a new index tree I0. If updtype is equalto Ins, cloud server inserts the encrypted document

ci into C, obtaining a new collection C0. If updtypeis equal to Del, cloud server deletes the encrypted

document ci from C to obtain the new collection C0.Similar to the scheme in [31], our scheme can also carry out

the update operation without storing the index tree on dataowner side.We choose to store the unencrypted index tree onthe data owner side to tradeoff storage cost for less communi-cation burdens. In both of the Kamara and Papamanthouscheme [31] and our design, it needs to change a set of nodesto update a leaf node because the vector data of an internalnode is computed from its children. If the data owner doesnot store the unencrypted subtree, the whole update processneeds two rounds of communications between the cloudserver and the data owner. Specifically, the data ownershould first download the involved subtree in encryptedform from the cloud server. Second, the data owner decryptsthe subtree and updates it with the newly added or deletedleaf node. Third, the data owner re-encrypts the subtree anduploads the encrypted subtree to the cloud server. Finally,

the cloud server replaces the old subtree with the updatedone. Thus, to reduce the communication cost, we store anunencrypted tree on the data owner side. Then, the dataowner can update the subtree directly with the newly addedor deleted leaf node and encrypt and upload the updatedsubtree to the cloud server. In this case, the update operationcan be finished with one round of communication betweenthe cloud server and the data owner.

As a dynamic scheme, it is not reasonable to fix thelength of vector as the size of dictionary because thenewly-added document may contain the keywords out ofthe dictionary. In the proposed scheme, we add someblank entries in the dictionary and set correspondingentries in each index vector as 0. If new keywords appearwhile inserting documents, these blank entries arereplaced with new keywords. Then, the index vectors ofnewly added documents are generated based on theupdated dictionary, while the other index vectors are notaffected and remain the same as before.

After several times of document updating, the real IDFvalues of some keywords in the present collection may haveobviously changed. Therefore, as the distributor of the IDFdata, the data owner needs to recalculate the IDF values forall keywords and distribute them to authorized users. InTable 1, there are three classes of keywords with differentIDF value ranges. The smaller IDF value means the key-word appears more frequently. Table 1 shows that afteradding or deleting 100 and 300 documents, the IDF valuesdo not change a lot. Thus, the data owner is unnecessary toupdate IDF values every time when he executes updateoperation on the data set. The data owner can flexiblychoose to check the change of IDF values, and distribute thenew IDF values when these values have changed a lot.

4.6 Parallel Execution of Search

Owing to the tree-based index structure, the proposedsearch scheme can be executed in parallel, which furtherimproves the search efficiency. For example, we assumethere are a set of processors P ¼ fp1; . . . ; plg available. Givena search request, an idle processor pi is used to query theroot r. If the search could be continued on both the children,and there is an idle processor pj, the processor pi continuesto deal with one of the children while processor pj deals

TABLE 1The Change of Keyword IDF Values After Updating in a Collection with 5,000 Documents

KeywordNO

OriginalIDF values

IDF values in the updated collection

After deleting100 documents

After deleting300 documents

After adding100 documents

After adding300 documents

1 3.0332 3.0253 3.0166 3.0334 3.02672 3.2581 3.2581 3.2530 3.2628 3.28573 3.7616 3.7584 3.7431 3.7647 3.75504 3.8934 3.8926 3.8910 3.9128 3.9226

5 5.6304 5.6103 5.6861 5.6501 5.68856 5.7478 5.7277 5.6861 5.7675 5.80597 5.8121 5.7920 5.8192 5.8319 5.8702

8 7.4192 7.3990 7.3573 7.4390 7.47749 7.8244 7.8043 7.7626 7.8442 7.882710 8.5174 8.4972 8.4555 8.5372 8.5757

XIA ET AL.: A SECURE AND DYNAMIC MULTI-KEYWORD RANKED SEARCH SCHEME OVER ENCRYPTED CLOUD DATA 347

Page 9: 340 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED … · a secure multi-keyword ranked search scheme over encrypted cloud data, which simultaneously supports dynamic update operations

with the other one. If there is no idle processor, the currentprocessor is used to deal with the child with larger rele-vance score, and the other child is put into a waiting queue.Once there is an idle processor, it takes the oldest node inthe queue to continue the search. Note that all the process-ors share the same result list RList.

5 PERFORMANCE ANALYSIS

We implement the proposed scheme using C++ language inWindows 7 operation system and test its efficiency on areal-world document collection: the Request for Comments(RFC) [39]. The test includes 1) the search precision on dif-ferent privacy level, and 2) the efficiency of index construc-tion, trapdoor generation, search, and update. Most of theexperimental results are obtained with an Intel Core(TM)Duo Processor (2.93 GHz), except that the efficiency ofsearch is tested on a server with two Intel(R) Xeon(R) CPUE5-2620 Processors (2.0 GHz), which has 12 processor coresand supports 24 parallel threads.

5.1 Precision and Privacy

The search precision of scheme is affected by the dummykeywords in EDMRS scheme. Here, the ’precision’ isdefined as that in [26]: Pk ¼ k0=k, where k0 is the number ofreal top-k documents in the retrieved k documents. If asmaller standard deviation s is set for the random variableP

"v, the EDMRS scheme is supposed to obtain higher pre-cision, and vice versa. The results are shown in Fig. 4a.

In the EDMRS scheme, phantom terms are added to theindex vector to obscure the relevance score calculation, sothat the cloud server cannot identify keywords by analyzingthe TF distributions of special keywords. Here, we quantifythe obscureness of the relevance score by “rank privacy”,which is defined as:

P 0k ¼Xjri � r0ij=k2; (8)

where ri is the rank number of document in the retrievedtop-k documents, and r0i is its real rank number in the wholeranked results. The larger rank privacy denotes the highersecurity of the scheme, which is illustrated in Fig. 4b.

In the proposed scheme, data users can accomplish dif-ferent requirements on search precision and privacy byadjusting the standard deviation s, which can be treated asa balance parameter.

We compare our schemes with a recent work proposedby Sun et al. [27], which achieves high search efficiency.Note that our BDMRS scheme retrieves the search resultsthrough exact calculation of document vector and queryvector. Thus, top-k search precision of the BDMRS schemeis 100 percent. But as a similarity-based multi-keywordranked search scheme, the basic scheme in [27] suffers fromprecision loss due to the clustering of sub-vectors duringindex construction. The precision test of [27]’s basic schemeis presented in Table 2. In each test, five keywords are ran-domly chosen as input, and the precision of returned top100 results is observed. The test is repeated 16 times, andthe average precision is 91 percent.

5.2 Efficiency

5.2.1 Index Tree Construction

The process of index tree construction for document collec-tion F includes two main steps: 1) building an unencryptedKBB tree based on the document collection F , and 2)encrypting the index tree with splitting operation and twomultiplications of a ðm�mÞ matrix. The index structure isconstructed following a post order traversal of the treebased on the document collection F , and OðnÞ nodes aregenerated during the traversal. For each node, generation ofan index vector takes OðmÞ time, vector splitting processtakes OðmÞ time, and two multiplications of a ðm�mÞmatrix takes Oðm2Þ time. As a whole, the time complexity

for index tree construction is Oðnm2Þ. Apparently, the timecost for building index tree mainly depends on the cardinal-ity of document collection F and the number of keywordsin dictionaryW. Fig. 5 shows that the time cost of index tree

Fig. 4. The precision (a) and rank privacy (b) of searches with differentstandard deviation s.

TABLE 2Precision Test of [27]’s Basic Scheme

NO Precision NO Precision

1 88% 9 96%2 94% 10 86.7%3 97% 11 87.5%4 100% 12 100%5 85% 13 82.3%6 89% 14 100%7 89% 15 100%8 96% 16 71.1%

Fig. 5. Time cost for index tree construction: (a) for the different sizes ofdocument collection with the fixed dictionary, m ¼ 4;000, and (b) for thedifferent sizes of dictionary with the fixed document collection, n ¼ 1;000.

348 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 27, NO. 2, FEBRUARY 2016

Page 10: 340 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED … · a secure multi-keyword ranked search scheme over encrypted cloud data, which simultaneously supports dynamic update operations

construction is almost linear with the size of document col-lection, and is proportional to the number of keywords inthe dictionary. Due to the dimension extension, the indextree construction of EDMRS scheme is slightly more time-consuming than that of BDMRS scheme. Although the indextree construction consumes relatively much time at thedata owner side, it is noteworthy that this is a one-timeoperation.

On the other hand, since the underlying balanced binarytree has space complexity OðnÞ and every node stores twom-dimensional vectors, the space complexity of the indextree is OðnmÞ. As listed in Table 3, when the document col-lection is fixed (n ¼ 1;000), the storage consumption of theindex tree is determined by the size of the dictionary.

5.2.2 Trapdoor Generation

The generation of a trapdoor incurs a vector splitting opera-tion and two multiplications of a ðm�mÞ matrix, thus the

time complexity is Oðm2Þ, as shown in Fig. 6a. Typicalsearch requests usually consist of just a few keywords.Fig. 6b shows that the number of query keywords has littleinfluence on the overhead of trapdoor generation when thedictionary size is fixed. Due to the dimension extension, thetime cost of EDMRS scheme is a little higher than that ofBDMRS scheme.

5.2.3 Search Efficiency

During the search process, if the relevance score at node u islarger than the minimum relevance score in result list RList,the cloud server examines the children of the node; else itreturns. Thus, lots of nodes are not accessed during a realsearch. We denote the number of leaf nodes that contain oneor more keywords in the query as u. Generally, u is largerthan the number of required documents k, but far less thanthe cardinality of the document collection n. As a balancedbinary tree, the height of the index is maintained to be log n,

and the complexity of relevance score calculation is OðmÞ.Thus, the time complexity of search isOðum log nÞ. Note thatthe real search time is less than um log n. It is because 1)many leaf nodes that contain the queried keywords are notvisited according to our search algorithm, and 2) the access-ing paths of some different leaf nodes share the mutual tra-versed parts. In addition, the parallel execution of searchprocess can increase the efficiency a lot.

We test the search efficiency of the proposed schemeon a server which supports 24 parallel threads. Thesearch performance is tested respectively by starting 1, 4,8 and 16 threads. We compare the search efficiency of ourscheme with that of Sun et al. [27]. In the implementationof Sun’s code, we divide 4,000 keywords into 50 levels.Thus, each level contains 80 keywords. According to [27],the higher level the query keywords reside, the higherthe search efficiency is. In our experiment, we choose 10keywords from the first level (the highest level, the opti-mal case) for search efficiency comparison. Fig. 7 showsthat if the query keywords are chosen from the first level,our scheme obtains almost the same efficiency as [27]when we start four threads.

Fig. 7 also shows that the search efficiency of ourscheme increases a lot when we increase the number ofthreads from 1 to 4. However, when we continue toincrease the threads, the search efficiency is not increased

TABLE 3Storage Consumption of Index Tree

Size of dictionary 1,000 2,000 3,000 4,000 5,000

BDMRS (MB) 73 146 220 293 367EDMRS (MB) 95 168 241 315 388

Fig. 6. Time cost for trapdoor generation: (a) for different sizes of dictio-nary with the fixed number of query keywords, t ¼ 10, and (b) for differ-ent numbers of query keywords with the fixed dictionary,m ¼ 4;000.

Fig. 7. The efficiency of a search with 10 keywords of interest as input:(a) for the different sizes of document collection with the same dictio-nary, m ¼ 4; 000, and (b) for different numbers of retrieved documentswith the same document collection and dictionary, n ¼ 1;000, andm ¼ 4;000.

XIA ET AL.: A SECURE AND DYNAMIC MULTI-KEYWORD RANKED SEARCH SCHEME OVER ENCRYPTED CLOUD DATA 349

Page 11: 340 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED … · a secure multi-keyword ranked search scheme over encrypted cloud data, which simultaneously supports dynamic update operations

remarkably. Our search algorithm can be executed in par-allel to improve the search efficiency. But all the startedthreads will share one result list RList in mutually exclu-sive manner. When we start too many threads, the threadswill spend a lot of time for waiting to read and write theRList.

An intuitive method to handle this problem is to con-struct multiple result lists. However, in our scheme, it willnot help to improve the search efficiency a lot. It is becausethat we need to find k results for each result list and timecomplexity for retrieving each result list is Oðum log n=lÞ. Inthis case, the multiple threads will not save much time, andselecting k results from the multiple result list will furtherincrease the time consumption. In the Fig. 8, we show thetime consumption when we start multiple threads withmultiple result lists. The experimental results prove that ourscheme will obtain better search efficiency when we startmultiple threads with only one result list.

5.2.4 Update Efficiency

In order to update a leaf node, the data owner needs toupdate logn nodes. Since it involves an encryption operation

for index vector at each node, which takes Oðm2Þ time, the

time complexity of update operation is thus Oðm2 log nÞ.We illustrate the time cost for the deletion of a document.Fig. 9a shows that when the size of dictionary is fixed, thedeletion of a document takes nearly logarithmic time withthe size of document collection. And Fig. 9b shows that theupdate time is proportional to the size of dictionary whenthe document collection is fixed.

In addition, the space complexity of each node is OðmÞ.Thus, space complexity of the communication package ofupdating a document is Oðm log nÞ.

6 CONCLUSION AND FUTURE WORK

In this paper, a secure, efficient and dynamic search schemeis proposed, which supports not only the accurate multi-keyword ranked search but also the dynamic deletion andinsertion of documents. We construct a special keyword bal-anced binary tree as the index, and propose a “GreedyDepth-first Search” algorithm to obtain better efficiencythan linear search. In addition, the parallel search processcan be carried out to further reduce the time cost. The secu-rity of the scheme is protected against two threat models by

using the secure kNN algorithm. Experimental results dem-onstrate the efficiency of our proposed scheme.

There are still many challenge problems in symmetric SEschemes. In the proposed scheme, the data owner is respon-sible for generating updating information and sending themto the cloud server. Thus, the data owner needs to store theunencrypted index tree and the information that are neces-sary to recalculate the IDF values. Such an active data ownermay not be very suitable for the cloud computing model. Itcould be a meaningful but difficult future work to design adynamic searchable encryption scheme whose updatingoperation can be completed by cloud server only, meanwhilereserving the ability to support multi-keyword rankedsearch. In addition, as the most of works about searchableencryption, our scheme mainly considers the challenge fromthe cloud server. Actually, there are many secure challengesin a multi-user scheme. First, all the users usually keep thesame secure key for trapdoor generation in a symmetric SEscheme. In this case, the revocation of the user is big chal-lenge. If it is needed to revoke a user in this scheme, we needto rebuild the index and distribute the new secure keys to allthe authorized users. Second, symmetric SE schemes usuallyassume that all the data users are trustworthy. It is not practi-cal and a dishonest data user will lead to many secure prob-lems. For example, a dishonest data user may search thedocuments and distribute the decrypted documents to theunauthorized ones. Even more, a dishonest data user maydistribute his/her secure keys to the unauthorized ones. Inthe future works, we will try to improve the SE scheme tohandle these challenge problems.

ACKNOWLEDGMENTS

This work is supported by the NSFC (61173141, 61232016,U1405254, 61173142, 61173136, 61103215, 61373132,61373133, 61300237, 61373167), GYHY201206033, 201301030,2013DFG12860, BC2013012, Fund of Jiangsu EngineeringCenter of Network Monitoring (KJR1308, KJR1402), Fund ofMOE Internet Innovation Platform (KJRP1403), NaturalScience Foundation of Hubei Province (2013CFB297) andPAPD fund. The authors would like to thank Wenhai Sunand Bing Wang to provide us with their experimental codefor performance comparisons, and thank the editor andreviewers for their valuable suggestions. Z. Xia is the corre-sponding author.

Fig. 8. The efficiency of a search with 10 keywords of interest as input: (a)for the different sizes of document collection with the same dictionary,m ¼ 4; 000, and (b) for different numbers of retrieved documents with thesame document collection and dictionary, n ¼ 1;000, andm ¼ 4;000.

Fig. 9. Time cost for deletion of a document: (a) for the different sizes ofdocument collection with the same dictionary, m ¼ 4;000, and (b) for thesame document collection with different sizes of dictionary, n ¼ 1;000.

350 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 27, NO. 2, FEBRUARY 2016

Page 12: 340 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED … · a secure multi-keyword ranked search scheme over encrypted cloud data, which simultaneously supports dynamic update operations

REFERENCES

[1] K. Ren, C. Wang, and Q. Wang, “Security challenges for the publiccloud,” IEEE Internet Comput., vol. 16, no. 1, pp. 69–73, Jan-Feb.2012.

[2] S. Kamara and K. Lauter, “Cryptographic cloud storage,” in Proc.Financ. Cryptography Data Secur., 2010, pp. 136–149.

[3] C. Gentry, “A fully homomorphic encryption scheme,” Ph.D. dis-sertation, Stanford Univ., Stanford, CA, USA, 2009.

[4] O. Goldreich and R. Ostrovsky, “Software protection and simula-tion on oblivious rams,” J. ACM, vol. 43, no. 3, pp. 431–473, 1996.

[5] D. Boneh, G. Di Crescenzo, R. Ostrovsky, and G. Persiano, “Publickey encryption with keyword search,” in Proc. Adv. Cryptol.-Euro-crypt, 2004, pp. 506–522.

[6] D. Boneh, E. Kushilevitz, R. Ostrovsky, and W. E. Skeith III,“Public key encryption that allows pir queries,” in Proc. Adv. Cryp-tol., 2007, pp. 50–67.

[7] D. X. Song, D. Wagner, and A. Perrig, “Practical techniques forsearches on encrypted data,” in Proc. IEEE Symp. Secur. Privacy,2000, pp. 44–55.

[8] E.-J. Goh, “Secure indexes,” IACR Cryptol. ePrint Archive, vol. 2003,p. 216, 2003.

[9] Y.-C. Chang and M. Mitzenmacher, “Privacy preserving keywordsearches on remote encrypted data,” in Proc. 3rd Int. Conf. Appl.Cryptography Netw. Secur., 2005, pp. 442–455.

[10] R. Curtmola, J. Garay, S. Kamara, and R. Ostrovsky, “Searchablesymmetric encryption: Improved definitions and efficient con-structions,” in Proc. 13th ACM Conf. Comput. Commun. Secur.,2006, pp. 79–88.

[11] J. Li, Q. Wang, C. Wang, N. Cao, K. Ren, and W. Lou, “Fuzzy key-word search over encrypted data in cloud computing,” in IEEEProc. INFOCOM, 2010, pp. 1–5.

[12] M. Kuzu, M. S. Islam, and M. Kantarcioglu, “Efficient similaritysearch over encrypted data,” in Proc. IEEE 28th Int. Conf. DataEng., 2012, pp. 1156–1167.

[13] C. Wang, K. Ren, S. Yu, and K. M. R. Urs, “Achieving usable andprivacy-assured similarity search over outsourced cloud data,” inProc. IEEE INFOCOM, 2012, pp. 451–459.

[14] B. Wang, S. Yu, W. Lou, and Y. T. Hou, “Privacy-preserving multi-keyword fuzzy search over encrypted data in the cloud,” in Proc.IEEE INFOCOM, 2014, pp. 2112–2120.

[15] P. Golle, J. Staddon, and B. Waters, “Secure conjunctive keywordsearch over encrypted data,” in Proc. Appl. Cryptography Netw.Secur., 2004, pp. 31–45.

[16] Y. H. Hwang and P. J. Lee, “Public key encryption with conjunc-tive keyword search and its extension to a multi-user system,” inProc. 1st Int. Conf. Pairing-Based Cryptography, 2007, pp. 2–22.

[17] L. Ballard, S. Kamara, and F. Monrose, “Achieving efficient con-junctive keyword searches over encrypted data,” in Proc. 7th Int.Conf. Inf. Commun. Secur., 2005, pp. 414–426.

[18] D. Boneh and B. Waters, “Conjunctive, subset, and range querieson encrypted data,” in Proc. 4th Conf. Theory Cryptography, 2007,pp. 535–554.

[19] B. Zhang and F. Zhang, “An efficient public key encryption withconjunctive-subset keywords search,” J. Netw. Comput. Appl.,vol. 34, no. 1, pp. 262–267, 2011.

[20] J. Katz, A. Sahai, and B. Waters, “Predicate encryption supportingdisjunctions, polynomial equations, and inner products,” in Proc.Adv. Cryptol., 2008, pp. 146–162.

[21] E. Shen, E. Shi, and B. Waters, “Predicate privacy in encryptionsystems,” in Proc. 6th Theory oCryptography Conf. Theory Cryptogra-phy., 2009, pp. 457–473.

[22] A. Lewko, T. Okamoto, A. Sahai, K. Takashima, and B. Waters,“Fully secure functional encryption: attribute-based encryptionand (hierarchical) inner product encryption,” in Proc. 29th Annu.Int. Conf. Theory Appl. Cryptographic Tech., 2010, pp. 62–91.

[23] A. Swaminathan, Y. Mao, G.-M. Su, H. Gou, A. L. Varna, S. He, M.Wu, and D. W. Oard, “Confidentiality-preserving rank-orderedsearch,” in Proc. ACMWorkshop Storage Security Survivability, 2007,pp. 7–12.

[24] S. Zerr, D. Olmedilla, W. Nejdl, andW. Siberski, “Zerber+ r: Top-kretrieval from a confidential index,” in Proc. 12th Int. Conf. Extend-ing Database Technol.: Adv. Database Technol., 2009, pp. 439–449.

[25] C. Wang, N. Cao, K. Ren, and W. Lou, “Enabling secure and effi-cient ranked keyword search over outsourced cloud data,” IEEETrans. Parallel Distrib. Syst., vol. 23, no. 8, pp. 1467–1479, Aug.2012.

[26] N. Cao, C. Wang, M. Li, K. Ren, and W. Lou, “Privacy-preservingmulti-keyword ranked search over encrypted cloud data,” inProc. IEEE INFOCOM, Apr. 2011, pp. 829–837.

[27] W. Sun, B. Wang, N. Cao, M. Li, W. Lou, Y. T. Hou, and H. Li,“Privacy-preserving multi-keyword text search in the cloud sup-porting similarity-based ranking,” in Proc. 8th ACM SIGSACSymp. Inf., Comput. Commun. secur., 2013, pp. 71–82.

[28] C. Orencik, M. Kantarcioglu, and E. Savas, “A practical and securemulti-keyword search method over encrypted cloud data,” inProc. IEEE 6th Int. Conf. Cloud Comput., 2013, pp. 390–397.

[29] W. Zhang, S. Xiao, Y. Lin, T. Zhou, and S. Zhou, “Secure rankedmulti-keyword search for multiple data owners in cloudcomputing,” in Dependable Syst. Networks (DSN), IEEE 44th Annu.IEEE/IFIP Int. Conf., 2014, pp. 276–286.

[30] S. Kamara, C. Papamanthou, and T. Roeder, “Dynamic searchablesymmetric encryption,” in Proc. ACM Conf. Comput. Commun.Secur., 2012, pp. 965–976.

[31] S. Kamara and C. Papamanthou, “Parallel and dynamic search-able symmetric encryption,” in Proc. Financ. Cryptography DataSecur., 2013, pp. 258–274.

[32] D. Cash, S. Jarecki, C. Jutla, H. Krawczyk, M.-C. Rosu, andM. Steiner, “Highly-scalable searchable symmetric encryptionwith support for boolean queries,” in Proc. Adv. Cryptol, 2013,pp. 353–373.

[33] D. Cash, J. Jaeger, S. Jarecki, C. Jutla, H. Krawczyk, M.-C. Rosu,and M. Steiner, “Dynamic searchable encryption in very largedatabases: Data structures and implementation,” in Proc. Netw.Distrib. Syst. Security Symp., vol. 14, 2014, pp. 1–16.

[34] C. D. Manning, P. Raghavan, and H. Sch€utze, Introduction to infor-mation Retrieval. Cambridge, U.K.: Cambridge Univ. Press, 2008.

[35] B. Gu and V. S. Sheng, “Feasibility and finite convergence analysisfor accurate on-line-support vector learning,” IEEE Trans. NeuralNetw. Learn. Syst., vol. 24, no. 8, pp. 1304–1315, Aug. 2013.

[36] X. Wen, L. Shao, W. Fang, and Y. Xue, “Efficient feature selectionand classification for vehicle detection,” IEEE Trans. Circuits Syst.Video Technol, DOI: 10.1109/TCSVT.2014.2358031.

[37] H. Delfs and H. Knebl, Introduction to Cryptography: Principles andApplications. New York, NY, USA: Springer, 2007.

[38] W. K. Wong, D. W.-L. Cheung, B. Kao, and N. Mamoulis, “SecureKNN computation on encrypted databases,” in Proc.ACM SIG-MOD Int. Conf. Manag. Data, 2009, pp. 139–152.

[39] (2014). Request for comments. [Online]. Available: http://www.rfc-editor.org/index.html

Zhihua Xia received the BE degree in HunanCity University, China, in 2006, the PhD degreein computer science and technology from HunanUniversity, China, in 2011. He works as a lecturerin the School of Computer & Software, NanjingUniversity of Information Science & Technology.His research interests include cloud computingsecurity, and digital forensic. He is a member ofthe IEEE.

Xinhui Wang received the BE degree in softwareengineering from the Nanjing University of Infor-mation Science & Technology in 2012, China.She is currently working towards the MS degreein computer science and technology at the Col-lege of Computer and Software, in Nanjing Uni-versity of Information Science & Technology,China. Her research interest includes cloudcomputing security.

XIA ET AL.: A SECURE AND DYNAMIC MULTI-KEYWORD RANKED SEARCH SCHEME OVER ENCRYPTED CLOUD DATA 351

Page 13: 340 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED … · a secure multi-keyword ranked search scheme over encrypted cloud data, which simultaneously supports dynamic update operations

Xingming Sun received the BS degree in mathe-matics from Hunan Normal University, China, in1984, MS degree in computing science from theDalian University of Science and Technology,China, in 1988, and the PhD degree in computingscience from Fudan University, China, in 2001.He is currently a professor in the School of Com-puter & Software, Nanjing University of Informa-tion Science & Technology, China. His researchinterests include network and information secu-rity, digital watermarking, and data security in

cloud. He is a senior member of the IEEE.

Qian Wang received the BS degree from WuhanUniversity, China, in 2003, the MS degree fromthe Shanghai Institute of Microsystem andInformation Technology, Chinese Academy ofSciences, China, in 2006, and the PhD degreefrom the Illinois Institute of Technology in 2012,all in electrical and computer engineering. He iscurrently a professor with the School of ComputerScience, Wuhan University. His research inter-ests include wireless network security andprivacy, cloud computing security, and applied

cryptography. Qian is an expert under the “1000 Young TalentsProgram” of China. He is a co-recipient of the Best Paper Award at IEEEICNP 2011. He is a member of the IEEE and a member of the ACM.

" For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

352 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 27, NO. 2, FEBRUARY 2016