An Efﬁcient Privacy-Preserving Ranked Keyword Search Method Efficient Privacy-Preserving … ·...

1045-9219 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistributionrequires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citationinformation: DOI 10.1109/TPDS.2015.2425407, IEEE Transactions on Parallel and Distributed Systems

1

An Efficient Privacy-Preserving RankedKeyword Search Method

Chi Chen, Member, IEEE, Xiaojie Zhu, Student Member, IEEE, Peisong Shen, StudentMember, IEEE, J.Hu, Member, IEEE, S.Guo, Senior Member, IEEE, Z.Tari, Senior Member, IEEE,

and Albert Y. Zomaya, Fellow, IEEE,

Abstract—Cloud data owners prefer to outsource documents in an encrypted form for the purpose of privacy preserving.Therefore it is essential to develop efficient and reliable ciphertext search techniques. One challenge is that the relationshipbetween documents will be normally concealed in the process of encryption, which will lead to significant search accuracyperformance degradation. Also the volume of data in data centers has experienced a dramatic growth. This will make it evenmore challenging to design ciphertext search schemes that can provide efficient and reliable online information retrieval on largevolume of encrypted data. In this paper, a hierarchical clustering method is proposed to support more search semantics andalso to meet the demand for fast ciphertext search within a big data environment. The proposed hierarchical approach clustersthe documents based on the minimum relevance threshold, and then partitions the resulting clusters into sub-clusters until theconstraint on the maximum size of cluster is reached. In the search phase, this approach can reach a linear computationalcomplexity against an exponential size increase of document collection. In order to verify the authenticity of search results, astructure called minimum hash sub-tree is designed in this paper. Experiments have been conducted using the collection setbuilt from the IEEE Xplore. The results show that with a sharp increase of documents in the dataset the search time of theproposed method increases linearly whereas the search time of the traditional method increases exponentially. Furthermore, theproposed method has an advantage over the traditional method in the rank privacy and relevance of retrieved documents.

Index Terms—Cloud computing, ciphertext search, ranked search, multi-keyword search, hierarchical clustering, big data,security

F

1 INTRODUCTION

A S we step into the big data era, terabyte ofdata are produced world-wide per day. Enter-

prises and users who own a large amount of datausually choose to outsource their precious data to

• An early version of this paper is presented at Workshop BigSecuritywith IEEE INFOCOM 2014 [28]. Extensive enhancements havebeen made which includes incorporating a novel verification schemeto help data user verify the authenticity of the search results, andadding a security analysis as well more details of the proposedscheme. This work was supported by Strategic Priority ResearchProgram of Chinese Academy of Sciences (No.XDA06010701) andNational High Technology Research and Development Program ofChina(No.2013AA01A24).

• Chi Chen is now with the State Key Laboratory Of InformationSecurity, Institute of Information Engineering, Chinese Academy ofSciences, Beijing, China (e-mail: [email protected]).

• Xiaojie Zhu is now with the State Key Laboratory Of InformationSecurity, Institute of Information Engineering, Chinese Academy ofSciences, Beijing, China (e-mail: [email protected]).

• Peisong Shen is now with the State Key Laboratory Of InformationSecurity, Institute of Information Engineering, Chinese Academy ofSciences, Beijing, China (e-mail: [email protected]).

• J.Hu is now with the Cyber Security Lab, School of Engineering andIT, University of New South Wales at the Australian Defence ForceAcademy, Canberra, ACT 2600, Australia. (e-mail: [email protected]).

• Song Guo is with School of Computer Science and Engineering, TheUniversity of Aizu, Japan. (email: [email protected]).

• Zahir Tari is with School of Computer Science, RMIT University, Aus-tralia. (email: [email protected]).

• Albert Zomaya is with School of Information Technologies, The Uni-versity of Sydney, Australia. (email: [email protected]).

cloud facility in order to reduce data managementcost and storage facility spending. As a result, datavolume in cloud storage facilities is experiencing adramatic increase. Although cloud server providers(CSPs) claim that their cloud service is armed withstrong security measures, security and privacy aremajor obstacles preventing the wider acceptance ofcloud computing service[1].

A traditional way to reduce information leakage isdata encryption. However, this will make server-sidedata utilization, such as searching on encrypted data,become a very challenging task. In the recent years,researchers have proposed many ciphertext searchschemes [35-38][43] by incorporating the cryptogra-phy techniques. These methods have been provenwith provable security, but their methods need mas-sive operations and have high time complexity. There-fore, former methods are not suitable for the big datascenario where data volume is very big and applica-tions require online data processing. In addition, therelationship between documents is concealed in theabove methods. The relationship between documentsrepresents the properties of the documents and hencemaintaining the relationship is vital to fully express adocument. For example, the relationship can be usedto express its category. If a document is independentof any other documents except those documents thatare related to sports, then it is easy for us to assert thisdocument belongs to the category of the sports. Due



2

to the blind encryption, this important property hasbeen concealed in the traditional methods. Therefore,proposing a method which can maintain and utilizethis relationship to speed the search phase is desirable.

On the other hand, due to software/hardware fail-ure, and storage corruption, data search results re-turning to the users may contain damaged data orhave been distorted by the malicious administratoror intruder. Thus, a verifiable mechanism should beprovided for users to verify the correctness and com-pleteness of the search results.

In this paper, a vector space model is used andevery document is represented by a vector, whichmeans every document can be seen as a point ina high dimensional space. Due to the relationshipbetween different documents, all the documents canbe divided into several categories. In other words, thepoints whose distance are short in the high dimen-sional space can be classified into a specific category.The search time can be largely reduced by selectingthe desired category and abandoning the irrelevantcategories. Comparing with all the documents in thedataset, the number of documents which user aimsat is very small. Due to the small number of thedesired documents, a specific category can be furtherdivided into several sub-categories. Instead of usingthe traditional sequence search method, a backtrack-ing algorithm is produced to search the target doc-uments. Cloud server will first search the categoriesand get the minimum desired sub-category. Then thecloud server will select the desired k documents fromthe minimum desired sub-category. The value of k ispreviously decided by the user and sent to the cloudserver. If current sub-category can not satisfy the kdocuments, cloud server will trace back to its parentand select the desired documents from its brothercategories. This process will be executed recursivelyuntil the desired k documents are satisfied or theroot is reached. To verify the integrity of the searchresult, a verifiable structure based on hash function isconstructed. Every document will be hashed and thehash result will be used to represent the document.The hashed results of documents will be hashed againwith the category information that these documentsbelong to and the result will be used to representthe current category. Similarly, every category willbe represented by the hash result of the combinationof current category information and sub-categoriesinformation. A virtual root is constructed to representall the data and categories. The virtual root is denotedby the hash result of the concatenation of all thecategories located in the first level. The virtual rootwill be signed so that it is verifiable. To verify thesearch result, user only needs to verify the virtual root,instead of verifying every document.

2 EXISTING SOLUTIONSIn recent years, searchable encryption which providestext search function based on encrypted data hasbeen widely studied, especially in security definition,formalizations and efficiency improvement, e.g. [2-7].As shown in Fig.1, the proposed method is comparedwith existing solutions and has the advantage inmaintaining the relationship between documents.

2.1 Single Keyword Searchable Encryption

Song et al [8] first introduced the notion of search-able encryption. They propose to encrypt each wordin the document independently. This method has ahigh searching cost due to the scanning of the wholedata collection word by word. Goh et al [9] formallydefined a secure index structure and formulate asecurity model for index known as semantic securityagainst adaptive chosen keyword attack (ind-cka).They also developed an efficient ind-cka secure indexconstruction called z-idx by using pseudo-randomfunctions and bloom filters. Cash et al [42] recentlydesign and implement an efficient data structure.Due to the lack of rank mechanism, users have totake a long time to select what they want whenmassive documents contain the query keyword. Thus,the order-preserving techniques are utilized to realizethe rank mechanism, e.g. [10-12]. Wang et al [13]use encrypted invert index to achieve secure rankedkeyword search over the encrypted documents. In thesearch phase, the cloud server computes the relevancescore between documents and the query. In this way,relevant documents are ranked according to theirrelevance score and users can get the top-k results.In the public key setting, Boneh et al [3] designedthe first searchable encryption construction, whereanyone can use public key to write to the data storedon server but only authorized users owning privatekey can search. However, all the above mentionedtechniques only support single keyword search.

2.2 Multiple Keyword Searchable Encryption

To enrich search predicates, a variety of conjunctivekeyword search methods (e.g. [7, 14-17]) have beenproposed. These methods show large overhead, suchas communication cost by sharing secret, e.g. [15], orcomputational cost by bilinear map, e.g.[7]. Pang et al[18] propose a secure search scheme based on vectorspace model. Due to the lack of the security analysisfor frequency information and practical search per-formance, it is unclear whether their scheme is secureand efficient or not. Cao et al [19] present a novelarchitecture to solve the problem of multi-keywordranked search over encrypted cloud data. But thesearch time of this method grows exponentially ac-companying with the exponentially increasing size ofthe document collections. Sun et al [20] give a new



3

architecture which achieves better search efficiency.However, at the stage of index building process, therelevance between documents is ignored. As a result,the relevance of plaintexts is concealed by the encryp-tion, users expectation cannot be fulfilled well. Forexample: given a query containing Mobile and Phone,only the documents containing both of the keywordswill be retrieved by traditional methods. But if tak-ing the semantic relationship between the documentsinto consideration, the documents containing Cell andPhone should also be retrieved. Obviously, the secondresult is better at meeting the users expectation.

2.3 Verifiable Search Based on Authenticated In-dexThe idea of data verification has been well studied inthe area of databases. In a plaintext database scenario,a variety of methods have been produced, e.g. [21-23]. Most of these works are based on the originalwork by Merkle [24, 25] and refinements by Naor andNissm [26] for certificate revocation. Merkle hash treeand cryptographic signature techniques are used toconstruct authenticated tree structure upon which endusers can verify the correctness and completeness ofthe query results.

Pang et al [27] apply the Merkle hash tree based onauthenticated structure to text search engines. How-ever, they only focus on the verification-specific issuesignoring the search privacy preserving capabilitiesthat will be addressed in this paper.

The hash chain is used to construct a single key-word search result verification scheme by Wang et al[10]. Sun et al [20] use Merkle hash tree and cryp-tographic signature to create a verifiable MDB-tree.However, their work cannot be directly used in ourarchitecture which is oriented for privacy-preservingmultiple keyword search. Thus, a proper mechanismthat can be used to verify the search results withinbig data scenario is essential to both the CSPs andend users.

3 OUR CONTRIBUTION

In this paper, we propose a multi-keyword rankedsearch over encrypted data based on hierarchicalclustering index (MRSE-HCI) to maintain the closerelationship between different plain documents overthe encrypted domain in order to enhance the searchefficiency. In the proposed architecture, the searchtime has a linear growth accompanying with an ex-ponential growing size of data collection. We derivethis idea from the observation that users retrievalneeds usually concentrate on a specific field. So wecan speed up the searching process by computingrelevance score between the query and documentswhich belong to the same specific field with the query.As a result, only documents which are classified tothe field specified by users query will be evaluated to

get their relevance score. Due to the irrelevant fieldsignored, the search speed is enhanced.

We investigate the problem of maintaining theclose relationship between different plain documentsover an encrypted domain and propose a cluster-ing method to solve this problem. According to theproposed clustering method, every document will bedynamically classified into a specific cluster whichhas a constraint on the minimum relevance scorebetween different documents in the dataset. The rele-vance score is a metric used to evaluate the relation-ship between different documents. Due to the newdocuments added to a cluster, the constraint on thecluster may be broken. If one of the new documentsbreaks the constraint, a new cluster center will beadded and the current document will be chosen asa temporal cluster center. Then all the documentswill be reassigned and all the cluster centers will bereelected. Therefore, the number of clusters dependson the number of documents in the dataset and theclose relationship between different plain documents.In other words, the cluster centers are created dynam-ically and the number of clusters is decided by theproperty of the dataset.

We propose a hierarchical method in order to geta better clustering result within a large amount ofdata collection. The size of each cluster is controlledas a trade-off between clustering accuracy and queryefficiency. According to the proposed method, thenumber of clusters and the minimum relevance scoreincrease with the increase of the levels whereas themaximum size of a cluster reduces. Depending on theneeds of the grain level, the maximum size of a clusteris set at each level. Every cluster needs to satisfy theconstraints. If there is a cluster whose size exceedsthe limitation, this cluster will be divided into severalsub-clusters.

We design a search strategy to improve the rankprivacy. In the search phase, the cloud server willfirst compute the relevance score between query andcluster centers of the first level and then chooses thenearest cluster. This process will be iterated to getthe nearest child cluster until the smallest cluster hasbeen found. The cloud server computes the relevancescore between query and documents included in thesmallest cluster. If the smallest cluster can not satisfythe number of desired documents which is previouslydecided by user, cloud server will trace back to theparent cluster of the smallest cluster and the brotherclusters of the smallest cluster will be searched. Thisprocess will be iterated until the number of desireddocuments is satisfied or the root is reached. Due tothe special search procedures, the rankings of docu-ments among their search results are different withthe rankings derived from traditional sequence search.Therefore, the rank privacy is enhanced.

Some part of the above work has been presentedin [28]. For further improvement, we also construct



4

a verifiable tree structure upon the hierarchical clus-tering method to verify the integrity of the searchresult in this paper. This authenticated tree structuremainly takes the advantage of the Merkle hash treeand cryptographic signature. Every document will behashed and the hash result will be used as the repre-sentative of the document. The smallest cluster will berepresented by the hash result of the combination ofthe concatenation of the documents included in thesmallest cluster and own category information. Theparent cluster is represented by the hash result of thecombination of the concatenation of its children andown category information. A virtual root is added andrepresented by the hash result of the concatenation ofthe categories located in the first level. In addition,the virtual root will be signed so that user can achievethe goal of verifying the search result by verifying thevirtual root.

In short, our contributions can be summarized asfollows:

1) We investigate the problem of maintaining theclose relationship between different plain docu-ments over an encrypted domain and propose aclustering method to solve this problem.

2) We proposed the MRSE-HCI architecture tospeed up server-side searching phase. Accompa-nying with the exponential growth of documentcollection, the search time is reduced to a lineartime instead of exponential time.

3) We design a search strategy to improve the rankprivacy. This search strategy adopts the back-tracking algorithm upon the above clusteringmethod. With the growing of the data volume,the advantage of the proposed method in rankprivacy tends to be more apparent.

4) By applying the Merkle hash tree and crypto-graphic signature to authenticated tree structure,we provide a verification mechanism to assurethe correctness and completeness of search re-sults.

The organization of the following parts of the paperis as follows: Section IV describes the system model,threat model, design goals and notations. The archi-tecture and detailed algorithm are displayed in sectionV. We discuss the efficiency and security of MRSE-HCI scheme in section VI. An evaluation method isprovided in Section VII. Section VIII demonstrates theresult of our experiments. Section IX concludes thepaper.

4 DEFINITION AND BACKGROUND4.1 System ModelThe system model contains three entities, as illus-trated in Fig. 1, the data owner, the data user, andthe cloud server.The box with dashed lines in thefigure indicates the added component to the existingarchitecture.

Fig. 1 Architecture of ciphertext search

The data owner is responsible for collecting doc-uments, building document index and outsourcingthem in an encrypted format to the cloud server.Apart from that, the data user needs to get the autho-rization from the data owner before accessing to thedata. The cloud server provides a huge storage space,and the computation resources needed by ciphertextsearch. Upon receiving a legal request from the datauser, the cloud server searches the encrypted index,and sends back top-k documents that are most likelyto match users query [12]. The number k is properlychosen by the data user. Our system aims at protectingdata from leaking information to the cloud serverwhile improving the efficiency of ciphertext search.

In this model, both the data owner and the data userare trusted, while the cloud server is semi-trusted,which is consistent with the architecture in [10, 19, 29].In other words, the cloud server will strictly followthe predicated order and try to get more informationabout the data and the index.

4.2 Threat Model

The adversarys ability can be concluded in two threatmodels.

Known Ciphertext ModelIn this model, Cloud server can get encrypted docu-

ment collection, encrypted data index, and encryptedquery keywords.

Known Background ModelIn this model, cloud server knows more informa-

tion than that in known ciphertext model. Statisticalbackground information of dataset, such as the docu-ment frequency and term frequency information of aspecific keyword, can be used by the cloud server tolaunch a statistical attack to infer or identify specifickeyword in the query [10, 11], which further revealsthe plaintext content of documents. The adversarysability can be represented in the above two threatmodels.

4.3 Design Goals

• Search efficiency. The time complexity of searchtime of the MRSE-HCI scheme needs to be loga-rithmic against the size of data collection in orderto deal with the explosive growth of documentsize in big data scenario.



5

• Retrieval accuracy. Retrieval precision is relatedto two factors: the relevance between the queryand the documents in result set, and the relevanceof documents in the result set.

• Integrity of the search result. The integrity of thesearch results includes three aspects:

1) Correctness. All the documents returnedfrom servers are originally uploaded by thedata owner and remain unmodified.

2) Completeness. No qualified documents areomitted from the search results.

3) Freshness. The returned documents are thelatest version of documents in the dataset.

• Privacy requirements. We set a series of privacyrequirements which current researchers mostlyfocus on.

1) Data privacy. Data privacy presents the con-fidentiality and privacy of documents. Theadversary cannot get the plaintext of doc-uments stored on the cloud server if dataprivacy is guaranteed. Symmetric cryptog-raphy is a conventional way to achieve dataprivacy.

2) Index privacy. Index privacy means the abil-ity to frustrate the adversary attempt tosteal the information stored in the index.Such information includes keywords and theTF (Term Frequency) of keywords in docu-ments, the topic of documents, and so on.

3) Keyword privacy. It is important to protectusers query keywords. Secure query gen-eration algorithm should output trapdoorswhich leak no information about the querykeywords.

4) Trapdoor unlinkability. Trapdoor unlinkabil-ity means that each trapdoor generated fromthe query is different, even for the samequery. It can be realized by integrating arandom function in the trapdoor generationprocess. If the adversary can deduce the cer-tain set of trapdoors which all correspondsto the same keyword, he can calculate thefrequency of this keyword in search requestin a certain period. Combined with the docu-ment frequency of keyword in known back-ground model, he/she can use statisticalattack to identify the plain keyword behindthese trapdoors.

5) Rank privacy. Rank order of search resultsshould be well protected. If the rank or-der remains unchanged, the adversary cancompare the rank order of different searchresults, further identify the search keyword.

4.4 NotationsIn this paper, notations presented in table 1 are used.

TABLE 1

Notation

di The ith document vector, denoted as di ={di,1, . . . , di,n}, where di,j represents whether thejth keyword in the dictionary appears in documentdi.

m The number of documents in the data collection.n The size of dictionary DW .CCV The collection of cluster centers vectors, denoted as

CCV = {c1, , cn}, where ci is the average vector ofall document vectors in the cluster.

CCVi The collection of the ith level cluster center vectors,denoted as CCVi = {vi,1, . . . , vi,n} where Vi,j rep-resents the jth vector in the ith level.

DC The information of documents classification such asdocument id list of a certain cluster.

DV The collection of document vectors, denoted as DV =d1, d2, , dm.

DW The dictionary, denoted as Dw = {w1, w2, . . . , wn}.Fw The ranked id list of all documents according to their

relevance to keyword w.Ic The clustering index which contains the encrypted

vectors of cluster centers.Id The traditional index which contains encrypted doc-

ument vectors.Li The minimum relevance score between different doc-

uments in the ith level of a cluster.QV The query vector.TH A fixed maximum number of documents in a cluster.Tw The encrypted query vector for users query.

5 ARCHITECTURE AND ALGORITHM5.1 System Model

In this section, we will introduce the MRSE-HCIscheme. The vector space model adopted by theMRSE-HCI scheme is same as the MRSE [19], whilethe process of building index is totally different. Thehierarchical index structure is introduced into theMRSE-HCI instead of sequence index. In MRSE-HCI,every document is indexed by a vector. Every dimen-sion of the vector stands for a keyword and the valuerepresents whether the keyword appears or not in thedocument. Similarly, the query is also represented bya vector. In the search phase, cloud server calculatesthe relevance score between the query and documentsby computing the inner product of the query vectorand document vectors and return the target docu-ments to user according to the top k relevance score.

Due to the fact that all the documents outsourcedto the cloud server is encrypted, the semantic rela-tionship between plain documents over the encrypteddocuments is lost. In order to maintain the semanticrelationship between plain documents over the en-crypted documents, a clustering method is used tocluster the documents by clustering their related indexvectors. Every document vector is viewed as a pointin the n-dimensional space. With the length of vectorsbeing normalized, we know that the distance of pointsin the n-dimensional space reflect the relevance ofcorresponding documents. In other word, points ofhigh relevant documents are very close to each otherin the n-dimensional space. As a result, we can cluster



6

Fig. 2 MRSE-HCI architecture

the documents based on the distance measure.With the volume of data in the data center has ex-

perienced a dramatic growth, conventional sequencesearch approach will be very inefficient. To promotethe search efficiency, a hierarchical clustering methodis proposed. The proposed hierarchical approach clus-ters the documents based on the minimum relevancethreshold at different levels, and then partitions theresulting clusters into sub-clusters until the constrainton the maximum size of cluster is reached. Uponreceiving a legal request, cloud server will search therelated indexes layer by layer instead of scanning allindexes.

5.2 MRSE-HCI Architecture

MRSE-HCI architecture is depicted by Fig. 2, wherethe data owner builds the encrypted index dependingon the dictionary, random numbers and secret key,the data user submits a query to the cloud serverfor getting desired documents, and the cloud serverreturns the target documents to the data user. Thisarchitecture mainly consists of following algorithms.• Keygen(1l(n))→ (sk, k):It is used to generate the

secret key to encrypt index and documents.• Index(D, sk)→ I :Encrypted index is generated

in this phase by using the above mentioned secretkey. At the same time, clustering process is alsoincluded current phase.

• Enc(D, k)→ E:The document collection is en-crypted by a symmetric encryption algorithmwhich achieves semantic security.

• Trapdoor(w, sk)→ Tw : It generates encryptedquery vector Tw with users input keywords andsecret key.

• Search(Tw, I, ktop)→ (Iw, Ew): In this phase,cloud server compares trapdoor with index to getthe top-k retrieval results.

• Dec(Ew, k)→ Fw:The returned encrypted docu-ments are decrypted by the key generated in thefirst step.

The concrete functions of different components isdescribed as below.

1) Keygen(1l(n): The data owner randomly gen-erates a (n + u + 1) bit vector S where everyelement is a integer 1 or 0 and two invertible(n+u+ 1)× (n+u+ 1)matrices whose elements

are random integers as secret key sk.The secretkey k is generated by the data owner choosingan n-bit pseudo sequence.

2) Index(D, sk):As show in the Fig.3, the dataowner uses tokenizer and parser to analyzeevery document and gets all keywords. Thendata owner uses the dictionary Dw to trans-form documents to a collection of documentvectors DV . Then the data owner calculates theDC and CCV by using a quality hierarchicalclustering (QHC) method which will be illus-trated in section C. After that, the data ownerapplies the dimension-expanding and vector-splitting procedure to every document vector. Itis worth noting that CCV is treated equally asDV . For dimension-expanding, every vector inDV is extended to (n + u + 1) bit-long, wherethe value in n + j(0 ≤ j ≤ u) dimension isan integer number generated randomly and thelast dimension is set to 1. For vector-splitting,every extended document vector is split intotwo (n + u + 1) bit-long vectors, V ′ and V ′′

with the help of the (n + u + 1)bit vector Sas a splitting indicator. If the ith element of S(Si ) is 0, then we set V ′′i = V ′i = Vi ; If ith

element of S (Si ) is 1, then V ′′i is set to a randomnumber and V ′i = Vi−V ′′i . Finally, the traditionalindex Id is encrypted as Id = {MT

1 V′,MT

2 V′′}by

using matrix multiplication with the sk, and Icis generated in a similar way. After this, Id ,Ic ,and DC are outsourced to the cloud server.

3) Enc(D, k) The data owner adopts a secure sym-metric encryption algorithm (e.g. AES) to en-crypt the plain document set D and outsourcesit to the cloud server.

4) Trapdoor(w, sk):The data user sends the queryto the data owner who will later analyze thequery and builds the query vector QV by an-alyzing the keywords of query with the helpof dictionary DW , QV then is extended to a(n + u + 1) bit query vector. Subsequently,vrandom positions chosen from a range (n, n+u]in QV are set to 1, others are set to 0.The valueat last dimension of QV is set to a randomnumber tε[0, 1].Then the first (n+ u)dimensionsof QW , denoted as qw, is scaled by a randomnumber r(r 6= 0) ,Qw = (r ·qw, t) . After that, Qw

is split into two random vectors as {Q′W , Q′′W }with vector-splitting procedure which is similarto that in the Index(D, sk) phase. The differenceis that if the ith bit of S is 1, then we haveq′i = q′′i = qi; If the ith bit of S is 0, q′i is setas a random number and q′′i = qi − q′i. Finally,the encrypted query vector Tw is generated asTw = {M−1

1 Q′w,M−12 Q′′w} and sent back to the

data user.5) Search(Tw, I, ktop):Upon receiving the Tw from

data user, the cloud server computes the rele-



7

Fig. 3 Algorithm Index

Fig. 4 Algorithm Dynamic k-means

vance score between Tw and index Ic and thenchooses the matched cluster which has the high-est relevance score. For every document con-tained in the matched cluster, the cloud serverextract its corresponding encrypted documentvector in Id , and calculates its relevance score Swith Tw , as described in the Equation 1. Finally,these scores of documents in the matched clus-ter are sorted and the top ktop documents arereturned by the cloud server. The detail will bediscussed in the section D.

S = Tw · Ic= {M−1

1 Q′w,M−12 Q′′w} · {MT

1 V′,MT

2 V′′}

= Q′w · V ′ +Q′wV′′

= Qw · V

(1)

6) Dec(Ew, k):The data user utilizes the secret keyk to decrypt the returned ciphertext Ew.

5.3 Relevance Measure

In this paper, the concept of coordinate matching[30]is adopted as a relevance measure. It is usedto quantify the relevance of document-query anddocument-document. It is also used to quantify therelevance of the query and cluster centers. Equation2 defines the relevance score between document diand query qw . Equation 3 defines the relevance scorebetween query qw and cluster center lci,j . Equation 4defines the relevance score between document di anddj .

Sqdi =

n+u+1∑t=1

(qw,t × di,t) (2)

Sqci =

n+u+1∑t=1

(qw,t × lci,j,t) (3)

Fig. 5 Algorithm Quality Hierarchical Clustering (QHC)

Fig. 6 Clustering Process

Sddi =

n+u+1∑t=1

(di,t × dj,t) (4)

5.4 Quality Hierarchical Clustering Algorithm

So far, a lot of hierarchical clustering methods hasbeen proposed. However all of these methods arenot comparable to the partition clustering method interms of time complexity performance. K-means[31]and K-medois[32] are popular partition clustering al-gorithms. But the k is fixed in the above two methods,which can not be applied to the situation of dynamicnumber of cluster centers. We propose a quality hi-erarchical clustering (QHC) algorithm based on thenovel dynamic K-means.

As the proposed dynamic K-means algorithm shownin the Fig.4, the minimum relevance threshold of theclusters is defined to keep the cluster compact anddense. If the relevance score between a document andits center is smaller than the threshold, a new clustercenter is added and all the documents are reassigned.The above procedure will be iterated until k is stable.Comparing with the traditional clustering method, kis dynamically changed during the clustering process.This is why it is called dynamic K-means algorithm.

The QHC algorithm is illustrated in the Fig.5. It goeslike that. Every cluster will be checked on whetherits size exceeds the maximum number TH or not. Ifthe answer is ”yes”, this ”big” cluster will be splitinto child clusters which are formed by using thedynamic K-means on the documents of this cluster.This procedure will be iterated until all clusters meetthe requirement of maximum cluster size. Clusteringprocedure is illustrated in Fig.6. All the documentsare denoted as points in a coordinate system. Thesepoints are initially partitioned into two clusters byusing dynamic K-means algorithm when the k = 2.These two bigger clusters are depicted by the ellipticalshape. Then these two clusters are checked to seewhether their points satisfy the distance constraint.The second cluster does not meet this requirement,thus a new cluster center is added with k = 3 andthe dynamic K-means algorithm runs again to partition



8

Fig. 7 Retrieval Process

Fig. 8 Algorithm Building-minimum hash sub-tree

the second cluster into two parts. Then the dataowner checks whether these clusters size exceed themaximum number TH . Cluster 1 is split into two sub-clusters again due to its big size. Finally all points areclustered into 4 clusters as depicted by the rectangle.

5.5 Search Algorithm

The cloud server needs to find the cluster that mostmatches the query. With the help of cluster index Icand document classification DC , the cloud serveruses an iterative procedure to find the best matchedcluster. Following instance demonstrates how to getmatched one:

1) The cloud server computes the relevance scorebetween Query Tw and encrypted vectors of thefirst level cluster centers in cluster index Ic, thenchooses the ith cluster center Ic,1,i which has thehighest score.

2) The cloud server gets the child cluster centers ofthe cluster center, then computes the relevancescore between Tw and every encrypted vectorsof child cluster centers, and finally gets thecluster center Ic,2,i with the highest score. Thisprocedure will be iterated until that the ultimatecluster center Ic,l,j in last level l is achieved.

In the situation depicted by Fig.7, there are 9 docu-ments which are grouped into 3 clusters. After calcu-lating the relevance score with trapdoor Tw , cluster1, which is shown within the box of dummy line inFig.7, is found to be the best match. Documents d1,d3

,d9 belong to cluster 1, then their encrypted documentvectors in the Id are extracted out to compute therelevance score with Tw.

5.6 Search Result Verification

The retrieved data have high possibility to be wrongsince the network is unstable and the data may bedamaged due to the hardware/software failure ormalicious administrator or intruder. Verifying the au-thenticity of search results is emerging as a critical

Fig. 9 Algorithm Processing-minimum hash sub-tree

issue in the cloud environment. We, therefore, de-signed a signed hash tree to verify the correctness andfreshness of the search results.• Building.The data owner builds the hash tree

based on the hierarchical index structure. Thealgorithm shown in the Fig.8 is described as fol-lows. The hash value of the leaf node of the tree ish(id ‖ version ‖ Φ(id)) where id means documentid, version means document version and Φ(id)means the document contents. The value of non-leaf node is a pair of values (id, h(id ‖ hchild)where id denotes the value of the cluster centeror document vector in the encrypted index, andhchild is the hash value of its child node. Thehash value of tree root node is based on thehash values of all clusters in the first level. Itis worth noting that the root node denotes thedata set which contains all clusters. Then the dataowner generates the signature of the hash valuesof the root node and outsources the hash treeincluding the root signature to the cloud server.Cryptographic signature σ (e.g., RSA signature,DSA signature) can be used here to authenticatethe hash value of root node.

• Processing.By the algorithm shown in the Fig.9,the cloud server returns the root signature andthe minimum hash sub-tree (MHST) to client.The minimum hash sub-tree includes the hashvalues of leaf nodes in the matched cluster andnon-leaf node corresponding to all cluster centersused to find the matched cluster in the searchingphase.For example, in the Fig.10, the search resultis document D, E and F . Then the leaf nodes areD, E, F and G, and non-leaf nodes includesC1,C2, C3, C4, dD, dE , dF , and dG. In addition, theroot is included in the non-leaf node.

• Verifying.The data owner uses the minimum hashsub-tree to re-compute the hash values of nodes,in particular the root node which can be furtherverified by the root signature. If all nodes arematched, then the correctness and freshness isguaranteed. Then the data owner re-searches theindex constructed by retrieved values in MHST.If the search result is same as the retrieved result,the completeness, correctness and freshness allare guaranteed.

As shown in the Fig.10, in the building phase, alldocuments are clustered into 2 big clusters and 4small clusters, and each big cluster contains 2 small



9

Fig. 10 Authentication for hierarchical clustering index

clusters. The hash value of leaf node A ish(idA ‖version ‖ Φ(idA)) , the value of the non-leaf nodeC3 is (idC3

, h(idC3‖ hA ‖ hB ‖ hC), and the value of

non-leaf node C1 is (idC1 , h(idC1 ‖ hC3 ‖ hC4)) . Theother values of leaf nodes and non-leaf nodes aregenerated similarly. In order to combine all first-levelclusters into a tree, a virtual root node is created bythe data owner with a hash value h(hC1,2

‖ hC2,2)

where C1,2 and C2,2 denotes the second part of clustercenter 1 and 2 respectively. Then the data owner signsthe root node, e.g., σ(h(hC1,2

‖ hC2,2)) = (hC1,2

‖hC2,2

, e(h(hC1,2‖ hC2,2

))k, g), and outsources it to thecloud server.

In the processing phase, suppose that the clusterC4 is the matched cluster and the returned top-3documents are D, E, and F . Then the minimum hashsub-tree includes the hash values of node D, E, F ,dD, dE , dF , dG, C3, C2, C1, C4 and the signed rootσ(h(hC1,2

‖ hC2,2)).

In the verifying phase, upon receiving the signedroot, the data user first check e(h(hC1,2

‖ hC2,2), g)k

?=

e(sigkh(h(hC1,2‖ hC2,2

)), g) . If it is not true, theretrieved hash tree is not authentic, otherwise the re-turned nodes, D, E, F , dD, dE , dF , dG, C3, C2, C1, C4,works together to verify each other and reconstructthe hash tree. If all the nodes are authenticate, thereturned hash tree are authenticate. Then the data userre-computes the hash value of the leaf nodes D, E andF by using returned documents. These new generatedhash values are compared with the correspondingreturned hash values. If there is no difference, theretrieved documents is correct. Finally, the data useruses the trapdoor to re-search the index constructedby the first part of retrieved nodes. If the search resultis same as the retrieved result, the search result iscomplete.

5.7 Dynamic Data CollectionAs the documents stored at server may be deleted ormodified and new documents may be added to theoriginal data collection, a mechanism which supportsdynamic data collection is necessary. A naive way toaddress these problems is downloading all documentsand index locally and updating the data collectionand index. However, this method needs huge cost inbandwidth and local storage space.

To avoid updating index frequently, we provide apractical strategy to deal with insertion, deletion and

modification operations. Without loss of generality,we use following examples to illustrate the workingsof the strategy. The data owner preserves many emptyentries in the dictionary for new documents. If a newdocument contains new keywords, the data ownerfirst adds these new keywords to the dictionary andthen constructs a document vector based on the newdictionary. The data owner sends the trapdoor gen-erated by the document vector, encrypted documentand encrypted document vector to the cloud sever.The cloud sever finds the closest cluster, and puts theencrypted document and encrypted document vectorinto it.

As every cluster has a constraint on the maximumsize, it is possible that the number of documents ina cluster exceeds the limitation due to the insertionoperation. In this case, all the encrypted documentvectors belonging to the broken cluster are returnedto the data owner. After decryption of the retrieveddocument vectors, the data owner re-builds the sub-index based on the deciphered document vectors. Thesub-index is re-encrypted and re-outsourced to thecloud server.

Upon receiving a deletion order, the cloud serversearches the target document. Then the cloud serverdeletes the document and the corresponding docu-ment vector.

Modifying a document can be described as deletingthe old version of the document and inserting thenew version. The operation of modifying documents,therefore, can be realized by combining insertion op-eration and deletion operation.

To deal with this impact on the hash tree, a lazyupdate strategy is designed. For the insertion opera-tion, the corresponding hash value will be calculatedand marked as a raw node, while the original nodesin the hash tree will be kept unchanged because theoriginal hash tree still supports document verificationexcept the new document. Only when the new addeddocument is accessed, the hash tree will be updated.Similar concept is used in the deletion operation. Theonly difference is that the deletion operation will notbring the hash tree update.

6 EFFICIENCY AND SECURITY6.1 Search EfficiencyThe search process can be divided intoTrapdoor(w, sk) phase and Search(Tw, I, ktop) phase.The number of operation needed in Trapdoor(w, sk)phase is illustrated as in Equation 5, where, n is thenumber of keywords in the dictionary, and w is thenumber of query keywords.

O(MRSE −HCI) = 5n+ u− v − w + 5 (5)

Due to the time complexity of Trapdoor(w, sk)phase independent to DC, when DC increases expo-nentially,it can be described as O(1).



10

The difference of the search process between theMRSE-HCI and the MRSE is the retrieval algorithmused in this phase. In the Search(Tw, I, ktop) phaseof the MRSE, the cloud server needs to compute therelevance score between the encrypted query vectorTw and all encrypted document vectors in Id , andget the top-k ranked document list Fw . The numberof operations need in Search(TW , I, ktop) phase isillustrated as in Equation 6, where m represents thenumber of documents in DC ,and n represents thenumber of keywords in the dictionary.

O(MRSE) = 2m ∗ (2n+ 2u+ 1) +m− 1 (6)

However, in the Search(TW , I, ktop) phase of MRSE-HCI, the cloud server uses the information DC toquickly locate the matched cluster and only com-pares Tw to a limited number of encrypted documentvectors inId .The number of operations needed inSearch(TW , I, ktop) phase is illustrated in equation 7,where ki represents the number of cluster centersneeded to be compared with in the ith level, and crepresents the number of document vectors in thematched cluster.

O(MRSE −HCI) = (

l∑i=1

ki) ∗ 2 ∗ (2n+ 2u+ 1)

+ c(2 ∗ (2n+ 2u+ 1)) + c− 1

(7)

When DC increases exponentially, m can be set to 2l.The time complexity of the traditional MRSE is O(2l) ,while the time complexity of the proposed MRSE-HCIis only O(l).

The total search time can be calculated as givenin Equation 8 below, where O(trapdoor) is O(1) ,andO(query) relies on the DC.

O(searchT ime) = O(trapdoor) +O(query) (8)

In short, when the number of documents in DC hasan exponential growth, the search time of MRSE-HCI increases linearly while the traditional methodsincrease exponentially.

6.2 Security Analysis

To express the security analysis briefly, we adoptsome concepts from [38-40] and define what kinds ofinformation will be leaked to the curious-but-honestserver.

The basic information of documents and queriesare inevitably leaked to the honest-but-curious serversince all the data are stored at the server and thequeries submitted to the server. Moreover, the accesspattern and search pattern cannot be preserved inMRSE-HCI as well as previous searchable encryption[19] [39-41].

Definition 1 (Size Pattern) Let D be a documentcollection. The size pattern induced by a q-query is

a tuple a(D,Q) = (m, |Q1|, · · · , |Qq|) where m is thenumber of documents and |Qi| is the size of query Qi.

Definition 2 (Access Pattern) Let D be a documentcollection and I be an index over D. The accesspattern induced by a q-query is a tuple b(D,Q) =(I(Q1), , I(Qq)), where I(Qi) is a set of identifiersreturned by query Qi, for 1 ≤ i ≤ q.

Definition 3 (Search Pattern) Let D be a documentcollection. The search pattern induced by a q-query isa m× q binary matrix c(D,Q) such that for 1 ≤ i ≤ mand 1 ≤ j ≤ q the element in the ith row and jthcolumn is 1, if an document identifier idi is returnedby a query Qj .

Definition 4 (known ciphertext model secure) LetΠ = (Keygen, Index,Enc, Trapdoor, Search,Dec) bean index-based MRSE-HCI scheme over dictionaryDw, n ∈ N , be the security parameter, the knownciphertext model secure experiment PrivKkcm

A,Π (n) isdescribed as follows.

1) The adversary submits two document collec-tions D0 and D1 with the same length to achallenger.

2) The challenger generates a secret key {sk, k} byrunning Keygen(1l(n)).

3) The challenger randomly choose a bit b ∈{0, 1}, and returns Index(Db, skb) → Ib andEnc(Db, kb)→ Eb to the adversary.

4) The adversary outputs a bit b′

5) The output of the experiment is defined to be 1if b′ = b, and 0 otherwise.

We say MRSE-HCI scheme is secure under knownciphertext model if for all probabilistic polynomial-time adversaries A there exists a negligible functionnegl(n) such that

Pr(PrivkkcmA,Π = 1) ≤ 1/2 + negl(n) (9)

Proof The adversary A distinguishes the documentcollections depending on analyzing the secret key,index and encrypted document collection. Then wehave equation 10, where Adv(AD({sk, k})) is the ad-vantage for adversary A to distinguish the secretkey from two random matrixes and two randomstrings, Adv(AD(I)) is the advantage to distinguishthe index from a random string and Adv(AD(E)) isthe advantage to distinguish the encrypted documentsfrom random strings.

Pr(PrivKkcmA,Π (n) = 1) = 1/2+

Adv(AD(sk, k)) +Adv(AD(I)) +Adv(AD(E))(10)

The elements of two matrixes in the secret key arerandomly chosen from {0, 1}l(n), and the split indica-tor S and key k are also chosen uniformly at randomfrom {0, 1}l(n). Given {0, 1}l(n), A distinguishes the se-cret key from two random matrixes and two randomstrings with a negligible probability. Then there exitsa negligible function negl1(n) such that



11

Adv(AD(sk, k)) = |Pr(Keygen(1l(n))→ (sk, k))−Pr(Random→ (skr, kr))| ≤ negl1(n)

(11)

where skr denotes two random matrixes and a ran-dom string, and kr is a random string. In our scheme,the encryption of hierarchical index is essential toencrypt all the document vectors and cluster centervectors. All the cluster center vectors are treated asdocument vectors in the encryption phase. Eventually,all the document vectors and cluster center vectors areencrypted by the secure KNN. As the secure KNN isknown plaintext attack (KPA) secure [33], the hier-archical index is secure under the known ciphertextmodel. Then there exists a negligible function negl2(n)satisfying that

Adv(AD(I)) = |Pr(Index(D, sk)→ (I))

− Pr(Random→ (Ir))| ≤ negl2(n)(12)

where Ir is a random string.Since the encryption algorithm used to encrypt

Db is semantic secure, the encrypted documents aresecure under known ciphertext model. Then thereexists a negligible function negl3(n) such that

Adv(AD(E)) = |Pr(Enc(D, k)→ (E))−Pr(Random→ (Er))| ≤ negl3(n)

(13)

Where Er is a random string set.According equation 10, 11, 12 and 13, we can get

equation 14.

Pr(PrivkkcmA,Π = 1) ≤ 1/2+

negl1(n) + negl2(n) + negl3(n)(14)

negl(n) = negl1(n) + negl2(n) + negl3(n) (15)

Pr(PrivkkcmA,Π ) ≤ 1/2 + negl(n) (16)

By combining equation 14 and 15, we can concludeequation 16. Then, we say MRSE-HCI is secure underknow ciphertext model.

7 EVALUATION METHOD

7.1 Search precision

The search precision can quantify the users satisfac-tion. The Retrieval precision is related to two fac-tors: the relevance between documents and the query,and the relevance of documents between each other.Equation 17 defines the relevance between retrieveddocuments and the query.

Pq =

k′∑i=1

S(qw, di)/(

k∑i=1

S(qw, di)) (17)

Here, k′ denotes the number of files retrieved by theevaluated method , k denotes the number of filesretrieved by plain text search, qw represents queryvector, di represents document vector, and S is afunction to compute the relevance score between qwand di . Equation 18 defines the relevance of differentretrieved documents.

Pd =

k′∑j=1

k′∑i=1

S(dj , di)/(

k∑j=1

k∑i=1

S(dj , di)) (18)

Here, k′ denotes the number of files retrieved bythe evaluated method, k denotes the number of filesretrieved by plaintext search, and both dianddj denotedocument vector.

Equation 19 combines the relevance between queryand retrieved documents and relevance of documentsto quantify the search precision such that

Acc = aPq + Pd (19)

where a functions as a tradeoff parameter to balancethe relevance between query and documents and rele-vance of documents. If a is bigger than 1, it puts moreemphasis on the relevance of documents otherwisequery keywords.

The above evaluation strategies should be based onthe same dataset and keywords.

7.2 Rank PrivacyRank privacy can quantify the information leakage ofthe search results. The definition of rank privacy isadopted from [19]. Equation 20 is used to evaluatethe rank privacy.

Pk =

k∑i=1

Pi/k (20)

Here, k denotes the number of top-k retrieved doc-uments, pi = |ci′ − ci| , ci′ is the ranking of documentdi in the retrieved top-k documents,ci is the actualranking of document di in the data set, and Pi is set tok if greater than k . The overall rank privacy measureat point k, denoted as Pk, is defined as the averagevalue of pi for every document di in the retrieved top-k documents.

8 PERFORMANCE ANALYSISIn order to test the performance of MRSE-HCI onreal dataset, we built an experimental platform totest the search efficiency, accuracy and rank privacy.We implemented the target experiment based on adistributed platform which includes three ThinkServerRD830 and a ThinkCenter M8400t. The data set is builtfrom IEEE Xplore, including about 51000 documents,and 22000 keywords.

According to the notations defined in section IV,n denotes the dictionary size, k denotes the number



12

(a)Search time with the in-creasing documents

(b)Search time with the in-creasing number of retrieveddocuments

(c)Search time with the in-creasing number of querykeywords

Fig. 11 search efficiency

of top-k documents, m denotes the number of docu-ments in the data set, and w denotes the number ofkeywords in the users query.

Fig.11 is used to describe search efficiency withdifferent conditions. Fig.11 (a) describes search ef-ficiency using the different size of document setwith unchanged dictionary size, number of re-trieved documents and number of query keywords,n = 22157, k = 20, w = 5. In Fig.11 (b), we adjust thevalue of k with unchanged dictionary size, doc-ument set size and number of query keywords,n = 22157,m = 51312, w = 5. Fig.11 (c) tests the dif-ferent number of query keywords with unchangeddictionary size, document set size and number ofretrieved documents, n = 22157,m = 51312, k = 20.

From the Fig.11 (a), we can observe that with the ex-ponential growth of document set size, the search timeof MRSE increases exponentially, while the searchtime of MRSE−HCI increases linearly. As the Fig.11(b) and (c) shows, the search time of MRSE −HCIkeeps stable with the increase of query keywords andretrieved documents. Meanwhile, the search time isfar below that of MRSE.

Fig.12 describes search accuracy by utilizing plain-text search as a standard. Fig.12 (a) illustrates therelevance of retrieved documents. With the number ofdocuments increases from 3200 to 51200, the ratio ofMRSE-to-plaintext search fluctuates at 1, while MRSE-HCI-to-plaintext search increases from 1.5 to 2. Fromthe Fig.12 (a), we can observe that the relevance ofretrieved documents in the MRSE-HCI is almost twiceas many as that in the MRSE, which means retrieveddocuments generated by MRSE-HCI are much closerto each other. Fig.12 (b) shows the relevance betweenquery and retrieved documents. With the size ofdocument set increases from 3200 to 51200, the MRSE-to-plaintext search ratio fluctuates at 0.75. MRSE-HCI-to-plaintext search ratio increases from 0.65 to 0.75

(a)Relevance of documents (b) Relevance between docu-ments and query

(c) Overall evaluation

Fig. 12 Search precision

Fig. 13 Rank privacy

accompanying with the growth of document set size.From the Fig.12 (b), we can see that the relevancebetween query and retrieved documents in MRSE-HCI is slightly lower than that in MRSE. Especially,this gap narrows when the data size increases sincea big document data set has a clear category distri-bution which improves the relevance between queryand documents. Fig.12 (c) shows the rank accuracy ac-cording to equation 19. The tradeoff parameter a is setto 1, which means there is no bias towards relevanceof documents or relevance between documents andquery. From the result, we can conclude that MRSE-HCI is better than MRSE in rank accuracy.

Fig. 13 describes the rank privacy according toequation 20. In this test, no matter the number ofretrieved documents, MRSE − HCI has better rankprivacy than MRSE. This mainly caused by the rele-vance of documents introduced into search strategy.

9 CONCLUSIONIn this paper, we investigated ciphertext search inthe scenario of cloud storage. We explore the prob-lem of maintaining the semantic relationship betweendifferent plain documents over the related encrypteddocuments and give the design method to enhancethe performance of the semantic search. We alsopropose the MRSE-HCI architecture to adapt to therequirements of data explosion, online informationretrieval and semantic search. At the same time, averifiable mechanism is also proposed to guarantee



13

the correctness and completeness of search results. Inaddition, we analyze the search efficiency and securityunder two popular threat models. An experimentalplatform is built to evaluate the search efficiency,accuracy, and rank security. The experiment resultproves that the proposed architecture not only prop-erly solves the multi-keyword ranked search problem,but also brings an improvement in search efficiency,rank security, and the relevance between retrieveddocuments.

10 ACKNOWLEDGEMENTThis work was supported by Strategic Priority Re-search Program of Chinese Academy of Sciences(No.XDA06040602) and Xinjiang Uygur AutonomousRegion science and technology plan (No.201230121).

REFERENCES

[1] S. Grzonkowski, P. M. Corcoran, and T. Coughlin, ”Securityanalysis of authentication protocols for next-generation mobileand CE cloud services,” in Proc. ICCE, Berlin, Germany, 2011,pp. 83-87.

[2] D. X. D. Song, D. Wagner, and A. Perrig, ”Practical techniquesfor searches on encrypted data,” in Proc. S & P, BERKELEY,CA, 2000, pp. 44-55.

[3] D. Boneh, G. Di Crescenzo, R. Ostrovsky, and G. Persiano,”Public key encryption with keyword search,” in Proc. EURO-CRYPT, Interlaken, SWITZERLAND, 2004, pp. 506-522.

[4] Y. C. Chang, and M. Mitzenmacher, ”Privacy preserving key-word searches on remote encrypted data,” in Proc. ACNS,Columbia Univ, New York, NY, 2005, pp. 442-455.

[5] R. Curtmola, J. Garay, S. Kamara, and R. Ostrovsky, ”Search-able symmetric encryption: improved definitions and efficientconstructions,” in Proc. ACM CCS, Alexandria, Virginia, USA,2006, pp. 79-88.

[6] M. Bellare, A. Boldyreva, and A. O’Neill, ”Deterministic andefficiently searchable encryption,” in Proc. CRYPTO, Santa Bar-bara, CA, 2007, pp. 535-552.

[7] D. Boneh, and B. Waters, ”Conjunctive, subset, and rangequeries on encrypted data,” in Proc. TCC, Amsterdam,NETHERLANDS, 2007, pp. 535-554.

[8] D. X. D. Song, D. Wagner, and A. Perrig, ”Practical techniquesfor searches on encrypted data,” in Proc. S & P 2000, BERKE-LEY, CA, 2000, pp. 44-55.

[9] E.-J. Goh, Secure Indexes, IACR Cryptology ePrint Archive, vol.2003, pp. 216. 2003.

[10] C. Wang, N. Cao, K. Ren, and W. J. Lou, Enabling Secure andEfficient Ranked Keyword Search over Outsourced Cloud Data,IEEE Trans. Parallel Distrib. Syst., vol. 23, no. 8, pp. 1467-1479,Aug. 2012.

[11] A. Swaminathan, Y. Mao, G. M. Su, H. Gou, A. Varna, S. He, M.Wu, and D. Oard, ”Confidentiality-Preserving Rank-OrderedSearch,” in Proc. ACM StorageSS, Alexandria, VA, 2007, pp.7-12.

[12] S. Zerr, D. Olmedilla, W. Nejdl, and W. Siberski, ”Zerber+R:top-k retrieval from a confidential index,” in Proc. EDBT, SaintPetersburg, Russia, 2009, pp. 439-449.

[13] C. Wang, N. Cao, J. Li, K. Ren, and W. J. Lou, ”Secure RankedKeyword Search over Encrypted Cloud Data,” in Proc. ICDCS,Genova, ITALY, 2010.

[14] P. Golle, J. Staddon, and B. Waters, ”Secure conjunctive key-word search over encrypted data,” in Proc. ACNS, Yellow Mt,China, 2004, pp. 31-45.

[15] L. Ballard, S. Kamara, and F. Monrose, ”Achieving efficientconjunctive keyword searches over encrypted data,” in Proc.ICICS, Beijing, China, 2005, pp. 414-426.

[16] R. Brinkman, Searching in encrypted data: University ofTwente, 2007.

[17] Y. H. Hwang, and P. J. Lee, ”Public key encryption withconjunctive keyword search and its extension to a multi-usersystem,” in Proc. Pairing, Tokyo, JAPAN, 2007, pp. 2-22.

[18] H. Pang, J. Shen, and R. Krishnan, Privacy-PreservingSimilarity-Based Text Retrieval, ACM Trans. Internet. Technol.,vol. 10, no. 1, pp. 39, Feb. 2010.

[19] N. Cao, C. Wang, M. Li, K. Ren, and W. J. Lou, ”Privacy-Preserving Multi-keyword Ranked Search over EncryptedCloud Data,” in Proc. IEEE INFOCOM, Shanghai, China, 2011,pp. 829-837.

[20] W. Sun, B. Wang, N. Cao, M. Li, W. Lou, Y. T. Hou, andH. Li, ”Privacy-preserving multi-keyword text search in thecloud supporting similarity-based ranking,” in Proc. ASIACCS,Hangzhou, China, 2013, pp. 71-82.

[21] F. Li, M. Hadjieleftheriou, G. Kollios, and L. Reyzin, ”Dynamicauthenticated index structures for outsourced databases,” inProc. ACM SIGMOD, Chicago, IL, USA, 2006, pp. 121-132.

[22] H. H. Pang, and K. L. Tan, ”Authenticating query results inedge computing,” in Proc. ICDE, Boston, MA, 2004, pp. 560-571.

[23] C. Martel, G. Nuckolls, P. Devanbu, M. Gertz, A. Kwong,and S. G. Stubblebine, A general model for authenticated datastructures, Algorithmica, vol. 39, no. 1, pp. 21-41, May. 2004.

[24] C. M. Ralph, ”Protocols for Public Key Cryptosystems,” inProc. S & P, Oakland, CA, 1980, pp. 122-122.

[25] R. C. Merkle, A CERTIFIED DIGITAL SIGNATURE, Lect.Notes Comput. Sci., vol. 435, pp. 218-238. 1990.

[26] M. Naor, and K. Nissim, Certificate revocation and certificateupdate, IEEE J. Sel. Areas Commun., vol. 18, no. 4, pp. 561-570,Apr. 2000.

[27] H. Pang, and K. Mouratidis, Authenticating the query resultsof text search engines, Proc. VLDB Endow., vol. 1, no. 1, pp.126-137, Aug. 2008.

[28] C. Chen, X. J. Zhu, P. S. Shen, and J. K. Hu, ”A HierarchicalClustering Method For Big Data Oriented Ciphertext Search,”presented at Proc. BigSecurity, Toronto, Canada, Apr. 27-May.2, 2014.

[29] S. C. Yu, C. Wang, K. Ren, and W. J. Lou, ”Achieving Secure,Scalable, and Fine-grained Data Access Control in Cloud Com-puting,” in Proc. IEEE INFOCOM, San Diego, CA, 2010, pp.1-9.

[30] I. H. Witten, A. Moffat, and T. C. Bell, Managing gigabytes:compressing and indexing documents and images, 2nd ed., SanFrancisco: Morgan Kaufmann, 1999.

[31] J. MacQueen, ”Some methods for classification and analysis ofmultivariate observations,” in Proc. Berkeley Symp. Math. Stat.Prob, California, USA, 1967, p. 14.

[32] Z. X. Huang, Extensions to the k-means algorithm for cluster-ing large data sets with categorical values, Data Min. Knowl.Discov., vol. 2, no. 3, pp. 283-304, Sep. 1998.

[33] W. K. Wong, D. W. Cheung, B. Kao, and N. Mamoulis, ”SecurekNN Computation on Encrypted Databases,” in Proc. ACMSIGMOD, Providence, RI, 2009, pp. 139-152.

[34] R. X. Li, Z. Y. Xu, W. S. Kang, K. C. Yow, and C. Z. Xu,Efficient multi-keyword ranked query over encrypted data incloud computing, Futur. Gener. Comp. Syst., vol. 30, pp. 179-190, Jan. 2014.

[35] G. Craig. ”Fully homomorphic encryption using ideal lattices.”STOC. Vol. 9. 2009

[36] D. Boneh, G. Di Crescenzo, R. Ostrovsky, et al. Public keyencryption with keyword search[C].Advances in Cryptology-Eurocrypt 2004. Springer Berlin Heidelberg, 2004: 506-522.

[37] D. Cash, S. Jarecki, C. Jutla, et al. Highly-scalable search-able symmetric encryption with support for boolean que-ries[M].Advances in CryptologyCRYPTO 2013. Springer BerlinHeidelberg, 2013: 353-373.

[38] S. Kamara, C. Papamanthou, T.Roeder. Dynamic searchablesymmetric encryption[C].Proceedings of the 2012 ACM con-ference on Computer and communications security. ACM, 2012:965-976.

[39] Curtmola R, Garay J, Kamara S, et al. Searchable symmet-ric encryption: improved definitions and efficient construc-tions[C]//Proceedings of the 13th ACM conference on Com-puter and communications security. ACM, 2006: 79-88.

[40] Chase, M., and Kamara, S. (2010). Structured encryption andcontrolled disclosure. In Advances in Cryptology-ASIACRYPT2010 (pp. 577-594). Springer Berlin Heidelberg



14

[41] Cash, D., Jaeger, J., Jarecki, S., Jutla, C., Krawczyk, H., Rosu,M. C., and Steiner, M. (2014). Dynamic searchable encryptionin very large databases: Data structures and implementation.In Proc. of NDSS (Vol. 14)

[42] Jarecki, S., Jutla, C., Krawczyk, H., Rosu, M., and Steiner,M. (2013, November). Outsourced symmetric private infor-mation retrieval. In Proceedings of the 2013 ACM SIGSACconference on Computer and communications security (pp. 875-888). ACM.

Chi Chen (M14) received the B.S.(2000) andM.S.(2003) degree from Shandong Univer-sity, Jinan, China. He received PH.D.(2008)degree from Institute of Software ChineseAcademy of Sciences, Beijing, China. He isassociate research fellow of Institute of In-formation Engineering, Chinese Academy ofSciences. His research interest includes thecloud security and database security. From2003 to 2011, he was a research apprentice,research assistant and associate research

fellow with the State Key Laboratory of Information Security, instituteof software Chinese Academy of Sciences. Since 2012, he is an as-sociate research fellow with the State Key Laboratory of InformationSecurity, institute of information engineering, Chinese Academy ofSciences, Beijing, China.

Xiaojie Zhu received the B.S. degree in Zhe-jiang University of Technology, HangZhou,China, in 2011.He is currently pursuing theMS degree in Institute of Information Engi-neering, Chinese Academy of Sciences. Hisresearch interest includes the information re-trieval, secure cloud storage and data secu-rity.

Peisong Shen received the B.S. degreein University of Science and Technology ofChina, HeFei, China, in 2012. He is currentlypursuing the Ph.D. degree in Institute of In-formation Engineering, Chinese Academy ofSciences. His research interest includes theinformation retrieval, secure cloud storageand data security.

Jiankun Hu is a Professor and Research Di-rector of Cyber Security Lab, The Universityof New South Wales, Canberra, Australia. Hehas obtained 7 ARC (Australian ResearchCouncil) Grants and is now serving at theprestigious Panel of Mathematics, Informa-tion and Computing Sciences, ARC ERAEvaluation Committee.

Song Guo received his Ph.D. in computerscience from University of Ot-tawa, Canada.From 2001 to 2006, he worked as chief soft-ware architect for Liska Biometry Inc., NH,USA. Dr. Guo also held a position with theDepartment of Electrical and Computer Engi-neering, the University of British Columbia ona prestigious NSERC (Natural Sciences andEngineering Research Council of Canada)Postdoctoral Fellowship in 2006. From 2006to 2007, he was an Assistant Professor at

the Department of Computer Science, University of Northern BritishColumbia, Canada. He is currently a Full Professor with the Schoolof Computer Science and Engineering, the University of Aizu,Japan.

Zahir Tari received the degree in mathemat-ics from University of Science and Technol-ogy Houari Boumediene, Bab-Ezzouar, Al-geria, in 1984, the Masters degree in op-erational research from the Uni-versity ofGrenoble, Grenoble, France, in 1985, andthe PhD degree in computer science fromthe University of Grenoble, in 1989. He is aProfessor (in distributed systems) at RMITUniver-sity, Melbourne, Australia. Later, hejoined the Database Labora-tory at EPFL

(Swiss Federal Institute of Technology, 1990-1992) and then movedto QUT (Queensland University of Technology, 1993-1995) andRMIT (Royal Melbourne Institute of Technology, since 1996). He isthe Head of the DSN (Distributed Systems and Networking) at theSchool of Computer Scienceand IT, where he pursues high impactresearch and development incomputer science. He leads a fewresearch groups that focus on some of the core areas, including net-working (QoS routing, TCP/IP conges-tion),distributed systems (per-formance, security, mobility, relia-bility), and distributed applications(SCADA, Web/Internet ap-plications, mobile applications).His recentresearch interests are in performance (in Cloud) and security (inSCADA systems). Dr. Tari regularly publishes in prestigious journals(like IEEE Transactions on Parallel and Distributed Systems, IEEETrans-actions on Web Services, ACM Transactions on Databases)and conferences (ICDCS, WWW, ICSOC etc.). He co-authored twobooks (John Wiley) and edited more than 10 books. He has beenthe Program Committee Chair of several international conferences,including the DOA (Distributed Object and Appli-cation Symposium),IFIP DS 11.3 on Database Security, and IFIP 2.6 on Data Semantics.He has also been the General Chair of more than 12 conferences.He is the recipient of 14 ARC (Australian ResearchCouncil) grants.He is a senior member of the IEEE.

Albert Y. Zomaya is currently the Chair Pro-fessor of High Performance Computing andNetworking and Australian Research CouncilProfessorial Fellow in the School of Informa-tion Technologies, The University of Sydney,Sydney, Australia. He is also the Directorof the Centre for Distributed and High Per-formance Computing which was establishedin late 2009. He is the author/co-author ofseven books, more than 370 papers, andthe editor of nine books and 11 conference

proceedings. Prof. Zomaya is the Editor in Chief of the IEEE Trans-actions on Computers and serves as an Associate Editor for 19leading journals. He is the recipient of the Meritorious Service Award(in 2000) and the Golden Core Recognition (in 2006), both fromthe IEEE Computer Society. He is a Chartered Engineer (CEng),a Fellow of the AAAS, the IEEE, the IET (UK), and a DistinguishedEngineer of the ACM.

An Efﬁcient Privacy-Preserving Ranked Keyword Search Method Efficient Privacy-Preserving … ·...

Documents

Transcript of An Efﬁcient Privacy-Preserving Ranked Keyword Search Method Efficient Privacy-Preserving … ·...