Hiding Tree-Structured Data and Queries from Untrusted Data Stores

22 I N F O R M A T I O N S Y S T E M S S E C U R I T Y

M A Y / J U N E 2 0 0 4

Hiding Tree-Structured Data and Queries from Untrusted Data Stores

Ping Lin and K. Selçuk Candan

n Web and mobile computing, clientsusually do not have sufficient computa-tion power or memory and they need

remote servers to do the computation orstore data for them. Publishing data orapplications on remote servers helpsimprove data availability and system scal-ability, reducing the client burden of man-aging data. Application Service Providers(ASPs) and Distributed Application Host-ing Services (DAHSs), which rent out stor-age, (Internet) presence, and computationpower to clients with IT needs (but withoutappropriate infrastructures) are becomingpopular. Especially with the emergence ofenabling technologies, such as J2EE and

.NET, there is currently a shift toward ser-vices hosted by third parties.

Those providing data outsourcing ser-vice, with their computation power andlarge memory, can be called data stores ororacles. Typically, these data stores cannotbe fully trusted, for we have to realize thatwith the data outsourced on third-partyhosts, entities independent of data/serviceowners, (1) it is possible for the hosts to ille-gally utilize sensitive information aboutdata to make a profit; (2) host operators mayaccept bribes from or be compelled byadversaries to grant insider control or leaksensitive information to them; (3) even ifhosts are honest, despite all security meth-

I

H I D I N G T R E E - S T R U C T U R E D D A T A

PING LIN is a Ph.D. student in the Computer Science and Engineering Department at Arizona StateUniversity. She is interested in information security and distributed information systems. Her currentresearch focuses on data and application security for outsourced services. She has published a bookchapter and several conference papers in this research area. She received her M.S. degree from theInstitute of Software, Chinese Academy of Sciences in 2000 and her master’s thesis was about the imple-mentation of GSM protocols for mobile terminals. She received her B.S. in computer science fromXiangtan University in China in 1993.

K. SELÇUK CANDAN is an associate professor at the Department of Computer Science and Engineer-ing at Arizona State University. He joined the department in August 1997, after receiving his Ph.D. from the Computer Science Department at the University of Maryland at College Park, where he received the 1997 ACM DC Chapter award of the Samuel N. Alexander Fellowship for his Ph.D. work. He has worked extensively on heterogeneous, distributed, and secure data management and integration issues. His research interests also include development of models, query processing, and optimization algo-rithms for heterogeneous data management systems. He received his B.S. degree, first ranked in the department, in computer science from Bilkent University in Turkey in 1993.


M A Y / J U N E 2 0 0 4

23

ods (firewalls, etc.) that are employed,chances are that the host can be hijacked,leaking sensitive information about data tothe hijacker; or (4) host machines may bephysically seized by adversaries. Due to theabove possibilities, a data store cannot betreated as trusted party. Thus, in security-critical domains, in addition to traditionalsecurity provisions, appropriate methodsshould be taken to protect sensitive infor-mation about data from adversaries at hosts.Without enough security guarantees, datastores cannot survive.

Clients with sensitive data (e.g., personalidentifiable data) may require that their datacontent be protected from such data storageoracles. This leads to encrypted databaseresearch, in which sensit ive data isencrypted, so the content is hidden from thedatabase. It is defined as content privacy.

Sometimes not only the data outsourcedto a data store but also queries are of value,and a malicious data store can make use ofsuch information for its own benefit. Thisprivacy is defined as access privacy. Typi-cal scenarios demanding access privacyinclude:

� A mineral company wants to hide the locations to be explored when retrieving relevant maps from the IT department map database; and

� In a stock database, the kind of stock a user is retrieving is sensitive and needs to be kept private.

This leads to private informationretrieval research, which studies how to letusers retrieve information from a databasewithout leaking (even to the server) thelocation of the retrieved data item.

The tree structure is a very important datastructure and tree-structured data showsitself in many application domains. In thisarticle, we address outsourcing and hidingof tree-structured data and queries on thisdata. For this work, we have two motivatingapplications: hiding XML data that is storedin the form of trees and XML queries in theform of tree paths; and hiding tree-indexeddata and queries for the data.

Hiding XML Data and Queries. Somedata, such as XML documents, has a naturaltree structure. DOM and LORE are twowell-known tree data models for XML doc-uments. XML introduces much flexibilityinto data structures and has become a defacto standard for data exchange and repre-sentation over the Internet. Some work hasbeen done on selective and authenticuntrusted third-party distribution of XMLdocuments. That work focuses on accesscontrol and authentication of document(i.e., query result) source and content. Withmore and more data stored in XML docu-ments, techniques to hide tree structures(the content and structure of XML docu-ments) from untrusted data stores are ingreat need. In an XML database, a query isoften given in the form of tree paths, such asXQuery. To hide XML queries and thestructure of XML documents, clients needto traverse XML trees in a hidden way.

Hiding Index Structures and Queries. Indatabases, proper index trees are often builtfor convenient access to data. However,most index structures closely reflect the dis-tribution of the data. For example, thequadtree data structure in Figure 1 demon-strates the close relationship between multi-dimensional index structures and theunderlying data distribution. Such is thecase with the B-tree, a tree generallyadopted for indexing in DBMSs, which canbe seen as a one-dimensional quadtree.Multidimensional index structures are oftenused in spatial and multimedia databases.Thus, in order to hide the data and data dis-tribution from the database, tree structurehiding techniques must be adopted to pro-tect index trees from oracles. To retrieve anindexed data item, a client needs to traversethe index tree to find the data’s location. Inorder to hide the query and the index tree,the traverse path needs protection.

We concentrate here on hiding tree-struc-tured data and traversal of trees from ora-cles . Noticing that exis t ing privateinformation retrieval techniques requireeither heavy replication of the database onto


M A Y / J U N E 2 0 0 4

multiple noncommunicating servers orlarge communication costs, we give a sin-gle-server tree-traversal protocol that pro-vides a balance between the communicationcost and security requirements. To protectthe client from the malicious data store,some tasks (such as traversing the tree struc-tures) are delegated to the client.

In the proposed technique, clients learnhow to traverse a remotely stored tree struc-ture to locate and retrieve data, while thetree structure and the traversal are hiddenfrom the server. Client responsibilitiesinclude encryption and decryption of thedata received from the data store during thetraversal of the tree. The computation capa-bility required at the client side can beachieved by assistant hardware equipment,such as smartcards, which are cheap (gener-ally no more than several dollars) and nowcommonly used in mobile environments. Inthis article, we analyze the overheadincurred by the proposed technique, includ-ing communication cost, encryption,decryption cost, and locking cost. AlthoughChor et al.4 have argued that information-theoretical private information retrievalcannot be achieved without a significantcommunication overhead, we discuss meth-ods to minimize those costs in a computa-tional hiding sense.

Related Work: Private Information RetrievalPrivate information retrieval aims to hidethe address of the target data to be retrieved

from the database server in an information-theoretic sense. Consequently, it is closelyrelated to our problem. The concept of pri-vate information retrieval was first intro-duced by Chor et al.4 They found that if asingle copy of the database is used, sendingthe whole database to the client is the onlyway to hide information retrieval in theinformation-theoretic sense. Consequently,replication of the whole database to multi-ple servers that are not able to cooperatewith each other is necessary if less than thewhole database is transferred. Later on,Ostrovsky et al.9 presented a technique bywhich a user can privately write data intothe database. Based on basic private infor-mation retrieval techniques, Chor et al.4 pro-posed a private information retrieval with akeyword technique to utilize search treestructures to transform keywords into dataaddresses.

The existing private information retrievalprotocols share some limitations and con-straints: the heavy communication cost andthe need of replication of databases. Atrade-off exists between communicationand replication: to reduce the communica-tion cost to subpolynomial complexity mea-sured in terms of the size of the database itis necessary to replicate more than a con-stant number of copies of the database. Fur-thermore, no communication between thecopies is allowed (otherwise copies cancheat in coalition to learn about the data).This is almost impossible in the real world.Protocols built on one-way functions claim

FIGURE 1 Data Distribution and Its Effect on the Quad Tree: (a) Data Distribution in 2D Space; (b) Corresponding Quadtree Structure

(a) Data distribution in 2D space (b) Corresponding quadtree structure


M A Y / J U N E 2 0 0 4

25

that a single database scheme could beachieved with polylogarithmic communica-tion complexity by lowering the securityrequirement from information-theoretic pri-vacy to computational privacy. However,these algorithms are very complicated,involving heavy computation at both the cli-ent and server. Furthermore, they are builton a binary-bit model of the database, whichis difficult to be transformed into a realisticdatabase system.

In this article, we discuss technical chal-lenges to privately retrieve hidden tree-structured data, especially how to let a clienttraverse a tree structure to find the desireddata node, while minimizing the leakage ofthe tree structure and the target data. Wepropose a protocol and provide tree tra-versal algorithms. We show that using theprotocol, by adjusting the security and com-munication cost parameters, we can achievethe required level of computational securitywith an acceptable communication cost.

Organization of the ArticleWe first present a general overview of theframework and the outline of the privatetree data retrieval protocol. Next we discusshow redundancy enables oblivious traversalof a tree structure, and address the underly-ing technical challenges, providing anoblivious traversal algorithm. We then givea quantitative analysis of the protocol anddiscuss how to tune the various system andsecurity parameters to optimize the perfor-mance. We show how to apply the protocolfor private accessing of outsourced XMLdocuments. We implement the protocol,analyze experiment results, and discuss theamount of security the protocol can achieve,and suggesting ways to improve the securityof the protocol in the future. Finally, weconclude.

OVERVIEW OF THE HIDING FRAMEWORKIn this section, we first give a general over-view of the hiding framework. We then pro-vide an outline of the proposed hidden dataretrieval protocol.

There are three types of entities with dif-ferent roles in the system: data owners,licensed users, and a data store (oracle). Thedata owners and licensed users are thin cli-ents (as explained before). A data owner hasthe right to publish its data in the oracle, anda licensed user has the permission grantedby some data owner to retrieve informationfrom the data owner’s data storage space inthe oracle. The oracle manages data storagespaces, where data and tree structures arestored in a hidden way.

Clients run data encryption algorithmsand have initial secret keys for decryption.Encryption algorithms are used to encryptdata and tree structures before sending themto the oracle to ensure that the data contentand the data structure are hidden from theoracle. If clients are accessing an out-sourced index tree, they have point- orrange-queries. If they are accessing out-sourced XML trees, they have query pat-terns. Query patterns are used to traverse atree structure along paths described by someregular-like expressions. These tasks areaccomplished efficiently by thin clientswith the help of specialized embedded hard-ware, such as smartcards, distributed to thelicensed user by data owners. Smartcardshave often been used in mobile computing.They are relatively cheap, costing no morethan several dollars. Such embedded hard-ware also helps in solving the secret key dis-tribution problem; that is, by distributingsmartcards that contain secret keys, a dataowner distributes keys to licensed users.

Data Store ArchitectureThe storage space of the data store isdivided into partitions, each owned by a dif-ferent data owner. We focus here on a singlepartition.

The unit of storage and access is a node.Each node is identified by a unique Node ID(NID). A client retrieves a node from thedata store or server by sending a request thatincludes the NID of the node.

Because the IDs of the nodes that consti-tute a tree structure are not known a priori tothe clients, there is a special entry node


M A Y / J U N E 2 0 0 4

called snode whose NID is known to alllegal data owners and licensed users. Itstores pointers to roots of tree structures.This node is encrypted by a fixed secret keyknown to all legal clients. We discuss itsfields in greater detail later.

Note that the NID of the snode may bedistributed to licensed users in the same wayas keys by data owners.

Outline of Private Tree Data Retrieval ProtocolAn outline of the private tree data retrievalprotocol is depicted in Figure 2.

Every time the data owner wants to insertnew data into the tree structure or delete adata item from it, the owner

� Encrypts the data with a secret key;� Walks the tree structure in an oblivious

manner so that the traversal path is hid-den to the data store;

� Locates the node of interest (either for insertion or deletion); and

� Updates the tree structure by inserting or deleting encrypted tree nodes in proper positions, in an oblivious way with respect to the data store.

By walking or updating the tree structurein an oblivious way with respect to the datastore, we mean minimizing the leakage ofinformation about the data and the treestructure as much as possible; the details ofhow to walk and update tree structures in anoblivious way are described later.

Client traversal of the tree for retrievinginformation is similar to update, as in orderto prevent the database server from differ-entiating between read and write operations,a read operation is always implemented as aread followed by a writing back of the con-tents.

OBLIVIOUS TRAVERSAL OF THE TREE STRUCTURENodes of a tree structure are alwaysencrypted before they are passed to the datastore. Consequently their content is alreadyhidden from a malicious store. However, ifa client traverses the tree structure in a plainway, the relationships between nodes in thetree, and therefore the tree structure as wellas the user’s query, are revealed. The paper“Private Information Retrieval with Key-words”10 presents an oblivious way totraverse a tree structure: it models every

FIGURE 2 Outline of the Protocol

DatabaseClient

obliviously walk tree structure2

analyze query1

decrypt data4

retrieve encrypted data3


M A Y / J U N E 2 0 0 4

27

level of a database as a separate array,which is then hidden using the basic privateinformation retrieval technique. Thisscheme is simple, but like other informa-tion-theoretical private informationretrieval approaches, it needs replication ofthe whole database across multiple nonco-operating data stores and the resulting com-munication cost is very expensive. Ourprotocol is to provide computational hidingguarantees with a single data store, whileminimizing the communication cost. Wepropose two adjustable techniques toachieve oblivious traversal of tree struc-tures: access redundancy and node swap-ping.

Access RedundancyAccess redundancy requires that each time aclient accesses a node, instead of simplyretrieving that particular node, it asks fromthe server a set of randomly selected m – 1nodes in addition to the target node. Conse-quently, the probability with which the datastore will guess the intended node is 1/m. mis an adjustable security parameter. We dis-cuss how to choose the value of m later anddefine this set as the redundancy set of thetarget node. We also discuss how a clientcan choose those m nodes with minimalknowledge about the organization of thewhole data storage space.

The problem with redundancy sets, onthe other hand, is that their repeated use canleak information about the target node. For

example, if the root node’s address is fixed,then multiple access requests for the rootnode reveal its position (despite the use ofredundancy) because the root is always inthe first redundancy set any client asks. Byintersecting all the redundancy sets, the datastore can learn the root node. If the root isrevealed, there is a high risk that its children(and also the whole tree structure) may beexposed. The situation is depicted in Figure3 where large circles represent redundantsets for different queries that retrieve thesame target node and small circles denotethe target node.

Node SwappingConsequently, in order to prevent the serverfrom using an attack based on intersectingrepeated or related requests, we have tomove nodes each time they are accessed.Preferably, the move should have minimalimpact on the tree structure and should notleak information about where a given nodeis moved. To achieve this, each time a clientneeds to access a node from the server, itasks the server for a redundancy set consist-ing of m nodes that includes at least oneempty node along with the target node. Theclient then

1. Decodes the target,2. Swaps it with the empty, and3. Re-encrypts the redundancy set and

writes it back.

FIGURE 3 Repeated Accesses Reveal the Position of the Target Node: (a) After the First Access; (b) After the Second Access; (c) After the Third Access

Query1

(a) After the first access

Query1 Query2

(b) After the second access

Query1 Query2

Query3

(c) After the third access


M A Y / J U N E 2 0 0 4

Figure 4 shows how this approach pre-vents information leakage: Figure 4(a)shows that after the first access, the positionof the target node is moved (the arrowshows the node’s movement). Figures 4(b)and (c) show that after the second and thethird accesses, the position of the targetnode is moved again. As shown in Figure 4,during the course of an access, the oraclehas the chance to know the position of thenode only if the redundancy set for theaccess has little intersection with the set ofthe previous access so that the positionwhere the node moved to after the previousaccess is revealed. But because the nodemoves again once the nodes are writtenback after the access, such leakage is of nouse to the server. In this way, the possibleposition of the target node is randomly dis-tributed in the data storage space and thusthe repeated-access-attack is avoided.

Re-Encryption to Enable Node SwappingNode swapping requires re-encryption ofnodes before they are rewritten to theserver. Re-encryption should employ a newencryption scheme/key, because if the sameencryption scheme is used, by comparingthe content of nodes in the redundancy setafter rewriting with their original content,the server can easily identify the new posi-tion of the node. This means that a client hasto identify how each node is encrypted. Weachieve this by adding a new field that con-tains the secret key for that particular node.This field is always encrypted using a sin-gle/fixed secret key. This way, the client can

decrypt this field to learn how to decrypt therest of the node.

Empty NodesThe node-swapping technique we describeduses empty nodes to move the content. This,however, requires allocation and (hidden)maintenance of empty nodes. Obviously, analternative technique would be to swap thecontent of two existing nodes, instead ofusing an empty node. The main disadvan-tage of the second approach, on the otherhand, is that it would require maintenanceof parent–child relationships for bothswapped nodes. As we discuss in the nextsection, maintaining node/parent-node rela-tionships is not a trivial task, and maintain-ing empty nodes is easier than maintainingthe parent relationship of both swappednodes.

HIDDEN TREE TRAVERSAL ALGORITHMTo implement oblivious traversal of a treestructure, some critical issues have to besolved.

� How does a client know NIDs of the empty nodes?

� In order to randomly distribute the target node in the database, other nodes in addi-tion to the target one and the empty one in the redundancy set should be ran-domly chosen from the database. How can a client randomly choose them?

� After moving one node, in order to main-tain the integrity of the tree structure, the parent’s pointer to this node has to be

FIGURE 4 Swapping Enables Movement of the Target Node: (a) After the First Access; (b) After the Second Access; (c) After the Third Access

(a) After the first access

Query1

(b) After the second access

Query1 Query2

(c) After the third access

Query1 Query2 Query3


M A Y / J U N E 2 0 0 4

29

updated accordingly. How can this be performed without revealing par-ent–child relationships on the tree struc-ture?

� How can consistency of a tree structure be kept when there are many clients accessing it concurrently?

� How can we choose the values of various system parameters, such as the size of the redundancy set?

In this section, we provide techniques toaddress the first four of these challenges,and provide hidden retrieval algorithmsbased on them and the underlying protocol.In the next section will discuss the choice ofsystem parameters in greater detail.

Keeping Track of the Empty NodesEmpty nodes are stored in hidden linkedlists. Each node (data + pointer) in such alist is encrypted, so the server can identifyneither the nodes that are the empty, nor thestructure of the linked list. The encryptedsnode, whose address is well known to alllegitimate clients, keeps three kinds ofpointers: eheads point to the heads of emptynode lists, etails point to the tails of emptynode lists, and roots point to the roots of treestructures. Because all the nodes includingthe snode are always encrypted whenpassed back to the server, the server neitherlearns the empty nodes nor the linked list.Furthermore, because every time an emptynode is required, a different empty node isused, the server cannot identify emptynodes through repeated use.

Random Choice of Nodes in DatabaseWhen a client requests a node, it asks for aset of m nodes, including the target one. Ifnode swapping is applied, those nodesshould include at least one empty one. Allnodes other than the target and the emptyone in a redundancy set must be randomlychosen from the data storage space so thatmaximum independence is achieved. Toenable a client to choose nodes randomly,the snode should record the range of NIDsof nodes in the data storage space. So by

generating m – 2 random NIDs within therange and ensuring that among them there isno NID of the target node and the emptynode, the client generates a redundancy set.

Maintaining Integrity of Tree StructuresAs to the chal lenge of maintainingnode/parent-node relationships after nodeswapping, we propose two solutions.

The first solution is to find the emptynode to be swapped with the child node andupdate the parent node correspondinglybefore actually moving the child node; thatis, when a client has got a redundancy setfor a parent node, and wants a child node, it

� Identifies the NID of the next empty node in the empty node list of the next level; this empty node is the one to be used for swapping the child once it is fetched from the server;

� Changes the parent node’s corresponding child pointer to the NID of this empty node;

� Uses another encryption scheme to encode the nodes of the parent’s redun-dancy set and writes them back into server; and

� Asks for a new set of m nodes, which includes the child node, the empty node, and m – 2 randomly selected nodes from the next level; they constitute a redun-dancy set for the child node.

This way, parents are always updatedconsidering the future locations of theirchildren.

The second solution is to let the clientkeep track of all nodes on the path while itwalks down the tree structure, deferring allthe updates to a later time. Once the nodecontaining the data to be retrieved isaccessed, the client visits the entire path inthe reverse order and swaps all nodes on thispath.

The benefit of the first solution lies inthat client does not need to keep track of thenodes on the path, thus requiring less mem-ory than the second solution. The benefit ofthe second solution is that, because the cli-ent updates the nodes on the tree structure at


M A Y / J U N E 2 0 0 4

the end of the data retrieval (and especiallybecause it updates a parent node right aftermoving its child node, without any otherprocessing in between) it is easier to main-tain the consistency of the database in thecase of aborted transactions. Those twosolutions also introduce different concur-rency penalties. We discuss them in the fol-lowing subsection.

Concurrency Control Without DeadlocksThe proposed protocol will be applied toWeb-based mobile computing environ-ments with a large number of clients. Inorder to keep consistency of the tree struc-ture with many clients accessing tree struc-tures simultaneously, proper concurrencycontrol must be used at the server side.There has been intensive study about indexlocking so that maximum concurrency isachieved while preserving the integrity ofthe tree structure. Because there is no pureread operation in the scheme (each node,after being read, should be written back),only exclusive locks are needed. To preventdeadlocks, we organize nodes in a dataowner’s data storage space into d levels.Each level of a data owner’s data storagespace requires an empty node list to main-tain empty nodes at this level. The clientalways asks for locks of parent-level nodesbefore asking for locks of child-level nodes,and always asks for locks of nodes belong-ing to the same level in some predeterminedorder (e.g., in the order of ascending NIDs).In this way, all nodes in a data owner’s datastorage area are accessed by all clients in afixed predetermined order. This ensures thatcircular waits cannot occur. Hence dead-locks are prevented.

These two solutions introduce differentconcurrency costs. Considering the firstsolution in which a parent node is updatedbefore its child node is moved, the exclusivelocks on nodes of the redundancy set of theparent node can be released immediatelyafter the locks on the nodes of the redun-dancy set of the child node are gained. Thenanother client may have the chance toaccess the parent. In the second solution,

because the whole path from the root to thenode where data is retrieved is updated inthe reverse order of their being accessed, thelocks on nodes along the path should also bereleased in the reverse order of beinggained. This means the lock on the root isthe first to be gained and the last to bereleased; thus other clients are prohibitedfrom accessing the tree structure at the sametime. Therefore, from the concurrency per-spective, the first solution is more desirable.

Pseudo Code of an Oblivious Tree Traversal AlgorithmAccording to the first tree integrity mainte-nance solution, we provide the pseudo codeof an oblivious traversal algorithm in thefollowing paragraph. The time complexityfor this algorithm is O(d × m), with d denot-ing the depth of tree storage space and mdenoting the redundancy set size, with spacecomplexity O(m).We omit the details of thesecond solution, but time complexity for itis O(d × m) and the space complexity is O(d× m).

Oblivious Traversal Algorithm. I npu t :feature values of target data item and theidentifier of the data owner from whose treestructure the client wants to retrieve data.

Output: pointer to the node that containsthe data if it exists; or null pointer.

1. Lock and fetch the public entry node to the data store, let it be PARENT, find the root, and let it be CURRENT.

2. Select a redundancy set for the CUR-RENT, lock nodes in the set, and let the empty node in the set be EMPTY.

3. Update the PARENT’s pointer to refer to the EMPTY, and release locks on the PARENT level.

4. Swap the CURRENT with the EMPTY.5. If CURRENT contains the data, return

CURRENT

else

6. Let CURRENT be PARENT, find the child node to be traversed next, let it be CURRENT, and repeat steps 2 through 5.


M A Y / J U N E 2 0 0 4

31

IDENTIFYING APPROPRIATE VALUES FOR THE SYSTEM PARAMETERSChoosing the appropriate design parametervalues for a hiding system depends on vari-ous system constraints, including theacceptable communication cost and therequired degree of hiding. We model a dataowner’s data storage space as d levels. Sup-pose the tree structure is an l-level tree.Then the following parameters and con-straints have to be considered.

� The maximum probability δ for the server to be able to find the actual node that the client is asking from a redun-dancy set gives us (1/m) < δ.

� The maximum probability λ for the server to find the path along which a cli-ent walks the tree structure gives us (1/ml) < λ. We emphasize here that although it is easy for the data store to guess the target node from the redun-dancy set if m is small, it becomes much harder to guess the parent–child relations between sequential node accesses. And the probability of discovering a path is reduced exponentially with the increase of length of the path and hence should be slim even with a small value of m.

� The total communication cost clients are allowed to make for each data retrieval gives us ((read(m) + write(m)) × l ≤ ε; here read(m)/write(m) denotes communication cost to read/write m nodes from the server.

� A node may contain multiple data points. We denote the node size (i.e., the number of data points a node is able to contain) as s. The value of s can be determined by considering the following. Let c denote the function of one round-trip communi-cation cost for data points to be received from and sent to the server, e and d denote the encryption and decryption cost function, and w and r denote the write and read cost function. Theoreti-cally, they are linear functions. Then:

total_cost_for_data_retrieval= tree_depth × m × (communication + decryption + encryption + read + write cost_per_node)= l ∞ m ∞ ( c(s) + d(s) + e(s) + r(s) + w(s) ).

As node size s increases, tree depth ldecreases and costs per node increase. If allother parameters are known, we can calcu-late optimal node size to minimize the totalcost. However, as s increases, the probabil-ity for the data store to find a path, which is1/ml, increases. Therefore, the value of sshould be carefully chosen to ensure that thesecurity requirement is satisfied and thetotal cost is minimized as much as possible.Note that most of the above constraints arelinear; hence an appropriate parameter set-ting can be easily identified using efficientalgorithms.

ENABLING PRIVATE ACCESS TO OUTSOURCED XML DOCUMENTSIn this section, we describe how our proto-col can be extended to provide privateaccess to XML documents.

In an XML document, each node of thetree corresponds to an element or anattribute of an element in the XML docu-ment. The root node contains the docu-ment’s root. A child node corresponds to asubelement or an attribute of the parentnode. For each child of a node, in additionto the pointer to the child, there can be a tagin the node that indicates the name of thechild node. If the child is a subelement, thename is the element tag of the subelement.If the child is an attribute, the name is theattribute name.

XML query processing is concerned withfinding the instances of a given pattern tree(query tree or twig) in a given target tree(XML document or document collection).Given directed labeled trees Q and T, thequery tree Q matches the target tree T at anode x iff there exists a one-to-one mappingfrom the nodes of Q to the nodes of T (withrespect to the node x) which preserves thelabels, the order, and the ancestor/descen-dant relationships of the nodes. However,the wildcard symbol * can be used in aquery to match any character in the alphabetand // can be used to match any sequence ofsymbols (Figure 5).

A naive approach to execute tree queriesis to navigate from root to leaves to look for


M A Y / J U N E 2 0 0 4

tree matches. A more advanced approach isto use inverted indexes to access nodes ofthe trees and then use join operations tomatch the tree structures.

Unfortunately, structural joins can bevery costly. A variety of algorithms, such asViST, use intervals to index nodes based onelement names and use these intervals toperform structural join operations. Our pro-tocol can be extended to both types ofaccesses.

Hiding Navigational Accesses to XML DataThe first kind of access is the naive naviga-tional access that involves explicit traversalof the tree structure. We consider a path-based query that may be expressed using aregular expression plus predicates that filternodes. Figure 6 gives an example XMLdocument.

Suppose a licensed user has an XQuerywith the Xpath description: “profes-so r s /Pro f e s sor [ name=“K.S .Can-dan”]/RAs/RA” which aims to find all RAsof the professor named “K.S. Candan.” Forthis document, the user checks the root ele-ment and finds that the root node matchesthe element professors. The user would thenretrieve the children of the root and check ifany is of type Professor. If any of the chil-dren satisfies this condition, then the user

would further evaluate the filtering predi-cate and pick the Professor nodes with thename attribute “K.S.Candan.” The userwould then continue navigating down thetree, checking at each step the element typesand filtering conditions, until the requiredRA elements are retrieved.

Navigational access is very similar to thetree traversal we described earlier. To hidethis type of access, the client would firstfind where the root node is by checking thesnode. Then it would generate a redundancyset for the root. It would retrieve the nodesin the redundancy set from the data storeand would identify the candidate nodes tobe evaluated at the next stage. This way thenavigation (or tree traversal) process wouldcontinue until the user visited all relevantnodes in the tree.

Hiding Join-Based Accesses to XML DataThe second kind of access is through struc-tural join operations. The general idea of astructural join is: (1) label the XML treenode so that it is straightforward to tell thestructural relationship (parent/child orancestor/decedent) between any two nodesby labels. A typical example of such a labelis the interval. By interval labeling, XMLtrees are traversed in pre-order and eachnode is assigned an interval such that every

FIGURE 5 (a) Query Tree (\\ Means Match Any Number of Elements); (b) and (c) Two Matching Documents

A

//

E

F

(a) A query tree

B

C

E

F

A

A

E

G F

(b) and (c) two matching documents


M A Y / J U N E 2 0 0 4

33

child’s interval is contained by the parent’sinterval. (2) Build an index tree on elementtags or labels to facilitate fast retrieval ofrequired XML nodes. (3) For a structuraljoin XML query, split it into the proper setof subqueries and execute them. (4) Uselabels to structurally join subquery results togain the XML query result.

Different structural join algorithms differin (a) how they assign intervals to nodes, (b)how they split queries into its subqueries tominimize query cost, (c) how they select thejoin order, (d) how they index the tags andintervals, and (e) how they perform thestructural join operations. However, a com-mon property of most approaches is that, tospeed up the structural join operations andto process tree queries efficiently, thesealgorithms use tree-based index structures,such as B+-trees, to store node intervals.Such index structures facilitate efficientretrieval of data. Improperly accessing theindex, not only the structure of the indexand its contents, but also the data distribu-tion, for most indexes generally closelyreflect the distribution of the data.

Thus, in order to hide the data and datadistribution from the database, tree-struc-ture hiding techniques must be adopted toprotect the access of index trees by oracles.

To retrieve an indexed data item, a clientneeds to traverse the index tree to find thedata’s location. In order to hide the queryand the index tree, the traversal path needsprotection. Therefore, access to tree-struc-tured index structures can also benefit fromthe oblivious tree traversal techniques wehave previously described.

EXPERIMENT RESULTSTo validate the protocol, we simulated theprotocol with the first tree traversal algo-rithm and conducted some experiments. Wedo not present results here regarding thesecond algorithm, as its concurrency cost ismuch higher. Therefore, we focus on thebetter of the two. The computing environ-ment consisted of a Linux server acting as adata store and a 1.0 Ghz/256 M laptop gen-erating client requests. They were con-nected via a wireless LAN system. Weimplemented a two-dimensional k – d treeas the index structure due to its simplicity.This simple structure enables us to observeexperiment results more effectively.

In the article, we do not experiment withrange queries inasmuch as we focus on pathtraversal. We point out that using this proto-col, range queries can be implemented as

FIGURE 6 Example XML Document

Professor: name =Kasim S. Candan

TAs RAs

TA: name =Ping Lin

RA: name =Lina Peng

RA: name =Shibo Wu

Professor: name =Ridda A. Bazzi

RAs

RA: name =Zhichao Liu

RA: name =Ting Yin

Professor: name =Suzanne W. Dietrich

Professors

Instructors TAs

TAs

Instructor: name =Mehmet E. Donderler

TA: name =Ping Lin

TA: name =Srinivas Vadrevu


M A Y / J U N E 2 0 0 4

multiple path traversals without deadlocks.We generated 40,000 data points that wereuniformly distributed in the region (0, 0) to(1,000,000, 1,000,000), and stored them in adata storage space with 30,000 node capac-ity. The size of the redundancy set m, is setto 8.

Response Time and Node SizeWe executed a set of experiments to showthe relationship between node size andresponse time, that is, the time between aclient sending a data retrieval request andgetting the response.

Figure 7 shows the two sets of experi-ment results. The dark points denote theresults of experiments with encryp-tion/decryption implemented by software.This set shows that when node size is set toaround 50 data points, the minimumresponse time (about 38 s), is achieved. Thisphenomenon verifies the theoretic observa-tion that there must be an optimal node size.Considering the probability for the mali-cious server to find the path (denoted aspath probability, which is a function of pagesize, . Here m is the redundancyparameter, num is the total number of datapoints stored, and s denotes node size), a

suitable node size can be chosen to satisfysecurity requirements and minimizeresponse time.

The set of white points depicts experi-ments with efficient hardware encryp-tion/decryption. From the result, we foundthat encryption and decryption constituteheavy cost and with assistant hardware,response time can be greatly reduced toabout 8 s.

To compare our protocol with the one-server Private Information Retrieval (PIR)technique, we also simulated PIR by trans-ferring the whole database to a client. Ittakes about 3643 s to finish transferring. Wecan claim that our protocol is much moreefficient.

Another interesting phenomenon weobserve from Figure 7 is that although thetwo sets of points have a big difference intheir values, they have a similar zigzag pat-tern. This shows that the discontinuities andsharp variances in response time values aremainly determined by other costs (commu-nication cost c(s), write cost w(s), and readcost r(s)) than encryption/decryption.

We also notice that the response time forthe set of black points has a strong tendencyto increase with the node size, whereas it is

FIGURE 7 Response Time and Node Size

0 10000 20000 40000 60000

20

40

60

80

100

120

140

160

node size (points)

resp

onse

tim

e(s

econ

ds)

1 m num slog( )


M A Y / J U N E 2 0 0 4

35

very slight for the white points. This can beexplained by the significant parts encryp-tion/decryption play in the total cost andtheir linear increase with the node size.

Concurrency ControlFurthermore, we conducted a set of experi-ments to show the effect of concurrencycontrol. In this set of experiments, 50retrieval requests for independently selectedrandom data points were launched one byone at varying frequency every 10 ms toevery 300 ms. In the experiment results, wefound no deadlocks. We also found that thetotal time to finish all the requests was muchless than letting the server process thoseretrievals sequentially. To give a sampleresult, when requests were launched every20 ms, the total time required to finish themwas 734.8 s, and the time to process themsequentially was 1442.9 s.

Figure 8 gives the ratio of the timerequired to process sequentially and thetime required by our protocol with concur-rency control. We can see that the ratio isabout 2. This means we gain 100 percentsavings with concurrency control. Figure 8

also shows that this ratio increases with thetime interval. This is consistent with thecommon knowledge that the efficiency ofthe data store is reduced with more clientssimultaneously accessing trees.

SECURITY ANALYSIS AND FUTURE WORKTo ensure computational private security,the protocol should be able to protect que-ries and the tree data structure from a poly-nomial-time server. To study the securityguarantee the protocol provides, supposethat the server keeps a history of all redun-dancy sets users retrieved, and the servermakes inferences about queries and data bystatistical analysis of the history. We defineeach redundancy set as a call, and the his-tory as a view of the server. The requiredamount of security is defined as:

� For any two different queries Q1 and Q2 posed in the view, the distributions of their sequences of calls are indistinguish-able in polynomial time; and

� For any two queries Q1 and Q2 posed in the view, it is hard to tell if they are iden-tical by observing their sequences of calls.

FIGURE 8 Effect of Concurrency Control

1.6

1.8

2

2.2

2.4

time interval (ms)

50 100 150 200 250 300

ratio


M A Y / J U N E 2 0 0 4

We first assume that:

� Queries are uniformly distributed; that is every tree node is accessed by clients as the target data to be retrieved with the same probability; and

� Every client can pose calls anonymously; that is the identity of the clients can be hidden from the data store.

We remove the first assumption later. Asto the second one, anonymous access isachievable with anonymous access proto-cols. Our proof that the protocol satisfiesboth levels of the security definition isbased on the following proposition.

Proposition. If the data storage space israndomly initialized and queries are uni-formly posed, tree nodes will always be uni-formly distributed in each layer of the datastorage space.

Proof. Suppose there are a total of N nodesin a level of the data storage space, and thereare n tree nodes at the layer. We use the N-bit random variable X(t) to represent the dis-tribution of tree nodes in the level at t sothat:

= 1, if the ith node

in the level stores the tree data

0, otherwise, for i [1..N].

Here denotes the ith bit of X(t). It

is obvious that = n. Let S be X(t)’s

state space; that is, S = {s | s is an N bit

binary number such that = n }. There

should be elements in S. Now we proveby induction that if queries are uniformlydistributed and data storage space is ran-domly initialized, tree nodes will always beuniformly distributed in each level afterqueries.

First, after the data storage space is ran-domly initialized, tree nodes are distrib-uted uniformly at each level. Let X(0) be the initial state of the level, then:

PROB[X(0) = s] = , s S.

Suppose that X(t) (i.e., the state of the level at t) conform to uniform distribution and there is a uniformly distributed random query to switch its target tree node with a randomly selected empty node at the level. Hence at time t + 1:

PROB[X(t + 1) = s]

= PROB[X(t)

= s′] × PROB[X(t + 1) = s|X(t) = s′].

Let D(s,2) = {s′ | s′ and the hamming distance between s′ and s is 2. It includes all possible states of X(t) if X(t+1)=s. Hence:

PROB[X(t + 1) = PROB[X(t)

= s′] × PROB[X(t + 1)

= s|X(t) = s′].

Consider that X(t) conforms to uniform distri-bution, |D(s,2)| = , and given an s′ in D(s,2), there are numbers of ways to perform a single random switch, and only one of them enables transition of the state of the level from s′ to s, the equations becomes:

PROB[X(t + 1) = s]

=

= .

Corollary. If the data store is randomly ini-tialized and queries are uniformly distrib-uted, redundancy sets are also uniformlydistributed.

The proof of the corollary is quite similarto the proof of the proposition and we omitthe details here.

Under the assumption of uniform distri-bution of queries, for two different queries, if

X t i( )

∈

X t i( )

X t i

i

( )∑

si

i

∑CN

n

1 CNm ∀ ∈

′∈∑s S

∈S

′∈∑

s D s( , )2

C Cn N n1 1× −

C Cn N n1 1× −

1 11 11 1C

C CC CN

n n N nn N n

× × ××−

−

1

CNn

∀ ∈s S


M A Y / J U N E 2 0 0 4

37

their query path lengths are equal, the distri-bution of their sequences of calls (redundancysets) are identical (uniform distribution), andhence indistinguishable in polynomial time;if their query path lengths are not equal, cli-ents can execute dummy calls at deeper lev-els to always make the same number ofcalls.

As to the second security requirement, iftwo identical queries are posed consecu-tively without any interfering calls, theircalls at the same level will always intersect;hence it seems possible for data stores todeduce some hints about identical queriesfrom intersections. However, under theassumption of uniform distribution of que-ries, the probability for two identical queriesto occur consecutively, which is (hered denotes the depth of the tree), is smallerthan the probability for any two sequencesof random sets to intersect, which is

. Because every call includessome randomly selected nodes, the intersec-tions introduced by identical queries will beperfectly hidden by intersections introducedby random nodes.

However, under other assumptions ofquery distribution, our protocol may notprovide the required security guarantee. Forexample, if a query Q occurs at a very highfrequency, the intersections introduced byrandom nodes cannot hide the intersectionsbetween consecutive Qs. We study how toimprove the protocol under other query dis-tributions by methodically introducingdummy calls and intersections to makeother query distributions appear uniform inthe view.

CONCLUSIONIn this article, we propose a simple, adap-tive, and deadlock-free protocol to hidetree-structured data and traversal paths froma data store. Because many data such asXML have a tree structure and queries canbe expressed as traversal paths, this protocolcan be utilized to hide such data and queries.XML documents can be seen as trees.Research has been done to study access con-trol, content hiding, and authentication of

XML documents, but none has been doneon hiding XML queries and structures. Webelieve that this is the first protocol toaddress this need. Compared with existingprivate information retrieval techniques,our protocol does not need replication ofdatabases and it requires moderate commu-nication, and is thus practical. We showhow to apply it to hide XML documents andtree-path-based queries. We conduct exper-iments and observe that the proposed tech-niques achieve hiding without generatingunacceptable concurrency problems.Finally, we give a security analysis of theprotocol and point out future research direc-tions.

NoteThis work is supported by the AFOSR grant#F49620-00-1-0063 P0003.

References1. Hacigümüs, H., Iyer, B.R., Li, C., and Mehrotra,

S. (2002). Executing SQL over Encrypted Data in the Database-Service-Provider Model. In Pro-ceedings of 2002 ACM SIGMOD International Conference on Management of Data, Madison, WI, June 3–6, 2002, 216–227.

2. Oracle Corp. (1999) Database Security in Oracle8i, 1999. Retrieved on February 26, 2004, from http://otn.oracle.com/depoly/security/ oracle8i/index.html.

3. Smith, S. W. and Safford, D. (2001). Practical Server Privacy with Secure Coprocessors. IBM Systems Journal, 40(3), 683–695.

4. Chor, B., Goldreich, O., Kushilevitz, E., and Sudan, M. (1995). Private Information Retrieval. In Proceedings of the 36th IEEE Conference on the Foundations of Computer Sciences, Milwau-kee, WI, October 23–25, 1995, 41–50.

5. Bouganim, L. and Pucheral, P. (2002). Chip-Secured Data Access: Confidential Data on Untrusted Servers. In Proceedings of the 28th Very Large Data Bases Conference, Hong Kong, 2002, 131–142.

6. Bayer, R. and Schkolnich, M. (1977). Concur-rency of Operations on B-Trees. Acta Informat-ica, 9, 1–21.

7. Mohan, C. (1996). Concurrency Control and Recovery Methods for B+-Tree Indexes: ARIES/KVL and ARIES/IM. In V. Kumar, (Ed.) Performance of Concurrency Control Mecha-nisms in Centralized Database Systems, Prentice-Hall, Englewood Cliffs, NJ, 248–306.

8. Mohan, C. (2002). An Efficient Method for Per-forming Record Deletions and Updates Using Index Scans. In Proceedings of the 28th Very Large Data Bases Conference, Hong Kong, 2002, 940–949.

1 N d

( )1− −C CN mm

Nm d


M A Y / J U N E 2 0 0 4

9. Ostrovsky, R. and Shoup, V. (1997). Private Information Storage, In Proceedings of the 29th Annual ACM Symposium on Theory of Comput-ing, El Paso, TX, May 4–6, 1997, 294–303.

10. Chor, B., Gilboa, N., and Naor, M. (1997). Pri-vate Information Retrieval by Keywords, Techni-cal Report TR CS0917. Technion, Israel.

11. Lin, P. and Candan, K.S. (2003). Data and Appli-cation Security for Distributed Application Host-ing Services. In M. Fugini and C. Bellettini (Eds.), Information Security Policies and Actions in Modern Integrated Systems. 273–316.

12. Lin, P., Candan, K.S., Bazzi, R., and Liu Z. (2003). Hiding Data and Code Security for Appli-cation Hosting Infrastructure. In H. Chen et al. (Eds.), Intelligence and Security Informatics, 388.

13. Lin, P. and Candan, K.S. (2003). Hiding Tra-versal of Tree Structured Data from Untrusted

Data Stores. In H. Chen et al. (Eds.), Intelligence and Security Informatics, 385.

14. Candan, K. S., Jajodia, S., and Subrahmanian, V. S. (1996). Secure Mediated Databases. In Pro-ceedings of the IEEE Conference on Data Engi-neering, 490–501.

15. Paparizos, S., Patel, J.M., Srivastava, D., Wiwat-wattana, N., Wu, Y., and Yu, C. (2003). TIM-BER: A Native XML Database for Querying XML. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, San Diego, CA, June 9–12, 2003, 672.

16. Wang, H., Park, S., Fan, W., and Yu, P.S. (2003). ViST: A Dynamic Index Method for Querying XML Data by Tree Structures. In Proceedings of the 2003 ACM SIGMOD International Confer-ence on Management of Data, San Diego, CA, June 9–12, 2003, 110–121.

Hiding Tree-Structured Data and Queries from Untrusted Data Stores

Documents

Transcript of Hiding Tree-Structured Data and Queries from Untrusted Data Stores