Efﬁcient R-Tree Based Indexing Scheme for Server-Centric ... · two-layer indexing scheme for...

1041-4347 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2016.2526006, IEEETransactions on Knowledge and Data Engineering

1

Efficient R-Tree Based Indexing Scheme forServer-Centric Cloud Storage System

Yang Hong, Qiwei Tang, Xiaofeng Gao, Bin Yao, Guihai Chen, Senior Member, IEEE , and Shaojie Tang

Abstract—Cloud storage system poses new challenges to the community to support efficient concurrent querying tasks for variousdata-intensive applications, where indexes always hold important positions. In this paper, we explore a practical method to construct atwo-layer indexing scheme for multi-dimensional data in diverse server-centric cloud storage system. We first propose RT-HCN, anindexing scheme integrating R-tree based indexing structure and HCN-based routing protocol. RT-HCN organizes storage andcompute nodes into an HCN overlay, one of the newly proposed sever-centric data center topologies. Based on the properties of HCN,we design a specific index mapping technique to maintain layered global indexes and corresponding query processing algorithms tosupport efficient query tasks. Then we expand the idea of RT-HCN onto another server-centric data center topology DCell, discoveringa potential generalized and feasible way of deploying two-layer indexing schemes on other server-centric networks. Furthermore, weprove theoretically that RT-HCN is both space-efficient and query-efficient, by which each node actually maintains a tolerable numberof global indexes while high concurrent queries can be processed within accepted overhead. We finally conduct targeted experimentson Amazons EC2 platforms, comparing our design with RT-CAN, a similar indexing scheme for traditional P2P network. The resultsvalidate the query efficiency, especially the speedup of point query of RT-HCN, depicting its potential applicability in future data centers.

Index Terms—Distributed Index, R-Tree, Data Center Network.

F

1 INTRODUCTION

C LOUD storage systems keep gaining attentions fromboth academia and industry nowadays. From classical

systems for general data services, such as Google’s GFS [1],Amazon’s Dynamo [2], Facebook’s Cassandra [3], to newlydesigned systems with specialities, such as Haystack [4],Megastore [5], Spanner [6], various distributed storage sys-tems were constructed to satisfy the increasing demandof online data-intensive applications that require massivescalability, efficient manageability, reliable availability, andlow latency in the storage layer. Many works have beenproposed for designing new indexing scheme and datamanagement system to support large-scale data analyticaljobs and high concurrent OLTP queries [7–10].

To achieve query efficiency, most existing mature cloudstorage systems employ a pure key-value data model orsome of its variants, which are lacking in built-in supportfor multi-dimensional index, as can be seen from Dynamoand BigTable. Actually, two-layer index [7] is just an efficientand suitable framework for multi-dimensional querying inCloud systems. This framework has been put into successfulpractice in [8–10]. However, they construct their global in-dexes on the P2P networks, like BATON [11] and CAN [12].We know that P2P networks give better illustrations forconnections on the logic level than the Internet level, buttheir underlying topologies are actually undefined and thenodes may scatter widely with unbounded physical hop

• Yang Hong, Qiwei Tang, Xiaofeng Gao, Bin Yao, and Guihai Chen arewith the Shanghai Key Laboratory of Scalable and Computing and Sys-tems, Department of Computer Science and Engineering, Shanghai JiaoTong University. Shaojie Tang is with the Naveen Jindal School of Man-agement, University of Texas at Dallas. E-mail: [email protected],[email protected], {gao-xf, yaobin, gchen}@cs.sjtu.edu.cn, [email protected].

• Xiaofeng Gao is the corresponding author.

distance, bringing instability of performance.As is known, more and more cloud systems today favor

an infrastructure called data center, which consists of largenumber of servers interconnected by a specific Data CenterNetwork (DCN) [13–25]. For instance, Cisco applies Fat-Tree topology as its DCN architecture for provably efficientcommunication [13]. Designing efficient indexing schemeon data center topologies has been put on the agenda.Different from P2P network, DCN is more structured withlow equipment cost, high network capacity, and support ofincremental expansion. Particularly, DCN has specific phys-ical topologies, with the connections among nodes strictlydefined. It is not wise to simply transplanted the indexingscheme for P2P networks onto DCN topologies. Such in-frastructures bring new challenges for researchers to designefficient indexing scheme to support query processing forvarious applications.

In this paper, we show the construction of a distributedmulti-dimensional indexing scheme for server-centric datacenter networks. We start from one of the mainstreamserver-centric topologies, named Hierarchical Irregular Com-pound Networks (HCN) [26, 27], as an example topology.HCN is mathematically a simple and beautiful topologydue to the inherent regularity and symmetry, which bringsin convenience for the indexing building and potentialfor expansion. Correspondingly, we propose a two-layerindexing scheme, RT-HCN. Since datasets are distributedamong different servers, we can use an R-Tree like indexingstructure to index locally stored multi-dimensional data foreach server. Next, RT-HCN portionably distributes theselocal indices across servers as their global indices. To avoidsingle master server bottleneck, each server only maintainspartial global index for its potential indexing range. Basedon the characteristics of HCN, we design an index mapping



2

method to promote the distribution of the global index ontocomputer servers. Corresponding query processing algo-rithms are also presented then. We also prove theoreticallythat RT-HCN is both query-efficient and space-efficient, bywhich servers will not maintain too may redundant indiceswhile a large number of users can concurrently processqueries with not relatively high routing cost. Before eval-uation, we discuss our work on expanding our RT-HCNindexing scheme into another data center topology DCell.We compare our design with RT-CAN [8], a similar designfor traditional P2P network, in terms of performance onmulti-dimensional and skewed data. Experiments validatethe query efficiency with low false positives of our proposedscheme and depict its potential implementation in datacenters.

Our contribution of this paper is threefold:

• we are the first to propose a distributed multi-dimensional indexing scheme for server-centric DCNstructures, and also the first to develop a general andpractical method to expand this two-layer indexingscheme onto other data center networks;

• we present a specialized mapping technique toimprove global index distribution in the network,bringing query-efficiency and load-balancing for thecloud system; we besides combine practical tech-niques to solve data skewness and querying falsepositives, greatly increasing the adaptability andquerying performance of RT-HCN;

• we theoretically prove the efficiency of RT-HCN, andcompare it numerically with RT-CAN [8], an index-ing scheme for P2P network. Experiments on realplatforms show that our scheme performs excellentlyfor point query.

The rest of this paper is organized as follows. Section. 2summarizes the related work. Section. 3 introduces theoverview of our system, including a coding method ofHCN nodes and meta-servers. In Section. 4 we illustrate theprocedure of two-layer indexing construction with a wealthof details, including some techniques for multi-dimensionindexing, data skewness handling, and false positives con-trolling. Section. 5 briefly depicts the query processing al-gorithms. In Section. 6 we expand the RT-HCN indexingscheme to fit another server-centric data center networks,DCell. Section. 7 proves the efficiency of our design andperforms numerical experiments in comparison with the RT-CAN [8]. Finally, Section. 8 provides a conclusion and givessome discussion about future work.

2 RELATED WORKS

2.1 Two-layer Indexing

To get a better speed for data retrieval in these distributedstorage, usually a mechanism for parallel querying is re-quired. A two-level indexing framework is a good choice. Akey idea about the combination of two-level indexing withthe overlay networks was first proposed in [7]. A generalmode is offered: in the indexing framework, processingnodes are organized in a structured overlay network, andeach processing node builds its local index to speed up

data access. A global index is built by selecting and pub-lishing a portion of the local index in the overlay network,while the global index is distributed and maintained inall nodes over the network. Several attempts have alreadybeen carried out on P2P overlay networks. RT-CAN [8] wasproposed as a multi-dimensional indexing scheme built ontop of local R-tree indexes and organized on a CAN overlaynetwork. It helps to provide efficient data retrieval servicefor large-scale shared-nothing clusters. Different from this,CG-index [9] organizes computer nodes into a structuredP2P network, BATON, and builds B-tree indexes to supporthigh throughput one-dimensional queries. The generaliz-able work in [10] presents an extensible indexing frameworkto support DBMS-like indexes in the cloud based on theobservation that many P2P overlays are instances of theCayley graph [28]. This extensible framework is able to sup-port multiple types of distributed indexes simultaneously,such as hash, B-tree-like and multi-dimensional index, sig-nificantly reducing the maintenance cost and providing theneeded scalability.

2.2 Date Center NetworksWhat our work pursues is a scalable method of adjustmentfor two-layer indexing building on DCN networks. One ofthe two core components is that our underlying topology isa specific server-centric DCN structure. Data center network(DCN) is the network infrastructure for a data center, whichconnects a large number of servers via high-speed linksand switches. Compared to traditional cloud system whichis usually based on P2P network, specially and carefullydesigned DCN topologies fulfill the requirements with low-cost, high scalability, low configuration overhead, robust-ness and energy-saving. DCN structures can be roughlydivided into two categories, one is switch-centric, whichmeans that the function of switches is enhanced to accom-modate the need of the interconnection and routing, whilethe servers require no modification. There are three kinds ofit, tree-like, flat and optical switch based. The Fat-Tree [13],VL2 [14] and Aspen trees [15] belongs to the tree-like kind.The other category is server-centric, which means that eachserver enables the functions of interconnection and routingwhile the switches require no change and only provide easycrossbar function. Among them, DCell [16], FiConn [17, 18],Dpillar [19], SWCube and SWKuatz [20] are designedfor mega data centers. While BCube [21], MDCube [22],uFix [23], snowflake [24], hyper-fat-tree network (HFN) [25]are topologies for the modular data centers. Chronologically,they usually have more advantages than the former designs.HCN [26, 27], the topology chosen in our system falls intothe server-centric topology. It is a well-designed network fordata center and offers a high degree of regularity, scalability,and symmetry. Different from traditional P2P network, thephysical connection of DCN is known, we can guarantee theprocessing time by calculating out the physical hops neededfor a given query, while in P2P network only logical hops ofthe overlay network can be estimated. We have to be awareof the physical topology when we are discussing DCN andthat is why we need to improve the mapping technique fordistributing global index to fix a given network.

Actually, similar works on the DCN-based indexingscheme has yielded good results. FT-Index [29] is a dis-



3

tributed indexing scheme based on Fat-Tree topology. InFT-Index, the B+-tree and the Interval tree are adopted toorganize data set locally and globally. FT-Index has a goodperformance on false positives and space-efficiency, withthe help of two assistant tools FT-Gap and FT-Bloom. Inanother work [30], the B+-tree is combined with Segmenttree, and the two-layer indexing is implemented on threetree-like switch-centric DCN topologies. However, eitherFT-Index or the later one performs worse for general rangequery than point query. This is essentially the limit of B+-tree indexing for one-dimensional data. Inspired by this, wewant to enrich the types of querying in our new indexingscheme.

2.3 R-tree in Distributed Environment

The other core components is a multi-dimension orientedindexing structure: we use the R-tree. R-trees [31] [32],developed for indexing multi-dimensional data, are themost popular indexes for spatial query processing due totheir simplicity and efficiency. The R-tree extended the B-tree from one dimension to multi-dimensions for indexingstatic features. Spatial features and their relationships wererecognized and stored in a tree structure so that featurescould be more quickly retrieved by tracing along the treestructure. R-tree’s advantages in multi-dimensional queryhas additional value in the distributed systems. With theproliferation of location-based e-commerce and mobile com-puting, the need for moving objects querying begins to arise.Assuming object movement trajectories are known as priori,Saltenis et al [33] proposed the Time-Parameterized R-tree(TPR-tree) for indexing moving objects. By combining withthe improved UK-means algorithm, Ben Kao et al [34] usesan R-tree index to solve the problem of clustering uncertainobjects whose locations are described by probability den-sity functions. In [35] Kai Zheng et al developed efficientalgorithms to answer fuzzy objects querying, by extendingthe R-tree indexing structure and deriving several highlyeffective heuristic rules.

3 SYSTEM OVERVIEW

We focus on the server-centric DCN structure for the dis-tributed indexing scheme construction, and take HCN asthe first example to illustrate our design. In this section,some basic features of HCN topology will be introduced andillustrated. Then, we explain some preliminary definitionsfor further illustration of index construction strategy.

3.1 Hierarchical Irregular Compound Network

HCN (Hierarchical irregular Compound Network) is a well-designed network for data center and offers a high degree ofregularity, scalability, and symmetry. A level-h HCN with nservers in every single unit is denoted as HCN(n, h). HCNis a recursively defined structure. A high-level HCN(n, h)employs n low level HCN(n, h − 1) as a unit cluster andconnects all the clusters by means of a complete graph.HCN(n, 0) is the smallest module (basic construction unit)that consists of n dual-port servers and an n-port mini-switch. For each server, its first port is used to connect with

the mini-switch while the second port is employed to inter-connect with another server in different smallest modulesfor constituting a larger network. Figure. 1 illustrates anexample of HCN with n = 4 and h = 2, which consistsof 64 servers. Each server is labeled according to a codingprocess, which will be introduced in Section. 3.2.

0 1

2 3

4 5

6 7

8 9

10 11

12 13

14 15

16 17

18 19

20 21

22 23

24 25

26 27

28 29

30 31

32 33

35

36 37

38 39

40 41

42 43

44 45

46 47

48 49

50 51

52 53

54 55

56 57

58 59

60 61

62 63

switch server

64 68

72 76

80 84

88 92

96 100

104 108

112 116

120 124

128 144

160 176

192

34

Fig. 1. HCN(4,2) with Coding for Meta-Server

It is easy to see that there is always multiple routes be-tween any two servers in HCN and this is called multi-pathrouting which provides good network features like highbandwidth, good balancing and fault tolerance. This is themain aspect we want to concern during index constructionand also becomes a main reason why special designed indexscheme for specific network is well worth being discussed.

For clarity, we summarize the symbols with their mean-ings in Table. 1. Some of them will be described in thefollowing sections.

TABLE 1Symbol Description

Sym Description

n Code for server and meta-serverSn The server with code nMn The meta-server with code nRn Representatives for Mn

Nn Sn’s node set for publishingh The highest level of HCNB Data boundaryBn Potential index range for Mn

Rin The ith representative for Mn

N in The ith publishing node for Sn

3.2 Meta-Server and Coding StrategyGiven an HCN(4, h), there are 4h+1 servers in total and arecoded by n ranging from 0 to 4h+1 − 1. Thus we use Sn

to denote the nth server in the HCN. Since HCN itself is arecursively defined structure, there are also 4h−l differentHCN(4, l)’s (0 ≤ l ≤ h) in one HCN(4, h). We considereach Sn and HCN(4, l) (0 ≤ l ≤ h) as a meta-server in oursystem.

Definition 1. A Meta-server is a single server or an HCN(4, l)(0 ≤ l ≤ h) considered entirely.



4

Meta-servers together constitute the overlay networkand facilitate global queries, which is going to be explainedin detail later. Meta-servers are also coded by n and denotedas Mn. The coding strategy of Mn is given by:

Mn =

{Sn, 0 ≤ n < 4h+1

HCN(4, q − 1), n ≥ 4h+1 , (1)

where the HCN(4, q−1) exactly consists of servers rang-ing over Sr, Sr+1, · · · , Sr+4q−1, with q and r from the aboveequation represent the quotient and the reminder when thegiven n is divided by 4h+1. From the function, we knowthat Mn is equivalent to Sn when n < 4h+1, and whenn ≥ 4h+1, Mn represents a specific HCN(4, l) (0 ≤ l ≤ h).For example, Figure. 1 shows that in an HCN(4, 2), 64 meta-servers consist of only one server are coded with the numberranging from 0 to 63, 16 meta-servers formed by HCN(4, 0)are coded with 64, 68, · · · , 124 (shown as red squares), 4bigger meta-servers formed by HCN(4, 1) are coded withnumber 128, 144, 160, 176 (shown as blue squares), whilethe biggest green square formed by the whole HCN(4, 2) iscoded with 192.

3.3 Representatives in Meta-Server

As is already mentioned, meta-servers form a higher-leveloverlay network and assist query processing in the network.However, as meta-servers are merely an abstract concept,we need to pick up several physical servers to be in chargeof queries that are sent to corresponding meta-server fromthe overlay network.

Definition 2. Representatives of a given meta-server Mn areseveral physical servers that actually deal with queries forwardedto Mn, which are picked out based on the following strategy.

1) If n < 4h+1, we chose Sn as the representative ofMn since Mn = Sn.

2) If n ≥ 4h+1, Mn, now, is actually an HCN(4, l) (0 ≤l ≤ h), and we provide a special bit manipulationto find out its representatives. First we calculate thequaternary form of n, denoted as q0q1 · · · qm. Herem > h stands since n ≥ 4h+1. Secondly, we pick upthe first m−h bits as shown in Eqn. (2) and calculatethe decimal number b. Then we replace each of thelast b bits with q∗, where q∗ = qm−b and will getseveral newly formed number. Finally we calculatethe decimal form of the last h + 1 bits of the newnumbers and servers coded with those numbers arethe representatives for this Mn.

m − h bits h + 1 bits︷︸︸︷q0q1 · · · qm−h−1

︷︸︸︷qm−h · · · qm−bqm−b+1 · · · qm

=b (decimal) replace part

(2)

For example, the grey nodes in Figure. 2 (a), (b) il-lustrates the representatives for corresponding meta-serverHCN(4, l) when l equals to 0 and 1. It is obvious thatthese representatives actually takes the advantages of HCNtopology and offer good connectivity which will facilitatequery processing.

We denote the representatives of Mn as set Rn and Rn

has different number of entities in different situation. In case

0 1

2 3

4 5

6 7

8 9

10 11

12 13

14 15

64 68

72 76

128

(a). l = 0

0 1

2 3

4 5

6 7

8 9

10 11

12 13

14 15

16 17

18 19

20 21

22 23

24 25

26 27

28 29

30 31

32 33

35

36 37

38 39

40 41

42 43

44 45

46 47

48 49

50 51

52 53

54 55

56 57

58 59

60 61

62 63

128 144

160 176

34

192

(b). l = 1

Fig. 2. The Representatives for Meta-servers in HCN(4,2)

1, Rn = {R1n} while in case 2, Rn = {R1

n, R2n, R

3n}. Here

Rin stands for the server which is the ith representative of

meta-server Mn, and Rin is called an lth-level representative

if and only if Mn is a meta-server formed by an HCN(4, l)(0 ≤ l ≤ h).

Now we give some explanation on choosing represen-tatives. It is obvious to choose Sn as representatives forMn when n < 4h+1 because they are exactly the same.So we focus our discussion on case 2 here. According tothe topology of HCN, it is not hard to find that all of therepresentatives we choose for Mn (n ≥ 4h+1) are serversthat are more closely connected to other HCN(4, l)’s inthe same level. Thus, our strategy cut the cost of queriesforwarding and since there are three representatives fora single Mn, it also offers flexibility to choose the closetrepresentative or the dullest one. This strategy also fits quitewell with the multi-path routing of HCN. What’s more, thefollowing lemma and theorem show that our strategy alsooffers good scalability and balancing property.

Lemma 1. Each meta-server Mn (n ≥ 4h+1) has exactly threerepresentatives except for the highest level of meta-server withfour.

Proof: There are four different bits in quaternary numberand our strategy replace the bits with a same bit differentwith qm−b in Eqn. (2), so the result is three. The biggestM(h+1)·4h+1 is the only special one that has four representa-tives since the bit qm−b does not exist in this case. □Theorem 1. Each server Sn will be the representative for exactlytwo different meta-servers.

Proof: Firstly, it is obvious that each server should be chosenas a representative for Mn where n < 4h+1. Secondly, eachmeta-server with its number≥ 4h+1 picks disjoint groups ofrepresentatives based on Eqn. 2, which means each serverwould also be chosen as a representative for only one meta-server Mn where n ≥ 4h+1.

From Lemma 1 and Theorem 1, we can claim that in RT-HCN, the number of replications for every global indexingis well controlled as three or four, bringing in good space-efficient, and that different scales of querying tasks areevenly distributed over all the HCN nodes as different levelsof representatives.

4 INDEXING CONSTRUCTION

4.1 Potential Indexing RangeBefore we begin our discussion of index construction, weillustrate another essential concept about our meta-servers.



5

In order to construct the global index for multi-dimensionaldata in our overlay network, we have assigned each meta-server a potential indexing range. Thus, for any givenqueries, we can figure out which meta-server is responsiblefor it and then the query is processed by the representativesof that meta-server.

Definition 3. Potential indexing range is an abstract attributeassigned to meta-server and indicates which meta-server (indeedthe subordinate representatives) is (are) responsible for processinga given query.

The multi-dimensional data forms a data boundarydenoted as B, which is a k-dimensional rectangle asthe bounding box of the spatial data objects: B =(B0, B1, B2, · · · , Bd). Here d is the number of dimensionsand each Bi is a closed bounded interval [li, ui] describingthe extent of the data along dimension i. Next, we focuson the 2-dimensional situation while for d ≥ 3, we take aslightly different consideration in Section. 4.2.

When d = 2, the data is bounded by B = (B0, B1),where B0 is [l0, u0] and B1 is [l1, u1]. We calculate a quater-nary number Qn for each meta-server Mn and use it to helpfigure out the potential indexing range Bn for Mn. Supposeq0q1 · · · qm is the corresponding quaternary form for n, thenQn is calculated according to the following two cases:

1) If m ≤ h, we add h−m consecutive 0’s to the frontof q0q1 · · · qm and construct an h+ 1 bit quaternarynumber Qn = 00 · · · 0q0q1 · · · qm.

2) If m > h, q0q1 · · · qm can be split as the formexplained in Eqn. (2). In this situation, we pick outthe last h + 1 bits and delete the replace part to getQn = qm−hqm−h+1 · · · qm−b.

Now, we use the following iteratively defined functionto calculate Bn.

Bn = pir(B, Qn)

=

pir(([l0,

l0+u0

2 ], [l1,l1+u1

2 ]), Q′n), q0 = 0

pir(([ l0+u0

2 , u0], [l1,l1+u1

2 ]), Q′n), q0 = 1

pir(([l0,l0+u0

2 ], [ l1+u1

2 , u1]), Q′n), q0 = 2

pir(([ l0+u0

2 , u0], [l1+u1

2 , u1]), Q′n), q0 = 3

([l0, u0], [l1, u1]) Qn = ∅

,

(3)where Qn is a quaternary number denoted as q0q1 · · · qi

and Q′n is a newly constructed quaternary number

q1q2 · · · qi.For example in Figure. 3, two axes B0 and B1 denote

the range of data in two dimensional space (w.l.o.g., weassume it generates a square area, rather than a rectangle).Then the potential indexing range for M80 (denoted as redsquare) should be ([ l0+u0

2 , l0+3u0

4 ],[l1, 3l1+u1

4 ]); the pir forM144 (denoted as blue square) is ([ l0+u0

2 , u0], [l1, l1+u1

2 ]),while the pir for M192 is ([l0, u0], [l1, u1]).

4.2 Cumulative Mapping

This section concerns about skewed data processing. Con-sider a one-dimensional skewed dataset, and there are someservers waiting to be assigned the potential indexing ranges.If the data is sorted, we can assign each server the range

0 1

2 3

4 5

6 7

8 9

10 11

12 13

14 15

16 17

18 19

20 21

22 23

24 25

26 27

28 29

30 31

32 33

34 35

36 37

38 39

40 41

42 43

44 45

46 47

48 49

50 51

52 53

54 55

56 57

58 59

60 61

62 63

B0

B1

l0 u0

l1

u1

HCN(4,2)

HCN(4,1)

HCN(4,0)

80

144

192

Fig. 3. Potential Index Range

containing equally proportional data according to the or-dered sequence. In this way, the number of data fall intothe responsible range of each server is almost the same andskewness is eliminated. However, the ideal method is notpractical in that we must collect all the data to do the sorting,which is both storage and time intolerable.

That being said, Zhang et al. [36] offered a PiecewiseMapping Function (PMF) to achieve an approximately uni-form distribution for the skewed data. To construct thePMF, we evenly partition the data range into some buck-ets. Then the boundary coordinates of buckets are thecoordinates used to compute the PMF. Consider 10 num-bers {0.5, 0.9, 1.4, 2.5, 2.7, 3.2, 3.7, 5.4, 7.0, 8.9} distributedin [0, 10]. If we divide them into 5 buckets with boundarycoordinates 0, 2, 4, 6, 8, and 10, the count of data in eachbuckets will be 3, 4, 1, 1, 1, with a variance of 1.6. Then wecan easily get the cumulative values of the boundary coor-dinates: 0, 3

10 ,710 ,

810 ,

910 , 1. Based on the 6 pairs of boundary

coordinate and cumulative value, we can build a piecewisefunction with 5 pieces well-reasonably. Then we map all the10 real data distributed on [0, 10] through this piecewisefunction to values distributed on [0, 1], gaining respectiveapproximate cumulative value, as shown in Table. 2. Aftermapping, the same buckets gains a variance of 0, balancingthe counts perfectly.

TABLE 2Mapping for Approximate Cumulative Value

[0, 2] [2, 4] [4, 6] [6, 8] [8, 10]︷︸︸︷0.5

︷︸︸︷2.5

︷︸︸︷5.4

︷︸︸︷7.0

︷︸︸︷8.9

0.9 2.71.4 3.2

3.7

0.075 0.21 0.44 0.64 0.850.135︸︷︷︸ 0.31︸︷︷︸ 0.54︸︷︷︸ 0.77︸︷︷︸ 0.945︸︷︷︸[0, 0.2] [0.2, 0.4] [0.4, 0.6] [0.6, 0.8] [0.8, 1.0]

Formally, PMF is obtained as follows:

1) Take a random sample for the whole dataset to dothe later computation on a small scale and keep theoriginal feature of data distribution as well.

2) Equally divide the data space into N buckets, cal-culate the cumulative values of the N +1 boundary



6

coordinates of these N buckets, and build N piece-wise linear functions PMF.

3) Based on the PMF, compute the mapped value forall real data in the sample set.

4) Treat the mapped values as a new sample set, repeatstep 2 and 3 until a satisfied variance of the countsin each buckets has been achieved.

As for multi-dimensional data, we compute PMFs fordifferent dimensions in the same way.

Cumulative mapping actually applies to both skewedand uniform data, as the consequence of mapping is a nearlyuniform distribution. However, since space mapping is a rel-atively static manner, original distribution maybe essentiallychanged due to data updating, the whole indexing systemwill not benefit from the mapping any longer. To handle this,we can periodically check the variance of number of data ineach bucket, and we shut down to rebuild the system in thecase of its obvious fluctuation beyond some threshold.

4.3 Bloom FilterA Bloom filter is a space-efficient probabilistic data struc-ture, which can be used as an accelerator in point query.For a set S with n elements {e1, e2, · · · en}, a filter usea bit array of m bits, initially all set to 0. k independenthash functions are used, denoted as f1, f2, · · · , fk, eachproducing an integer in [1,m]. For each element ei, thebits at positions f1(ei), f2(ei), · · · , fk(ei) in the array of mbits are set to 1. In practical use, we need to choose theparameters m and k appropriately in order to minimize theinevitable false positives and maximize the filtering ability.According to [37], we draw several salutary lessons. For apreviously defined Bloom filter, the probability of a falsepositive for an element not in the set, or the false positiverate, can be estimated as f = (1 − e−

knm )k. On one hand,

given n and m, more hash functions provides more chancesto find a 0 bit for an element that is not a member of S,but using fewer hash functions increases the fraction of 0bits in the array. The optimal number of hash functions thatminimizes f as a function of k is ln 2 · mn . On the other hand,given n and optimal k, the length of bit array m needs tobe essentially n log2 e · log2 1

ϵ for any representation schemewith a false positive rate bounded by ϵ. For our experiments,the parameter selection for Bloom filters depends largely onthese theoretical results.

As for multi-dimension, a direct idea in [38] is to gen-erate a bloom filter for each dimension with the sameseries of hash functions. When a multi-dimensional datais to be checked, we only need to check its componentson different dimensions with corresponding bit arrays andget the intersected results of yes or no. However, thereare high probability that a misjudgment at any dimensionwould have probably brought in a final false positive. Thisis because the union information of the multi-dimensionis not fully used. Based on this observation, we decide toconcatenate all the multidimensional components togetheras a single string. Moreover, the Murmur hash we use togenerator bloom filters is very good at processing longstrings.

Specifically, bloom filter is added as an inherent propertyof our local R-tree nodes. In the construction of RT-HCN,

each selected local R-tree node will create a bit array for allthe data items it covers with Murmur hash function as itsbloom filter. Then all the created filters are published alongwith the nodes to be buffered remotely. They will serve forthe RT-HCN point query procedure. The details of indexpublishing and update of bloom filters is to be explainedlater.

4.4 Indexing PublishingAs is introduced before, each data server Sn build an R-treeindex for its local data to facilitate multi-dimensional search.Then, Sn adaptively selects a set of index nodes Nn ={N1

n, N2n, · · · , Ndn

n } from its local R-tree and publishes eachN i

n to the representatives of a specific meta-server whosepotential indexing range just covers the minimum boundingrange of N i

n. Bloom filters of the corresponding nodes isalso generated and published. The choose of the publishingnodes starts from the second level of R-tree to an endinglevel in a probabilistic manner. For each level before theending level, we publish nodes who have no publishedancestors with a fixed probability. For the ending level, wepublish all the left nodes who have no published ancestors.In this way , the principle of index completeness and uniqueindex when doing index publishing [8] are both satisfied.Meanwhile, various sizes of R-tree nodes get a chance to bepublished, which conforms to the RT-HCN’s design ideasof “indexing hierarchical management”. We set the endinglevel as antepenult level and the probability as 0.7, givingmore publishing chances for the larger nodes with thepurpose of checking the bloom filters of high level nodes assoon as possible in the point query procedure and keepingthe total number of global indexes at a relatively low level.The format of the published R-tree nodes is (n,mbr), wheren indicates the origin server for storing the data and mbr isthe minimal bounding range of the published R-tree node.After receiving the published nodes, representatives buffersthe index in memory. Alg. 1 describes the process of indexpublishing.

Algorithm 1: Index Publishing (For Sn)

1 Nn =getSelectedRTreeNode(Sn)2 for each N i

n ∈ Nn do3 Find the least n′ s.t. Bn′ fully covers N i

n.mbr4 Get the representatives Rn′ for Mn′

5 for each Sk ∈ Rn′ do6 Sk inserts (n,N i

n.mbr) into its global index set

Figure. 4 provides a simple example of a local R-tree.If the node R1 is selected to be published, it should bepublished to server S17, S18, S19, which are representativesof the HCN(4, 0) shown in red. Similarly, if the node R3 isselected to be published, it should be published to serverS16, S26, S31, which are representatives of the HCN(4, 1)shown in blue.

The average routing cost for one-to-one traffic in HCN isO(2h+1), and the cost for information transmission betweenrepresentatives of the same meta-server is o(2h+1), thus thecost to publish an index node should be O(2h+1) on aver-age, and it equals to O(

√N) when n = 4, where N is the



7

Fig. 4. Index Distribution

total number of servers in the given HCN. For more generalsituation, the cost is given by N− log2 n and can be reducedas n get larger. We also want to mention here that the costshown above is exactly physical hops between servers whileprevious works claims that it takes only O(logN) to publishindex in P2P network but they are only discussing hopsin the overlay network. Since the physical connection ofP2P network is unclear, the physical hops can be hard toexam and constrain. Another improved feature for indexpublishing in our system is that under this design we canmake sure that each index node is published to exactly onlythree servers in the system. However, the strategy used in [8]publishes the larger index nodes to many servers as long asthe range of a auxiliary circle centered with the publishingnode’s center overlaps with the responsible query range ofthem. Then, the amount of index nodes published in thesystem is hard to control.

As for index maintenance, we consider the index updatetriggered by local data insertion. If an insertion causes therange of a leaf node to be expanded, we simply publishedthis leaf node with a newly generated bloom filter to theright place, which costs 1 or 3 or 4 HCN routing messages(depending on the number of representatives) with infor-mation of the newly published node and at most dozens oflocal hashing. If an insertion does not cause any expansion,it needs to update the remote bloom filter of its publishedancestor node and there must exist such one. In this case,the updating still costs 1 or 3 or 4 HCN routing messages(depending on the number of representatives) with infor-mation of new bit array positions to be set to 1 and severaltimes of local hashing. In this way, the correctness of pointand range query can both be guaranteed. We also implementanother version of index updating when bloom filter is notused for our experiment, which is the same with RT-CAN:when published node is split due to local insertion, we justdelete the remote published ones and republish two newnodes, which costs triple HCN routing messages comparingwith the former updating method.

4.5 Multi-Dimension Indexing

When dimension is 3 or higher, we weaken the concept ofpotential indexing range. HCN can not naturally supportthe one-one mapping between topology and data space

like RT-CAN but has its own advantage that it is a recur-sively defined multi-level topology. We can use the levelsto process the dimensions. An HCN(n, h) consists of nHCN(n, h − 1), or n2 HCN(n, h − 2), etc. Similarly, a d-dimensional space can be divided into d

n or dn2 (d − 1)-

dimension space, to be further processed by HCN(n, h− 1)or HCN(n, h − 2). And so it goes on. Finally, the last 2-dimensional space is just what we have already discussed.To achieve this, we need to find an appropriate way ofallocating HCN levels.

In our design, an HCN(4, h) can be used to process dataof at most (h+ 2)-dimension. Given the top level h and thedata dimension (A1, A2, · · · , AD) (D ≤ h+2, the order heremakes no sense), we use the following principles to allocatethe total (h+ 1) HCN levels for different dimensions:

1) Let l2 be the number of levels shared by dimensionA1 and A2. For d ≥ 3, let ld be the number of levelsallocated to dimension Ad, then ld = ld−1 or ld =ld−1 − 1.

2) ΣDi=2li = h+ 1.

Take HCN(4, 2) for example. If the target dataset is two-dimensional, we calculate the Meta-servers and correspond-ing potential indexing ranges just as the previous discussiondoes, which means we allocate total 3 HCN levels for 2-dimensional data. In the case of 3-dimension, we allocate 2levels for the first two dimensions, and 1 level for the thirddimension according to the above principles. Figure. 5 (b)is the 3d viewpoint of planar HCN(4, 2) in (a) with 4 levelsof HCN(4, 1) with the ability to partition a 3-dimensionaldata space. To be simple, we only highlight the 0th and 3rdHCN(4, 1) unities as dark regions. In another word, we usefour HCN(4, 1) to divide one dimension, and use one singleHCN(4, 1) to process the other two dimensions. Above all,the previous concepts of Meta-server, representative andpotential indexing range only make sense in the singleHCN(4, 1) for two-dimension processing. As for the nodepublishing of 3-dimensional R-tree, each node appears asa cuboid in a 3-d space in Figure. 5 (b). If a selected R-tree node overlaps with some of the 4 subspaces along thez-axis, we will distribute the same copies of it into theseoverlapping subspaces for the same representatives alongthe z-axis’s direction. Generally speaking, with one more di-mension, the number of global indexes will be quadrupled.Correspondingly, querying process for multi-dimensionaldata will firstly determine which smallest subsequences for2-d processing are involved in. This step just consists of lotsof simple partitions of the search space.

0 1

2 3

4 5

6 7

8 9

10 11

12 13

14 15

16 17

18 19

20 21

22 23

24 25

26 27

28 29

30 31

32 33

34 35

36 37

38 39

40 41

42 43

44 45

46 47

48 49

50 51

52 53

54 55

56 57

58 59

60 61

62 63

(a). Planar View (b). 3D View

Fig. 5. Two Views of HCN(4, 2)



8

5 QUERY PROCESSING

RT-HCN can process kinds of queries, such as point query,range query, and k−NN query. In this section, we explainhow different queries for 2-dimensioal data are executed inRT-HCN.

The point query in RT-HCN is a two-stage processing,similar to most other two-layer indexing schemes. Firststage happens among the Meta-servers, where a query pointQ(v0, v1) is first forwarded to the nearest hth level represen-tative Snh

which represents the largest Meta-server coveringQ. Snh

returns qualified results through checking the bloomfilters of its buffered global indexing nodes, afterwards, itforwards the query to the nearest h− 1th level representatveSnh−1

which represents the h− 1th Meta-server covering Qand Snh−1

does the same querying and forwarding jobs tillthe level of a single server. After all the qualified resultsreceived, the query processing goes into the second stageof remote querying on local R-trees. In all, a point queryneeds to search h + 1 servers in total, which actually arerepresentatives from 0th to hth levels. Our index publishingscheme does bring out some extra cost for some queries, andwe present the theoretical analysis about the comparison ofthis extra routing cost with the simplest one-to-one pointquery in Section. 7.1.

The range query is a general version of the point query.We regard a range R([a0, b0], [a1, b1]) as a large queriedpoint and find out the smallest HCN(4, t) that fully coversthis large queried point. Then, a point query procedure iscarried out until the tth level. During the query before lth

level, the query is always forwarded to the Meta-serverwhose potential indexing range covers the queried rangeR while from then on, the query shall be forwarded to allthe Meta-servers whose potential indexing range overlapswith R, guaranteeing the correctness and completeness ofresults.

The k-nearest neighbours (k−NN) query returns thetop-k nearest results to the data Q(v0, · · · , vd−1) given thequery (Q, k). Since the dataset has been mapped into anapproximately evenly distributed data space, we considerthe density of data in the mapped space as ρ = K

Π(Ui−Li),

where K is the total number of data and the denominatoris the size of the whole data space. ρ is thus the averagenumber of data in a single unit of the data space. Wecan use ρ to estimate the ranges of k nearest results insuch space. We will first query the range ([v0 − γ d

√kρ , v0 +

γ d

√kρ ], · · · , [vd−1 − γ d

√kρ , vd−1 − γ d

√kρ ]), where d is the

dimension, γ d

√kρ is a uniform offset, and γ is a scaling

parameter typically equals to 0.5. If more than k results arereturned, we select k nearest ones. Otherwise, we increasethe offset linearly.

6 SCHEME GENERALIZATION

Server-centric data center topologies generally are recursivetopologies (usually recursively defined multi-level struc-tures), in which a high-level structure consists of certainlow-level structures and the structures at the same levelare connected with each other in a well-defined way. Asis mentioned in the beginning, HCN is our first target topol-ogy since it is relatively simple and regular. We further seek

for methods to build similar two-layer indexing schemes onother structured server-centric data center networks. Com-pared to developing specific indexing schemes for differenttypes of underlying topologies, a reusable and adaptablemethod for this procedure is much more admired.

DCell is a structure that has many desirable features fordata centers. It uses servers equipped with multiple networkports and mini-switches to construct its recursively definedarchitecture. Any high-level DCell is constituted by connect-ing certain number of the next lower level DCells. DCell0is the basic building block in which n servers are connectedto an n-port commodity switch. DCells at the same level arefully connected with each other. Given t servers in a DCellk,t+1 DCellk’s are used to build a DCellk+1. The t serversin a DCellk connect to the other t DCellk’s, respectively.This way, DCell achieves high scalability and high bisectionwidth.

The DCell topology is our first attempt for generalizationand is believed to be a good breakthrough point, derivedfrom some internal connection between HCN and DCell.Figure. 6 shows the topology of DCell1 with n = 4, referredas the level-1 DCell. The DCell1 consists of 5 DCell0s,ranging from DCell0[0] to DCell0[4]. In fact, we can getdifferent “pictures” of this topology from different angles.Figure. 6 (a) is the normal version of DCell1, which hasbeen the most common form well known since its birth.However, what Figure. 6 (b) illustrates is also the topologyof DCell1. This version can be transformed from Figure. 6(a) by “lifting up” one of the five building blocks (Herewe choose DCell0[0]) and doing some rearrangement onthe wiring. Such wiring rearrangement is definitely feasiblebecause as long as the fully connective feature at the samelevel is guaranteed, the topology must be a DCell. Hence,an HCN(4,1) appears in the transformational version ofDCell1. This inclusion relation is no by accident, which maybe a specific result of the small DCell1. Later, we will seethis kind of relation also exists when the DCell topologyscales up further.

DCell0[0]

DCell0[1]

DCell0[2] DCell0[3]

DCell0[4]

(a). Normal Version

DCell0[1]

DCell0[0]

DCell0[2]

DCell0[3]

DCell0[4]

(b). Transformational Version

Fig. 6. Two Forms of DCell1 Topology with n=4

Based on this internal connection between DCell andHCN, we can promote a similar two-layer indexing scheme,RT-DCell, on the DCell topology just like what we havedone for HCN. The concrete method stated here basicallyfollows the principles of RT-HCN scheme, and also bringsnew attractive features.

We take Figure. 6 (b) as an illustration for RT-DCell.The part of HCN-like topology in DCell1 is used for theindexing construction, and actually it is an HCN(4,1). Thenodes left are treated as common data nodes with no index-ing abilities. To be brief, in practice data is distributed all



9

over the nodes in DCell while part of them help to build anindexing scheme. We consider the two kinds of nodes in thefollowing definition.

Definition 4. Starter nodes are the nodes constituting the highestlevel of an HCN-like topology as long as they can in a DCelltopology. All the nodes left are called Bencher nodes.

For starter nodes, they make up of the HCN topology.As in Figure. 6 (b) for n=4, a DCell1 contains an HCN(4,1).Hence the concept of Meta-server, representative and poten-tial indexing range for RT-HCN can be used directly withoutany modification. Data is distributed all over the nodes inDCell, but the global indexing nodes are published to starternodes only.

For bencher nodes, they are not meaningless roles in RT-DCell scheme. DCell has a characteristic that all the one-level lower DCells are fully connected at any level. It meansthe HCN-like topology consisting of starter nodes is not afixed one, and the set of starter nodes is not fixed, either.In Figure. 6 (b), any four of the five DCell0s can make upof an HCN(4,1). In other words, if some DCell0 unit usedto build the HCN(4,1) does not work well due to machinefault or network congestion, it can be replaced by thebencher DCell0, keeping the whole HCN indexing schemecomplete and unaffected. We can change the HCN topologycorrectly and naturally by letting the indexing publishingprocedure to produce replicated global indexing nodes andstore redundant ones on bencher nodes. This behaviour isjust like the bench players substitute the starters, and thiscycle is sustainable.

To better understand the roles of bencher nodes in RT-DCell, we let the DCell1 expands a step further.

DCell1[0] - DCell1[15]

DCell1[20]

DCell1[16] - DCell1[19]

HCN(4,2)

HCN(4,3)

HCN(4,1)

Bencher nodes

Starter nodes

Fig. 7. DCell2 Topology with n=4

Figure. 7 shows a DCell2 topology with n=4. It consistsof 21 DCell1 numbered from 0 to 20, each of which appearsas a rectangular pyramid like the DCell1[20] at the topright corner. These DCell1s should be fully connected withpart of the links represented. A node in Figure. 7 actuallyrepresents a DCell0 unit. Based on Definition 4, starternodes are distributed in DCell1[0] to DCell1[15] limitedin the yellow range, forming a highest level of HCN(4,3).In general, green range indicates an HCN(4,2), blue range

indicates an HCN(4,1), and red range indicates an HCN(4,0).All left nodes from DCell1[16] to DCell1[20] are benchernodes. Interestingly, these bencher nodes can be elegantlydivided to form an HCN(4,2) and an HCN(4,1).

The two-layer indexing scheme normally runs on theHCN(4,3) topology, consisting of starter nodes. However,the bencher nodes offers rich alternatives for substitution.No matter what size of a sub-HCN topology in the HCN(4,3)encounters problems, bencher nodes can greatly smooth thebad effects by providing an HCN(4,1) for small scale faultsand an HCN(4,2) for larger ones. This back up functionfrom bencher nodes seems to be a customized service forthe RT-HCN indexing architecture in RT-DCell. The faultyproblem mentioned here is rather vital and makes sensein today’s data centers, and need to be treated carefullyin the theoretical design if possible. RT-DCell is the firstgeneralization from RT-HCN scheme on other server-centricdata center topologies, and we believe it is a highly fault-tolerant and robust indexing architecture.

7 PERFORMANCE EVALUATION

In this section, we evaluate the performance of RT-HCNtheoretically and experimentally.

7.1 Theoretical Analysis for Extra Routing CostWe propose a comparison of the routing cost between RT-HCN indexing routing strategy and the shortest routingstrategy here. The shortest routing strategy is the mostcommon routing strategy for ordinary distributed indexingmechanisms. It acts as a reference in the following analysis,reflecting the relative performance of the RT-HCN indexingrouting strategy. We take the point query process for illus-tration.

The RT-HCN indexing routing strategy for point querycan be described as four steps. Firstly, a query comes ata random starting node S in HCN(n, h), and S forwardsthe query to the nearest hth level representative Snh

whichis responsible for it. Secondly, Snh

forwards the query tothe nearest h− 1th level representative Snh−1

, and this steprecursively continues until the 0th level representative isreached. Thirdly, each time a representative is reached, itreturns back eligible indexing nodes to the starting nodeS. Fourthly, S search for the final result with the receivedindexing nodes.

However, if we use the shortest routing strategy for pointquery, one vital precondition is that every node has bufferedall the local indexes of all the other nodes. As long as a querycomes at a starting node S, it knows which node the finalresult is located at, with no need of any extra routing. Next,we make the extra routing cost more formalized to get abetter understanding.

Following are the computation procedures of 3 variablesdefined for the two routing strategies.

1) Lh: the average distance between any starter nodeS and the nearest hth level representative Snh

inHCN(n, h). According to whether the starter nodefalls into the same sub-HCN with the hth level rep-resentative, Lh has two probable values. It equalsto Lh−1 with the probability of 1

n , and equals to



10

Lh−1 + 2h−1 with the probability of n−1n . L0 is 0.

Thus,

Lh =n− 1

n(Lh−1 + 2h−1) +

1

nLh−1

= Lh−1 +n− 1

n2h−1 =

n− 1

n(2h − 1).

2) Rh: the average distance of the routing path whichstarts from the hth level representative Snh

, andends up at the 0th level representative Sn0 . Simi-larly, for any two representatives of two adjacentlevels, Sni and Sni−1 , the average distance has twoprobable values according to whether they falls intothe same sub HCN. Rh is their summation. Thus,

Rh =h∑

i=1

[1

n(2k − 1) +

n− 1

n2k] + 1

= 2h+1 − n+ h

n.

3) Ah: the average distance between any pair of nodesin HCN(n, h). Similarly, according to whether thetwo nodes fall into the same sub HCN, Ah has twoprobable values. In HCN(n, h), the probability ofany two nodes falling into the same HCN(n, h − 1)

is (n1)(nh

2 )

(nh+1

2 ), while the probability is (n2)(

nh

1 )(nh

1 )

(nh+1

2 )for

the contrary. A0 is 1. Thus,

Ah =

(n1

)(nh

2

)(nh+1

2

) Ah−1 +

(n2

)(nh

1

)(nh

1

)(nh+1

2

) (2Lh + 1)

=nh − 1

nh+1 − 1Ah−1 +

(n− 1)nh

nh+1 − 1(2Lh + 1)

=2n− 2

2n− 1

(2n)h+1 − 2(2n)h + 1

nh+1 − 1−

nh+1 − 2nh + 1

nh+1 − 1.

The Lh here has been defined above.

In fact, Ah stands for the routing cost of the shortestrouting strategy, while Lh and Rh stands for the first twosteps of the RT-HCN indexing routing strategy, which bringsin the major extra routing cost. Consider that HCN(n, h)usually has a small value for h, we take limits of the 3variables by tending n to ∞.

1) limn→∞

Lh = limn→∞

n−1n (2h − 1) = 2h − 1.

2) limn→∞

Rh = limn→∞

(2h+1 − n+hn ) = 2h+1 − 1.

3) limn→∞

Ah = limn→∞

( 2n−22n−1

(2n)h+1−2(2n)h+1nh+1−1

−nh+1−2nh+1

nh+1−1) = 2h+1 − 1.

The relative major extra routing cost can be representedas Lh+Rh

Ah= 1 + 2h−1

2h+1−1< 3

2 . Regarding the vital precon-dition that the RT-HCN indexing routing strategy stores farless global indexing items than the shortest routing, we candraw a conclusion that this two-layer indexing scheme hasperfect space efficiency while it may suffer a little burdenfor extra routing, which is quite reasonable.

7.2 Numerical ExperimentsTo testify the proposed indexing scheme, we evaluate RT-HCN on Amazon’s EC2 platform. We implement our index-ing system in Python 2.7.9, with bloom filters implementedin C++. The instance each has a 3.3GHz Turbo Intel Xeonprocessor, 4GB memory and 8GB EBS storage. The net-work links are 100Mbps. Experimental network scale rangesfrom 4, 16 and 64, corresponding to the total number ofservers supported by HCN(4, 0), HCN(4, 1), and HCN(4, 2).The experiments involves 4 synthetic datasets and 1 realdataset, classified into 2 groups. One group follows uni-form distribution, named as Uniform 2d, Uniform 3d, andUniform 4d, generated for 2 to 4 dimensions. The othergroup follows skewed distribution, named as Zipfian 2dand Hypsogr. Zipfian 2d is a 2-dimensional dataset, strictlyfollows zipfian distribution with skewness factor 0.8 in eachdimension. Hypsogr is a real dataset obtained from the R-tree Portal, containing 76999 two-dimensional MBRs. Foreachdataset, we generate 200000N data points, where N isthe number of instances. Particularly, for Hyposogr, 3 stepsare taken to generate 200000N data items. We first extractfrom the original MBRs for distinct 2-dimensional points.Then we do several minor offsets for each point as newdata. Finally, we randomly rearrange the data sequence.Synthetic datasets are all distributed in the range [0, 2.5].In each experiment concerning one dataset, we initiallydivide and distribute the whole dataset randomly over theinstances, making all of them maintain roughly the samenumber of data items. Table 3 is the parameters used in ourexperiments, where default values are in bold.

TABLE 3Experiment Settings

Parameter Values

Cardinality 800000, 3200000, 1280000Dimensionality 2, 3, 4Distribution Uniform, Zipfian, RealUniform Datasets Uniform 2d, Uniform 3d, Uniform 4dSkewed Datasets Zipfian 2d, HypsogrSelectivity 0.01%, 0.1%HCN Level 0, 1, 2Participants RT-HCN, RT-CANM of Local R-tree 10

Experiments are conducted in the following ways. Foreach network scale and specific dataset, we execute pointquery, range query, k−NN query, and update-point-mixquery. We interact with the running HCN topology withanother EC2 instance called client in a centralized manner,mainly in charge of query tasks distribution and informationgathering. For each experiment, client instance continuouslydistribute the query tasks randomly over the HCN clusters.Query time (seconds consumed per 1000 queries or 500queries) is used as our performance metric. We execute10000 query tasks per kind of query, and record the process-ing time every 1000 queries. Each test is repeated 10 timesand the average results are used.

7.2.1 Query PerformanceWe test the performance of 3 query types on Uniform 2dand Hypsogr datasets.

For point query, the query sets are generated through arandom extraction from the target dataset. Considering that



11

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

4 16 64

Quer

y

Tim

e (s

)

Number of HCN Nodes

uniform_2d

hypsogr

Fig. 8. Performance of Point Query

3

6

9

12

15

4 16 64

Quer

y

Tim

e (s

)

Number of HCN Nodes

uniform_2d (0.1%)uniform_2d (1%)hypsogr (0.1%)hypsogr (1%)

Fig. 9. Performance of Range Query

0

2

4

6

8

10

4 16 64

Quer

y

Tim

e (s

)

Number of HCN Nodes

uniform_2d

hypsogr

Fig. 10. Performance of k−NN Query(k=8)

0

2

4

6

8

10

2 4 6 8

Quer

y

Tim

e (s

)

Number of HCN Nodes

uniform_2d

hypsogr

Fig. 11. Performance of k−NN Query(64 Nodes)

0

1

2

3

4

5

6

7

4 16 64Q

uer

y

Tim

e (s

)Number of HCN Nodes

pure load (with BF)mix load (with BF)pure load (no BF)mix load (no BF)

Fig. 12. Performance of Update

0

0.2

0.4

0.6

0.8

1

1.2

1.4

4 16 64

Quer

y

Tim

e (s

)

Number of HCN Nodes

uniform_2duniform_3duniform_4d

Fig. 13. Muilti-Dimension

the dense portion of a dataset is more likely to be queried,this method of query generation is natural. For RT-HCN,point query is not a special type of range query due tothe bloom filters. The filters carry more reliable informationabout the existence of a point, saving much time wastedin the remote verification. Figure. 8 shows the performanceof point query in different network scales. Processing timeper 1000 queries is the metric. Regarded collectively, pointquery gets acceleration to some extent as network scalesup. This is because the bloom filters play great roles inthe parallel processing. Considering the data features andindexing publishing rules, each representative will bufferlots of global indexes with many interval overlaps. Givena querying point in a uniformly distributed dataset, wehave reasons to believe that almost all the servers may storeproximal points in the worst case, leading to the verificationwork spread all over the network. Theoretically, this methodcannot make full use of the parallelism of the scaling net-work. What bloom filters contribute is the sharp cutoff of thecandidate locations to be verified. For Uniform 2d, as thenetwork scales from 4 to 16 and 16 to 64, RT-HCN obtainsa speedup of 2.15 and 1.64, respectively. As for Hypsogr,the performance trend of point query is similar with theuniform dataset but a little higher in value. This has tworeasons. One is the cumulative mapping helps to handleskewed datasets just as the uniform distribution. Anotherreason is that each point in the real dataset occupys morebytes, which may affect the efficiency of hashing in bloomfilters.

For range query, the query sets are generated throughdifferent selectivities, which are defined as the percentageof searched space. Figure. 9 shows the performance of rangequery in different network scales. The metric is processingtime per 500 queries. For Uniform 2d and 0.01% selectiv-ity, when network scales 16 times up, we observe a totalspeedup of 1.39 times. A similar performance appears inhypsogr. However, when selectivity increases up to tenfold,the speedup almost disappears. Actually, the first step of RT-HCN querying algorithm concerning four top-level repre-

sentatives do not benefit from the network scaling up. Whenrange query involves small searched space, at least not allthe servers are to be involved, and the parallelism can makesome contribution. But as the selectivity increases, the worstcase of searching spread all over the network will frequentlyoccur in range query, which makes the overall performancelargely depend on the first stage of candidate global indexsearching, leaving the four starting nodes as bottlenecks.As for RT-CAN with a similar indexing structure, it hasdifferent behaviors for range query, which will be shownlater.

For k−NN query, query sets are basically the same aspoint query, with extra parameter k. As stated before, weuse an efficient estimation of inital queried range throughthe density of data in the mapped space. Processing timeper 1000 queries is used. In Figure. 10 , we show the perfor-mance in different network scales when k = 8. For uniformdataset, performance improves slowly like the range querybut the values are better. This is because the k−NN searchrange is much smaller. Real dataset has a similar perfor-mance as the estimation of initial queried range also appliesfor skewed data due to the work of cumulative mapping.Figure. 11 shows the performance of various values of kfrom 2 to 8 for the two datasets. Network scale is set to 64.The query time increases when parameter k increases, as thesearch space increases with k.

7.2.2 Update PerformanceFor index update experiment, we generate a mixed loadof point queries and point insertions. To be specific, ourclient instance distribute the same querying tasks as in thepoint query test and generates a random local R-tree pointinsertion every 5 queries. To have a better view of updateperformance, we also do a comparative experiment withno bloom filters. Processing time per 1000 queries is used.Results are shown in Figure. 12, where pure load is just thepoint query tasks. From the perspective of pure load, wesee clearly how fast bloom filters are. In network scale of4, point query performance of RT-HCN with bloom filtersis 3.21 times the no-bloom-filter version. With the network



12

0

2

4

6

8

10

12

4 16 64

Quer

y

Tim

e (s

)

Number of HCN Nodes

RT-HCN (with BF)RT-HCN (no BF)RT-CAN

Fig. 14. Comparison of Point Query(Uniform Dataset)

0

3

6

9

12

15

4 16 64

Quer

y

Tim

e (s

)

Number of HCN Nodes

RT-HCN

RT-CAN

Fig. 15. Comparison of Range Query(Uniform Dataset)

0

2

4

6

8

10

12

4 16 64

Quer

y

Tim

e (s

)

Number of HCN Nodes

RT-HCN (with BF)

RT-HCN (no BF)

RT-CAN

Fig. 16. Comparison of Point Query(Zipfian Dataset)

growing up, RT-HCN with bloom filters continuously bene-fits from the parallelism and keeps gaining speedup, whilethe performance of no-bloom-filter RT-HCN grows slowlyowing to the similar reasons as range query. As for themixed load containing 20% data insertions, there are twokinds of performance. For RT-HCN without bloom filters,updating brings better performance, while for RT-HCN withbloom filters, they are very close. When filter is not used, theglobal updating is node-oriented, concerning the deletionand republishing of related global nodes through at most 4routing messages, faster than the point query. When filter isused, the global updating is bit-array-oriented, concerningthe new positions to be set to 1 of related published bloomfilters through at most 4 routing messages. However, theupdate of published bloom filters still needs a linear scan asquery. In all, RT-HCN can handle both queries and updatesefficiently.

7.2.3 Multi-dimensional PerformanceWe also conduct a simple experiment to verify the index-ing ability for other dimensions. Based on the desingingprinciples for dimension greater than 2, the HCN levelsare allocated for extra dimensions. Thus, network scale of4 only supports Uniform 2d. As the the network scales upfor one level, a new dataset with one more dimension issupported. We present the results of point query for threeuniform datasets in Figure. 13. Processing time per 1000queries is the metric. Undoubtedly, 2-dimension has the bestperformance, since RT-HCN works in two different stylesfor 2-dimension and other dimensions. When doing queriesfor dimensions greater than 2, lots of partition work needsto be done before locating a region for further 2-dimensionalprocessing.

7.2.4 Comparison with RT-CANRT-CAN is a well known two-layer indexing schemes onP2P networks, sharing similar designing philosophy withRT-HCN. It is the most relevant work we have learned aboutso far that can be experimentally implemented to comparewith for our scheme. Hence we actually make RT-CAN runon the same batch of EC2 instances as RT-HCN, treatingthem to be logically connected like the P2P topology. Weimplement the main querying functions for RT-CAN inpython, except for the index tuning mechanism, becausewe do not involve dynamic updating behaviors but staticpoint and range querying tasks in this experiment. Threeother notes need to be declared here. First, our local R-treehas at least 5 or 6 levels with branch limitation M set to10, which is different from 3 levels in the RT-CAN paper.Second, the parameter Rmax is set as the half of the side

length of each CAN node zone, since the selection of Rmax

in RT-CAN paper has no common sense for different sizes oflocal R-tree. Third, we published the antepenult level nodesof local R-tree as global index, rather than the last but onelevel. We compare RT-HCN and RT-CAN on Uniform 2dand Zipfian 2d for different network scales. Point and rangequery are both tested. Processing time per 1000 queries isused for point query while for range query, the metric is per500 queries.

Figure. 14 and Figure. 15 show their performances ofpoint and range query for Uniform 2d, respectively. InFigure. 14, we find that RT-CAN does not act very well,even slower than the RT-HCN without bloom filters. How-ever, we do not regard this as the real performance of RT-CAN. Having checked out the number of global indexesmaintained on each server, we discover that each RT-CANserver buffers global indexes 10 times more than RT-HCNserver on average, which slow down the first stage of two-layer indexing. The situation may result from our selectionsof indexing publishing levels and Rmax. We try differentsettings and never get a remarkable improvement. Fromanother point of view, as the network grows up, the speedupof RT-CAN is better than RT-HCN, because RT-CAN hasno bottleneck when doing global indexing search like thefour top-level representatives in RT-HCN. But the numberof global index buffered in RT-CAN servers is really hard tocontrol due to its arbitrary redundancy of published nodesand often stays at a high value, its absolute query perfor-mance is limited. As for range query in Figure. 15, a similarpattern appears. Figure. 16 shows the point query perfor-mance for Zipfian 2d. For RT-HCN, cumulative mappinghas made the zipfian distribution equivalent to the uniformdistribution, thus RT-HCN has almost the same queryingperformance as Figure. 14. For RT-CAN, it does not offersany preprocessing mechanisms for data skeweness. It isjust passively adapted to handle skewness, thus sufferinga performance degradation.

In short, RT-CAN indeed perform better than RT-HCNin regard of performance speedup as the network scale in-creases. However, the absolute performance depends largelyon the selections of index pulishing levels and Rmax whichdecide the total number of published global indexing nodes.Moreover, RT-CAN is bad at processing skewed data. ForRT-HCN, we equip it with bloom filters to bypass thetop-level bottleneck and regain speedup for point query,achieving tremendous performance improvement. With thehelp of representative mechanisms and specific selection ofpublishing levels, the number of published global indexes iscontrolled at a relatively low level, which also point out an-other significant fact: considering the enormous redundancy



13

of published nodes, bloom filters are not appropriate for RT-CAN. Last but not least, RT-HCN use cumulative mappingto efficiently process data skewness.

8 CONCLUSION

In this paper, we proposed an indexing scheme namedRT-HCN for multi-dimensional query processing in datacenters, which are the infrastructures for building cloudstorage systems and are interconnected using a specificdata center network (DCN). RT-HCN is a two-layer in-dexing scheme, which integrates HCN-based routing pro-tocol [27] and the R-Tree based indexing technology, andis portionably distributed on every server. Based on thecharacteristics of HCN, we present a specialized mappingtechnique to improve global index allocation in the network,resulting query-efficiency and load-balancing for the cloudsystem. We also combine practical techniques in face of dataskewness and querying false positives, greatly increasingthe adaptability and querying performance of RT-HCN. Weprove theoretically that RT-HCN is both query-efficient andspace-efficient, by which each server will only maintain aconstrained number of indices while a large number of userscan concurrently process queries with low routing cost. Wealso give an insight into the inner relation between the HCNand other data center topologies, and apply the two-layerindexing scheme to the DCell topology in a generalizedway. We compare our design with RT-CAN [8], a similar de-sign for traditional P2P network. Experiments validate theefficiency of our proposed scheme and depict its potentialimplementation in data centers.

REFERENCES

[1] S. Ghemawat, H. Gobioff, and S. T. Leung. The googlefile system. In ACM SIGOPS operating systems review,volume 37, pages 29–43, 2003.

[2] G. DeCandia, D. Hastorun, M. Jampani, et al. Dynamo:amazon’s highly available key-value store. In ACMSIGOPS Operating Systems Review, volume 41, pages205–220, 2007.

[3] A. Lakshman and P. Malik. Cassandra: a decentralizedstructured storage system. ACM SIGOPS OperatingSystems Review, 44:35–40, 2010.

[4] D. Beaver, S. Kumar, H. C. Li, et al. Finding a needlein haystack: Facebook’s photo storage. In OSDI, vol-ume 10, pages 1–8, 2010.

[5] J. Baker, C. Bond, J. C. Corbett, et al. Megastore: Pro-viding scalable, highly available storage for interactiveservices. In CIDR, volume 11, pages 223–234, 2011.

[6] J. C. Corbett, J. Dean, M. Epstein, et al. Spanner:Googles globally distributed database. ACM Transac-tions on Computer Systems (TOCS), 31:8, 2013.

[7] S. Wu and K. L. Wu. An indexing framework forefficient retrieval on the cloud. IEEE Data Eng. Bull.,32:75–82, 2009.

[8] J. Wang, S. Wu, H. Gao, et al. Indexing multi-dimensional data in a cloud system. In Proceedingsof the 2010 ACM SIGMOD International Conference onManagement of data, pages 591–602, 2010.

[9] S. Wu, D. Jiang, B. C. Ooi, and K. L. Wu. Efficient b-treebased indexing for cloud data processing. Proceedingsof the VLDB Endowment, 3:1207–1218, 2010.

[10] G. Chen, H. T. Vo, S. Wu, et al. A framework forsupporting dbms-like indexes in the cloud. Proceedingsof the VLDB Endowment, 4:702–713, 2011.

[11] H. V. Jagadish, B. C. Ooi, and Q. H. Vu. Baton: Abalanced tree structure for peer-to-peer networks. InProceedings of the 31st international conference on Verylarge data bases, pages 661–672, 2005.

[12] S. Ratnasamy, P. Francis, M. Handley, and et.al. Ascalable content-addressable network, volume 31. ACM,2001.

[13] M. Al-Fares, A. Loukissas, and A. Vahdat. A scalable,commodity data center network architecture. ACMSIGCOMM Computer Communication Review, 38:63–74,2008.

[14] A. Greenberg, J. R. Hamilton, N. Jain, et al. Vl2: ascalable and flexible data center network. In ACMSIGCOMM computer communication review, volume 39,pages 51–62, 2009.

[15] M. Walraed-Sullivan, A. Vahdat, and K. Marzullo. As-pen trees: balancing data center fault tolerance, scala-bility and cost. In Proceedings of the ninth ACM confer-ence on Emerging networking experiments and technologies,pages 85–96, 2013.

[16] C. Guo, H. Wu, K. Tan, et al. Dcell: a scalable andfault-tolerant network structure for data centers. ACMSIGCOMM Computer Communication Review, 38:75–86,2008.

[17] D. Li, C. Guo, H. Wu, et al. Ficonn: Using backupport for server interconnection in data centers. In IEEEINFOCOM, pages 2276–2285, 2009.

[18] D. Li, C. Guo, H. Wu, et al. Scalable and cost-effectiveinterconnection of data-center servers using dual serverports. IEEE/ACM Transactions on Networking, 19:102–114, 2011.

[19] Y. Liao, D. Yin, and L. Gao. Dpillar: Scalable dual-port server interconnection for data center networks. InProceedings of 19th International Conference on ComputerCommunications and Networks (ICCCN), pages 1–6, 2010.

[20] J. Wu D. Li. On the design and analysis of datacenter network architectures for interconnecting dual-port servers. In Proceedings IEEE INFOCOM, pages1851–1859, 2014.

[21] C. Guo, G. Lu, D. Li, et al. Bcube: a high performance,server-centric network architecture for modular datacenters. ACM SIGCOMM Computer Communication Re-view, 39:63–74, 2009.

[22] H. Wu, G. Lu, D. Li, et al. Mdcube: a high performancenetwork structure for modular data center interconnec-tion. In Proceedings of the 5th international conference onEmerging networking experiments and technologies, pages25–36, 2009.

[23] D. Li, M. Xu, H. Zhao, and X. Fu. Building mega datacenter from heterogeneous containers. In 19th IEEE In-ternational Conference on Network Protocols (ICNP), pages256–265, 2011.

[24] X. Q. Liu, S. B. Yang, L. M. Guo, et al. Snowflake:a new-type network structure of data center. JisuanjiXuebao(Chinese Journal of Computers), 34:76–86, 2011.



14

[25] Z. Ding, D. Guo, X. Liu, et al. A mapreduce-supportednetwork structure for data centers. Concurrency andComputation: Practice and Experience, 24:1271–1295, 2012.

[26] D. Guo, T. Chen, D. Li, et al. Bcn: Expansible net-work structures for data centers using hierarchicalcompound graphs. In Proceedings IEEE INFOCOM,pages 61–65, 2011.

[27] D. Guo, T. Chen, D. Li, et al. Expandable and cost-effective network structures for data centers usingdual-port servers. IEEE Transactions on Computers,62:1303–1317, 2013.

[28] M. Lupu, B. C. Ooi, and Y. C. Tay. Paths to stardom:calibrating the potential of a peer-based data manage-ment system. In Proceedings of the 2008 ACM SIGMODinternational conference on Management of data, pages265–278, 2008.

[29] X. Gao, B. Li, Z. Chen, M. Yin, G. Chen, and Y. Jin. Ft-index: A distributed indexing scheme for switch-centriccloud storage system. In review for ICC, 2015.

[30] Y. Liu, X. Gao, and G. Chen. Design and optimizationfor distributed indexing scheme in switch-centric cloudstorage system. ISCC, 2015.

[31] A. Guttman. R-trees: a dynamic index structure for spatialsearching, volume 14. ACM, 1984.

[32] N. Beckmann, H. P. Kriegel, R. Schneider, and B. Seeger.The R*-tree: an efficient and robust access method for pointsand rectangles, volume 19. ACM, 1990.

[33] S. Saltenis, C. S. Jensen, S. T. Leutenegger, et al. Indexingthe positions of continuously moving objects, volume 29.ACM, 2000.

[34] B. Kao, S. D. Lee, F. K. F. Lee, et al. Clustering un-certain data using voronoi diagrams and r-tree index.Knowledge and Data Engineering, IEEE Transactions on,22:1219–1233, 2010.

[35] K. Zheng, X. Zhou X, P. C. Fung, et al. Spatial queryprocessing for fuzzy objects. The VLDB JournalTheInternational Journal on Very Large Data Bases, 21:729–751, 2012.

[36] R. Zhang, J. Qi, M. Stradling, et al. Towards a painlessindex for spatial objects. ACM Transactions on DatabaseSystems (TODS), 39:19, 2014.

[37] A. Border and M. Mitzenmacher. Network applicationsof bloom filters: A survey. Internet mathematics, 1:485–509, 2003.

[38] D. Guo, J. Wu, H. Chen, and X. Luo. Theory andnetwork applications of dynamic bloom filters. IEEEINFOCOM, pages 1–12, 2006.

Yang Hong is a postgraduate student in the De-partment of Computer Science and Engineeringat Shanghai Jiao Tong University.P.R.China. Heobtained his B.S. degree in Computer Scienceand Technology from Nanjing University in 2014.His research interests include data communica-tion and engineering, data center networks anddistributed systems.

Qiwei Tang is a undergraduate student in theDepartment of Computer Science and Engineer-ing at Shanghai Jiao Tong University, China. Hisresearch interests include data communicationand engineering, data center networks virtualreality.

Xiaofeng Gao received the B.S. degree in infor-mation and computational science from NankaiUniversity, China; the M.S. degree in operationsresearch and control theory from Tsinghua Uni-versity, China; and the Ph.D. degree in computerscience from The University of Texas at Dallas,Richardson, USA. She is an Associate Profes-sor with the Department of Computer Scienceand Engineering, Shanghai Jiao Tong University,China. Her research interests include data en-gineering, wireless communications, and combi-

natorial optimizations. She has produced a number of research papersin indexing and data retrieval, distributed database systems, and algo-rithm design.

Bin Yao received the BS degree and the MS de-gree in computer science from the South ChinaUniversity of Technology in 2003 and 2007, andthe PhD degree in computer science from theFlorida State University in 2011. He has been anassociate professor in the Department of Com-puter Science and Engineering, Shanghai JiaoTong University since 2011. His research inter-estes are management and indexing of largedatabases, and scalable data analytics.

Guihai Chen obtained his BS degree from Nan-jing University, M. Engineering from SoutheastUniversity, and PhD from University of HongKong. He visited Kyushu Institute of Technology,Japan in 1998 as a research fellow, and Univer-sity of Queensland, Australia in 2000 as a visitingProfessor. He is a Distinguished Professor withthe Department of Computer Science, ShanghaiJiao Tong University. Prof. Chen has publishedmore than 280 papers in peer-reviewed journalsand refereed conference proceedings in the ar-

eas of wireless sensor networks, high-performance computer architec-ture, peer-to-peer computing and performance evaluation.

Shaojie Tang is currently an assistant profes-sor of Naveen Jindal School of Management atUniversity of Texas at Dallas. He received hisPhD in computer science from Illinois Instituteof Technology in 2012. His research interest in-cludes social networks, mobile commerce, gametheory, e-business and optimization. He receivedthe Best Paper Awards in ACM MobiHoc 2014and IEEE MASS 2013. He also received theACM SIGMobile service award in 2014. Tangserved in various positions (as chairs and TPC

members) at numerous conferences, including ACM MobiHoc, IEEEINFOCOM, IEEE ICDCS and IEEE ICNP.

Efﬁcient R-Tree Based Indexing Scheme for Server-Centric ... · two-layer indexing scheme for...

Documents

Transcript of Efﬁcient R-Tree Based Indexing Scheme for Server-Centric ... · two-layer indexing scheme for...