Analysis of the clustering properties of the hilbert space ...cacs.usc.edu › education › cs653...

18
Analysis of the Clustering Properties of the Hilbert Space-Filling Curve Bongki Moon, H.V. Jagadish, Christos Faloutsos, Member, IEEE, and Joel H. Saltz, Member, IEEE Abstract—Several schemes for the linear mapping of a multidimensional space have been proposed for various applications, such as access methods for spatio-temporal databases and image compression. In these applications, one of the most desired properties from such linear mappings is clustering, which means the locality between objects in the multidimensional space being preserved in the linear space. It is widely believed that the Hilbert space-filling curve achieves the best clustering [1], [14]. In this paper, we analyze the clustering property of the Hilbert space-filling curve by deriving closed-form formulas for the number of clusters in a given query region of an arbitrary shape (e.g., polygons and polyhedra). Both the asymptotic solution for the general case and the exact solution for a special case generalize previous work [14]. They agree with the empirical results that the number of clusters depends on the hypersurface area of the query region and not on its hypervolume. We also show that the Hilbert curve achieves better clustering than the z curve. From a practical point of view, the formulas given in this paper provide a simple measure that can be used to predict the required disk access behaviors and, hence, the total access time. Index Terms—Locality-preserving linear mapping, range queries, multiattribute access methods, data clustering, Hilbert curve, space- filling curves, fractals. æ 1 INTRODUCTION T HE design of multidimensional access methods is difficult compared to one-dimensional cases because there is no total ordering that preserves spatial locality. Once a total ordering is found for a given spatial or multidimensional database, one can use any one- dimensional access method (e.g., B -tree), which may yield good performance for multidimensional queries. An inter- esting application of the ordering arises in a multidimen- sional indexing technique proposed by Orenstein [19]. The idea is to develop a single numeric index on a one- dimensional space for each point in a multidimensional space, such that, for any given object, the range of indices, from the smallest index to the largest, includes few points not in the object itself. Consider a linear traversal or a typical range query for a database where record signatures are mapped with multi- attribute hashing [24] to buckets stored on disk. The linear traversal specifies the order in which the objects are fetched from disk, as well as the number of blocks fetched. The number of nonconsecutive disk accesses will be determined by the order of blocks fetched. Although the order of blocks fetched is not explicitly specified in the range query, it is reasonable to assume that the set of blocks fetched can be rearranged into a number of groups of consecutive blocks by a database server or disk controller mechanism [25]. Since it is more efficient to fetch a set of consecutive disk blocks rather than a randomly scattered set in order to reduce additional seek time, it is desirable that objects close together in a multidimensional attribute space also be close together in the one-dimensional disk space. A good clustering of multidimensional data points on the one- dimensional sequence of disk blocks may also reduce the number of disk accesses that are required for a range query. In addition to the applications described above, several other applications also benefit from a linear mapping that preserves locality: 1. In traditional databases, a multiattribute data space must be mapped into a one-dimensional disk space to allow efficient handling of partial-match queries [22]; in numerical analysis, large multidimensional arrays [6] have to be stored on disk, which is a linear structure. 2. In image compression, a family of methods use a linear mapping to transform an image into a bit string; subsequently, any standard compression method can be applied [18]. A good clustering of pixels will result in a fewer number of long runs of similar pixel values, thereby improving the com- pression ratio. 3. In geographic information systems (GIS), run- encoded forms of image representations are order- ing-sensitive, as they are based on representations of the image as sets of runs [1]. 4. Heuristics in computational geometry problems use a linear mapping. For example, for the traveling salesman problem, the cities are linearly ordered and visited accordingly [2]. 124 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 13, NO. 1, JANUARY/FEBRUARY 2001 . B. Moon is with the Department of Computer Science, University of Arizona, Tucson, AZ 85721-0077. E-mail: [email protected]. . H.V. Jagadish is with the EECS Department, University of Michigan, Ann Arbor, MI 48109-2122. E-mail: [email protected]. . C. Faloutsos is with the Department of Computer Science, Carnegie Mellon Uinversity, Pittsburgh, PA 15213-3891. E-mail: [email protected]. . J.H. Saltz is with the Department of Computer Science, University of Maryland, College Park, MD 20742. E-mail: [email protected]. Manuscript received 18 Mar. 1996; revised 9 Jan. 1998; accepted 19 Aug. 1999. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number 104359. 1041-4347/01/$10.00 ß 2001 IEEE

Transcript of Analysis of the clustering properties of the hilbert space ...cacs.usc.edu › education › cs653...

Page 1: Analysis of the clustering properties of the hilbert space ...cacs.usc.edu › education › cs653 › Moon-HilbertCurve-TKDE01.pdf · called Peano curves or space-filling curves

Analysis of the Clustering Properties ofthe Hilbert Space-Filling Curve

Bongki Moon, H.V. Jagadish, Christos Faloutsos, Member, IEEE, and

Joel H. Saltz, Member, IEEE

AbstractÐSeveral schemes for the linear mapping of a multidimensional space have been proposed for various applications, such as

access methods for spatio-temporal databases and image compression. In these applications, one of the most desired properties from

such linear mappings is clustering, which means the locality between objects in the multidimensional space being preserved in the

linear space. It is widely believed that the Hilbert space-filling curve achieves the best clustering [1], [14]. In this paper, we analyze the

clustering property of the Hilbert space-filling curve by deriving closed-form formulas for the number of clusters in a given query region

of an arbitrary shape (e.g., polygons and polyhedra). Both the asymptotic solution for the general case and the exact solution for a

special case generalize previous work [14]. They agree with the empirical results that the number of clusters depends on the

hypersurface area of the query region and not on its hypervolume. We also show that the Hilbert curve achieves better clustering than

the z curve. From a practical point of view, the formulas given in this paper provide a simple measure that can be used to predict the

required disk access behaviors and, hence, the total access time.

Index TermsÐLocality-preserving linear mapping, range queries, multiattribute access methods, data clustering, Hilbert curve, space-

filling curves, fractals.

æ

1 INTRODUCTION

THE design of multidimensional access methods isdifficult compared to one-dimensional cases because

there is no total ordering that preserves spatial locality.Once a total ordering is found for a given spatial ormultidimensional database, one can use any one-dimensional access method (e.g., B�-tree), which may yieldgood performance for multidimensional queries. An inter-esting application of the ordering arises in a multidimen-sional indexing technique proposed by Orenstein [19]. Theidea is to develop a single numeric index on a one-dimensional space for each point in a multidimensionalspace, such that, for any given object, the range of indices,from the smallest index to the largest, includes few pointsnot in the object itself.

Consider a linear traversal or a typical range query for adatabase where record signatures are mapped with multi-attribute hashing [24] to buckets stored on disk. The lineartraversal specifies the order in which the objects are fetchedfrom disk, as well as the number of blocks fetched. Thenumber of nonconsecutive disk accesses will be determinedby the order of blocks fetched. Although the order of blocksfetched is not explicitly specified in the range query, it isreasonable to assume that the set of blocks fetched can be

rearranged into a number of groups of consecutive blocksby a database server or disk controller mechanism [25].Since it is more efficient to fetch a set of consecutive diskblocks rather than a randomly scattered set in order toreduce additional seek time, it is desirable that objects closetogether in a multidimensional attribute space also be closetogether in the one-dimensional disk space. A goodclustering of multidimensional data points on the one-dimensional sequence of disk blocks may also reduce thenumber of disk accesses that are required for a range query.

In addition to the applications described above, severalother applications also benefit from a linear mapping thatpreserves locality:

1. In traditional databases, a multiattribute data spacemust be mapped into a one-dimensional disk spaceto allow efficient handling of partial-match queries[22]; in numerical analysis, large multidimensionalarrays [6] have to be stored on disk, which is a linearstructure.

2. In image compression, a family of methods use alinear mapping to transform an image into a bitstring; subsequently, any standard compressionmethod can be applied [18]. A good clustering ofpixels will result in a fewer number of long runs ofsimilar pixel values, thereby improving the com-pression ratio.

3. In geographic information systems (GIS), run-encoded forms of image representations are order-ing-sensitive, as they are based on representationsof the image as sets of runs [1].

4. Heuristics in computational geometry problems usea linear mapping. For example, for the travelingsalesman problem, the cities are linearly ordered andvisited accordingly [2].

124 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 13, NO. 1, JANUARY/FEBRUARY 2001

. B. Moon is with the Department of Computer Science, University ofArizona, Tucson, AZ 85721-0077. E-mail: [email protected].

. H.V. Jagadish is with the EECS Department, University of Michigan, AnnArbor, MI 48109-2122. E-mail: [email protected].

. C. Faloutsos is with the Department of Computer Science, Carnegie MellonUinversity, Pittsburgh, PA 15213-3891. E-mail: [email protected].

. J.H. Saltz is with the Department of Computer Science, University ofMaryland, College Park, MD 20742. E-mail: [email protected].

Manuscript received 18 Mar. 1996; revised 9 Jan. 1998; accepted 19 Aug.1999.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number 104359.

1041-4347/01/$10.00 ß 2001 IEEE

Page 2: Analysis of the clustering properties of the hilbert space ...cacs.usc.edu › education › cs653 › Moon-HilbertCurve-TKDE01.pdf · called Peano curves or space-filling curves

5. Locality-preserving mappings are used for band-width reduction of digitally sampled signals [4] andfor graphics display generation [20].

6. In scientific parallel processing, locality-preservinglinearization techniques are widely used for dy-namic unstructured mesh partitioning [17].

Sophisticated mapping functions have been proposed inthe literature. One based on interleaving bits from thecoordinates, which is called z-ordering, was proposed [19].Its improvement was suggested by Faloutsos [8], usingGray coding on the interleaved bits. A third method, basedon the Hilbert curve [13], was proposed for secondary keyretrieval [11]. In mathematical context, these three mappingfunctions are based on different space-filling curves: the zcurve, the Gray-coded curve, and the Hilbert curve, respec-tively. Fig. 1 illustrates the linear orderings yielded by thespace-filling curves for a 4� 4 grid.

It was shown that under most circumstances, the linearmapping based on the Hilbert space-filling curve outper-forms the others in preserving locality [14]. In this paper,we provide analytic results of the clustering effects of theHilbert space-filling curve, focusing on arbitrarily shapedrange queries, which require the retrieval of all objectsinside a given hyperrectangle or polyhedron in multi-dimensional space.

For purposes of analysis, we assume a multidimensionalspace with finite granularity, where each point correspondsto a grid cell. The Hilbert space-filling curve imposes alinear ordering on the grid cells, assigning a single integer

value to each cell. Ideally, it is desirable to have mappingsthat result in fewer disk accesses. The number of diskaccesses, however, depends on several factors, such as thecapacity of the disk pages, the splitting algorithm, theinsertion order, etc. Here, we use the average number ofclusters, or continuous runs, of grid points within a subspacerepresenting a query region as the measure of the clusteringperformance of the Hilbert curve. If each grid point ismapped to one disk block, this measure exactly correspondsto the number of nonconsecutive disk accesses, whichinvolve additional seek time. This measure is also highlycorrelated to the number of disk blocks accessed since (withmany grid points in a disk block) consecutive points arelikely to be in the same block, while points across adiscontinuity are likely to be in different blocks. Thismeasure is used only to render the analysis tractable andsome weaknesses of this measure were discussed in [14].

Definition 1.1. Given a d-dimensional query, a cluster is defined

to be a group of grid points inside the query that are

consecutively connected by a mapping (or a curve).

For example, there are two clusters in the z curve

(Fig. 2a), but only one in the Hilbert curve (Fig. 2b) for the

same two-dimensional rectangular query Sx � Sy. Now, the

problem we will investigate is formulated as follows:

Given a d-dimensional rectilinear polyhedron representing a

query, find the average number of clusters inside the polyhedron

for the Hilbert curve.

MOON ET AL.: ANALYSIS OF THE CLUSTERING PROPERTIES OF THE HILBERT SPACE-FILLING CURVE 125

Fig. 1. Illustration of space-filling curves.

Fig. 2. Illustration of (a) two clusters for the z curve and (b) one cluster for the Hilbert curve.

Page 3: Analysis of the clustering properties of the hilbert space ...cacs.usc.edu › education › cs653 › Moon-HilbertCurve-TKDE01.pdf · called Peano curves or space-filling curves

The definition of the d-dimensional rectilinear polyhe-dron is given in Section 3. Note that in the d-dimensionalspace with finite granularity, for any d-dimensional object,such as spheres, ellipsoids, quadric cones, etc., there exists acorresponding (rectilinear) polyhedron that contains exactlythe same set of grid points inside the given object. Thus, thesolution to the problem above will cover more general casesconcerning any simple connected object of arbitrary shape.

The rest of the paper is organized as follows: Section 2surveys historical work on space-filling curves and otherrelated analytic studies. Section 3 presents an asymptoticformula of the average number of clusters for d-dimensionalrange queries of arbitrary shape. Section 4 derives a closed-form exact formula of the average number of clusters in atwo-dimensional space. In Section 5, we provide empiricalevidence to demonstrate the correctness of the analyticresults for various query shapes. Finally, in Section 6, wediscuss the contributions of this paper and suggest futurework.

2 HISTORICAL SURVEY AND RELATED WORK

Peano, in 1890, discovered the existence of a continuouscurve which passes through every point of a closed square[21]. According to Jordan's precise notion (in 1887) ofcontinuous curves, Peano's curve is a continuous mappingof the closed unit interval I � �0; 1� into the closed unitsquare S � �0; 1�2. Curves of this type have come to becalled Peano curves or space-filling curves [28]. Formally,

Definition 2.1. If a mapping f : I ! En�n � 2� is continuous,and f�I�, the image of I under f , has positive Jordan content(area for n � 2 and volume for n � 3), then f�I� is called aspace-filling curve. En denotes an n-dimensional Euclideanspace.

Although Peano discovered the first space-filling curve, itwas Hilbert, in 1891, who was the first to recognize a generalgeometric procedure that allows the construction of an entireclass of space-filling curves [13]. If the interval I can bemapped continuously onto the square S, then afterpartitioning I into four congruent subintervals and S intofour congruent subsquares, each subinterval can be mappedcontinuously onto one of the subsquares. If this is carried onad infinitum, I and S are partitioned into 22n congruentreplicas for n � 1; 2; 3; � � � ;1. Hilbert demonstrated that thesubsquares can be arranged so that the inclusion relation-ships are preserved, that is, if a square corresponds to an

interval, then its subsquares correspond to the subintervalsof that interval. Fig. 3 describes how this process is to becarried out for the first three steps. It has been shown that theHilbert curve is a continuous, surjective, and nowheredifferentiable mapping [26]. However, Hilbert gave thespace-filling curve, in a geometric form only, for mapping Iinto S (i.e., two-dimensional Euclidean space). The genera-tion of a three-dimensional Hilbert curve was described in[14], [26]. A generalization of the Hilbert curve, in an analyticform, for higher-dimensional spaces was given in [5].

In this paper, a d-dimensional Euclidean space withfinite granularity is assumed. Thus, we use the kth orderapproximation of a d-dimensional Hilbert space-filling curve(k � 1 and d � 2), which maps an integer set �0; 2kd ÿ 1� intoa d-dimensional integer space �0; 2k ÿ 1�d.Notation 2.1. For k � 1 and d � 2, let Hd

k denote the kth orderapproximation of a d-dimensional Hilbert space-filling curve,which maps �0; 2kd ÿ 1� into �0; 2k ÿ 1�d.

The drawings of the first, second, and third steps of theHilbert curve in Fig. 3 correspond to H2

1, H22, and H2

3,respectively.

Jagadish [14] compared the clustering properties ofseveral space-filling curves by considering only 2� 2 rangequeries. Among the z curve (2.625), the Gray-coded curve(2.5), and the Hilbert curve (2), the Hilbert curve was thebest in minimizing the number of clusters. The numberswithin the parentheses are the average number of clustersfor 2� 2 range queries. Rong and Faloutsos [23] derived aclosed-form expression of the average number of clustersfor the z curve, which gives 2.625 for 2� 2 range queries(exactly the same as the result given in [14]) and, in general,approaches one-third of the perimeter of the queryrectangle plus two-thirds of the side length of the rectanglein the unfavored direction. Jagadish [16] derived closed-form, exact expressions of the average number of clustersfor the Hilbert curve in a two-dimensional grid, but only for2� 2 and 3� 3 square regions. This is a special case of themore general formulae derived in this paper.

Abel and Mark [1] reported empirical studies to explorethe relative properties of such mapping functions usingvarious metrics. They reached the conclusion that theHilbert ordering deserves closer attention as an alternativeto the z curve ordering. Bugnion et al. estimated the averagenumber of clusters and the distribution of interclusterintervals for two-dimensional rectangular queries. Theyderived the estimations based on the fraction of vertical andhorizontal edges of any particular space-filling curve.

126 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 13, NO. 1, JANUARY/FEBRUARY 2001

Fig. 3. The first three steps of the Hilbert space-filling curve: (a) first step, (b) second step, and (c) third step.

Page 4: Analysis of the clustering properties of the hilbert space ...cacs.usc.edu › education › cs653 › Moon-HilbertCurve-TKDE01.pdf · called Peano curves or space-filling curves

However, those fractions were provided only for a2-dimensional space and without any calculation or formalverification. In this paper, we formally prove that, in ad-dimensional space, the d different edge directionsapproach the uniform distribution, as the order of theHilbert curve approximation grows into infinity.

Several closely related analyses for the average number oftwo-dimensional quadtree nodes have been presented in theliterature. Dyer [7] presented an analysis for the best, worst,and average case of a square of size 2n � 2n, giving anapproximate formula for the average case. Shaffer [27] gavea closed formula for the exact number of blocks that such asquare requires when anchored at a given position (x; y); healso gave a formula for the average number of blocks forsuch squares (averaged over all possible positions). Some ofthese formulae were generalized for arbitrary 2-dimensionaland d-dimensional rectangles [9], [10].

3 ASYMPTOTIC ANALYSIS

In this section, we give an asymptotic formula for theclustering property of the Hilbert space-filling curve forgeneral polyhedra in a d-dimensional space. Thesymbols used in this section are summarized inTable 1. The polyhedra we consider here are notnecessarily convex, but are rectilinear in the sense thatany (d-1)-dimensional polygonal surface is perpendicularto one of the d coordinate axes.

Definition 3.1. A rectilinear polyhedron is bounded by a setV of polygonal surfaces each of which is perpendicular to one ofthe d coordinate axes, where V is a subset of Rd andhomeomorphic1 to a (d-1)-dimensional sphere Sdÿ1.

For d � 2, the set V is, by definition, a Jordan curve [3],which is essentially a simple closed curve in R2. The set ofsurfaces of a polyhedron divides the d-dimensional spaceRd into two connected components, which may be calledthe interior and the exterior.

The basic intuition is that each cluster within a givenpolyhedron corresponds to a segment of the Hilbert curveconnecting a group of grid points in the cluster, which hastwo endpoints adjacent to the surface of the polyhedron.The number of clusters is then equal to half the number ofendpoints of the segments bounded by the surface of thepolyhedron. In other words,

Remark 3.1. The number of clusters within a givend-dimensional polyhedron is equal to the number ofentries (or exits) of the Hilbert curve into (or from) thepolyhedron.

Thus, we expect that the number of clusters is approxi-mately proportional to the perimeter or hypersurface areaof the d-dimensional polyhedron (d � 2). With this observa-tion, the task is reduced to finding a constant factor of alinear function.

Our approach to derive the asymptotic solution largelydepends on the self-similar nature of the Hilbert curve,which stems from the recursive process of the curveexpansion. Specifically, we shall show in the followinglemmas that the edges of d different orientations areuniformly distributed in a d-dimensional Euclidean space.That is, approximately one dth of the edges are aligned tothe ith dimensional axis for each i �1 � i � d�. Here, wemean by edges, the line segments of the Hilbert curveconnecting two neighboring points. The uniform distribu-tion of the edges provides key leverage for deriving theasymptotic solution. To show the uniform distribution, it isimportant to understand

MOON ET AL.: ANALYSIS OF THE CLUSTERING PROPERTIES OF THE HILBERT SPACE-FILLING CURVE 127

1. Two subsets, X and Y , of Euclidean space are called homeomorphic ifthere exists a continuous bijective mapping, f : X ! Y , with a continuousinverse fÿ1 [12].

TABLE 1Definition of Symbols

Page 5: Analysis of the clustering properties of the hilbert space ...cacs.usc.edu › education › cs653 › Moon-HilbertCurve-TKDE01.pdf · called Peano curves or space-filling curves

. how the kth order approximation of the Hilbertcurve is derived from lower order approximations,and

. how the d-dimensional Hilbert curve is extendedfrom the two-dimensional Hilbert curve, which wasdescribed only in a geometric form in [13]. (Analyticforms for the d-dimensional Hilbert curves werepresented in [5].)

In a d-dimensional space, the kth order approximation ofthe d-dimensional Hilbert curve Hd

k is derived from the 1storder approximation of the d-dimensional Hilbert curve Hd

1

by replacing each vertex in the Hd1 by Hd

kÿ1, which may berotated about a coordinate axis and/or reflected about ahyperplane perpendicular to a coordinate axis. Since thereare 2d vertices in the Hd

1, the Hdk is considered to be

composed of 2d Hdkÿ1 vertices and �2d ÿ 1� edges, each

connecting two of them.Before describing the extension for the d-dimensional

Hilbert curve, we define the orientations of Hdk. Consider Hd

1,which consists of 2d vertices and �2d ÿ 1� edges. No matterwhere the Hilbert curve starts its traversal, the coordinatesof the start and end vertices of the Hd

1 differ only in onedimension, meaning that both vertices lie on a line parallelto one of the d coordinate axes. We say that Hd

1 is i-orientedif its start and end vertices lie on a line parallel to the ithcoordinate axis. For any k �k > 1�, the orientation of Hd

k isequal to that of Hd

1 from which it is derived.Figs. 4 and 5 illustrate the processes that generate H3

k

from H2k and H4

k from H3k, respectively. In general, when the

dth dimension is added to the (d-1)-dimensional Hilbert

curve, each vertex of Hdÿ11 (i.e, Hdÿ1

kÿ1) is replaced by Hdkÿ1 of

the same orientation except in the 2dÿ1th one (i.e., the endvertex of Hdÿ1

1 ), whose orientation is changed from1-oriented to d-oriented parallel to the dth dimensional axis.For example, in Fig. 5, the orientations of the two verticesconnected by a dotted line have been changed from 1 to 4.Since the orientations of all the other �2dÿ1 ÿ 1� Hd

kÿ1

vertices remain unchanged, they are all j-oriented for somej�1 � j < d�. The whole 2dÿ1 Hd

kÿ1 vertices are then repli-cated by reflection and, finally, the two replicas areconnected by an edge parallel to the dth coordinate axis(called d-oriented edge) to form a d-oriented Hd

k. In short,whenever a dimension (say, the dth dimension) is added, twod-orientedHd

kÿ1 vertices are introduced, the number of 1-oriented

Hdkÿ1 vertices remains unchanged as two, and the number ofHd

kÿ1

vertices of the other orientations are doubled.

Notation 3.1. Let 'i be the number of i-oriented Hdkÿ1 vertices

in a given d-oriented Hdk.

Lemma 1. For a d-oriented Hdk (d � 2),

'i � 2 if i � 1;2d�1ÿi if 1 < i � d:

��1�

Proof. By induction on d. tu

The following lemma shows that the edges of d differentorientations approach the uniform distribution as the orderof the Hilbert curve approximation grows into infinity.

Notation 3.2. Let "i;k denote the number of i-oriented edges in ad-oriented Hd

k.

Lemma 2. In a d-dimensional space, for any i andj�1 � i; j � d�, "i;k="j;k approaches unity as k grows toinfinity.

Proof. We begin by deriving recurrence relations among the

terms "i;k and 'i. As we mentioned previously, the

fundamental operations involved in expanding the

Hilbert curve (i.e., from Hdkÿ1 to Hd

k) are rotation and

reflection. During the expansion ofHdk, the orientation of a

Hdkÿ1 vertex in a quantized subregion is changed only by

rotation; a set of subregions of an orientation are

replicated from one of the same orientation, which leaves

the directions of their edges unchanged. Consequently,

128 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 13, NO. 1, JANUARY/FEBRUARY 2001

Fig. 4. The 3-dimensional Hilbert curve (H3k with vertices representing

H3kÿ1 approximations annotated by their orientations.)

Fig. 5. The 4-dimensional Hilbert curve (H4k with vertices representing H4

kÿ1 approximations annotated by their orientations.)

Page 6: Analysis of the clustering properties of the hilbert space ...cacs.usc.edu › education › cs653 › Moon-HilbertCurve-TKDE01.pdf · called Peano curves or space-filling curves

any two distinct Hdkÿ1 vertices of the same orientation

contain the same number of edges "i;kÿ1 for each direction

i�1 � i � d). Therefore, the set of the 1-oriented edges in

the Hdk consists of 2dÿ1 connection edges (in Hd

1),

d-oriented edges of the 1-oriented Hdkÿ1 vertices, �dÿ

1�-oriented edges of the 2-oriented Hdkÿ1 vertices, �dÿ

2�-oriented edges of the 3-oriented Hdkÿ1 vertices and so

on.By applying the same procedure to the other direc-

tions, we obtain

"1;k � '1"d;kÿ1 � '2"dÿ1;kÿ1 � � � � � 'd"1;kÿ1 � 2dÿ1

"2;k � '2"d;kÿ1 � '3"dÿ1;kÿ1 � � � � � '1"1;kÿ1 � 2dÿ2

"3;k � '3"d;kÿ1 � '4"dÿ1;kÿ1 � � � � � '2"1;kÿ1 � 2dÿ3

..

.

"d;k � 'd"d;kÿ1 � '1"dÿ1;kÿ1 � � � � � 'dÿ1"1;kÿ1 � 1:

�2�

The initial values are given by "i;1 � 2dÿi and the valuesof 'i are in Lemma 1. With the constants in the last termsbeing ignored, the recurrence relations are completelysymmetric. From the symmetry, it can be shown that forany i and j�1 � i; j � d),

limk!1

"i;k"j;k� 1:

The proof is complete. tuNow, we consider a d-dimensional grid space, which is

equivalent to a d-dimensional Euclidean integer space. Inthe d-dimensional grid space, each grid point y ��x1; . . . ; xd� has 2d neighbors. The coordinates of theneighbors differ from those of y by unity only in onedimension. In other words, the coordinates of the neighborsthat lie in a line parallel to the ith axis must be either�x1; . . . ; xi�1; . . . ; xd� or �x1; . . . ; xiÿ1; . . . ; xd�. We call themthe i�-neighbor and the iÿ-neighbor of y, respectively.

Butz showed that any unit increment in the Hilbert orderproduces a unit increment in one of the d coordinates andleaves the other dÿ 1 coordinates unchanged [5]. Theimplication is that, for any grid point y, both neighbors ofy, in the linear order imposed by the Hilbert curve, arechosen from the 2d neighbors of y in the d-dimensional gridspace. Of the two neighbors of y in the Hilbert order, theone closer to the start of the Hilbert traversal is called thepredecessor of y.

Notation 3.3. For a grid point y in a d-dimensional grid space,let p�i be the probability that the predecessor of y is thei�-neighbor of y and let pÿi be the probability that thepredecessor of y is the iÿ-neighbor of y.

Lemma 3. In a sufficiently large d-dimensional grid space, forany i �1 � i � d�,

p�i � pÿi �1

d:

Proof. Assume y is a grid point in a d-dimensional spaceand z is its predecessor. Then, the edge yz adjacent to yand z is parallel to one of the d dimensional axes. FromLemma 2 and the recursive definition of the Hilbert

curve, the probability that yz is parallel to the ith

dimensional axis is dÿ1 for any i �1 � i � d�. This implies

that the probability that z is either i�-neighbor or

iÿ-neighbor of y is dÿ1. tu

For a d-dimensional rectilinear polyhedron representing

a query region, the number, sizes, and shapes of the

surfaces can be arbitrary. Due to the constraint of surface

alignment, however, it is feasible to classify the surfaces of a

d-dimensional rectilinear polyhedron into 2d different

kinds: For any i �1 � i � d�,. If a point y is inside the polyhedron and its

i�-neighbor is outside, then the point y faces ani�-surface.

. If a point y is inside the polyhedron and itsiÿ-neighbor is outside, then the point y faces aniÿ-surface.

For example, Fig. 6 illustrates grid points which face

surfaces in a two-dimensional grid space. The shaded

region represents the inside of the polyhedron. Assuming

that the first dimension is vertical and the second dimen-

sion is horizontal, grid points A and D face a 1�-surface and

grid point B (on the convex) faces both a 1�-surface and a

2�-surface. Although grid point C (on the concave) is close

to the boundary, it does not face any surface because all of

its neighbors are inside the polyhedron. Consequently, the

chance that the Hilbert curve enters the polyhedron through

grid point B is approximately twice that of entering through

grid point A (or D). The Hilbert curve cannot enter through

grid point C.For any d-dimensional rectilinear polyhedron, it is

interesting to see that the aggregate area of i�-surface is

exactly as large as that of iÿ-surface. In a d-dimensional grid

space, we mean, by surface area, the number of interior grid

points that face a given surface of any kind.

Notation 3.4. For a d-dimensional rectilinear polyhedron, let S�iand Sÿi denote the aggregate number of interior grid points

that face i�-surface and iÿ-surface, respectively.

Before proving the following theorem, we state an

elementary remark without proof.

Remark 3.2. Given a d-dimensional rectilinear polyhedron,

S�i � Sÿi for any i �1 � i � d�.Notation 3.5. Let N d be the average number of clusters within a

given d-dimensional rectilinear polyhedron.

MOON ET AL.: ANALYSIS OF THE CLUSTERING PROPERTIES OF THE HILBERT SPACE-FILLING CURVE 129

Fig. 6. Illustration of grid points facing surfaces.

Page 7: Analysis of the clustering properties of the hilbert space ...cacs.usc.edu › education › cs653 › Moon-HilbertCurve-TKDE01.pdf · called Peano curves or space-filling curves

Theorem 1. In a sufficiently large d-dimensional grid space

mapped by Hdk, let Sq be the total surface area of a given

rectilinear polyhedral query q. Then,

limk!1N d � Sq

2d: �3�

Proof. Assume a grid point y faces an i�-surface (or an

iÿ-surface). Then, the probability that the Hilbert curve

enters the polyhedron through y is equivalent to the

probability that the predecessor of y is an i�-neighbor (or

an iÿ-neighbor) of y. Thus, the expected number of

entries through an i�-surface (or an iÿ-surface) is S�i p�i

(or Sÿi pÿi ). Since the number of clusters is equal to the

total number of entries into the polyhedron through any

of the 2d kinds of surfaces (Remark 3.1), it follows that

limk!1N d �

Xdi�1

�S�i p�i � Sÿi pÿi �

�Xdi�1

S�i �p�i � pÿi � �by Remark 3:2�

�Xdi�1

S�i1

d�by Lemma 3�

� Sq2d:

The proof is complete. tuTheorem 1 confirms our early conjecture that the number

of clusters is approximately proportional to the hypersur-

face area of a d-dimensional polyhedron and provides

�2d�ÿ1 as the constant factor of the linear function. In a

2-dimensional space, the average number of clusters for the

z curve approaches one-third of the perimeter of a query

rectangle plus two-thirds of the side length of the rectangle

in the unfavored direction [23]. It follows that the Hilbert

curve achieves better clustering than the z curve because

the average number of clusters for the Hilbert curve is

approximately equal to one-fourth of the perimeter of a

2-dimensional query rectangle.

Corollary 1. In a sufficiently large d-dimensional grid space

mapped by Hdk, the following properties are satisfied:

1. Given an s1 � s2 � � � � � sd hyperrectangle,

limk!1N d � 1

d

Xdi�1

�1si

Ydj�1

sj�:

2. Given a hypercube of side length s,

limk!1N d � sdÿ1:

For a square of side length 2, Corollary 1, item 2 in the list,

provides 2 as an average number of clusters, which is

exactly the same as the result given in [14].

4 EXACT ANALYSIS: A SPECIAL CASE

Theorem 1 states that as the size of a grid space growsinfinitely, the average number of clusters approaches halfthe surface area of a given query region divided by thedimensionality. It does not provide an intuition as to howrapidly the number of clusters converges to the asymptoticsolution. To address this issue, in this section, we derive aclosed-form, exact formula for a two-dimensional finitespace. We can then measure how closely the asymptoticsolution reflects the reality in a finite space by comparing itwith the exact formula. Specifically, we assume that a finite2k�n � 2k�n grid space is mapped by H2

k�n and a queryregion is a square of size 2k � 2k. We first describe ourapproach and then present the formal derivation of thesolution in several lemmas and a theorem. Table 2summarizes the symbols used in this section.

4.1 Basic Concepts

Remark 3.1 states that the number of clusters within a givenquery region is equal to the number of entries into theregion made by the Hilbert curve traversal. Since each entryis eventually followed by an exit from the region, an entry isequivalent to two cuts of the Hilbert curve by the boundaryof the query region. We restate Remark 3.1 as follows:

Remark 4.1. The number of clusters within a given query regionis equal to half the number of edges cut by the boundary of theregion.

Here, we mean by edges the line segments of the Hilbertcurve connecting two neighboring grid points. Now, weknow from Remark 4.1 that deriving the exact formula isreduced to counting the number of edge cuts by theboundary of a 2k � 2k query window at all possiblepositions within a 2k�n � 2k�n grid region. Then, theaverage number of clusters is simply obtained by dividingthis number by twice the number of possible positions ofthe query window.

Notation 4.1. Let N 2�k; k� n� be the average number of clustersinside a 2k � 2k square window in a 2k�n � 2k�n grid region.

The difficulty of counting the edge cuts lies in the factthat, for each edge within the grid region, the number ofcuts varies depending on the location of the edge.Intuitively, the edges near the boundary of the gridregion are cut less often than those near the center. Thisis because a smaller number of square windows can cutthe edges near the boundary. Thus, to make it easier tocount the edge cuts, the grid region H2

k�n is divided intonine subregions, as shown in Fig. 7. The width of thesubregions on the boundary is 2k. Then, the 2k�n � 2k�n

grid region �H2k�n� can be considered as a collection of 22n

H2k approximations each of which is connected to one or

two neighbors by connection edges. From now on, by aninternal edge, we mean one of the 22k ÿ 1 edges in a H2

k,and by a connection edge one that connects two H2

k

subregions. For example, subregion F includes only oneH2k and is connected to subregions B and D by a

horizontal and a vertical connection edge, respectively.Subregion B includes �2n ÿ 2� H2

k approximations each of

130 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 13, NO. 1, JANUARY/FEBRUARY 2001

Page 8: Analysis of the clustering properties of the hilbert space ...cacs.usc.edu › education › cs653 › Moon-HilbertCurve-TKDE01.pdf · called Peano curves or space-filling curves

which is connected to its two neighbors by connectionedges.

Consider an edge (internal or connection) near the center

of subregion A,and a horizontal edge in subregion B. An

edge in subregion A can be cut by 2k�1 square windows,

whose positions within the region are mutually distinct. On

the other hand, a horizontal edge in subregion B can be cut

by a different number of distinct windows, depending on

the position of the edge. Specifically, if the edge in

subregion B is on the ith row from the topmost, then it is

cut 2� i times. The observations we have made are

summarized as follows:

1. Every edge (either horizontal or vertical) at least oneof whose end points resides in subregion A is cut2k�1 times.

2. Every vertical edge in subregions B and C is cut2k times by the top or bottom side of a window.

3. Every horizontal edge in subregions D and E is cut2k times by the left or right side of a window.

4. Every connection edge in subregions {B,F,H} ishorizontal and resides in the 2kth row from thetopmost and is cut 2k�1 times by the left and rightsides of a window. Similarly, every connection edgein subregions {C,G,I} is horizontal and resides in the2kth row from the topmost and is cut twice by theleft and right sides of a window.

5. Every connection edge in subregions {D,F,G} isvertical and resides in the first column from theleftmost and is cut twice by the top and bottom sidesof a window. Every connection edge in subregions{E,H,I} is vertical and resides in the first column fromthe rightmost and is cut twice by the top and bottomsides of a window.

6. Every horizontal edge in the ith row from thetopmost of subregion B is cut 2� i times by boththe left and right sides of a window and everyhorizontal edge in the ith row from the topmost ofsubregion C is cut 2k�1 ÿ 2� i� 2 times by both theleft and right sides of a window.

7. Every vertical edge in the ith column from theleftmost of subregion D is cut 2� i times by both thetop and bottom sides of a window and every verticaledge in the ith column from the leftmost ofsubregion E is cut 2k�1 ÿ 2� i� 2 times by boththe top and bottom sides of a window.

MOON ET AL.: ANALYSIS OF THE CLUSTERING PROPERTIES OF THE HILBERT SPACE-FILLING CURVE 131

TABLE 2Definition of Symbols

Fig. 7. H2k�n divided into nine subregions.

Page 9: Analysis of the clustering properties of the hilbert space ...cacs.usc.edu › education › cs653 › Moon-HilbertCurve-TKDE01.pdf · called Peano curves or space-filling curves

8. Every horizontal edge in the ith row from thetopmost of subregions {F,H} is cut i times by eitherthe left or right side of a window.

9. Every horizontal edge in the ith row from thetopmost of subregions {G,I} is cut 2k ÿ i� 1 timesby either the left or right side of a window.

10. Every vertical edge in the ith column from theleftmost of subregions {F,G} is cut i times by eitherthe top or bottom side of a window.

11. Every vertical edge in the ith column from theleftmost of subregions {H,I} is cut 2k ÿ i� 1 times byeither the top or bottom side of a window.

12. Two connection edges through which the Hilbertcurve enters into and leaves from the grid region arecut once each.

From these observations, we can categorize the edges in

the H2k�n grid region into the following five groups:

1. E1: a group of edges, as described in observation 1.Each edge is cut 2k�1 times.

2. E2: a group of edges, as described in observations 2and 3. Each edge is cut 2k times.

3. E3: a group of edges, as described in observations 4and 5. Each connection edge on the top boundary(i.e., subregions {B,F,H}) is cut 2k�1 times and anyother connection edge is cut twice.

4. E4: a group of edges, as described in observations 6and 7. Each edge is cut 2i or 2�2k ÿ i� 1� times if it isin the ith row (or column) from the topmost (orleftmost).

5. E5: a group of edges, as described in observations 8to 11. Each edge is cut i or 2k ÿ i� 1 times if it is inthe ith row (or column) from the topmost (orleftmost).

Notation 4.2. Ni denotes the number of edge cuts from an edge

group Ei.

In a H2k�n grid region, the number of all possible

positions of a 2k � 2k window is �2k�n ÿ 2k � 1�2. Since

there are two more cuts from observation 12, in addition to

N1; . . . ; N5, the average number of clusters N 2�k; k� n� is

given by

N 2�k; k� n� � N1 �N2 �N3 �N4 �N5 � 2

2�2k�n ÿ 2k � 1�2 : �4�

In the next section, we derive a closed-form expression for

each of the edge groups N1; . . . ; N5.

4.2 Formal Derivation

We adopt the notion of orientations of Hdk given in Section 3

and extend it, so that it can be used to derive inductions.

Notation 4.3. An i-oriented Hdk is called i�-oriented (or

iÿ-oriented) if the ith coordinate of its start point is notgreater (or less) than that of any grid point in the Hd

k.

Fig. 8 illustrates 1�-oriented, 1ÿ-oriented, 2�-oriented,and 2ÿ-oriented H2

2 approximations. Note that either of thetwo end points can be a start point for each curve.

We begin by deriving N1 and N3. It appears at the firstglance that the derivation of N1 is simple because each edgein E1 is cut 2k�1 times. However, the derivation of N1

involves counting the number of connection edges crossingthe boundary between subregion A and the other sub-regions, as well as the number of edges inclusive tosubregion A. We accomplish this by counting the numberof edges in the complementary set E1 (that is, {edges inH2k�n} ÿE1). Since E1 consists of edges in 4�2n ÿ 1�H2k approximations in boundary subregions B through I

and connection edges in E3, j E1 j is equal to

4�2n ÿ 1� � �22k ÿ 1�� j E3 j :To find the number of connection edges in E3, we define thenumber of connection edges in different parts of the boundarysubregions. In the following, without loss of generality, weassume that the grid region is 2�-orientedH2

k�n.

Notation 4.4. Let tn, bn, and sn denote the number of connectionedges in the top boundary (i.e., subregions {B,F,H}), in thebottom boundary (i.e., subregions {C,G,I}), and in the left orright boundary (i.e., subregions {D,F,G} or {E,H,I}) of a2�-oriented H2

k�n, respectively.

Note that the number of connection edges in subregions{D,F,G} and the number of connection edges in subregions{E,H,I} are identical because the 2�-oriented H2

k�n isvertically self-symmetric.

Lemma 4. For any positive integer n,

tn � 2nÿ1 and bn � 2sn � 2�2n ÿ 1�: �5�

Proof. Given in Appendix A. tu

From Lemma 4, the number of connection edgesinclusive to the boundary subregions (i.e., E3) is given bytn � bn � 2sn � 5� 2nÿ1 ÿ 2. From this, we can obtain thenumber of edges in E1 as well as E3 and, hence, the number

132 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 13, NO. 1, JANUARY/FEBRUARY 2001

Fig. 8. Four different orientations of H22. (a) 1�-oriented, (b) 1ÿ-oriented, (c) 2�-oriented, and (d) 2ÿ-oriented.

Page 10: Analysis of the clustering properties of the hilbert space ...cacs.usc.edu › education › cs653 › Moon-HilbertCurve-TKDE01.pdf · called Peano curves or space-filling curves

of cuts from E1 and E3. The results are presented in thefollowing lemma.

Lemma 5. The numbers of edge cuts from E1 and E3 are

N1 � 2�2n ÿ 2�223k � 3�2n ÿ 2�2k; �6�

N3 � 2n�k � 4�2n ÿ 1�: �7�

Proof. Given in Appendix A. tu

Then, all that we need to derive N2 is to count thenumber of vertical edges in subregions {B,C} and thenumber of horizontal edges in subregions {D,E}. Noconnection edges in these subregions are involved. Sincethe number of horizontal (or vertical) edges in a H2

k isdetermined by its orientation, it is necessary to find thenumber of H2

k approximations of different orientations insubregions {B,C,D,E}. In the following, we give notationsfor the number of horizontal and vertical edges in a H2

k andthe number of H2

k approximations of different orientationsin the boundary subregions in Fig. 7.

Notation 4.5. Let Hk and Vk denote the number of horizontal andvertical edges in a 2-oriented H2

k, respectively.

By definition, the numbers of horizontal and verticaledges in a 1-oriented H2

k are Vk and Hk, respectively.

Notation 4.6. For a set of subregions fR1; R2; . . . ; Rjg in Fig. 7,let

fR1;R2;...;Rjgi�;n and

fR1;R2;...;Rjgiÿ ;n denote the number of

i�-oriented and iÿ-oriented H2k approximations in those

subregions, respectively.

Lemma 6. Given a 2�-oriented H2k�n, as depicted in Fig. 7,

fBg2�;n � 2n ÿ 2; �8�

fDg1�;n � fEg1ÿ;n � fCg2�;n � 2n ÿ 2; �9�

fCg1� ;n � fCg1ÿ;n � fD;Eg2�;n � fD;Eg2ÿ;n � 2�2n ÿ 2�: �10�

Proof. Given in Appendix A. tu

From Lemma 6, a closed-form expression of N2 isderived in the following lemma:

Lemma 7. The number of edge cuts from E2 is

N2 � 2�2n ÿ 2�23k ÿ 2�2n ÿ 2�2k: �11�

Proof. Given in Appendix A. tu

Now, we consider the number of cuts from E4 and E5.

The edges in these groups are cut a different number of

times depending on their relative locations within the H2k

which they belong to. Consequently, the expressions for N4

and N5 include such terms as i� vk�i� and i� hk�i�. The

definitions of vk�i� and hk�i� are given below. We call

H2k approximations having such terms gradients.

Notation 4.7. Let hk�i� be the number of horizontal edges in theith row from the topmost and vk�i� be the number of verticaledges in the ith column from the leftmost of a 2�-oriented H2

k.

To derive the closed-form expressions for N4 and N5, wefirst define different types of gradients. Consider the2�-oriented H2

k approximations in subregions {B,C,D,E}.From observations 6 and 7, the number of cuts from thehorizontal edges in a 2�-oriented H2

k in subregion B isP2k

i�1 2ihk�i�. Likewise, the number of cuts from thehorizontal edges in a 2�-oriented H2

k in subregion C isP2k

i�1 2�2k ÿ i� 1�hk�i�, and the number of cuts from thevertical edges in a 2�-oriented H2

k in subregion D or E isP2k

i�1 2ivk�i�. The number of cuts from vertical edges is thesame in both subregions D and E because a 2�-oriented H2

k

is vertically self-symmetric. Based on this, we define threetypes of gradients for a 2�-oriented H2

k:

Definition 4.1.

1. A 2�-oriented H2k is called u-gradientk if each of its

horizontal edges in the ith row from the topmost is cuti or 2i times.

2. A 2�-oriented H2k is called d-gradientk if each of its

horizontal edges in the ith row from the topmost is cut2k ÿ i� 1 or 2�2k ÿ i� 1� times.

3. A 2�-oriented H2k is called s-gradientk if each of its

vertical edges in the ith column from either the leftmostor rightmost is cut i or 2i times.

Fig. 9 illustrates the three different gradients(u-gradient2, d-gradient2, and s-gradient2) and the cuttingboundaries of a sliding window. These definitions can be

MOON ET AL.: ANALYSIS OF THE CLUSTERING PROPERTIES OF THE HILBERT SPACE-FILLING CURVE 133

Fig. 9. Three different gradients and cutting windows. (a) u-gradient2, (b) d-gradient2, and (c) s-gradient2.

Page 11: Analysis of the clustering properties of the hilbert space ...cacs.usc.edu › education › cs653 › Moon-HilbertCurve-TKDE01.pdf · called Peano curves or space-filling curves

applied to the H2k approximations of different orientations

as well, by simply rotating the directions. For example, a1�-oriented H2

k in subregion D is d-gradientk and a2ÿ-oriented H2

k in subregion D is s-gradientk.

Lemma 8. Let �k �P2k

i�1 ihk�i�, �k �P2k

i�1�2k ÿ i� 1�hk�i�,and k �

P2k

i�1 ivk�i�. Then,

�k � �k � �2k � 1�Hk and k � 1

2�2k � 1�Vk: �12�

Proof. Given in Appendix A. tu

Next, we need to know the number of gradients of eachtype in the boundary subregions B through I so that we canderive N4 and N5. For H2

k approximations in subregions{B,C,D,E},

. Every 2�-oriented H2k in B is u-gradientk.

. Every 2�-oriented H2k in C, 1�-oriented H2

k in D, and1ÿ-oriented H2

k in E is d-gradientk.. Every 1�-oriented or 1ÿ-oriented H2

k in C and2�-oriented or 2ÿ-oriented in {D,E} is s-gradientk.

The H2k approximations in subregions {F,G,H,I} are dual-

type gradients. In other words,

. Each of the 2�-oriented H2k approximations in {F,H}

is both u-gradientk and s-gradientk.. The H2

k in G is both d-gradientk and s-gradientkbecause the subgrid is either 2�-oriented or1�-oriented.

. The H2k in I is both d-gradientk and s-gradientk

because the subgrid is either 2�-oriented or1ÿ-oriented.

Thus, in subregions {B,C,D,E}, the number of u-gradientkapproximations is

fBg2�;n, the number of d-gradientk approx-

imations is fCg2�;n � fDg1�;n � fEg1ÿ;n, and the number of

s-gradientk approximations is

fD;Eg2�;n � fD;Eg2ÿ;n � fCg1ÿ ;n � fCg1�;n:

In subregions {F,G,H,I}, the number of u-gradientk approx-imations is two, the number of d-gradientk approximationsis two, and the number of s-gradientk approximations isfour. From this observation, and Lemma 6 and 8, it followsthat

Lemma 9. The numbers of edge cuts from E4 and E5 are

N4 � 2�2n ÿ 2��2k � 1��22k ÿ 1�; �13�

N5 � 2�2k � 1��22k ÿ 1�: �14�

Proof. Given in Appendix A. tu

Finally, in the following theorem, we present a closed-form expression of the average number of clusters.

Theorem 2. Given a 2k�n � 2k�n grid region, the averagenumber of clusters within a 2k � 2k query window is

N 2�k; k� n� � �2n ÿ 1�223k � �2n ÿ 1�22k � 2n

�2k�n ÿ 2k � 1�2 : �15�

Proof. From (4),

N 2�k; k� n� � �N1 �N2 �N3 �N4 �N5 � 2�2�2k�n ÿ 2k � 1�2

� ��2n ÿ 1�223k � �2n ÿ 1�22k � 2n�

�2k�n ÿ 2k � 1�2 :

tuFor increasing n, N 2�k; k� n� asymptotically approaches

a limit of 2k, which is the side length of the square queryregion. This matches the asymptotic solution given inCorollary 1.2 for d � 2.

5 EXPERIMENTAL RESULTS

To demonstrate the correctness of the asymptotic and exactanalyses presented in the previous sections, we carried outsimulation experiments for range queries of various sizesand shapes. The objective of our experiments was toevaluate the accuracy of the formulas given in Theorem 1and Theorem 2. Specifically, we intended to show that theasymptotic solution is an excellent approximation forgeneral d-dimensional range queries of arbitrary sizes andshapes. We also intended to validate the correctness of theexact solution for a two-dimensional 2k � 2k square query.

5.1 Arrangements of Experiments

To obtain exact measurements of the average number ofclusters, it was required that we average the number ofclusters within a query region at all possible positions in agiven grid space. Such exhaustive simulation runs allowedus to validate empirically the correctness of the exactformula given in Theorem 2 for a 2k � 2k square query.

However, the number of all possible queries is exponen-tial on the dimensionality. In a d-dimensional N �N �. . .�N grid space, the total number of distinct positions ofa d-dimensional k� k� . . .� k hypercubic query is�N ÿ k� 1�d. Consequently, for a large grid space and ahigh dimensionality, each simulation run may requireprocessing an excessively large number of queries, whichin turn makes the simulation take too long. Thus, we carriedout exhaustive simulations only for relatively small2-dimensional and three-dimensional grid spaces. Instead,for relatively large or high-dimensional grid spaces, we didstatistical simulation by random sampling of queries.

For query shapes, we chose squares, circles, and concavepolygons for two-dimensional cases, and cubes, concavepolyhedra, and spheres for three-dimensional cases. Fig. 10illustrates some of the query shapes used in our experi-ments. In higher dimensional spaces, we used hypercubicand hyperspherical query shapes because it was relativelyeasy to identify the query regions by simple mathematicalformulas.

5.2 Empirical Validation

The first set of experiments was carried out in2-dimensional grid spaces with two different sizes. Thetable in Fig. 11a compares the empirical measurements withthe exact and asymptotic formulas for a 2k � 2k squarequery. The second column of the table contains the average

134 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 13, NO. 1, JANUARY/FEBRUARY 2001

Page 12: Analysis of the clustering properties of the hilbert space ...cacs.usc.edu › education › cs653 › Moon-HilbertCurve-TKDE01.pdf · called Peano curves or space-filling curves

numbers of clusters obtained by an exhaustive simulationperformed on a 1; 024� 1; 024 grid space. The numbers inthe third and fourth columns were computed by theformulas in Theorem 1 and Theorem 2, respectively. Thenumbers from the simulation are identical to those from theexact formula ignoring round-off errors. Moreover, bycomparing the second and third columns, we can measurehow closely the asymptotic formula reflects the reality in afinite grid space.

Fig. 11b compares three different two-dimensional queryshapes: squares, circles, and concave polygons. The averagenumber of clusters were obtained by a statistical simulationperformed on a 32K � 32K grid space. For the statisticalsimulation, a total of 200 queries were generated and placedrandomly within the grid space for each combination ofquery shape and size. With a few exceptional cases, thenumbers of clusters form a linear curve for each queryshape; the linear correlation coefficients are 0.999253 forsquares, 0.999936 for circles, and 0.999267 for concavepolygons. The numbers are almost identical for the threedifferent query shapes despite their covering differentareas. A square covers s2 grid points, a concave polygon3s2=4 grid points and a circle approximately �s2=4 gridpoints.

However, this should not be surprising, as the threequery shapes have the same length of perimeter for a givenside length s. For a circular query of diameter s, we canalways find a rectilinear polygon that contains the same setof grid points as the circular query region. And, it is alwaysthe case that the perimeter of the rectilinear polygon (asshown in Fig. 10c) is equal to that of a square of sidelength s. In general, in a two-dimensional grid space, the

perimeter of a rectilinear polygon is greater than or equal tothat of the minimum bounding rectangle (MBR) of thepolygon. This justifies the general approach of using aminimum bounding rectangle to represent a two-dimen-sional range query because the use of an MBR does notincrease the actual number of clusters (i.e., the number ofnonconsecutive disk accesses).

A similar set of experiments was carried out in higherdimensional grid spaces. The results in Fig. 12a wereobtained by a statistical simulation performed on a 32K �32K � 32K grid space. For the statistical simulation, a totalof 200 queries were generated and placed randomly withinthe grid space for each combination of query shape andsize. Those in Fig. 12b were obtained by a statisticalsimulation with 200 random d-dimensional 3� 3� . . .� 3hypercubic queries in a d-dimensional 32K � 32K � . . .�32K grid space (2 � d � 10).

In Fig. 12a, the numbers of clusters form quadraticcurves for all the three query shapes, but with slightlydifferent coefficients for the quadratic term. To determinethe quadratic functions, we applied the least-square curvefitting method for each query shape. The approximatequadratic functions were obtained as follows:

fcube�s� � 1:02307s2 � 1:60267s� 1:93663

fpoly�s� � 0:947168s2 � 1:26931s� 1:95395

fsphere�s� � 0:816674s2 � 1:27339s� 2:6408:

The approximate function fcube�s� for a cubic query confirmsthe asymptotic solution given in Corollary 1, item 2 in thelist, as it is quite close to s2. Furthermore, Fig. 12b illustratesthat the empirical results from the hypercubic queries

MOON ET AL.: ANALYSIS OF THE CLUSTERING PROPERTIES OF THE HILBERT SPACE-FILLING CURVE 135

Fig. 10. Illustration of sample query shapes: (a) square, (b) polygon, (c) circle, (d) cube, and (e) polyhedron.

Fig. 11. Average number of clusters for two-dimensional queries: (a) exhaustive simulation (grid: 1; 024� 1; 024) and (b) statistic simulation (grid:

32K � 32K).

Page 13: Analysis of the clustering properties of the hilbert space ...cacs.usc.edu › education › cs653 › Moon-HilbertCurve-TKDE01.pdf · called Peano curves or space-filling curves

coincide with the formula (sdÿ1) even in higher-dimensionalspaces.2 The numbers from the experiments were less than 2percent off from the formula.

In contrast, the functions fpoly�s� and fsphere�s� for

concave polyhedral and spherical queries are lower than

s2. The reason is that, unlike in the two-dimensional case,

the surface area of a concave polyhedron or a sphere is

smaller than that of its minimum bounding cube. For

example, the surface area of the polyhedron illustrated in

Fig. 10e is 112 s

2, while that of the corresponding cube is 6s2.

For a sphere of diameter s � 16, the surface area (i.e., the

number of grid points on the surface of the sphere) is 1,248.

This is far smaller than the surface area of the correspond-

ing cube, which is 6� 162. Note that the coefficients of the

quadratic terms in fpoly�s� and fsphere�s� are fairly close to1112 � 0:9166 � � � and 1;248

6�322 � 0:8125, respectively. This indi-

cates that, in a d-dimensional space (d � 3), accessing the

minimum bounding hyperrectangle of a given query region

may incur additional nonconsecutive disk accesses and,

hence, supports the argument made in [15] that the

minimum bounding rectangle may not be a good approx-

imation of a nonrectangular object.

5.3 Comparison with the Gray-Coded and Z Curves

It may be argued that it is not convincing to make adefinitive conclusion that the Hilbert curve is better orworse than others solely on the basis of the averagebehaviors because the clustering achieved by the Hilbertcurve might have a wider deviation from the average thanother curves. Therefore, it is desirable to perform a worst-case analysis to determine the bounds on the deviation. Afull-fledged worst-case analysis, however, is beyond thescope of this paper. Instead, we measured the worse-casenumbers of clusters for the Hilbert curve and comparedthose for the Gray-coded and z curves in the samesimulation experiments.

Fig. 13 and Fig. 14 show the worst-case and averagenumbers of clusters, respectively. Each figure presents theresults from an exhaustive simulation performed on a 1K �1K two-dimensional space and a statistical simulationperformed on a 32K � 32K � 32K three-dimensional space.The Hilbert curve achieves much better clustering than theother curves in both the worst and average cases. Forexample, for a two-dimensional square query, the Hilbertcurve significantly reduced the numbers of clusters,yielding an improvement of up to 43 percent for theworst-case behaviors and 48 percent for the average cases.For a three-dimensional spherical query, the Hilbert curveachieved an improvement of up to 28 percent from thez curve and 18 percent from the Gray-coded curve for theworst cases and up to 31 percent from the z curve and22 percent from the Gray-coded curve for the average cases.

Although it is not the focus of this paper, it is worthnoting that the Gray-coded curve was not always betterthan the z curve, which is in contrast to a previous study[14] that the Gray-coded curve achieves better clusteringthan the z curve for a two-dimensional 2� 2 square query.In particular, for two-dimensional circular queries (Fig. 13band Fig. 14b), the Gray-coded curve was worse than the zcurve in both the worst and average cases. On the otherhand, for two-dimensional square queries, the Gray-codedcurve was better than the z curve for the average clusteringonly by negligible amounts (the two measurements werealmost identical, as shown in Fig. 14a). Furthermore, it wassurprising that both the Gray-coded and z curves per-formed exactly the same for the worst-case clustering (thetwo measurements were completely identical, as shown inFig. 13a). In a three-dimensional space, however, the Gray-coded curve was clearly better than the z curve for bothtypes of queries in both the worst and average cases.

5.4 Summary

The main conclusions from our experiments are:

. The exact solution given in Theorem 2 matchesexactly the experimental results from exhaustivesimulations for the square queries of size 2k � 2k.(See Fig. 11a.)

. The asymptotic solutions given in Theorem 1 andCorollary 1 provide excellent approximations ford-dimensional queries of arbitrary shapes and sizes.

136 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 13, NO. 1, JANUARY/FEBRUARY 2001

Fig. 12. Average number of clusters for higher-dimensional queries: (a) three-dimensional queries and (b) d-dimensional hypercubic queries.

2. The exponential growth gives rise to the question of whether using theHilbert curve is a practical technique for clustering high-dimensional dataobjects. For instance, in a 10-dimensional space, the expected number ofclusters was 19,683.

Page 14: Analysis of the clustering properties of the hilbert space ...cacs.usc.edu › education › cs653 › Moon-HilbertCurve-TKDE01.pdf · called Peano curves or space-filling curves

(See Fig. 11b and Fig. 12.) For example, the relativeerrors did not exceed 2 percent for d-dimensional(2 � d � 10) hypercubic queries.

. Assuming that blocks are arranged on disk by theHilbert ordering, accessing the minimum boundingrectangles of a d-dimensional (d � 3) query regionmay increase the number of nonconsecutiveaccesses, whereas this is not the case for atwo-dimensional query.

. The Hilbert curve outperforms the z and Gray-codedcurves by a wide margin for both the worst andaverage case clustering. (See Fig. 13 and Fig. 14.)

. For three-dimensional cubic and spherical queries,the Gray-coded curve outperformed the z curve forboth the worst-case and average clustering. How-ever, the clustering by the Gray-coded curve wasalmost identical to that by the z curve for two-dimensional square queries (in Fig. 13a and Fig. 14a)and clearly worse for two-dimensional circularqueries (in Fig. 13b and Fig. 14b).

6 CONCLUSIONS

We have studied the clustering property of the Hilbert

space-filling curve as a linear mapping of a multidimen-

sional space. Through algebraic analysis, we have provided

simple formulas that state the expected number of clusters

for a given query region and also validated their correctness

through simulation experiments. The main contributions of

this paper are:

. Theorem 2 generalizes the previous work done onlyfor a 2� 2 query region [14] by providing an exactclosed-form formula for 2k � 2k square queries forany k �k � 1�. The asymptotic solution given inTheorem 1 further generalizes it for d-dimensionalpolyhedral query regions (d � 2).

. We have proven that the Hilbert curve achievesbetter clustering than the z curve in a two-dimensional space; the average number of clustersfor the Hilbert curve is one-fourth of the perimeterof a query rectangle, while that of the z curve isone-third of the perimeter plus two-thirds of theside length of the rectangle in the unfavoreddirection [23]. Furthermore, by simulation experi-ments, we have shown that the Hilbert curveoutperforms both the z and Gray-coded curves intwo-dimensional and 3-dimensional spaces. Weconjecture that this trend will hold even in high-er-dimensional spaces.

. We have shown that it may incur extra overhead toaccess the minimum bounding hyperrectangle for ad-dimensional nonrectangular query (d � 3) becauseit may increase the number of clusters (i.e., non-consecutive disk accesses).

MOON ET AL.: ANALYSIS OF THE CLUSTERING PROPERTIES OF THE HILBERT SPACE-FILLING CURVE 137

Fig. 13. Worst-case number of clusters for three different space-filling curves: (a) two-dimensional square queries, (b) two-dimensional circular

queries, (c) three-dimensional cubic queries, and (d) three-dimensional spherical queries.

Page 15: Analysis of the clustering properties of the hilbert space ...cacs.usc.edu › education › cs653 › Moon-HilbertCurve-TKDE01.pdf · called Peano curves or space-filling curves

The approaches used in this paper can be applied to otherspace-filling curves. In particular, the basic intuitionssummarized in Remark 3.1 and Remark 4.1 are true forany space-filling curves.

From a practical point of view, it is important to predictand minimize the number of clusters because it determinesthe number of nonconsecutive disk accesses, which, in turn,incur additional seek time. Assuming that blocks arearranged on disk by the Hilbert ordering, we can providea simple measure that depends only on the perimeter orsurface area of a given query region and its dimensionality.The measure can then be used to predict the required diskaccess behaviors and, thereby, the total access time.

The full-fledged analysis of the worst-case behaviorsfor the Hilbert curve is left for future research. Futurework also includes the extension of the exact analysis ford-dimensional spaces (d � 3) and the investigation of thedistribution of distances between clusters.

APPENDIX

PROOFS

Proof of Lemma 4. A 2�-oriented H2k�n approximation is

composed of four H2k�nÿ1 approximations (two on the

top and two on the bottom) and three connection edges.

The two H2k�nÿ1 approximations on the top half are

2�-oriented and the two H2k�nÿ1 approximations on the

bottom half are 1�-oriented on the left and 1�-oriented on

the right. Among the three edges connecting the four

H2k�nÿ1 approximations, the horizontal edge is not

included in the boundary subregion of the H2k�n because

the edge resides on the 2k�nÿ1th row from the topmost of

the H2k�n. The other two vertical connection edges are on

the leftmost and rightmost columns and included in the

boundary subregion of the H2k�n. Thus, the main

observations are:

1. The number of connection edges in the topboundary subregion of the 2�-oriented H2

k�n isthe sum of those in the top boundary subregionsof the two 2�-oriented H2

k�nÿ1 approximations.2. The number of connection edges in the bottom

boundary subregion of the 2�-orientedH2k�n is the

sum of those in the bottom boundary subregionsof the 1�-oriented H2

k�nÿ1 and 1ÿ-oriented H2k�nÿ1

approximations.3. The number of connection edges in the left (or

right) boundary subregion of the 2�-orientedH2k�n is the sum of those in the left (or right)

boundary subregions of the 2�-oriented H2k�nÿ1

and 1�-oriented (or 1ÿ-oriented) H2k�nÿ1 approx-

imations, plus one for a connection edge.

Since the bottom boundary subregion of a 1�-orientedH2k�nÿ1 is equivalent to the right boundary subregion of a

2�-oriented H2k�nÿ1, etc., it follows that

138 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 13, NO. 1, JANUARY/FEBRUARY 2001

Fig. 14. Average number of clusters for three different space-filling curves: (a) two-dimensional square queries, (b) two-dimensional circular queries,

(c) three-dimensional cubic queries, and (d) three-dimensional spherical queries.

Page 16: Analysis of the clustering properties of the hilbert space ...cacs.usc.edu › education › cs653 › Moon-HilbertCurve-TKDE01.pdf · called Peano curves or space-filling curves

tn � 2� tnÿ1

bn � 2� snÿ1

sn � snÿ1 � bnÿ1 � 1:

Since t1 � 1; b1 � 0, and s1 � 1, we obtain tn � 2nÿ1 and

bn � 2sn � 2�bnÿ1 � 2snÿ1� � 2;

which yields bn � 2sn � 2�2n ÿ 1�. tuProof of Lemma 5. The H2

k�n and H2k approximations

contain 22�k�n� ÿ 1 and 22k ÿ 1 edges, respectively. Since

there are a total of 4�2n ÿ 1� H2k approximations in the

boundary subregions, the total number of edges in E1 is

given by

�22�k�n� ÿ 1� ÿ 4�2n ÿ 1��22k ÿ 1� ÿ �5� 2nÿ1 ÿ 2�� 22k�2n ÿ 2�2 � 3�2nÿ1 ÿ 1�:

Because each edge in E1 is cut 2k�1 times, it follows that

N1 � 2k�1�22k�2n ÿ 2�2 � 3�2nÿ1 ÿ 1��� 2�2n ÿ 2�223k � 3�2n ÿ 2�2k:

Among the 5� 2nÿ1 ÿ 2 edges in E3, tn edges are cut 2k�1

times, and the other bn � 2sn edges are cut twice.

Therefore,

N3 � 2k�1tn � 2�bn � 2sn� � 2n�k � 4�2n ÿ 1�:tu

Proof of Lemma 6. Consider a 2�-oriented H2k�n, which is

composed of four H2k�nÿ1 approximations and three

connection edges. The number of 2�-oriented H2k

approximations in the top subregions (i.e., {B,F,H}) of

the 2�-oriented H2k�n is twice the number of 2�-oriented

H2k approximations in the top subregions of the

2�-oriented H2k�nÿ1. This is because the top half of the

2�-oriented H2k�n consists of two 2�-oriented H2

k�nÿ1

approximations. Thus, the recurrence relation is

fB;F;Hg2�;n � 2� fB;F;Hg2�;nÿ1 . Since

fB;F;Hg2�;1 � 2, we obtain

fB;F;Hg2�;n � 2n:

The bottom half of the 2�-oriented H2k�n consists of a

1�-oriented H2k�nÿ1 and a 1ÿ-oriented H2

k�nÿ1. In the

bottom boundary subregions {C,G,I}, each 1ÿ-oriented

H2k in the 1�-oriented H2

k�nÿ1 approximation becomes a

2�-oriented H2k in the 2�-oriented H2

k�n approximation;

each 1�-oriented H2k in the 1ÿ-oriented H2

k�nÿ1 approx-

imation becomes a 2�-oriented H2k in the 2�-oriented

H2k�n approximation. No other than the 1ÿ-oriented and

1�-oriented H2k approximations in the H2

k�nÿ1 approx-

imations becomes a 2�-oriented H2k in the H2

k�n. Thus, it

follows that

fC;G;Ig2�;n � fC;G;Ig1ÿ;nÿ1 � fC;G;Ig1�;nÿ1 :

Since there exist no 2ÿ-orientedH2k approximations in the

bottom boundary subregions, fC;G;Ig2ÿ;n � 0. Thus,

fC;G;Ig2�;n � fC;G;Ig1ÿ;n � fC;G;Ig1�;n � 2n:

Similarly, on the left boundary subregion, we obtain thefollowing recurrence relations:

fD;F ;Gg1�;n �

fD;F;Gg2�;nÿ1 � fD;F;Gg2ÿ;nÿ1

fD;F ;Gg1�;n � fD;F;Gg2�;n � fD;F ;Gg2ÿ;n � 2n:

Then, from the above four recurrence relations,

fC;G;Ig2�;n � 2

fD;F ;Gg1�;n � �2nÿ1 ÿ fC;G;Ig2�;nÿ1 � � 2�2nÿ1 ÿ fD;F;Gg1�;nÿ1 �

� �2nÿ2 � fC;G;Ig2�;nÿ2 � � 2�2nÿ2 � fD;F;Gg1�;nÿ2 �� 3� 2nÿ2 � � fC;G;Ig2�;nÿ2 � 2

fD;F;Gg1�;nÿ2 �:

Since

fC;G;Ig2�;1 � 2

fD;F;Gg1�;1 � 2

and

fC;G;Ig2�;2 � 2

fD;F ;Gg1�;2 � 4;

we obtain

fC;G;Ig2�;n � 2

fD;F ;Gg1�;n � 2n:

From fE;H;Ig1ÿ;n � fD;F;Gg1�;n due to the self-symmetry of the

2�-oriented H2k�n, it follows that

fC;G;Ig2�;n � fD;F ;Gg1� ;n � fE;H;Ig1ÿ;n � fC;G;Ig2�;n � 2

fD;F ;Gg1�;n � 2n:

Now, consider subregions {F,G,H,I}. The H2k approx-

imations in {F,H} are always 2�-oriented, the H2k in {G} is

either 2�-oriented or 1�-oriented, and the H2k in {I} is

either 2�-oriented or 1ÿ-oriented. Thus, fF;Hg2�;n � 2 and

fG;Ig2�;n � fG;Ig1�;n � fG;Ig1ÿ;n � 2. Therefore,

fBg2�;n � fB;F;Hg2�;n ÿ fF;Hg2� ;n � 2n ÿ 2

fCg2�;n � fDg1�;n � fEg1ÿ ;n � � fC;G;Ig2�;n � fD;F;Gg1�;n � fE;H;Ig1ÿ;n �

ÿ � fG;Ig2�;n � fG;Ig1�;n � fG;Ig1ÿ;n �� 2n ÿ 2:

So far, we have derived the first two equations given in

this lemma.Finally, to derive the third equation, consider sub-

regions {B,C,D,E}. Since the total number of H2k

approximations in those subregions is 4�2n ÿ 2�,

fB;C;D;Eg2�;n � fB;C;D;Eg2ÿ;n � fB;C;D;Eg1ÿ;n � fB;C;D;Eg1� ;n � 4�2n ÿ 2�:

There exist no 2ÿ-oriented H2k in {B,C}, no 1ÿ-oriented H2

k

in {B,D}, and no 1�-oriented H2k in {B,E}. That is,

fB;Cg2ÿ;n � fB;Dg1ÿ;n � fB;Eg1�;n � 0. Therefore,

fD;Eg2�;n � fD;Eg2ÿ;n � fCg1ÿ;n � fCg1� ;n � 4�2n ÿ 2�

ÿ � fB;Cg2�;n � fB;Cg2ÿ;n � fB;D;Eg1ÿ;n � fB;D;Eg1�;n �

� 4�2n ÿ 2� ÿ � fB;Cg2�;n � fEg1ÿ;n � fDg1�;n� � 2�2n ÿ 2�:tu

MOON ET AL.: ANALYSIS OF THE CLUSTERING PROPERTIES OF THE HILBERT SPACE-FILLING CURVE 139

Page 17: Analysis of the clustering properties of the hilbert space ...cacs.usc.edu › education › cs653 › Moon-HilbertCurve-TKDE01.pdf · called Peano curves or space-filling curves

Proof of Lemma 7. Every H2k approximation in subregion

{B} is 2�-oriented, and there exists no 2ÿ-oriented H2k

approximation in subregion {C}. Thus, the number of

vertical edges in subregions {B,C} is the sum of

fB;Cg2�;n Vk and � fCg1� ;n � fCg1ÿ;n�Hk. Likewise, the number

of horizontal edges in subregions {D,E} is the sum of

� fD;Eg2�;n � fD;Eg2ÿ ;n �Hk and � fDg1�;n � fEg1ÿ;n�Vk because there

exist no 1ÿ-oriented H2k in subregion {D} and no

1�-oriented H2k in subregion {E}. Thus, the total

number of edges in E2 is given by

� fB;Cg2�;n � fDg1�;n � fEg1ÿ;n�Vk�� fCg1�;n � fCg1ÿ;n � fD;Eg2�;n � fD;Eg2ÿ;n �Hk

� 2�2n ÿ 2��Hk � Vk� �by Lemma 6�:Each edge in E2 is cut 2k times and Hk � Vk � 22k ÿ 1.

Therefore,

N2 � 2�2n ÿ 2��22k ÿ 1�2k � 2�2n ÿ 2�23k ÿ 2�2n ÿ 2�2k:tu

Proof of Lemma 8. First,

�k � �k �X2ki�1

ihk�i� �X2ki�1

�2k ÿ i� 1�hk�i�

�X2ki�1

�2k � 1�hk�i�:

From the definition of Hk, Hk �P2k

i�1 hk�i�. Therefore,

�k � �k � �2k � 1�Hk:

Second,

k �X2kÿ1

i�1

ivk�i� �X2k

i�2kÿ1�1

ivk�i�

�X2kÿ1

i�1

ivk�i� �X2kÿ1

i�1

�2kÿ1 � i�vk�2kÿ1 � i�:

Since 2-oriented H2k approximations are vertically self-

symmetric,

vk�2k ÿ i� 1� � vk�i�holds for any i �1 � i � 2kÿ1�. Thus,

k �X2kÿ1

i�1

ivk�i� �X2kÿ1

i�1

�2kÿ1 � i�vk�2kÿ1 ÿ i� 1�

�X2kÿ1

i�1

ivk�i� �X2kÿ1

i�1

�2k ÿ i� 1�vk�i�:

From the definition of Vk and self-symmetry,

Vk � 2P2kÿ1

i�1 vk�i�. Therefore,

k �X2kÿ1

i�1

�2k � 1�vk�i� � 1

2�2k � 1�Vk:

tu

Proof of Lemma 9. In E4, the number of horizontal cuts

from a single u-gradientk is 2� �k, the number of

horizontal cuts from a single d-gradientk is 2� �k, and

the number of vertical cuts from a single s-gradientk is

2� k. Thus,

N4 � 2�k fBg2�;n � 2�k� fCg2�;n � fDg1� ;n � fE1ÿ;ng�

� 2 k� fD;Eg2�;n � fD;Eg2ÿ;n � fCg1ÿ;n � fCg1�;n�� 2�k�2n ÿ 2� � 2�k�2n ÿ 2� � 4 k�2n ÿ 2� �by Lemma 6�� 2�2n ÿ 2���k � �k � 2 k�� 2�2n ÿ 2��2k � 1��Hk � Vk� �by Lemma 8�� 2�2n ÿ 2��2k � 1��22k ÿ 1�:

In E5, the number of horizontal cuts from a singleu-gradientk is �k, the number of horizontal cuts from asingle d-gradientk is �k, and the number of vertical cutsfrom a single s-gradientk is k. Thus,

N5 � 2�k � 2�k � 4 k � 2�2k � 1��22k ÿ 1�:tu

ACKNOWLEDGMENTS

This work was sponsored in part by US National Science

Foundation CAREER Award IIS-9876037 and grants IRI-

9625428 and DMS-9873442, by the Advanced Research

Projects Agency under contract No. DABT63-94-C-0049. It

was also supported by US National Science Foundation

Cooperative Agreement No. IRI-9411299 and by DARPA/

ITO through Order F463, issued by ESC/ENS under

contract N66001-97-C-851. Additional funding was pro-

vided by donations from NEC and Intel. The authors

assume all responsibility for the contents of the paper.

REFERENCES

[1] D.J. Abel and D.M. Mark, ªA Comparative Analysis of Some Two-Dimensional Orderings,º Int'l J. Geographical Information Systems,vol. 4, no. 1, pp. 21±31, Jan. 1990.

[2] J.J. Bartholdi and L.K. Platzman, ªAn O(n logn) TravellingSalesman Heuristic Based on Spacefilling Curves,º OperationResearch Letters, vol. 1, no. 4, pp. 121±125, Sept. 1982.

[3] M. Berger, Geometry II. New York: Springer-Verlag, 1987.[4] T. Bially, ªSpace-Filling Curves: Their Generation and Their

Application to Bandwidth Reduction,º IEEE Trans. InformationTheory, vol. 15, no. 6, pp. 658±664, Nov. 1969.

[5] A.R. Butz, ªConvergence with Hilbert's Space Filling Curve,º J.Computer and System Sciences, vol. 3, pp. 128±146, 1969.

[6] I.S. Duff, ªDesign Features of a Frontal Code for Solving SparseUnsymmetric Linear Systems Out-of-Core,º SIAM J. ScientificStatistical Computing, vol. 5, no. 2, pp. 270±280, June 1984.

[7] C.R. Dyer, ªThe Space Efficiency of Quadtrees,º Computer Graphicsand Image Processing, vol. 19, no. 4, pp. 335±348, Aug. 1982.

[8] C. Faloutsos, ªMultiattribute Hashing Using Gray Codes,º Proc.1986 ACM SIGMOD Conf., pp. 227±238, May 1986.

[9] C. Faloutsos, ªAnalytical Results on the Quadtree Decompositionof Arbitrary Rectangles,º Pattern Recognition Letters, vol. 13, no. 1,pp. 31±40, Jan. 1992.

[10] C. Faloutsos, H.V. Jagadish, and Y. Manolopoulos, ªAnalysis ofthe N-Dimensional Quadtree Decomposition for Arbitrary Hy-perrectangles,º IEEE Trans. Knowledge and Data Eng., vol. 9, no. 3,pp. 373±383, May/June 1997.

140 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 13, NO. 1, JANUARY/FEBRUARY 2001

Page 18: Analysis of the clustering properties of the hilbert space ...cacs.usc.edu › education › cs653 › Moon-HilbertCurve-TKDE01.pdf · called Peano curves or space-filling curves

[11] C. Faloutsos and S. Roseman, ªFractals for Secondary KeyRetrieval,º Proc. ACM Principles of Database Systems Conf.,pp. 247±252, Mar. 1989.

[12] P.J. Giblin, Graphs, Surfaces, and Homology, second ed. New York:Chapman and Hall, 1981.

[13] D. Hilbert, ªUÈ ber die stetige Abbildung einer Linie auf FlaÈchen-stuÈ ck,º Math. Ann., vol. 38, pp. 459±460, 1891.

[14] H.V. Jagadish, ªLinear Clustering of Objects with MultipleAttributes,º Proc. ACM SIGMOD Conf., pp. 332±342, May 1990.

[15] H.V. Jagadish, ªSpatial Search with Polyhedra,º Proc. Sixth Int'lConf. Data Eng., pp. 311±319, Feb. 1990.

[16] H.V. Jagadish, ªAnalysis of the Hilbert Curve for RepresentingTwo-Dimensional Space,º Information Processing Letters, vol. 62,no. 1, pp. 17±22, Apr. 1997.

[17] M. Kaddoura, C.-W. Ou, and S. Ranka, ªPartitioning UnstructuredComputational Graphs for Nonuniform and Adaptive Environ-ments,º IEEE Parallel and Distributed Technology, vol. 3, no. 3,pp. 63±69, Fall 1995.

[18] A. Lempel and J. Ziv, ªCompression of Two-DimensionalImages,º NATO ASI Series, vol. F12, pp. 141±154, June 1984.

[19] J. Orenstein, ªSpatial Query Processing in an Object-OrientedDatabase System,º Proc. 1986 ACM SIGMOD Conf., pp. 326±336,May 1986.

[20] E.A. Patrick, D.R. Anderson, and F.K. Bechtel, ªMapping Multi-Dimensional Space to One Dimension for Computer OutputDisplay,º IEEE Trans. Computers, vol. 17, no. 10, pp. 949±953, Oct.1968.

[21] G. Peano, ªSur une Courbe qui Remplit Toute une Aire Plane,ºMath. Ann., vol. 36, pp. 157±160, 1890.

[22] R.L. Rivest, ªPartial Match Retrieval Algorithms,º SIAM J.Computing, vol. 5, no. 1, pp. 19±50, Mar. 1976.

[23] Y. Rong and C. Faloutsos, ªAnalysis of the Clustering Property ofPeano Curves,º Technical Report CS-TR-2792, UMIACS-TR-91-151, Univ. of Maryland, Dec. 1991.

[24] J.B. Rothnie and T. Lozano, ªAttribute Based File Organization ina Paged Memory Environment,º Comm. ACM, vol. 17, no. 2,pp. 63±69, Feb. 1974.

[25] C. Ruemmler and J. Wilkes, ªAn Introduction to Disk DriveModeling,º IEEE Computer, vol. 27, no. 3, pp. 17±28, Mar. 1994.

[26] H. Sagan, ªA Three-Dimensional Hilbert Curve,º Int'l J. Math. Ed.Science Technology, vol. 24, pp. 541±545, 1993.

[27] C.A. Shaffer, ªA Formula for Computing the Number of QuadtreeNode Fragments Created by a Shift,º Pattern Recognition Letters,vol. 7, no. 1, pp. 45±49, Jan. 1988.

[28] G.F. Simmons, Introduction to Topology and Modern Analysis. NewYork: McGraw-Hill Book Company, Inc., 1963.

Bongki Moon received the the BS and MSdegrees in computer engineering from SeoulNational University, Korea, in 1983 and 1985,respectively, and the PhD degree in computerscience from the University of Maryland, CollegePark, in 1996. He is an assistant professor in theDepartment of Computer Science, University ofArizona. His current research interests includehigh-performance spatial databases, scalableweb servers, data mining and warehousing,

and parallel and distributed processing. He worked for SamsungElectronics Corp., Korea, as a member of the research staff at theCommunication Systems Division from 1985 to 1990.

H.V. Jagadish received the PhD degree fromStanford University in 1985. He spent over adecade at AT&T Bell Laboratories in Murray Hill,N.J., eventually becoming head of AT&T Labsdatabase research department at the ShannonLaboratory in Florham Park, New Jersey. He hasalso served as a professor at the University ofIllinois in Urbana-Champaign. Currently, he is aprofessor of computer science at the Universityof Michigan in Ann Arbor. Dr. Jagadish is well-

known for his broad-ranging research on databases and is currently thefounding editor of the ACM SIGMOD Digital Review. Among the manyprofessional positions he has held, he has been an associate editor forthe ACM Transactions on Database Systems (1992-1995) and programchair of the ACM SIGMOD annual conference (1996). The focus of hiscurrent work is the design of hierarchical databases for highly distributedand heteregenous contexts.

Christos Faloutsos received the BSc degree inelectrical engineering (1981) from the NationalTechnical University of Athens, Greece, and theMSc and PhD degrees in computer science fromthe University of Toronto, Canada. Dr. Faloutsosis currently a faculty member at Carnegie MellonUniversity (CMU) in Pittsburgh, Pennsylvania.Prior to joining CMU, he was on the faculty of theDepartment of Computer Science at Universityof Maryland in College Park. He has spent

sabbaticals at IBM Almaden and AT&T Bell Labs. He has worked as aconsultant to several companies, including AT&T Research, Lucent, andSUN. He received the Presidential Young Investigator Award by the USNational Science Foundation (1989), two ªbest paperº awards (SIGMOD'94, VLDB '97), and four teaching awards. He has published more than70 refereed articles, one monograph, and has filed for four patents. Hisresearch interests include physical database design, searching methodsfor text, geographic information systems indexing methods for multi-media databases, and data mining. He is a member of the IEEE and theIEEE Computer Society.

Joel H. Saltz received the BS degree inmathematics and physics from University ofMichigan and the MA degree in mathematicsfrom University of Michigan in Ann Arbor, in1977 and 1978, respectively, and the MD andPhD degrees in computer science from DukeUniversity in 1985. He is a professor with theDepartment of Computer Science and theInstitute for Advanced Computer Studies, Uni-versity of Maryland, College Park. He leads a

research group whose goal is to develop methods to produce portablecompilers that generate efficient multiprocessor code for irregularscientific problems that are unstructured, sparse, adaptive, or blockstructured. He is a member of the IEEE and the IEEE Computer Society.

MOON ET AL.: ANALYSIS OF THE CLUSTERING PROPERTIES OF THE HILBERT SPACE-FILLING CURVE 141