Indexing Multidimensional Data

Indexing Multidimensional Data

Rui Zhanghttp://www.csse.unimelb.edu.au/~rui

The University of MelbourneAug 2006

Outline

Backgrounds

Multidimensional data and queries

Approaches Mapping based indexing

Z-curve iDistance

Hierarchical-tree based indexing R-tree k-d-tree Quad-tree

Compression based indexing VA-file

Multidimensional Data Spatial data

Geographic Information: Melbourne (37, 145) Which city is at (30, 140)?

Computer Aided Design: width and height (40, 50) Any part that has a width of 40 and height of 50?

Records with multiple attributes Employee (ID, age, score, salary, …) Is there any employee whose

age is under 25 and performance score is greater than 80 andsalary is between 3000 and 5000

Multimedia data Color histograms of images Give me the most similar

image to

Multimedia Features: color, shape, texture

ID Age Score Salary …

…

(high-dimensionality)

(medium-dimensionality)

(low-dimensionality)

Multidimensional Queries Point query

Return the objects located at Q(x1, x2, …, xd).

E.g. Q=(3.4, 6.6).

Window query

Return all the objects enclosed or intersected by the hyper-rectangle W{[L1, U1], [L2, U2], …, [Ld, Ud]}.

E.g. W={[0,4],[2,5]}

K-Nearest Neighbor Query (KNN Query)

Return k objects whose distances to Q are no larger than any other object’ distance to Q.

E.g. 3NN of Q=(4,1)

Mapping Based Multidimensional Indexing

Story The CBD: [0,4][2,5] Blocks in the CBD are: [8,15], [32,33] and [36,37]

General strategy: three steps Data mapping and indexing Query mapping and data retrieval Filtering out false positive

Name x y Block Height

A 0.7 1.2 2 100

B 5.8 1.2 19 50

C 2.7 2.3 12 80

D 5.5 2.4 25 90

E 6.6 2.5 28 40

F 1.7 3.8 11 120

G 2.8 4.7 36 100

H 0.6 5.8 34 50

I 1.6 6.7 41 60

J 3.4 6.6 45 40

Name x y Block Height

A 0.7 1.2 2 100

F 1.7 3.8 11 120

C 2.7 2.3 12 80

B 5.8 1.2 19 50

D 5.5 2.4 25 90

E 6.6 2.5 28 40

H 0.6 5.8 34 50

G 2.8 4.7 36 100

I 1.6 6.7 41 60

J 3.4 6.6 45 40

Sort

The Z-curve and Other Space-Filling Curves The Z-curve

Z-value calculation: bit-interleaving

Support efficient window queries Disadvantage

Jumps

Other space-filling curves Hilbert-curves Gray-code Column-wise scan

3

2

1

Mapping for KNN Queries

Story continued New factory at Q[4,1] Find 3 nearest buildings to Q

Termination condition K candidates All in the current search circle

Name x y Street Height

A 0.7 1.2 14 100

B 5.8 1.2 32 50

C 2.7 2.3 12 80

D 5.5 2.4 31 90

E 6.6 2.5 32 40

F 1.7 3.8 13 120

G 2.8 4.7 24 100

H 0.6 5.8 23 50

I 1.6 6.7 22 60

J 3.4 6.6 24 40

Sort

11121314

21222324

3132

Name x y Street Height

C 2.7 2.3 12 80

F 1.7 3.8 13 120

A 0.7 1.2 14 100

I 1.6 6.7 22 60

H 0.6 5.8 23 50

G 2.8 4.7 24 100

J 3.4 6.6 24 40

D 5.5 2.4 31 90

B 5.8 1.2 32 50

E 6.6 2.5 32 40

Rank 1 2 3

Candidate A

Distance to Q 3.31

Q

Rank 1 2 3

Candidate B A F

Distance to Q 1.81 3.31 3.62

Rank 1 2 3

Candidate B E A

Distance to Q 1.81 3.00 3.31

Rank 1 2 3

Candidate A F

Distance to Q 3.31 3.62

Rank 1 2 3

Candidate B C E

Distance to Q 1.81 1.84 3.00

Rank 1 2 3

Candidate B C D

Distance to Q 1.81 1.84 2.05

||AQ|| = 3.31||FQ|| = 3.62||BQ|| = 1.81||EQ|| = 3.00||CQ|| = 1.84||DQ|| = 2.05

1234

R = 0.35R = 0.70R = 1.05R = 1.40R = 1.75R = 2.10

The iDistance

Data partitioned into a number of clusters Streets are concentric circles

Data mapping Objects mapped to street numbers

Query mapping Search circle mapped to streets intersected

Hierarchical Tree Structures R-tree

Minimum bounding rectangle (MBR) Incomplete and overlapping

partitioning Disk-based; Balanced

AD

C

EB

F

G

AD

C

EB

F

G

AD

CE

B

GF

AD

CE

B

GF

K-d-tree Space division recursively Complete and disjoint partitioning In-memory; Unbalanced There are algorithms to page

and balance the tree, but withmore complex manipulations

AN1

N2

N1 B C D

N1

A C D

N1

B E

N2

N1 N2

F G

N1

N3N3

A B C D

N1

0.5

N3

N1 N2

A D

N1

B C E

N2

N3

F

B C E

N2

F G

N4

N4

N5

0.3

N5

Problem: Overlap Problem: Empty space

Hierarchical Tree Structures (continued) Quad-tree

Space divided into 4 rectanglesrecursively.

Complete and disjoint partitioning In-memory; Unbalanced There are algorithms to page

and balance the tree, but withmore complex manipulations

The point quad-tree

A

D

C

E

B

F

G

A

NW NE

SW

B

NW

SW SE

NE

CD

E FGSE

Compression Based Indexing

The dimensionality curse

The Vector Approximation File (VA-File)

VA File Skewed data

Summary of the Indexing TechniquesIndex Disk-based /

In-memoryBalanced Efficient qu

ery typeDimensionality

Comments

R-tree Disk-based Yes Point, window, kNN

Low Disadvantage is overlap

K-d-tree In-memory No Point, window, kNN(?)

Low Inefficient for skewed data

Quad-tree In-memory No Point, window, kNN(?)

Low Inefficient for skewed data

Z-curve + B+-tree

Disk-based Yes Point, window

Low Order of the Z-curve affects performance

iDistance Disk-based Yes Point, kNN High Not good for uniform data in

very high-D

VA-File Disk-based Point, window, kNN

High Not good for skewed data

Index Implementations in major DBMS

SQL Server B+-Tree data structure Clustered indexes are sparse Indexes maintained as updates/insertions/deletes are

performed Oracle

B+-tree, hash, bitmap, spatial extender for R-Tree Clustered index Index organized table (unique/clustered) Clusters used when creating tables

DB2 B+-Tree data structure, spatial extender for R-tree Clustered indexes are dense Explicit command for index reorganization

Recommended Readings and References Survey on multidimensional indexing techniques

Christian Böhm, Stefan Berchtold, Daniel A. Keim. Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases. ACM Computing Surveys 2001.

Volker Gaede, Oliver Günther. Multidimensional Access Methods. ACM Computing Surveys 1998

Mapping based indexing Rui Zhang, Panos Kalnis, Beng Chin Ooi, Kian-Lee Tan. Generalized Multi-dimensional Data Map

ping and Query Processing. ACM Transactions on Data Base Systems (TODS), 30(3), 2005.

Space-filling curves H. V. Jagadish. Linear Clustering of Objects with Multiple Atributes . ACM SIGMOD Conference

(SIGMOD) 1990.

iDistance H.V. Jagadish, Beng Chin Ooi, Kian-Lee Tan, Cui Yu, Rui Zhang. iDistance: An Adaptive B+-tree B

ased Indexing Method for Nearest Neighbor Search. ACM Transactions on Data Base Systems (TODS), 30(2), 2005.

R-tree Antonin Guttman. R-Trees: A Dynamic Index Structure for Spatial Searching . ACM SIGMOD Co

nference (SIGMOD) 1984.

Quad-tree Hanan Samet. The Quadtree and Related Hierarchical Data Structures . ACM Computing Survey

s 1984.

VA-File Roger Weber, Hans-Jörg Schek, Stephen Blott. A Quantitative Analysis and Performance Study f

or Similarity-Search Methods in High-Dimensional Spaces. International Conference on Very Large Data Bases (VLDB) 1998.

Indexing Multidimensional Data

Documents

Transcript of Indexing Multidimensional Data