Multidimensional Indexing: Spatial Data Management & High Dimensional Indexing
Indexing Multidimensional Data
description
Transcript of Indexing Multidimensional Data
Indexing Multidimensional Data
Rui Zhanghttp://www.csse.unimelb.edu.au/~rui
The University of MelbourneAug 2006
Outline
Backgrounds
Multidimensional data and queries
Approaches Mapping based indexing
Z-curve iDistance
Hierarchical-tree based indexing R-tree k-d-tree Quad-tree
Compression based indexing VA-file
Multidimensional Data Spatial data
Geographic Information: Melbourne (37, 145) Which city is at (30, 140)?
Computer Aided Design: width and height (40, 50) Any part that has a width of 40 and height of 50?
Records with multiple attributes Employee (ID, age, score, salary, …) Is there any employee whose
age is under 25 and performance score is greater than 80 andsalary is between 3000 and 5000
Multimedia data Color histograms of images Give me the most similar
image to
Multimedia Features: color, shape, texture
ID Age Score Salary …
…
(high-dimensionality)
(medium-dimensionality)
(low-dimensionality)
Multidimensional Queries Point query
Return the objects located at Q(x1, x2, …, xd).
E.g. Q=(3.4, 6.6).
Window query
Return all the objects enclosed or intersected by the hyper-rectangle W{[L1, U1], [L2, U2], …, [Ld, Ud]}.
E.g. W={[0,4],[2,5]}
K-Nearest Neighbor Query (KNN Query)
Return k objects whose distances to Q are no larger than any other object’ distance to Q.
E.g. 3NN of Q=(4,1)
Mapping Based Multidimensional Indexing
Story The CBD: [0,4][2,5] Blocks in the CBD are: [8,15], [32,33] and [36,37]
General strategy: three steps Data mapping and indexing Query mapping and data retrieval Filtering out false positive
Name x y Block Height
A 0.7 1.2 2 100
B 5.8 1.2 19 50
C 2.7 2.3 12 80
D 5.5 2.4 25 90
E 6.6 2.5 28 40
F 1.7 3.8 11 120
G 2.8 4.7 36 100
H 0.6 5.8 34 50
I 1.6 6.7 41 60
J 3.4 6.6 45 40
Name x y Block Height
A 0.7 1.2 2 100
F 1.7 3.8 11 120
C 2.7 2.3 12 80
B 5.8 1.2 19 50
D 5.5 2.4 25 90
E 6.6 2.5 28 40
H 0.6 5.8 34 50
G 2.8 4.7 36 100
I 1.6 6.7 41 60
J 3.4 6.6 45 40
Sort
The Z-curve and Other Space-Filling Curves The Z-curve
Z-value calculation: bit-interleaving
Support efficient window queries Disadvantage
Jumps
Other space-filling curves Hilbert-curves Gray-code Column-wise scan
3
2
1
Mapping for KNN Queries
Story continued New factory at Q[4,1] Find 3 nearest buildings to Q
Termination condition K candidates All in the current search circle
Name x y Street Height
A 0.7 1.2 14 100
B 5.8 1.2 32 50
C 2.7 2.3 12 80
D 5.5 2.4 31 90
E 6.6 2.5 32 40
F 1.7 3.8 13 120
G 2.8 4.7 24 100
H 0.6 5.8 23 50
I 1.6 6.7 22 60
J 3.4 6.6 24 40
Sort
11121314
21222324
3132
Name x y Street Height
C 2.7 2.3 12 80
F 1.7 3.8 13 120
A 0.7 1.2 14 100
I 1.6 6.7 22 60
H 0.6 5.8 23 50
G 2.8 4.7 24 100
J 3.4 6.6 24 40
D 5.5 2.4 31 90
B 5.8 1.2 32 50
E 6.6 2.5 32 40
Rank 1 2 3
Candidate A
Distance to Q 3.31
Q
Rank 1 2 3
Candidate B A F
Distance to Q 1.81 3.31 3.62
Rank 1 2 3
Candidate B E A
Distance to Q 1.81 3.00 3.31
Rank 1 2 3
Candidate A F
Distance to Q 3.31 3.62
Rank 1 2 3
Candidate B C E
Distance to Q 1.81 1.84 3.00
Rank 1 2 3
Candidate B C D
Distance to Q 1.81 1.84 2.05
||AQ|| = 3.31||FQ|| = 3.62||BQ|| = 1.81||EQ|| = 3.00||CQ|| = 1.84||DQ|| = 2.05
1234
R = 0.35R = 0.70R = 1.05R = 1.40R = 1.75R = 2.10
The iDistance
Data partitioned into a number of clusters Streets are concentric circles
Data mapping Objects mapped to street numbers
Query mapping Search circle mapped to streets intersected
Hierarchical Tree Structures R-tree
Minimum bounding rectangle (MBR) Incomplete and overlapping
partitioning Disk-based; Balanced
AD
C
EB
F
G
AD
C
EB
F
G
AD
CE
B
GF
AD
CE
B
GF
K-d-tree Space division recursively Complete and disjoint partitioning In-memory; Unbalanced There are algorithms to page
and balance the tree, but withmore complex manipulations
AN1
N2
N1 B C D
N1
A C D
N1
B E
N2
N1 N2
F G
N1
N3N3
A B C D
N1
0.5
N3
N1 N2
A D
N1
B C E
N2
N3
F
B C E
N2
F G
N4
N4
N5
0.3
N5
Problem: Overlap Problem: Empty space
Hierarchical Tree Structures (continued) Quad-tree
Space divided into 4 rectanglesrecursively.
Complete and disjoint partitioning In-memory; Unbalanced There are algorithms to page
and balance the tree, but withmore complex manipulations
The point quad-tree
A
D
C
E
B
F
G
A
NW NE
SW
B
NW
SW SE
NE
CD
E FGSE
Compression Based Indexing
The dimensionality curse
The Vector Approximation File (VA-File)
VA File Skewed data
Summary of the Indexing TechniquesIndex Disk-based /
In-memoryBalanced Efficient qu
ery typeDimensionality
Comments
R-tree Disk-based Yes Point, window, kNN
Low Disadvantage is overlap
K-d-tree In-memory No Point, window, kNN(?)
Low Inefficient for skewed data
Quad-tree In-memory No Point, window, kNN(?)
Low Inefficient for skewed data
Z-curve + B+-tree
Disk-based Yes Point, window
Low Order of the Z-curve affects performance
iDistance Disk-based Yes Point, kNN High Not good for uniform data in
very high-D
VA-File Disk-based Point, window, kNN
High Not good for skewed data
Index Implementations in major DBMS
SQL Server B+-Tree data structure Clustered indexes are sparse Indexes maintained as updates/insertions/deletes are
performed Oracle
B+-tree, hash, bitmap, spatial extender for R-Tree Clustered index Index organized table (unique/clustered) Clusters used when creating tables
DB2 B+-Tree data structure, spatial extender for R-tree Clustered indexes are dense Explicit command for index reorganization
Recommended Readings and References Survey on multidimensional indexing techniques
Christian Böhm, Stefan Berchtold, Daniel A. Keim. Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases. ACM Computing Surveys 2001.
Volker Gaede, Oliver Günther. Multidimensional Access Methods. ACM Computing Surveys 1998
Mapping based indexing Rui Zhang, Panos Kalnis, Beng Chin Ooi, Kian-Lee Tan. Generalized Multi-dimensional Data Map
ping and Query Processing. ACM Transactions on Data Base Systems (TODS), 30(3), 2005.
Space-filling curves H. V. Jagadish. Linear Clustering of Objects with Multiple Atributes . ACM SIGMOD Conference
(SIGMOD) 1990.
iDistance H.V. Jagadish, Beng Chin Ooi, Kian-Lee Tan, Cui Yu, Rui Zhang. iDistance: An Adaptive B+-tree B
ased Indexing Method for Nearest Neighbor Search. ACM Transactions on Data Base Systems (TODS), 30(2), 2005.
R-tree Antonin Guttman. R-Trees: A Dynamic Index Structure for Spatial Searching . ACM SIGMOD Co
nference (SIGMOD) 1984.
Quad-tree Hanan Samet. The Quadtree and Related Hierarchical Data Structures . ACM Computing Survey
s 1984.
VA-File Roger Weber, Hans-Jörg Schek, Stephen Blott. A Quantitative Analysis and Performance Study f
or Similarity-Search Methods in High-Dimensional Spaces. International Conference on Very Large Data Bases (VLDB) 1998.