A Spatial Index Structure for High Dimensional Point Data
description
Transcript of A Spatial Index Structure for High Dimensional Point Data
A Spatial Index Structure for High A Spatial Index Structure for High Dimensional Point DataDimensional Point Data
Wei Wang, Jiong Yang, and Richard MuntzData Mining Lab
Department of Computer ScienceUniversity of California, Los Angeles
OutlineOutline
• Introduction• Structure of PK-tree• Operations on PK-tree• Performance• Conclusions
IntroductionIntroduction
• Dynamic spatial index method has been an active research area. – index structure based on spatial decomposition
• PR-Quad tree, K-D tree, K-D-B tree, ...• No overlapping among sibling nodes• How to achieve high disk page utilization for large dimensionality
with skewed data distributions remains a challenge.
– R-tree family of index structure• R*-tree, SR-tree, X-tree, ... • Increasing of overlapping among sibling nodes along with
increasing dimensionality degrades performance severely.
IntroductionIntroduction
• PK-tree– Spatial decomposition
• no overlapping among sibling nodes– Bound on height– Bounds on number of children– Uniqueness for any data set
• independent of order of insertion and deletion– Solid theoretical foundation– Fast retrieval and updates
Structure of PK-treeStructure of PK-tree
• Recursively rectilinear dividing space
dim 1
dim 2
ith level
(i+1)th level
. . . . . .
. . . . . .
. . .
. . .
. . .. . .
Set notation (e.g., , , , , , , ) is used to express relationships among cells.
Structure of PK-treeStructure of PK-tree
• Space is recursively divided until a level LD such that each cell contains at most one point.
. . . ..
. . . . . ..
..
. . ..
. .. . .
.
Level 0
. . . ..
. . . . . ..
..
. . ..
. .. . .
.
Level 3
. . . ..
. . . . . ..
..
. . ..
. .. . .
.
Level 2
. . . ..
. . . . . ..
..
. . ..
. .. . .
.
Level 1
Structure of PK-treeStructure of PK-tree
• Point cell: a non-empty cell at level LD
• A cell C is K-instantiable iff– C is a point cell, or– there does not exist (K-1) or less K-instantiable sub-cells C1, …, CK-1
C, such that d D (d C d i=0K-1 Ci).
.. . . . . .
..
.. . .
.. .
. . ..
. . . .
Level 3 (LD) Level 2
.. . . . . .
..
.. . .
.. .
. . ..
. . . ..
. . . . . ..
..
. . ..
. .. . .
.
. . . .
Level 1
K = 3
Structure of PK-treeStructure of PK-tree
root
Example of a PK-tree of rank 3
a2 c2d1 b4d2 c3 e1 e2 f3 h1g1 g2 h2 g3 a7 g5f6f5e5d8c8d7b8b7
.. . . . . .
..
.. . .
.. .
. . ..
. . . .
Level 3 (LD)
12345678
a b c d e f g h
U R
12345678
.. . . . . .
..
.. . .
.. .
. . ..
. . . .
Level 1a b c d e f g h
U R
K
NM
B D M N K
12345678
.. . . . . .
..
.. . .
.. .
. . ..
. . . .
Level 2a b c d e f g h
B D
K
M N
Structure of PK-treeStructure of PK-tree
• Given a finite set of points D over index space C0 and dividing ration R, a PK-tree of rank K (K>1) is defined as follows.
– The cell at level 0 (C0) is always instantiated and serves as the root of the PK-tree.
– Every node else (except the root) in the PK-tree is mapped one-to-one to a K-instantiable cell.
– For any two nodes C1 and C2 in the PK-tree, C1 is a child of C2 (or C2 is the parent of C1) iff
• C1 is a proper sub-cell of C2, i.e., C1 C2, and
• there does not exist C3 in the PK-tree such that C1 C3 and C3 C2.
• Properties: existence and uniqueness, bounds on node outdegree, bounded storage space, bounds on expected height, no overlapping among sibling nodes, and so on.
Properties of PK-treeProperties of PK-tree
H
longest path
Expected Height of a PK-tree
Ci
Ci+1 P(d Ci+1 | d Ci) < 1...
at least K-1...
at least K-1...
at least K-1
...at least K-1
leaf
root (N points)
Properties of PK-treeProperties of PK-tree
• M-Level Clustering Spatial Distribution– 0-level: uniform distribution over C0 P(d Ci+1 | d Ci) = 1/r
– 1-level: Let A C0 be some subset of C0 and Ac = C0 - A. Distributions for points in A and Ac are 0-level clustering spatial distribution.
A
Ac Ci+1
Ci
...
...
...
...
1)1(
)|( 1
cc AA
A
AA
Aii DD
DDrD
DCdCdP
Operations on PK-treeOperations on PK-tree
• Pagination of the PK-tree– Pick the parameter K and the number of dimensions to split at each level such that
the maximum size node is close to a page size.– Allocate one node to a page.– Space utilization can be guaranteed to be at least 50% and is much more than 50%
in experiments.• Insertion
– First follow the path from the root to locate all (potential) ancestors of the inserted leaf cell.
– Then from the leaf level back to the root along the same path to make all necessary changes (e.g., instantiate or de-instantiate cells).
• Search– K Nearest Neighbor Query– Range Query
PerformancePerformance
• Setup: Sparc 10 workstation (SunOS 5.5) with 208 MB main memory and a local disk with 9GB capacity
• Synthetic Data Sets (each contains 100,000 points)– u: uniform distribution– c1, c2: 20% of data are uniformly distributed and 80% of data are distributed in
disjoint clusters• Height of generated trees
Dimension 2 4 8 16 32 64
PK-tree (u) 4 4 5 6 7 9
PK-tree (c1) 5 7 7 6 7 8
PK-tree (c2) 7 7 6 7 8 9
X-tree 4 4 4 4 5 6
SR-tree 4 4 5 5 6 7
PerformancePerformance
• Size of index in MB with 100,000 points
2 4 8 16Dimension
u c1 c2 u c1 c2 u c1 c2 u c1 c2
PK-tree 1.8 1.9 1.9 2.8 2.8 2.8 4.9 4.8 4.9 9.4 9.3 9.4
X-tree 1.8 1.8 1.8 3.0 3.0 3.0 5.6 5.5 5.6 10.7 10.4 10.6
SR-tree 69 70 70 74 73 74 74 74 75 90 91 92
PerformancePerformance
• Range query on uniform data distribution
PerformancePerformance
• Range query on clustered data distribution
PerformancePerformance
• KNN query on uniform data distribution
PerformancePerformance
• KNN query on clustered data distribution
PerformancePerformance
• Real data set: NASA Sky Telescope Data– 200,000 two-dimensional points (they are the coordinates of crater
locations on the surface of Mars)
height size KNNCPU
KNNI/O
RANCPU
RANI/O
PK-tree 5 3.7MB 4ms 4 3ms 4
X-tree 4 5.7MB 90ms 4 10ms 4
SR-tree 5 120MB 28ms 8 14ms 6
ConclusionsConclusions
• PK-tree: employing spatial decomposition to ensure no overlapping among sibling nodes but avoiding large number of nodes usually resulting from a skewed spatial distribution of objects.– The total number of nodes in a PK-tree is O(N) and the expected
height of a PK-tree is O(logN) under some general conditions.• Other properties: uniqueness, bounds on number of
children.• Empirical studies shown that the PK-tree outperforms
SR-tree and X-tree by a wide margin.