1974 Fast One-Dimensional Digital Convolution by Multidimensional Techniques
Multidimensional Indexing: Spatial Data Management & High Dimensional Indexing
description
Transcript of Multidimensional Indexing: Spatial Data Management & High Dimensional Indexing
Multidimensional Indexing:
Spatial Data Management & High Dimensional Indexing
Types of Spatial Data Point Data
Points in a multidimensional space E.g., Raster data such as satellite imagery,
where each pixel stores a measured value E.g., Feature vectors extracted from text
Region Data Objects have spatial extent with location
and boundary DB typically uses geometric approximations
constructed using line segments, polygons, etc., called vector data.
Spatial Indexing
Point Access Methods (PAMs) vs Spatial Access Methods (SAMs)
PAM: index only point data Hierarchical (tree-based) structures Multidimensional Hashing Space filling curve
SAM: index both points and regions Transformations Overlapping regions Clipping methods (non-overlapping)
Data partitioning vs Space partitioning
Types of Spatial Queries Spatial Range Queries
Find all cities within 50 miles of Troy Query has associated region (location, boundary) Answer includes overlapping or contained data regions
Nearest-Neighbor Queries Find the 10 cities nearest to Troy Results must be ordered by proximity
Spatial Join Queries Find all cities near a lake Expensive, join condition involves regions and
proximity
Applications of Spatial Data
Geographic Information Systems (GIS) E.g., ESRI’s ArcInfo; OpenGIS Consortium Geospatial information All classes of spatial queries and data are common
Computer-Aided Design/Manufacturing Store spatial objects such as surface of airplane fuselage Range queries and spatial join queries are common
Multimedia Databases Images, video, text, etc. stored and retrieved by content First converted to feature vector form; high
dimensionality Nearest-neighbor queries are the most common
• Requirementso Fast range/window query search (range queryo Fast similarity search
Similarity range query K-nearest neighbour query (KNN query)
High Dimensional Indexing
Complex Objects Feature Vectors
Similarity Queries
Nasdaq
Feature extraction and transformation
Index construction
Index for range/ similarity Search
Feature Base Similarity Search
Similarity Search based on sample image in color composition
Retrieval by Colour
Given a sample image
Window/Range query: Retrieve data points fall within a given range along each dimension.
Designed to support range retrieval, facilitate joins and similarity search (if applicable).
Query Requirement
• Similarity queries: Similarity range and KNN queries
• Similarity range query: Given a query point, find all data points within a given distance r to the query point.
•KNN query: Given a query point, find the K nearest neighbours, in distance to the point.
r
Kth NN
Query Requirement
Single-Dimensional Indexes
B+ trees are fundamentally single-dimensional indexes.
When we create a composite search key B+ tree, e.g., an index on <age, sal>, we effectively linearize the 2-dimensional space since we sort entries first by age and then by sal.
Consider entries:<11, 80>, <12, 10><12, 20>, <13, 75>
11 12 13
70605040302010
80
B+ treeorder
Multidimensional Indexes A multidimensional index clusters entries so as to
exploit “nearness” in multidimensional space. Keeping track of entries and maintaining a
balanced index structure presents a challenge!
Consider entries:<11, 80>, <12, 10><12, 20>, <13, 75>
Spatialclusters
70605040302010
80
B+ treeorder
11 12 13
Motivation for Multidimensional Indexes
Spatial queries (GIS, CAD). Find all hotels within a radius of 5 miles from
the conference venue. Find the city with population 500,000 or more
that is nearest to Kalamazoo, MI. Find all cities that lie on the Nile in Egypt. Find all parts that touch the fuselage (in a
plane design). Similarity queries (content-based
retrieval). Given a face, find the five most similar faces.
Multidimensional range queries. 50 < age < 55 AND 80K < sal < 90K
What’s the difficulty? An index based on spatial location
needed. One-dimensional indexes don’t support
multidimensional searching efficiently. Hash indexes only support point queries;
want to support range queries as well. Must support inserts and deletes gracefully.
Ideally, want to support non-point data as well (e.g., lines, shapes).
Multi-dimensional Indexes
Multi-key Indexes Grid Files Partitioned Hash Indexes kd-Trees Quad Trees R Trees Bitmap indexes
Idea:
Key1 Key2
Partitioned hash function
h1 h2
010110 1110010
h1(toy) =0 000h1(sales) =1 001h1(art)=1 010
. 011
.h2(10k) =01 100h2(20k) =11 101h2(30k) =01 110h2(40k) =00 111
.
.
<Fred,toy,10k>,<Joe,sales,10k><Sally,art,30k>
Example:
Insert
<Joe><Sally>
<Fred>
h1(toy) =0 000h1(sales) =1 001h1(art)=1 010
. 011
.h2(10k) =01 100h2(20k) =11 101h2(30k) =01 110h2(40k) =00 111
.
.• Find Emp. with Dept. = Sales
Sal=40k
<Fred><Joe><Jan>
<Mary>
<Sally>
<Tom><Bill><Andy>
h1(toy) =0 000h1(sales) =1 001h1(art)=1 010
. 011
.h2(10k) =01 100h2(20k) =11 101h2(30k) =01 110h2(40k) =00 111
.
.• Find Emp. with Sal=30k
<Fred><Joe><Jan>
<Mary>
<Sally>
<Tom><Bill><Andy>
h1(toy) =0 000h1(sales) =1 001h1(art)=1 010
. 011
.h2(10k) =01 100h2(20k) =11 101h2(30k) =01 110h2(40k) =00 111
.
.• Find Emp. with Dept. = Sales
<Fred><Joe><Jan>
<Mary>
<Sally>
<Tom><Bill><Andy>
Grid File
Hashing methods for multidimensional points (extension of Extensible hashing)
Idea: Use a grid to partition the space each cell is associated with one page
Two disk access principle (exact match)
Grid File Start with one bucket for the
whole space. Select dividers along each
dimension. Partition space into cells
Dividers cut all the way. Each cell corresponds to 1 disk
page. Many cells can point to the
same page. Cell directory potentially
exponential in the number of dimensions
Grid File Implementation
Dynamic structure using a grid directory Grid array: a 2 dimensional array with
pointers to buckets (this array can be large, disk resident) G(0,…, nx-1, 0, …, ny-1)
Linear scales: Two 1 dimensional arrays that are used to access the grid array (main memory) X(0, …, nx-1), Y(0, …, ny-1)
Example
Linear scale X
Linear scale
Y
Grid Directory
Buckets/Disk
Blocks
Grid File Search
Exact Match Search: at most 2 I/Os assuming linear scales fit in memory. First use liner scales to determine the index into the cell
directory access the cell directory to retrieve the bucket address (may
cause 1 I/O if cell directory does not fit in memory) access the appropriate bucket (1 I/O)
Range Queries: use linear scales to determine the index into the cell
directory. Access the cell directory to retrieve the bucket addresses of
buckets to visit. Access the buckets.
Grid File Insertions
Determine the bucket into which insertion must occur. If space in bucket, insert. Else, split bucket
how to choose a good dimension to split? If bucket split causes a cell directory to split do so and
adjust linear scales. insertion of these new entries potentially requires a
complete reorganization of the cell directory--- expensive!!!
Grid File Deletions
Deletions may decrease the space utilization. Merge buckets
We need to decide which cells to merge and a merging threshold
Buddy system and neighbor system A bucket can merge with only one buddy in each
dimension Merge adjacent regions if the result is a rectangle
A
A A
(N=6)
1
2
34
5
6
1 2 3 4 5 6
Grid File Example
1 2 3 4 5 6A
A AA B A B
7
8 9
10 11
12
1 3 5 7 A
2 4 6 B
8
9
10
11 12
(N=6)
1
2
34
5
6
Grid File Example
A B A BA B
C
A B
C B
1 3 5 7 8 10A
2 4 6 9 11 12B
(N=6)
7
8 9
10 11
12
1
2
34
5
6
13
14
15
1 7 8 13 A
2 4 6 9 11 12B
3 5 10 C
14 15
Grid File Example
A B
C
A B
C B
A D B
C
A D
C C
B
B
(N=6)
7
8 9
10 11
12
1
2
34
5
6
13
14
15
1 3 5 7 8 10A
2 4 6 9 11 12B
1 7 8 13 A
2 4 6 9 11 12B
3 5 10 C
14 15
16
1 2 3 4 5 6A 1 3 5 7 A
2 4 6 B
1 7 8 13 A
2 4 6 9 11 12B
3 5 10 C
1 8 13 16 A
2 4 6 9 11 12B
3 5 10 C
7 14 15 D
Grid File Example
(N=6)
x1 x2 x3 x4
y4
y2
y1
A B
C
D
E
F
G
H
Iy3
A H
A I
D
D
F
F
B
B
A I G F B
E E G F B
C C C C B
Grid File Example
Kd-Trees
Binary partitioning of space. Split of the form a < V & a >= V for some attribute (Internal nodes)
The dimensions to “cut” or “split” alternate among all dimensions
Doesn’t have to span the whole dim (unlike Grid Files)
Leaves are blocks that hold the points
A
A
B CB C DB C D EB C D E F
B CB C
D
B C
D E
B
F
C
ED
kd
x1 x2
y1
y2
x1
B Cy1
C Dx2
D Ey2
E F
Kd-Trees Example
x2
y8
x9 x6
x8y1y6
x7
y4
y7
x4
y9
x8
y5
x3
y2
y3
x1
15 1 10 3
4 5
11 14
12
16 8 18 19 2021
132
69
17
7
x5
y2
kdA
x1 x2 x3 x5 x6 x7 x8 x9x4
y9
y8
y7y6
y5
y4y3
y2
y1
1 2 3
4 5
10
6 7 8 9
11 12
1415
13
17
16
18
21
19 20
KdTrees Example
x2
y8
x5
y2
B
C
D
E F
1 2 3
4 5
10
6 7 8 9
11 12
1415
13
17
16
18
21
19 20
B C
D
E
F
x1 x2 x3 x5 x6 x7 x8 x9x4
y9
y8
y7y6
y5
y4y3
y2
y1
kdB
y7
y3
x1
15 1 1016
x6
x8y1
18 19 2021
x4
y9
x83
4 5
2
x9
x7
y4
11 14
12
8
9y6
y5
x3
y2
136
17
7
B
C
D EF
kDB Trees Example
Region Quadtree
12 34 5
13 141911 12
615
181716
71098
A
B C F
2
1
3 4 5 6 11 12D 13 14 19E
15 16 17 187 8 9 10
NW NE SW SE
Point Quad-tree
(0,100)
(0,0)
(100,100)
(100,0)
(92,1)(52,15)
(88,65)
(20,88)
(50,50)
(75,75) (25,25) (75,25)
The R-Tree The R-tree is a tree-structured
index that remains balanced on inserts and deletes.
Each key stored in a leaf entry is intuitively a box, or collection of intervals, with one interval per dimension.
Example in 2-D:
X
Y
Root ofR Tree
Leaf level
R-Tree Properties Leaf entry = < n-dimensional box, rid >
key value is a box. Box is the tightest bounding box for a data
object. Non-leaf entry = < n-dim box, ptr to child
node > Box covers all boxes in child node (in fact,
subtree). All leaves at same distance from root. Nodes can be kept 50% full (except root).
Can choose a parameter m that is <= 50%, and ensure that every node is at least m% full.
Example of an R-Tree
R8R9
R10
R11
R12
R17
R18
R19
R13
R14
R15
R16
R1
R2
R3
R4
R5
R6
R7
Leaf entry
Index entry
Spatial objectapproximated by bounding box R8
Example R-Tree (Contd.)R1 R2
R3 R4 R5 R6 R7
R8 R9 R10 R11R12 R13R14 R15R16 R17R18R19
Search for Objects Overlapping Box Q
Start at root.1. If current node is non-leaf, for each entry <E, ptr>, if box E overlaps Q, search subtree identified by ptr.2. If current node is leaf, for each entry <E, rid>, if E overlaps Q, rid identifies an object that might overlap Q.
Note: May have to search several subtrees at each node!(In contrast, a B-tree equality search goes to just one leaf.)
Improving Search Using Constraints
It is convenient to store boxes in the R-tree as approximations of arbitrary regions, because boxes can be represented compactly.
But why not use convex polygons to approximate query regions more accurately? Will reduce overlap with nodes in tree,
and reduce the number of nodes fetched by avoiding some branches altogether.
Cost of overlap test is higher than bounding box intersection, but it is a main-memory cost, and can actually be done quite efficiently. Generally a win.
Insert Entry <B, ptr> Start at root and go down to “best-fit”
leaf L. Go to child whose box needs least
enlargement to cover B; resolve ties by going to smallest area child.
If best-fit leaf L has space, insert entry and stop. Otherwise, split L into L1 and L2. Adjust entry for L in its parent so that the
box now covers (only) L1. Add an entry (in the parent node of L) for
L2. (This could cause the parent node to recursively split.)
Splitting a Node During Insertion
The entries in node L plus the newly inserted entry must be distributed between L1 and L2.
Goal is to reduce likelihood of both L1 and L2 being searched on subsequent queries.
Idea: Redistribute so as to minimize area of L1 plus area of L2.
Exhaustive algorithm is too slow; quadratic and linear heuristics are described in the paper. GOOD SPLIT!
BAD!
R-Tree Variants The R* tree uses the concept of forced reinserts to
reduce overlap in tree nodes. When a node overflows, instead of splitting: Remove some (say, 30% of the) entries and reinsert them
into the tree. Could result in all reinserted entries fitting on some existing
pages, avoiding a split.
R* trees also use a different heuristic, minimizing box perimeters rather than box areas during insertion.
Another variant, the R+ tree, avoids overlap by inserting an object into multiple leaves if necessary. Searches now take a single path to a leaf, at cost of
redundancy.
GiST The Generalized Search Tree (GiST) abstracts
the “tree” nature of a class of indexes including B+ trees and R-tree variants. Striking similarities in insert/delete/search and even
concurrency control algorithms make it possible to provide “templates” for these algorithms that can be customized to obtain the many different tree index structures.
B+ trees are so important (and simple enough to allow further specialization) that they are implemented specially in all DBMSs.
GiST provides an alternative for implementing other tree indexes in an ORDBS.
Comments on R-Trees Deletion consists of searching for the entry
to be deleted, removing it, and if the node becomes under-full, deleting the node and then re-inserting the remaining entries.
Overall, works quite well for 2D and 3D datasets. Several variants (notably, R+ and R* trees) have been proposed; widely used.
Can improve search performance by using a convex polygon to approximate query shape (instead of a bounding box) and testing for polygon-box intersection.
Bitmap Index Bitmap index: specialized index that takes
advantage Read-mostly data: data produced from scientific
experiments can be appended in large groups Fast operations
“Predicate queries” can be performed with bitwise logical operations• Predicate ops: =, <, >, <=, >=, range,• Logical ops: AND, OR, XOR, NOT
They are well supported by hardware Easy to compress, potentially small index size Each individual bitmap is small and frequently
used ones can be cached in memory
Bitmap Index
Can be useful for stable columns with few values
Bitmap: String of bits: 0 (no match) or 1 (match) One bit for each row
Bitmap index record Column value Bitmap DBMS converts bit position into row identifier.
Bitmap Index Example
RowId FacSSN … FacRank 1 098-55-1234 Asst 2 123-45-6789 Asst 3 456-89-1243 Assc 4 111-09-0245 Prof 5 931-99-2034 Asst 6 998-00-1245 Prof 7 287-44-3341 Assc 8 230-21-9432 Asst 9 321-44-5588 Prof 10 443-22-3356 Assc 11 559-87-3211 Prof 12 220-44-5688 Asst
FacRank Bitmap Asst 110010010001 Assc 001000100100 Prof 000101001010
Faculty Table
Bitmap Index on FacRank
Compressing Bitmaps:Run Length Encoding
Bit vector for Assc: 0010000000010000100
Runs: 2, 8, 4 Can we do: 10 1000 100? Is this
unambiguous? Fixed bits per run (max=4)
0010 1000 0100 Variable bits per run
1010 11101000 110100 11101000 is broken into: 1110 (#bits),
1000 (value)• i.e., 4 bits are required, value is 8
Operation-efficient Compression Methods
Uncompressed:0000000000001111000000000 ......0000001000000001111111100000000 .... 0000100
Compressed:12, 0, 0, 0, 0, 1000, 8, 0, 0, 0, 0, 0, 0, 0, 0, 1000
Store very short sequences as-is
AND/OR/COUNT operations:Can uncompress on the fly
Based on variations of Run Length Compression
Indexing High-Dimensional Data
Typically, high-dimensional datasets are collections of points, not regions. E.g., Feature vectors in multimedia applications. Very sparse
Nearest neighbor queries are common. R-tree becomes worse than sequential scan for most
datasets with more than a dozen dimensions.
As dimensionality increases contrast (ratio of distances between nearest and farthest points) usually decreases; “nearest neighbor” is not meaningful. In any given data set, advisable to empirically test contrast.