Multidimensional Indexing: Spatial Data Management & High Dimensional Indexing

Multidimensional Indexing:

Spatial Data Management & High Dimensional Indexing

Types of Spatial Data Point Data

Points in a multidimensional space E.g., Raster data such as satellite imagery,

where each pixel stores a measured value E.g., Feature vectors extracted from text

Region Data Objects have spatial extent with location

and boundary DB typically uses geometric approximations

constructed using line segments, polygons, etc., called vector data.

Spatial Indexing

Point Access Methods (PAMs) vs Spatial Access Methods (SAMs)

PAM: index only point data Hierarchical (tree-based) structures Multidimensional Hashing Space filling curve

SAM: index both points and regions Transformations Overlapping regions Clipping methods (non-overlapping)

Data partitioning vs Space partitioning

Types of Spatial Queries Spatial Range Queries

Find all cities within 50 miles of Troy Query has associated region (location, boundary) Answer includes overlapping or contained data regions

Nearest-Neighbor Queries Find the 10 cities nearest to Troy Results must be ordered by proximity

Spatial Join Queries Find all cities near a lake Expensive, join condition involves regions and

proximity

Applications of Spatial Data

Geographic Information Systems (GIS) E.g., ESRI’s ArcInfo; OpenGIS Consortium Geospatial information All classes of spatial queries and data are common

Computer-Aided Design/Manufacturing Store spatial objects such as surface of airplane fuselage Range queries and spatial join queries are common

Multimedia Databases Images, video, text, etc. stored and retrieved by content First converted to feature vector form; high

dimensionality Nearest-neighbor queries are the most common

• Requirementso Fast range/window query search (range queryo Fast similarity search

Similarity range query K-nearest neighbour query (KNN query)

High Dimensional Indexing

Complex Objects Feature Vectors

Similarity Queries

Nasdaq

Feature extraction and transformation

Index construction

Index for range/ similarity Search

Feature Base Similarity Search

http://unicorn.comp.nus.edu.sg/~mmir/cgi-bin/PLIB-IMAGE2/VOL1/pdisc6/pd6_151.jpg


Similarity Search based on sample image in color composition

Retrieval by Colour

Given a sample image



Window/Range query: Retrieve data points fall within a given range along each dimension.

Designed to support range retrieval, facilitate joins and similarity search (if applicable).

Query Requirement

• Similarity queries: Similarity range and KNN queries

• Similarity range query: Given a query point, find all data points within a given distance r to the query point.

•KNN query: Given a query point, find the K nearest neighbours, in distance to the point.

r

Kth NN

Query Requirement

Single-Dimensional Indexes

B+ trees are fundamentally single-dimensional indexes.

When we create a composite search key B+ tree, e.g., an index on <age, sal>, we effectively linearize the 2-dimensional space since we sort entries first by age and then by sal.

Consider entries:<11, 80>, <12, 10><12, 20>, <13, 75>

11 12 13

70605040302010

80

B+ treeorder

Multidimensional Indexes A multidimensional index clusters entries so as to

exploit “nearness” in multidimensional space. Keeping track of entries and maintaining a

balanced index structure presents a challenge!

Consider entries:<11, 80>, <12, 10><12, 20>, <13, 75>

Spatialclusters

70605040302010

80

B+ treeorder

11 12 13

Motivation for Multidimensional Indexes

Spatial queries (GIS, CAD). Find all hotels within a radius of 5 miles from

the conference venue. Find the city with population 500,000 or more

that is nearest to Kalamazoo, MI. Find all cities that lie on the Nile in Egypt. Find all parts that touch the fuselage (in a

plane design). Similarity queries (content-based

retrieval). Given a face, find the five most similar faces.

Multidimensional range queries. 50 < age < 55 AND 80K < sal < 90K

What’s the difficulty? An index based on spatial location

needed. One-dimensional indexes don’t support

multidimensional searching efficiently. Hash indexes only support point queries;

want to support range queries as well. Must support inserts and deletes gracefully.

Ideally, want to support non-point data as well (e.g., lines, shapes).

Multi-dimensional Indexes

Multi-key Indexes Grid Files Partitioned Hash Indexes kd-Trees Quad Trees R Trees Bitmap indexes

Idea:

Key1 Key2

Partitioned hash function

h1 h2

010110 1110010

h1(toy) =0 000h1(sales) =1 001h1(art)=1 010

. 011

.h2(10k) =01 100h2(20k) =11 101h2(30k) =01 110h2(40k) =00 111

.

.

<Fred,toy,10k>,<Joe,sales,10k><Sally,art,30k>

Example:

Insert

<Joe><Sally>

<Fred>

h1(toy) =0 000h1(sales) =1 001h1(art)=1 010

. 011

.h2(10k) =01 100h2(20k) =11 101h2(30k) =01 110h2(40k) =00 111

.

.• Find Emp. with Dept. = Sales

Sal=40k

<Fred><Joe><Jan>

<Mary>

<Sally>

<Tom><Bill><Andy>

h1(toy) =0 000h1(sales) =1 001h1(art)=1 010

. 011

.h2(10k) =01 100h2(20k) =11 101h2(30k) =01 110h2(40k) =00 111

.

.• Find Emp. with Sal=30k

<Fred><Joe><Jan>

<Mary>

<Sally>

<Tom><Bill><Andy>

h1(toy) =0 000h1(sales) =1 001h1(art)=1 010

. 011

.h2(10k) =01 100h2(20k) =11 101h2(30k) =01 110h2(40k) =00 111

.

.• Find Emp. with Dept. = Sales

<Fred><Joe><Jan>

<Mary>

<Sally>

<Tom><Bill><Andy>

Grid File

Hashing methods for multidimensional points (extension of Extensible hashing)

Idea: Use a grid to partition the space each cell is associated with one page

Two disk access principle (exact match)

Grid File Start with one bucket for the

whole space. Select dividers along each

dimension. Partition space into cells

Dividers cut all the way. Each cell corresponds to 1 disk

page. Many cells can point to the

same page. Cell directory potentially

exponential in the number of dimensions

Grid File Implementation

Dynamic structure using a grid directory Grid array: a 2 dimensional array with

pointers to buckets (this array can be large, disk resident) G(0,…, nx-1, 0, …, ny-1)

Linear scales: Two 1 dimensional arrays that are used to access the grid array (main memory) X(0, …, nx-1), Y(0, …, ny-1)

Example

Linear scale X

Linear scale

Y

Grid Directory

Buckets/Disk

Blocks

Grid File Search

Exact Match Search: at most 2 I/Os assuming linear scales fit in memory. First use liner scales to determine the index into the cell

directory access the cell directory to retrieve the bucket address (may

cause 1 I/O if cell directory does not fit in memory) access the appropriate bucket (1 I/O)

Range Queries: use linear scales to determine the index into the cell

directory. Access the cell directory to retrieve the bucket addresses of

buckets to visit. Access the buckets.

Grid File Insertions

Determine the bucket into which insertion must occur. If space in bucket, insert. Else, split bucket

how to choose a good dimension to split? If bucket split causes a cell directory to split do so and

adjust linear scales. insertion of these new entries potentially requires a

complete reorganization of the cell directory--- expensive!!!

Grid File Deletions

Deletions may decrease the space utilization. Merge buckets

We need to decide which cells to merge and a merging threshold

Buddy system and neighbor system A bucket can merge with only one buddy in each

dimension Merge adjacent regions if the result is a rectangle

A

A A

(N=6)

1

2

34

5

6

1 2 3 4 5 6

Grid File Example

1 2 3 4 5 6A

A AA B A B

7

8 9

10 11

12

1 3 5 7 A

2 4 6 B

8

9

10

11 12

(N=6)

1

2

34

5

6

Grid File Example

A B A BA B

C

A B

C B

1 3 5 7 8 10A

2 4 6 9 11 12B

(N=6)

7

8 9

10 11

12

1

2

34

5

6

13

14

15

1 7 8 13 A

2 4 6 9 11 12B

3 5 10 C

14 15

Grid File Example

A B

C

A B

C B

A D B

C

A D

C C

B

B

(N=6)

7

8 9

10 11

12

1

2

34

5

6

13

14

15

1 3 5 7 8 10A

2 4 6 9 11 12B

1 7 8 13 A

2 4 6 9 11 12B

3 5 10 C

14 15

16

1 2 3 4 5 6A 1 3 5 7 A

2 4 6 B

1 7 8 13 A

2 4 6 9 11 12B

3 5 10 C

1 8 13 16 A

2 4 6 9 11 12B

3 5 10 C

7 14 15 D

Grid File Example

(N=6)

x1 x2 x3 x4

y4

y2

y1

A B

C

D

E

F

G

H

Iy3

A H

A I

D

D

F

F

B

B

A I G F B

E E G F B

C C C C B

Grid File Example

Kd-Trees

Binary partitioning of space. Split of the form a < V & a >= V for some attribute (Internal nodes)

The dimensions to “cut” or “split” alternate among all dimensions

Doesn’t have to span the whole dim (unlike Grid Files)

Leaves are blocks that hold the points

A

A

B CB C DB C D EB C D E F

B CB C

D

B C

D E

B

F

C

ED

kd

x1 x2

y1

y2

x1

B Cy1

C Dx2

D Ey2

E F

Kd-Trees Example

x2

y8

x9 x6

x8y1y6

x7

y4

y7

x4

y9

x8

y5

x3

y2

y3

x1

15 1 10 3

4 5

11 14

12

16 8 18 19 2021

132

69

17

7

x5

y2

kdA

x1 x2 x3 x5 x6 x7 x8 x9x4

y9

y8

y7y6

y5

y4y3

y2

y1

1 2 3

4 5

10

6 7 8 9

11 12

1415

13

17

16

18

21

19 20

KdTrees Example

x2

y8

x5

y2

B

C

D

E F

1 2 3

4 5

10

6 7 8 9

11 12

1415

13

17

16

18

21

19 20

B C

D

E

F

x1 x2 x3 x5 x6 x7 x8 x9x4

y9

y8

y7y6

y5

y4y3

y2

y1

kdB

y7

y3

x1

15 1 1016

x6

x8y1

18 19 2021

x4

y9

x83

4 5

2

x9

x7

y4

11 14

12

8

9y6

y5

x3

y2

136

17

7

B

C

D EF

kDB Trees Example

Region Quadtree

12 34 5

13 141911 12

615

181716

71098

A

B C F

2

1

3 4 5 6 11 12D 13 14 19E

15 16 17 187 8 9 10

NW NE SW SE

Point Quad-tree

(0,100)

(0,0)

(100,100)

(100,0)

(92,1)(52,15)

(88,65)

(20,88)

(50,50)

(75,75) (25,25) (75,25)

The R-Tree The R-tree is a tree-structured

index that remains balanced on inserts and deletes.

Each key stored in a leaf entry is intuitively a box, or collection of intervals, with one interval per dimension.

Example in 2-D:

X

Y

Root ofR Tree

Leaf level

R-Tree Properties Leaf entry = < n-dimensional box, rid >

key value is a box. Box is the tightest bounding box for a data

object. Non-leaf entry = < n-dim box, ptr to child

node > Box covers all boxes in child node (in fact,

subtree). All leaves at same distance from root. Nodes can be kept 50% full (except root).

Can choose a parameter m that is <= 50%, and ensure that every node is at least m% full.

Example of an R-Tree

R8R9

R10

R11

R12

R17

R18

R19

R13

R14

R15

R16

R1

R2

R3

R4

R5

R6

R7

Leaf entry

Index entry

Spatial objectapproximated by bounding box R8

Example R-Tree (Contd.)R1 R2

R3 R4 R5 R6 R7

R8 R9 R10 R11R12 R13R14 R15R16 R17R18R19

Search for Objects Overlapping Box Q

Start at root.1. If current node is non-leaf, for each entry <E, ptr>, if box E overlaps Q, search subtree identified by ptr.2. If current node is leaf, for each entry <E, rid>, if E overlaps Q, rid identifies an object that might overlap Q.

Note: May have to search several subtrees at each node!(In contrast, a B-tree equality search goes to just one leaf.)

Improving Search Using Constraints

It is convenient to store boxes in the R-tree as approximations of arbitrary regions, because boxes can be represented compactly.

But why not use convex polygons to approximate query regions more accurately? Will reduce overlap with nodes in tree,

and reduce the number of nodes fetched by avoiding some branches altogether.

Cost of overlap test is higher than bounding box intersection, but it is a main-memory cost, and can actually be done quite efficiently. Generally a win.

Insert Entry <B, ptr> Start at root and go down to “best-fit”

leaf L. Go to child whose box needs least

enlargement to cover B; resolve ties by going to smallest area child.

If best-fit leaf L has space, insert entry and stop. Otherwise, split L into L1 and L2. Adjust entry for L in its parent so that the

box now covers (only) L1. Add an entry (in the parent node of L) for

L2. (This could cause the parent node to recursively split.)

Splitting a Node During Insertion

The entries in node L plus the newly inserted entry must be distributed between L1 and L2.

Goal is to reduce likelihood of both L1 and L2 being searched on subsequent queries.

Idea: Redistribute so as to minimize area of L1 plus area of L2.

Exhaustive algorithm is too slow; quadratic and linear heuristics are described in the paper. GOOD SPLIT!

BAD!

R-Tree Variants The R* tree uses the concept of forced reinserts to

reduce overlap in tree nodes. When a node overflows, instead of splitting: Remove some (say, 30% of the) entries and reinsert them

into the tree. Could result in all reinserted entries fitting on some existing

pages, avoiding a split.

R* trees also use a different heuristic, minimizing box perimeters rather than box areas during insertion.

Another variant, the R+ tree, avoids overlap by inserting an object into multiple leaves if necessary. Searches now take a single path to a leaf, at cost of

redundancy.

GiST The Generalized Search Tree (GiST) abstracts

the “tree” nature of a class of indexes including B+ trees and R-tree variants. Striking similarities in insert/delete/search and even

concurrency control algorithms make it possible to provide “templates” for these algorithms that can be customized to obtain the many different tree index structures.

B+ trees are so important (and simple enough to allow further specialization) that they are implemented specially in all DBMSs.

GiST provides an alternative for implementing other tree indexes in an ORDBS.

Comments on R-Trees Deletion consists of searching for the entry

to be deleted, removing it, and if the node becomes under-full, deleting the node and then re-inserting the remaining entries.

Overall, works quite well for 2D and 3D datasets. Several variants (notably, R+ and R* trees) have been proposed; widely used.

Can improve search performance by using a convex polygon to approximate query shape (instead of a bounding box) and testing for polygon-box intersection.

Bitmap Index Bitmap index: specialized index that takes

advantage Read-mostly data: data produced from scientific

experiments can be appended in large groups Fast operations

“Predicate queries” can be performed with bitwise logical operations• Predicate ops: =, <, >, <=, >=, range,• Logical ops: AND, OR, XOR, NOT

They are well supported by hardware Easy to compress, potentially small index size Each individual bitmap is small and frequently

used ones can be cached in memory

Bitmap Index

Can be useful for stable columns with few values

Bitmap: String of bits: 0 (no match) or 1 (match) One bit for each row

Bitmap index record Column value Bitmap DBMS converts bit position into row identifier.

Bitmap Index Example

RowId FacSSN … FacRank 1 098-55-1234 Asst 2 123-45-6789 Asst 3 456-89-1243 Assc 4 111-09-0245 Prof 5 931-99-2034 Asst 6 998-00-1245 Prof 7 287-44-3341 Assc 8 230-21-9432 Asst 9 321-44-5588 Prof 10 443-22-3356 Assc 11 559-87-3211 Prof 12 220-44-5688 Asst

FacRank Bitmap Asst 110010010001 Assc 001000100100 Prof 000101001010

Faculty Table

Bitmap Index on FacRank

Compressing Bitmaps:Run Length Encoding

Bit vector for Assc: 0010000000010000100

Runs: 2, 8, 4 Can we do: 10 1000 100? Is this

unambiguous? Fixed bits per run (max=4)

0010 1000 0100 Variable bits per run

1010 11101000 110100 11101000 is broken into: 1110 (#bits),

1000 (value)• i.e., 4 bits are required, value is 8

Operation-efficient Compression Methods

Uncompressed:0000000000001111000000000 ......0000001000000001111111100000000 .... 0000100

Compressed:12, 0, 0, 0, 0, 1000, 8, 0, 0, 0, 0, 0, 0, 0, 0, 1000

Store very short sequences as-is

AND/OR/COUNT operations:Can uncompress on the fly

Based on variations of Run Length Compression

Indexing High-Dimensional Data

Typically, high-dimensional datasets are collections of points, not regions. E.g., Feature vectors in multimedia applications. Very sparse

Nearest neighbor queries are common. R-tree becomes worse than sequential scan for most

datasets with more than a dozen dimensions.

As dimensionality increases contrast (ratio of distances between nearest and farthest points) usually decreases; “nearest neighbor” is not meaningful. In any given data set, advisable to empirically test contrast.

Multidimensional Indexing: Spatial Data Management & High Dimensional Indexing

Documents

Transcript of Multidimensional Indexing: Spatial Data Management & High Dimensional Indexing