The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation

48
The A-tree: An Index Structure for High- dimensional Spaces Using Relative Approximation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa (Nara Institute of Science and Technology) Shunsuke Uemura (Nara Institute of Science and Technology) Haruhiko Kojima (NTT Cyber Solutions Laboratories)

description

The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation. Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa (Nara Institute of Science and Technology) Shunsuke Uemura (Nara Institute of Science and Technology) - PowerPoint PPT Presentation

Transcript of The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation

Page 1: The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation

The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation

Yasushi Sakurai (NTT Cyber Space Laboratories)

Masatoshi Yoshikawa (Nara Institute of Science and Technology)

Shunsuke Uemura (Nara Institute of Science and Technology)

Haruhiko Kojima (NTT Cyber Solutions Laboratories)

Page 2: The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation

Introduction Demand

– High-performance multimedia database systems– Content-based retrieval with high speed and

accuracy Multimedia databases

– Large size– Various features, high-dimensional data

More efficient spatial indices for high-dimensional data

Page 3: The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation

Our Approach VA-File and SR-tree are excellent search

methods for high-dimensional data. Comparisons of them motivated the concept

of the A-tree.– No comparisons of them have been reported.– We performed experiments using various data

sets Approximation tree (A-tree)

– Relative approximation: MBRs and data objects are approximated based on their parent MBR.

– About 77% reduction in the number of page accesses compared with VA-File and SR-tree

Page 4: The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation

Related Work (1)

R5R3 R4 R8R6 R7

R1 R2Non-leaf Node

Leaf Node

R-tree family– Tree structure using MBRs (Minimum Bounding

Rectangles) and/or MBSs (Minimum Bounding Spheres)

– SR-tree: • Structured by both MBRs and MBSs• Outperforms SS-tree and R*-tree for 16-dimensional data

R1R2R3

R4

R5 R6

R7

R8

Page 5: The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation

Related Work (2) VA-File (Vector Approximation File)

– Use approximation file and vector file

1. Divide the entire data space into cells

2. Approximate vector data by using the cells, then create the approximation file

3. Select candidate vectors by scanning the approximation file

4. Access to the candidate vectors in the vector file– Better than X-tree and R*-tree beyond dimensionality of 6

11

10

01

00

11100100

0.6 0.8

0.9 0.1

10 11

11 00

Approximation Vector Data

Page 6: The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation

Experimental Results and Analysis

Structure suitable for non-uniformly distributed data– Structure changes according to data distribution.

Large entry size for high-dimensional spaces– Large entries small fanout many node accesses

Changing node size and fanout– Larger node size does NOT lead to low IO cost. – Larger fanout always contributes to the reduction in node

accesses.

MBS contribution– The contribution of MBSs in node pruning is small in high-

dimensional spaces.

--- Properties of the SR-tree ---

Page 7: The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation

Experimental Results and Analysis

Data skew degenerates search performance.– Absolute approximation: the approximation is

independent of data distribution.– Effective for uniformly distributed data– Unsuitable for non-uniformly distributed data

• A large amount of dense data tends to be approximated by the same value.

• Absolute approximation leads to large approximation errors.

--- Properties of the VA-File ---

Page 8: The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation

The A-tree (Approximation tree) Ideas from the SR-tree and VA-File comparison:

– Tree structure• Tree structures are suitable for non-uniformly distributed

data.

– Relative approximation• MBRs and data objects are approximated based on their

parent bounding rectangle. • Small approximation error• Small entry size and large fanout low IO cost

– Partial usage of MBSs in high-dimensional searches• MBSs are not stored in the A-tree. • The centroid of data objects in a subtree is used only for

update.

Page 9: The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation

(28, 20)

(28, 4)(4, 4)

(4, 20)Rectangle A

Rectangle B

VBR C (22, 16)

(22, 10)(10, 10)

(10, 16)

(11, 15)

(11, 11) (21, 11)

(21, 15)

Virtual Bounding Rectangle (VBR) C approximates a rectangle B. C is calculated from rectangles A and B. Search using VBRs guarantees the same res

ult as that of MBRs.

Page 10: The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation

0 1 2 3 4 5 6 7

i-th dimensional coordinate axis

3 19

6 8

Edge of rectangle A

Edge of rectangle B

Subspace Code Subspace code represents a VBR. The edge of child MBR B is quantized in

relation to the edge of parent MBR A. The edge of B is approximated as a pair of 8-

ary codes (1, 2) or binary codes (001, 010).

Page 11: The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation

Rectangle B

VBR C

Subspace Code C is the VBR of B in A C is represented by the subspace codes:

S = (010, 011, 101, 101)

010

011

101

101

Rectangle A

Page 12: The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation

The A-tree Structure Relative approximation:

– MBRs and data objects in child nodes are approximated based on parent MBR.

Configuration– One node contains partial information of

rectangles in two consecutive generations.

M1 SC(V3) SC(V4) M2

M4M3 SC(C1)

P1 P2

SC(V1) SC(V2)

SC(C2)

CD1 CD2

CD3 CD4V2

V1

M1

M2V4

M4

V3M3

P2

P1C1

C2

R (Entire space)

Page 13: The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation

The A-tree Structure

P1 and P2: data objects,

M1 SC(V3) SC(V4) M2

M4M3 SC(C1)

P1 P2

SC(V1) SC(V2)

SC(C2)

CD1 CD2

CD3 CD4V2

V1

M1

M2V4

M4

V3M3

P2

P1C1

C2

R (Entire space)

Page 14: The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation

The A-tree Structure

P1 and P2: data objects, M1 -- M4: MBRs

M1 SC(V3) SC(V4) M2

M4M3 SC(C1)

P1 P2

SC(V1) SC(V2)

SC(C2)

CD1 CD2

CD3 CD4V2

V1

M1

M2V4

M4

V3M3

P2

P1C1

C2

R (Entire space)

Page 15: The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation

The A-tree Structure

P1 and P2: data objects, M1 -- M4: MBRs

SC(V1) -- SC(V4): subspace codes of VBRs for the MBRs

M1 SC(V3) SC(V4) M2

M4M3 SC(C1)

P1 P2

SC(V1) SC(V2)

SC(C2)

CD1 CD2

CD3 CD4V2

V1

M1

M2V4

M4

V3M3

P2

P1C1

C2

R (Entire space)

Page 16: The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation

The A-tree Structure

P1 and P2: data objects, M1 -- M4: MBRs

SC(V1) -- SC(V4): subspace codes of VBRs for the MBRs

M1 SC(V3) SC(V4) M2

M4M3 SC(C1)

P1 P2

SC(V1) SC(V2)

SC(C2)

CD1 CD2

CD3 CD4V2

V1

M1

M2V4

M4

V3M3

P2

P1C1

C2

R (Entire space)

Page 17: The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation

The A-tree Structure

P1 and P2: data objects, M1 -- M4: MBRs

SC(V1) -- SC(V4): subspace codes of VBRs for the MBRs

M1 SC(V3) SC(V4) M2

M4M3 SC(C1)

P1 P2

SC(V1) SC(V2)

SC(C2)

CD1 CD2

CD3 CD4V2

V1

M1

M2V4

M4

V3M3

P2

P1C1

C2

R (Entire space)

Page 18: The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation

The A-tree Structure

P1 and P2: data objects, M1 -- M4: MBRs

SC(V1) -- SC(V4): subspace codes of VBRs for the MBRs

M1 SC(V3) SC(V4) M2

M4M3 SC(C1)

P1 P2

SC(V1) SC(V2)

SC(C2)

CD1 CD2

CD3 CD4V2

V1

M1

M2V4

M4

V3M3

P2

P1C1

C2

R (Entire space)

Page 19: The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation

The A-tree Structure

P1 and P2: data objects, M1 -- M4: MBRs

SC(V1) -- SC(V4): subspace codes of VBRs for the MBRs

SC(C1) and SC(C2): subspace codes of VBRs for the data objects

M1 SC(V3) SC(V4) M2

M4M3 SC(C1)

P1 P2

SC(V1) SC(V2)

SC(C2)

CD1 CD2

CD3 CD4V2

V1

M1

M2V4

M4

V3M3

P2

P1C1

C2

R (Entire space)

Page 20: The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation

The A-tree Structure

P1 and P2: data objects, M1 -- M4: MBRs

SC(V1) -- SC(V4): subspace codes of VBRs for the MBRs

SC(C1) and SC(C2): subspace codes of VBRs for the data objects

M1 SC(V3) SC(V4) M2

M4M3 SC(C1)

P1 P2

SC(V1) SC(V2)

SC(C2)

CD1 CD2

CD3 CD4V2

V1

M1

M2V4

M4

V3M3

P2

P1C1

C2

R (Entire space)

Page 21: The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation

The A-tree Structure

P1 and P2: data objects, M1 -- M4: MBRs

SC(V1) -- SC(V4): subspace codes of VBRs for the MBRs

SC(C1) and SC(C2): subspace codes of VBRs for the data objects

CD1 -- CD4: centroid of the data objects in the subtree

M1 SC(V3) SC(V4) M2

M4M3 SC(C1)

P1 P2

SC(V1) SC(V2)

SC(C2)

CD1 CD2

CD3 CD4V2

V1

M1

M2V4

M4

V3M3

P2

P1C1

C2

R (Entire space)

Page 22: The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation

The A-tree Structure

M1 SC(V3) SC(V4) M2

M4M3 SC(C1)

P1 P2

SC(V1) SC(V2)

SC(C2)

CD1 CD2

CD3 CD4

Index nodes

Data nodes

Data nodes Index nodes

– leaf nodes– intermediate nodes– root node

Page 23: The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation

The A-tree Structure

M1 SC(V3) SC(V4) M2

M4M3 SC(C1)

P1 P2

SC(V1) SC(V2)

SC(C2)

CD1 CD2

CD3 CD4

Index nodes

Data nodes

Data node– data objects– pointers to the data description records

Data node

Page 24: The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation

The A-tree Structure

M1 SC(V3) SC(V4) M2

M4M3 SC(C1)

P1 P2

SC(V1) SC(V2)

SC(C2)

CD1 CD2

CD3 CD4

Index nodes

Data nodes

Leaf node– an MBR– a pointer to the data node– subspace codes of VBRs

Leaf nodes

Page 25: The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation

The A-tree Structure

M1 SC(V3) SC(V4) M2

M4M3 SC(C1)

P1 P2

SC(V1) SC(V2)

SC(C2)

CD1 CD2

CD3 CD4

Index nodes

Data nodes

Intermediate node– an MBR– a list of entries

• a pointer to the child node• the subspace code of a VBR• the centroid of data objects in the subtree• the number of the data objects

Intermediate nodes

Page 26: The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation

The A-tree Structure

M1 SC(V3) SC(V4) M2

M4M3 SC(C1)

P1 P2

SC(V1) SC(V2)

SC(C2)

CD1 CD2

CD3 CD4

Index nodes

Data nodes

Root node:– a list of entries

• a pointer to the child node• the subspace code of a VBR• the centroid of data objects in the subtree• the number of the data objects

Root node

Page 27: The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation

Search Algorithm Basic ideas:

– VBRs are calculated from parent MBR and the subspace codes.

– Exception: the entire space is used in the root node.

– The algorithm uses calculated VBRs for pruning.

M1 SC(V3) SC(V4) M2

M4M3 SC(C1)

P1 P2

SC(V1) SC(V2)

SC(C2)

V2

V1M1

M2V4

M4

V3M3

P2

P1C1

C2Root node

R (Entire space)

Page 28: The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation

Search Algorithm Calculate V1 and V2 from R, SC(V1) and SC(V2)

M1 SC(V3) SC(V4) M2

M4M3 SC(C1)

P1 P2

SC(V1) SC(V2)

SC(C2)

V2

V1M1

M2V4

M4

V3M3

P2

P1C1

C2

R (Entire space)Query point

Page 29: The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation

Search Algorithm Calculate V1 and V2 from R, SC(V1) and SC(V2)

Calculate V3 and V4 from M1, SC(V3) and SC(V4)

M1 SC(V3) SC(V4) M2

M4M3 SC(C1)

P1 P2

SC(V1) SC(V2)

SC(C2)

V2

V1M1

M2V4

M4

V3M3

P2

P1C1

C2

R (Entire space)Query point

Page 30: The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation

Search Algorithm Calculate V1 and V2 from R, SC(V1) and SC(V2)

Calculate V3 and V4 from M1, SC(V3) and SC(V4)

Calculate C1 and C2 from M3, SC(C1) and SC(C2)

M1 SC(V3) SC(V4) M2

M4M3 SC(C1)

P1 P2

SC(V1) SC(V2)

SC(C2)

V2

V1M1

M2V4

M4

V3M3

P2

P1C1

C2

R (Entire space)Query point

Page 31: The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation

Search Algorithm Calculate V1 and V2 from R, SC(V1) and SC(V2)

Calculate V3 and V4 from M1, SC(V3) and SC(V4)

Calculate C1 and C2 from M3, SC(C1) and SC(C2)

Access to P1

M1 SC(V3) SC(V4) M2

M4M3 SC(C1)

P1 P2

SC(V1) SC(V2)

SC(C2)

V2

V1M1

M2V4

M4

V3M3

P2

P1C1

C2

R (Entire space)Query point

Page 32: The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation

Update Algorithm

Basic idea: – Based on the update algorithm of the SR-tree, but:– Needs to update subspace codes

CD3M1 SC(V3) SC(V4) M2

M4M3 SC(C1)

P1 P2

SC(V1) SC(V2)

SC(C2)

CD4

CD1 CD2

P3

SC(C3)

Page 33: The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation

Code Calculation

Parent MBR

VBRs

Page 34: The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation

Code Calculation

Inserted point Parent MBR

VBRs

If parent MBR does not change, calculate the subspace code for the inserted data object.

Page 35: The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation

Code Calculation

Inserted point Parent MBR

VBRs

Inserted point

If parent MBR does not change, calculate the subspace code for the inserted data object.

If parent MBR changes, calculate all subspace codes

Page 36: The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation

Update data node and leaf node– Insert a new data object P3– Update M3

Update Algorithm

CD3M1 SC(V3) SC(V4) M2

M4M3 SC(C1)

P1 P2

SC(V1) SC(V2)

SC(C2)

CD4

CD1 CD2

P3

SC(C3)

Page 37: The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation

Update data node and leaf node– Insert a new data object P3– Update M3– If M3 does not change, calculate SC(C3).

Update Algorithm

CD3M1 SC(V3) SC(V4) M2

M4M3 SC(C1)

P1 P2

SC(V1) SC(V2)

SC(C2)

CD4

CD1 CD2

P3

SC(C3)

Page 38: The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation

Update data node and leaf node– Insert a new data object P3– Update M3– If M3 does not change, calculate SC(C3).– If M3 changes, calculate SC(C1), SC(C2) and SC

(C3).

Update Algorithm

CD3M1 SC(V3) SC(V4) M2

M4M3 SC(C1)

P1 P2

SC(V1) SC(V2)

SC(C2)

CD4

CD1 CD2

P3

SC(C3)

Page 39: The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation

Update intermediate node – If M3 changes, update M1.– If M3 changes but M1 does not change, calculate

SC(V3).– If M1 changes, calculate SC(V3), SC(V4).– Calculate CD3

Update Algorithm

CD3M1 SC(V3) SC(V4) M2

M4M3 SC(C1)

P1 P2

SC(V1) SC(V2)

SC(C2)

CD4

CD1 CD2

P3

SC(C3)

Page 40: The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation

Update root node – If M1 changes, calculate SC(V1)– Calculate CD1

Update Algorithm

CD3M1 SC(V3) SC(V4) M2

M4M3 SC(C1)

P1 P2

SC(V1) SC(V2)

SC(C2)

CD4

CD1 CD2

P3

SC(C3)

Page 41: The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation

Performance Test Data sets: real data set (hue histogram image data),

uniformly distributed data set, cluster data set. Data size: 100,000 Dimension: varies from 4 to 64 Page size: 8 KB 20-nearest neighbor queries Evaluation is based on the average for 1,000

insertion or query points. CPU: 296 MHz Code length:

– The code length that gave the best performance was chosen.

– A-tree: code length varies from 4 to 12.– VA-File: code length varies from 4 to 8 according to [18].

Page 42: The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation

Search Performance

A-tree gives significantly superior performance! 77% reduction in number of page accesses for

64-dimensional real data Relative approximation

– Small entry size and large fanout low IO cost

Real data Uniformly distributed data

Page 43: The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation

Approximation error ε: error of the distance between p and Vi during a search

p: query point, S: the number of visited VBRs, Vi: visited VBRs, Mi : the MBRs corresponding to Vi

Optimum code length depends on dimensionality and data distribution

Influence of Code Length

S

i i

i

Mp

Vp

Srr

1 ,

,1,100)1(

Page 44: The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation

VA-File/A-tree Comparison

VA-File (absolute approximation)– approximated using the entire space edge length 2-l

A-tree (relative approximation)– approximated using parent MBR smaller VBR size,

fewer object accesses

Edge length of VBRs/cells Number of data object accesses

Page 45: The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation

CPU-time

CPU-time for real data– Similar to the SR-tree and outperforms the VA-File

VA-File– Calculates the approximated position coordinate for all objects

A-tree– Reducing node accesses leads to low CPU cost.

Page 46: The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation

Insertion and Storage Cost

Increase in the insertion cost is modest. About 20% less storage cost for 64-dimensional data

(1) VBRs need only small storage volumes.

(2) The number of index nodes is extremely small.

Insertion cost Storage cost

Page 47: The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation

The A-tree offers excellent search performance for high-dimensional data– Relative approximation

• MBRs and data objects in child nodes are approximated based on parent MBR.

– About 77% reduction in the number of page accesses compared with VA-File and SR-tree

Future work– Cost model for finding optimum code length

Conclusions

Page 48: The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation

Contribution of MBSs for Pruning

SR-tree contains both MBRs and MBSs but:

the frequency of the usage of MBSs decreases as dimensionality increases.