Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model and...

21
Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model and Index Architecture for the Similarity Join

Transcript of Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model and...

Page 1: Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model and Index Architecture for the Similarity Join.

Christian Böhm & Hans-Peter Kriegel,Ludwig Maximilians Universität München

A Cost Model and Index Architecture for the Similarity Join

Christian Böhm & Hans-Peter Kriegel,Ludwig Maximilians Universität München

A Cost Model and Index Architecture for the Similarity Join

Page 2: Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model and Index Architecture for the Similarity Join.

Chr

isti

an B

öhm

2

Feature Based SimilarityFeature Based Similarity

Page 3: Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model and Index Architecture for the Similarity Join.

Chr

isti

an B

öhm

3

Simple Similarity QueriesSimple Similarity Queries

Specify query object and• Find similar objects – range query• Find the k most similar objects – nearest neighbor q.

Page 4: Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model and Index Architecture for the Similarity Join.

Chr

isti

an B

öhm

4

Join Applications: Catalogue MatchingJoin Applications: Catalogue Matching

Catalogue matching• E.g. Astronomic catalogues

R

S

Page 5: Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model and Index Architecture for the Similarity Join.

Chr

isti

an B

öhm

5

Join Applications: ClusteringJoin Applications: Clustering

Clustering (e.g. DBSCAN)

Similarity self-join

Page 6: Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model and Index Architecture for the Similarity Join.

Chr

isti

an B

öhm

6

R-tree Spatial Join (RSJ)R-tree Spatial Join (RSJ)

procedure r_tree_sim_join (R, S, ) if IsDirpg (R) IsDirpg (S) then foreach r R.children do foreach s S.children do if mindist (r,s) then CacheLoad(r); CacheLoad(s); r_tree_sim_join (r,s,) ; else (* assume R,S both DataPg *) foreach p R.points do foreach q S.points do if |p q| then report (p,q);

R S

Page 7: Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model and Index Architecture for the Similarity Join.

Chr

isti

an B

öhm

7

Cost ModelingCost Modeling

Single similarity queries: Access prob. of pages modeled using the concept of Minkowski Sum

Page 8: Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model and Index Architecture for the Similarity Join.

Chr

isti

an B

öhm

8

Cost ModelingCost Modeling

Binomial formula:

Page 9: Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model and Index Architecture for the Similarity Join.

Chr

isti

an B

öhm

9

Cost ModelingCost Modeling

Mating probability of index pages: Probability that distance between two pages Two-fold application of Minkowski sum

Page 10: Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model and Index Architecture for the Similarity Join.

Chr

isti

an B

öhm

10

Page Capacity OptimizationPage Capacity Optimization

Cost model can determine index selectivity which depends on various parameters

Page capacity (number of stored points) is an important parameter

Known from similarity search: Page capacity optimization yields considerable improvement

Page 11: Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model and Index Architecture for the Similarity Join.

Chr

isti

an B

öhm

11

Analysis of the Index OverheadAnalysis of the Index Overhead

Assuming 100% selectivity (index doesnt work)How much more expensive is index usage ?

CPU:• Distance betw. boxes more

expensive to compute than distance betw. points:

• Smaller capacity more box distance computations

Page 12: Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model and Index Architecture for the Similarity Join.

Chr

isti

an B

öhm

12

Analysis of the Index OverheadAnalysis of the Index Overhead

Disk I/O:• High constant cost per page access (move disk head)• Page access is by factor 10000 / d more

expensive than continuous reading of a point• Smaller capacity more disk head movement

Page 13: Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model and Index Architecture for the Similarity Join.

Chr

isti

an B

öhm

13

Analysis of the Index OverheadAnalysis of the Index Overhead

What selectivity is needed that index pays off ?

Page 14: Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model and Index Architecture for the Similarity Join.

Chr

isti

an B

öhm

14

OptimizationOptimization

I/O cost function:

is optimized by

CPU cost function:

is optimized by:

Page 15: Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model and Index Architecture for the Similarity Join.

Chr

isti

an B

öhm

15

OptimizationOptimization

I/O cost:• Large capacity optimum (several 10,000 points, typically)

CPU cost:• Small capacity optimum (< 100 points, typically)

• No compromise achievable

Page 16: Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model and Index Architecture for the Similarity Join.

Chr

isti

an B

öhm

16

Multipage Index (MuX)Multipage Index (MuX)

CPU-performance like CPU optimized index

I/O- performance like I/O optimized index

separateoptimization

Page 17: Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model and Index Architecture for the Similarity Join.

Chr

isti

an B

öhm

17

Experimental EvaluationExperimental Evaluation

Uniform 4D Uniform 8D

Page 18: Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model and Index Architecture for the Similarity Join.

Chr

isti

an B

öhm

18

Experimental EvaluationExperimental Evaluation

CAD Data 16D Color Images 64D

Page 19: Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model and Index Architecture for the Similarity Join.

Chr

isti

an B

öhm

19

ConclusionsConclusions

Summary• High potential for performance gains of the

similarity join by page capacity optimization• Necessary to separately optimize I/O and CPU

Future research potential• Similarity join for metric index structures• Approximate similarity join• Parallel similarity join algorithms

Page 20: Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model and Index Architecture for the Similarity Join.

Chr

isti

an B

öhm

20

ConsequencesConsequences

Assume for I/O optimization selectivity Page accesses in a nested block loop like style:

if mindist(r,s) then join (r,s) ;

foreach joining R-page r in cache doload (s) ;if s joins some of the cached R-pg then

foreach S-page s dofill cache with pages of R (1 page free) ;

Page 21: Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model and Index Architecture for the Similarity Join.

Chr

isti

an B

öhm

21

R-tree Spatial Join (RSJ)R-tree Spatial Join (RSJ)

procedure r_tree_sim_join (R, S, ) if IsDirpg (R) IsDirpg (S) then foreach r R.children do foreach s S.children do if mindist (r,s) then CacheLoad(r); CacheLoad(s); r_tree_sim_join (r,s,) ; else (* assume R,S both DataPg *) foreach p R.points do foreach q S.points do if |p q| then report (p,q);

R S