23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health...
-
Upload
trevor-garrett -
Category
Documents
-
view
217 -
download
0
description
Transcript of 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health...
![Page 1: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal.](https://reader033.fdocuments.us/reader033/viewer/2022051717/5a4d1b5f7f8b9ab0599ac8c1/html5/thumbnails/1.jpg)
123
Christian Böhm1, Florian Krebs2, and Hans-Peter Kriegel2
1University for Health Informatics and Technology, Innsbruck2University of Munich
Optimal Dimension Order: A Generic Technique for the Similarity Join
![Page 2: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal.](https://reader033.fdocuments.us/reader033/viewer/2022051717/5a4d1b5f7f8b9ab0599ac8c1/html5/thumbnails/2.jpg)
223 Feature Based Similarity
![Page 3: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal.](https://reader033.fdocuments.us/reader033/viewer/2022051717/5a4d1b5f7f8b9ab0599ac8c1/html5/thumbnails/3.jpg)
323 Simple Similarity Queries
Specify query object and• Find similar objects – range query• Find the k most similar objects – nearest neighbor q.
![Page 4: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal.](https://reader033.fdocuments.us/reader033/viewer/2022051717/5a4d1b5f7f8b9ab0599ac8c1/html5/thumbnails/4.jpg)
423 Join Applications: Catalogue Matching
Catalogue matching• E.g. Astronomy catalogues
R
S
![Page 5: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal.](https://reader033.fdocuments.us/reader033/viewer/2022051717/5a4d1b5f7f8b9ab0599ac8c1/html5/thumbnails/5.jpg)
523 Join Applications: Clustering
Clustering (e.g. DBSCAN)
Similarity self-join
![Page 6: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal.](https://reader033.fdocuments.us/reader033/viewer/2022051717/5a4d1b5f7f8b9ab0599ac8c1/html5/thumbnails/6.jpg)
623 R-Tree Similarity Join
Depth-first traversal of two trees[Brinkhoff, Kriegel, Seeger: Efficient Process. of Spatial Joins Using R-trees, Sigmod Conf. 1993]
R S
![Page 7: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal.](https://reader033.fdocuments.us/reader033/viewer/2022051717/5a4d1b5f7f8b9ab0599ac8c1/html5/thumbnails/7.jpg)
723 The -kdB-Tree
[Shim, Srikant, Agrawal: High-dimensional Similarity Joins, ICDE 1997]
Assumption: 2 adjacent -stripes fit in main mem. Unrealistic for large data sets which are ...
• clustered, • skewed and • high-dimensional data
![Page 8: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal.](https://reader033.fdocuments.us/reader033/viewer/2022051717/5a4d1b5f7f8b9ab0599ac8c1/html5/thumbnails/8.jpg)
823 Epsilon Grid Order
[Böhm, Braunmüller, Krebs, Kriegel: Epsilon Grid Order. SIGMOD Conf. 2001]
![Page 9: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal.](https://reader033.fdocuments.us/reader033/viewer/2022051717/5a4d1b5f7f8b9ab0599ac8c1/html5/thumbnails/9.jpg)
923 Common Properties
Decomposition of data/space into regions Regions described by hyper-rectangles
for each pair (P,Q) of partitions having dist (P,Q)
for each pair of points (p,q) on (P,Q)test dist (p,q) ;
Most CPU-effort in distance test between vectors:Idea: Speed-up distance test
![Page 10: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal.](https://reader033.fdocuments.us/reader033/viewer/2022051717/5a4d1b5f7f8b9ab0599ac8c1/html5/thumbnails/10.jpg)
1023 Related Work: Plane Sweep for Polygons
[Shin, Moon, Lee: Adaptive Multi-Stage Distance Join Processing, SIGMOD Conf. 2000]
Observations:• More efficient to use x-axis as sweep direction.• Projection of polygons to y-axis yield high overlap• Decide by projections of the bounding boxes
(integrate a pdf)
![Page 11: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal.](https://reader033.fdocuments.us/reader033/viewer/2022051717/5a4d1b5f7f8b9ab0599ac8c1/html5/thumbnails/11.jpg)
1123
Distance computation between feature vectors p,qfor (i=0 ; i<d ; i++) {dist2 = dist2 + (p[i] q[i])2 ;if (dist2 > 2)break ;}
Order dimensions by Mating Probability (increasing)
Feature Vectors in the Similarity Join
d0
d1
![Page 12: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal.](https://reader033.fdocuments.us/reader033/viewer/2022051717/5a4d1b5f7f8b9ab0599ac8c1/html5/thumbnails/12.jpg)
1223 Computation of the Mating Probability
To determine mating probability for di: Project bounding boxes on di-axis
d0
d1
![Page 13: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal.](https://reader033.fdocuments.us/reader033/viewer/2022051717/5a4d1b5f7f8b9ab0599ac8c1/html5/thumbnails/13.jpg)
1323 Computation of the Mating Probability
To determine mating probability for di: Project bounding boxes on di-axis Consider two projections in 2-dimensional space
d0
d0
d 0
d 0
d 0
d 0
d 0
![Page 14: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal.](https://reader033.fdocuments.us/reader033/viewer/2022051717/5a4d1b5f7f8b9ab0599ac8c1/html5/thumbnails/14.jpg)
1423 Computation of the Mating Probability
To determine mating probability for di: Project bounding boxes on di-axis Consider two projections in 2-dimensional space
d0-Projection of each point pair located inthis event space
d0[P]
d0[Q]
![Page 15: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal.](https://reader033.fdocuments.us/reader033/viewer/2022051717/5a4d1b5f7f8b9ab0599ac8c1/html5/thumbnails/15.jpg)
1523 Computation of the Mating Probability
To determine mating probability for di: Project bounding boxes on di-axis Consider two projections in 2-dimensional space
d0[P]
d0-Projection of each point pair located inthis event space
mating
point
pairs
on -
stripe
d0[Q]
y x y
x +
![Page 16: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal.](https://reader033.fdocuments.us/reader033/viewer/2022051717/5a4d1b5f7f8b9ab0599ac8c1/html5/thumbnails/16.jpg)
1623 Computation of the Mating Probability
To determine mating probability for di: Project bounding boxes on di-axis Consider two projections in 2-dimensional space
MatingProbabilityfor d0
d0[P]
d0[Q]
![Page 17: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal.](https://reader033.fdocuments.us/reader033/viewer/2022051717/5a4d1b5f7f8b9ab0599ac8c1/html5/thumbnails/17.jpg)
1723 Optimal Dimension Order
For a given pair (P,Q) of partitions the optimal dimension order ODO is the sequence of dimensions with increasing mating probability
Algorithm:for each pair (P,Q) of partitions having dist (P,Q)
determine ODO ;for each pair of points (p,q) on (P,Q)
test dist (p,q) using ODO ;
![Page 18: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal.](https://reader033.fdocuments.us/reader033/viewer/2022051717/5a4d1b5f7f8b9ab0599ac8c1/html5/thumbnails/18.jpg)
1823 Shape of the Intersection Area
20 different shapes are possible, e.g.
1223 2233 2223
Easy proof of completeness and efficient case distinction by assigning codes to the corners• 1: Corner is left or above the -stripe• 2: Corner is on the -stripe• 3: Corner is right or below the -stripe
Easy formulas (only 45° and 90° angles)
![Page 19: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal.](https://reader033.fdocuments.us/reader033/viewer/2022051717/5a4d1b5f7f8b9ab0599ac8c1/html5/thumbnails/19.jpg)
1923 Experimental Evaluation: R-tree Sim. Join
0%100%200%300%400%500%600%700%
base
tech
nique
ODO-algo
rithm
SDO dimen
sion 1
SDO dimen
sion 2
SDO dimen
sion 3
SDO dimen
sion 4
SDO dimen
sion 5
SDO dimen
sion 6
SDO dimen
sion 7
SDO dimen
sion 8
8-dimensional data, uniformly distributed
![Page 20: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal.](https://reader033.fdocuments.us/reader033/viewer/2022051717/5a4d1b5f7f8b9ab0599ac8c1/html5/thumbnails/20.jpg)
2023 Experimental Evaluation: R-tree Sim. Join
16-dimensional data, from CAD-similarity search
![Page 21: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal.](https://reader033.fdocuments.us/reader033/viewer/2022051717/5a4d1b5f7f8b9ab0599ac8c1/html5/thumbnails/21.jpg)
2123 Experimental Evaluation: Scalability
MuX, uniform data Z-RSJ, uniform data
![Page 22: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal.](https://reader033.fdocuments.us/reader033/viewer/2022051717/5a4d1b5f7f8b9ab0599ac8c1/html5/thumbnails/22.jpg)
2223 Experimental Evaluation: Scalability
EGO, CAD data
![Page 23: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal.](https://reader033.fdocuments.us/reader033/viewer/2022051717/5a4d1b5f7f8b9ab0599ac8c1/html5/thumbnails/23.jpg)
2323 Conclusion
Conclusion:• Similarity join is an important database primitive for
knowledge discovery in databases• Many different basic algorithms• Most accelerable by our optimal dimension order
Future Work:• New applications of the similarity join• Further optimization (multi-parameter) of the sim. join• Parallel and distributed environments