Feature Based Similarity
description
Transcript of Feature Based Similarity
![Page 1: Feature Based Similarity](https://reader036.fdocuments.us/reader036/viewer/2022062305/5681599f550346895dc6ed99/html5/thumbnails/1.jpg)
117
Christian Böhm, Bernhard Braunmüller, Florian Krebs, and Hans-Peter Kriegel,University of Munich
Epsilon Grid Order: An Algorithm for the Similarity Join on Massive High-Dimensional Data
![Page 2: Feature Based Similarity](https://reader036.fdocuments.us/reader036/viewer/2022062305/5681599f550346895dc6ed99/html5/thumbnails/2.jpg)
217 Feature Based Similarity
![Page 3: Feature Based Similarity](https://reader036.fdocuments.us/reader036/viewer/2022062305/5681599f550346895dc6ed99/html5/thumbnails/3.jpg)
317 Simple Similarity Queries
Specify query object and• Find similar objects – range query
• Find the k most similar objects – nearest neighbor q.
![Page 4: Feature Based Similarity](https://reader036.fdocuments.us/reader036/viewer/2022062305/5681599f550346895dc6ed99/html5/thumbnails/4.jpg)
417 Join Applications: Catalogue Matching
Catalogue matching• E.g. Astronomic catalogues
R
S
![Page 5: Feature Based Similarity](https://reader036.fdocuments.us/reader036/viewer/2022062305/5681599f550346895dc6ed99/html5/thumbnails/5.jpg)
517 Join Applications: Clustering
Clustering (e.g. DBSCAN)
Similarity self-join
![Page 6: Feature Based Similarity](https://reader036.fdocuments.us/reader036/viewer/2022062305/5681599f550346895dc6ed99/html5/thumbnails/6.jpg)
617 Grid partitioning
General idea: Grid approximation where grid line distance =
Similar idea in the -kdB-tree[Shim, Srikant, Agrawal: High-dimensional Similarity Joins, ICDE 1997]
Disadvantage of any grid approach:Number of neighboring grid cells: 3d 1
![Page 7: Feature Based Similarity](https://reader036.fdocuments.us/reader036/viewer/2022062305/5681599f550346895dc6ed99/html5/thumbnails/7.jpg)
717 Scalability of the -kdB-tree
Assumption: 2 adjacent -stripes fit in main mem. Unrealistic for large data sets which are ...
• clustered,
• skewed and
• high-dimensional data
![Page 8: Feature Based Similarity](https://reader036.fdocuments.us/reader036/viewer/2022062305/5681599f550346895dc6ed99/html5/thumbnails/8.jpg)
817 Epsilon Grid Order
![Page 9: Feature Based Similarity](https://reader036.fdocuments.us/reader036/viewer/2022062305/5681599f550346895dc6ed99/html5/thumbnails/9.jpg)
917 -Grid-Order Is a Total Strict Order
Strict Order:• Irreflexivity
• Transitivity
• Asymmetry
-grid-order can be used in any sorting algorithm
![Page 10: Feature Based Similarity](https://reader036.fdocuments.us/reader036/viewer/2022062305/5681599f550346895dc6ed99/html5/thumbnails/10.jpg)
1017 -Interval
Coarse approximation of join mates:Used for I/O processing
![Page 11: Feature Based Similarity](https://reader036.fdocuments.us/reader036/viewer/2022062305/5681599f550346895dc6ed99/html5/thumbnails/11.jpg)
1117 I/O Processing for the Self Join
Decompose the sorted file into I/O units
![Page 12: Feature Based Similarity](https://reader036.fdocuments.us/reader036/viewer/2022062305/5681599f550346895dc6ed99/html5/thumbnails/12.jpg)
1217 Epsilon Grid Order
![Page 13: Feature Based Similarity](https://reader036.fdocuments.us/reader036/viewer/2022062305/5681599f550346895dc6ed99/html5/thumbnails/13.jpg)
1317 CPU Processing
I/O units are further decomposed before joining Simple divide-and-conquer: No further sorting Decomposition: maximize active dimensions
![Page 14: Feature Based Similarity](https://reader036.fdocuments.us/reader036/viewer/2022062305/5681599f550346895dc6ed99/html5/thumbnails/14.jpg)
1417 CPU Processing
Point distance computations: Order of dimensions• Neighboring inactive dimensions
• Unspecified dimensions
• Active dimension
• Aligned inactive dimensions
![Page 15: Feature Based Similarity](https://reader036.fdocuments.us/reader036/viewer/2022062305/5681599f550346895dc6ed99/html5/thumbnails/15.jpg)
1517 Experimental Results
8-dimensional uniformly distributed vectors
![Page 16: Feature Based Similarity](https://reader036.fdocuments.us/reader036/viewer/2022062305/5681599f550346895dc6ed99/html5/thumbnails/16.jpg)
1617 Experimental Results (2)
16-d feature vectors from CAD application
![Page 17: Feature Based Similarity](https://reader036.fdocuments.us/reader036/viewer/2022062305/5681599f550346895dc6ed99/html5/thumbnails/17.jpg)
1717 Conclusions
Summary• High potential for performance gains of the similarity
join by page capacity optimization
• Necessary to separately optimize I/O and CPU
Future research potential• Similarity join for metric index structures
• Approximate similarity join
• Parallel similarity join algorithms