Similarity Search in High Dimensions via Hashing Aristides Gionis, Piotr Indyk, Rajeev Motwani...
-
Upload
camilla-hardy -
Category
Documents
-
view
215 -
download
2
Transcript of Similarity Search in High Dimensions via Hashing Aristides Gionis, Piotr Indyk, Rajeev Motwani...
![Page 1: Similarity Search in High Dimensions via Hashing Aristides Gionis, Piotr Indyk, Rajeev Motwani Presented by: Fatih Uzun.](https://reader035.fdocuments.us/reader035/viewer/2022071805/56649cea5503460f949b5cf9/html5/thumbnails/1.jpg)
Similarity Search in High Dimensions via Hashing
Aristides Gionis, Piotr Indyk, Rajeev Motwani
Presented by:
Fatih Uzun
![Page 2: Similarity Search in High Dimensions via Hashing Aristides Gionis, Piotr Indyk, Rajeev Motwani Presented by: Fatih Uzun.](https://reader035.fdocuments.us/reader035/viewer/2022071805/56649cea5503460f949b5cf9/html5/thumbnails/2.jpg)
Outline
• Introduction
• Problem Description
• Key Idea
• Experiments and Results
• Conclusions
![Page 3: Similarity Search in High Dimensions via Hashing Aristides Gionis, Piotr Indyk, Rajeev Motwani Presented by: Fatih Uzun.](https://reader035.fdocuments.us/reader035/viewer/2022071805/56649cea5503460f949b5cf9/html5/thumbnails/3.jpg)
Introduction
• Similarity Search over High-Dimensional Data– Image databases, document collections etc
• Curse of Dimensionality– All space partitioning techniques degrade to linear
search for high dimensions
• Exact vs. Approximate Answer– Approximate might be good-enough and much-faster
– Time-quality trade-off
![Page 4: Similarity Search in High Dimensions via Hashing Aristides Gionis, Piotr Indyk, Rajeev Motwani Presented by: Fatih Uzun.](https://reader035.fdocuments.us/reader035/viewer/2022071805/56649cea5503460f949b5cf9/html5/thumbnails/4.jpg)
Problem Description
- Nearest Neighbor Search ( - NNS)– Given a set P of points in a normed space , preprocess P
so as to efficiently return a point p P for any given query point q, such that
• dist(q,p) (1 + ) min r P dist(q,r)
• Generalizes to K- nearest neighbor search ( K >1)
![Page 5: Similarity Search in High Dimensions via Hashing Aristides Gionis, Piotr Indyk, Rajeev Motwani Presented by: Fatih Uzun.](https://reader035.fdocuments.us/reader035/viewer/2022071805/56649cea5503460f949b5cf9/html5/thumbnails/5.jpg)
Problem Description
![Page 6: Similarity Search in High Dimensions via Hashing Aristides Gionis, Piotr Indyk, Rajeev Motwani Presented by: Fatih Uzun.](https://reader035.fdocuments.us/reader035/viewer/2022071805/56649cea5503460f949b5cf9/html5/thumbnails/6.jpg)
Key Idea
• Locality Sensitive Hashing ( LSH ) to get sub-linear dependence on the data-size for high-dimensional data
• Preprocessing : – Hash the data-point using several LSH
functions so that probability of collision is higher for closer objects
![Page 7: Similarity Search in High Dimensions via Hashing Aristides Gionis, Piotr Indyk, Rajeev Motwani Presented by: Fatih Uzun.](https://reader035.fdocuments.us/reader035/viewer/2022071805/56649cea5503460f949b5cf9/html5/thumbnails/7.jpg)
Algorithm : Preprocessing
• Input – Set of N points { p1 , …….. pn }– L ( number of hash tables )
• Output– Hash tables Ti , i = 1 , 2, …. L
• Foreach i = 1 , 2, …. L– Initialize Ti with a random hash function gi(.)
• Foreach i = 1 , 2, …. LForeach j = 1 , 2, …. N
Store point pj on bucket gi(pj) of hash table Ti
![Page 8: Similarity Search in High Dimensions via Hashing Aristides Gionis, Piotr Indyk, Rajeev Motwani Presented by: Fatih Uzun.](https://reader035.fdocuments.us/reader035/viewer/2022071805/56649cea5503460f949b5cf9/html5/thumbnails/8.jpg)
LSH - Algorithm
g1(pi) g2(pi) gL(pi)
TLT2T1
pi
P
![Page 9: Similarity Search in High Dimensions via Hashing Aristides Gionis, Piotr Indyk, Rajeev Motwani Presented by: Fatih Uzun.](https://reader035.fdocuments.us/reader035/viewer/2022071805/56649cea5503460f949b5cf9/html5/thumbnails/9.jpg)
Algorithm : - NNS Query
• Input – Query point q
– K ( number of approx. nearest neighbors )
• Access – Hash tables Ti , i = 1 , 2, …. L
• Output– Set S of K ( or less ) approx. nearest neighbors
• S
Foreach i = 1 , 2, …. L
– S S { points found in gi(q) bucket of hash table Ti }
![Page 10: Similarity Search in High Dimensions via Hashing Aristides Gionis, Piotr Indyk, Rajeev Motwani Presented by: Fatih Uzun.](https://reader035.fdocuments.us/reader035/viewer/2022071805/56649cea5503460f949b5cf9/html5/thumbnails/10.jpg)
• Family H of (r1, r2, p1, p2)-sensitive functions, {hi(.)} – dist(p,q) < r1 ProbH [h(q) = h(p)] p1
– dist(p,q) r2 ProbH [h(q) = h(p)] p2 – p1 > p2 and r1 < r2
• LSH functions: gi(.) = { h1(.) …hk(.) } • For a proper choice of k and l, a simpler problem, (r,)-
Neighbor, and hence the actual problem can be solved
• Query Time : O(d n[1/(1+)] )– d : dimensions , n : data size
LSH - Analysis
![Page 11: Similarity Search in High Dimensions via Hashing Aristides Gionis, Piotr Indyk, Rajeev Motwani Presented by: Fatih Uzun.](https://reader035.fdocuments.us/reader035/viewer/2022071805/56649cea5503460f949b5cf9/html5/thumbnails/11.jpg)
Experiments• Data Sets
– Color images from COREL Draw library (20,000 points,dimensions up to 64)
– Texture information of aerial photographs (270,000 points, dimensions 60)
• Evaluation– Speed, Miss Ratio, Error (%) for various data sizes,
dimensions, and K values
– Compare Performance with SR-Tree ( Spatial Data Structure )
![Page 12: Similarity Search in High Dimensions via Hashing Aristides Gionis, Piotr Indyk, Rajeev Motwani Presented by: Fatih Uzun.](https://reader035.fdocuments.us/reader035/viewer/2022071805/56649cea5503460f949b5cf9/html5/thumbnails/12.jpg)
Performance Measures
• Speed– Number of disk block accesses in order to answer the
query ( # hash tables)
• Miss Ratio– Fraction of cases when less than K points are found for
K-NNS
• Error– Average of fractional error in distance to point found
by LSH as compared to nearest neighbor distance taken over entire set of queries
![Page 13: Similarity Search in High Dimensions via Hashing Aristides Gionis, Piotr Indyk, Rajeev Motwani Presented by: Fatih Uzun.](https://reader035.fdocuments.us/reader035/viewer/2022071805/56649cea5503460f949b5cf9/html5/thumbnails/13.jpg)
Speed vs. Data SizeApproximate 1 - NNS
0
2
4
6
8
10
12
14
16
18
20
0 5000 10000 15000 20000
Number of Database Points
Dis
k A
cc
es
se
s LSH, error = 0.2
LSH, error = 0.1
LSH, error = 0.05
LSH, error =0.02
SR-Tree
![Page 14: Similarity Search in High Dimensions via Hashing Aristides Gionis, Piotr Indyk, Rajeev Motwani Presented by: Fatih Uzun.](https://reader035.fdocuments.us/reader035/viewer/2022071805/56649cea5503460f949b5cf9/html5/thumbnails/14.jpg)
Speed vs. DimensionApproximate 1-NNS
0
2
4
6
8
10
12
14
16
18
20
0 20 40 60 80
Dimensions
Dis
k A
cces
ses LSH , Error = 0.2
LSH, Error = 0.1
LSH, Error = 0.05
LSH, Error = 0.02
SR- Tree
![Page 15: Similarity Search in High Dimensions via Hashing Aristides Gionis, Piotr Indyk, Rajeev Motwani Presented by: Fatih Uzun.](https://reader035.fdocuments.us/reader035/viewer/2022071805/56649cea5503460f949b5cf9/html5/thumbnails/15.jpg)
Speed vs. Nearest NeighborsApproximate K-NNS
0
2
4
6
8
10
12
14
16
0 20 40 60 80 100 120
Number of Nearest Neighbors
Dis
k A
cc
es
se
s
LSH, Error 0.2
LSH, Error 0.1
LSH, Error 0.05
![Page 16: Similarity Search in High Dimensions via Hashing Aristides Gionis, Piotr Indyk, Rajeev Motwani Presented by: Fatih Uzun.](https://reader035.fdocuments.us/reader035/viewer/2022071805/56649cea5503460f949b5cf9/html5/thumbnails/16.jpg)
Speed vs. Error
0
50
100
150
200
250
300
350
400
450
10 20 30 40 50
Error ( % )
Dis
k A
cces
ses
SR-Tree
LSH
![Page 17: Similarity Search in High Dimensions via Hashing Aristides Gionis, Piotr Indyk, Rajeev Motwani Presented by: Fatih Uzun.](https://reader035.fdocuments.us/reader035/viewer/2022071805/56649cea5503460f949b5cf9/html5/thumbnails/17.jpg)
Miss Ratio vs. Data SizeApproximate 1 -NNS
0
0.05
0.1
0.15
0.2
0.25
0 5000 10000 15000 20000
Number of Database Points
Mis
s R
atio
Error = 0.1
Error = 0.05
![Page 18: Similarity Search in High Dimensions via Hashing Aristides Gionis, Piotr Indyk, Rajeev Motwani Presented by: Fatih Uzun.](https://reader035.fdocuments.us/reader035/viewer/2022071805/56649cea5503460f949b5cf9/html5/thumbnails/18.jpg)
Conclusion
Better Query Time than Spatial Data Structures
Scales well to higher dimensions and larger data size ( Sub-linear dependence )
Predictable running timeExtra storage over-head Inefficient for data with distances concentrated around average
![Page 19: Similarity Search in High Dimensions via Hashing Aristides Gionis, Piotr Indyk, Rajeev Motwani Presented by: Fatih Uzun.](https://reader035.fdocuments.us/reader035/viewer/2022071805/56649cea5503460f949b5cf9/html5/thumbnails/19.jpg)
Future Work
• Investigate Hybrid-Data Structures obtained by merging tree and hash based structures.
• Make use of the structure of the data-set to systematically obtain LSH functions
• Explore other applications of LSH-type techniques to data mining
![Page 20: Similarity Search in High Dimensions via Hashing Aristides Gionis, Piotr Indyk, Rajeev Motwani Presented by: Fatih Uzun.](https://reader035.fdocuments.us/reader035/viewer/2022071805/56649cea5503460f949b5cf9/html5/thumbnails/20.jpg)
Questions ?