Similarity Search in High Dimensions via Hashing Aristides Gionis, Piotr Indyk, Rajeev Motwani...

Similarity Search in High Dimensions via Hashing

Aristides Gionis, Piotr Indyk, Rajeev Motwani

Presented by:

Fatih Uzun

Outline

• Introduction

• Problem Description

• Key Idea

• Experiments and Results

• Conclusions

Introduction

• Similarity Search over High-Dimensional Data– Image databases, document collections etc

• Curse of Dimensionality– All space partitioning techniques degrade to linear

search for high dimensions

• Exact vs. Approximate Answer– Approximate might be good-enough and much-faster

– Time-quality trade-off

Problem Description

- Nearest Neighbor Search ( - NNS)– Given a set P of points in a normed space , preprocess P

so as to efficiently return a point p P for any given query point q, such that

• dist(q,p) (1 + ) min r P dist(q,r)

• Generalizes to K- nearest neighbor search ( K >1)

Problem Description

Key Idea

• Locality Sensitive Hashing ( LSH ) to get sub-linear dependence on the data-size for high-dimensional data

• Preprocessing : – Hash the data-point using several LSH

functions so that probability of collision is higher for closer objects

Algorithm : Preprocessing

• Input – Set of N points { p1 , …….. pn }– L ( number of hash tables )

• Output– Hash tables Ti , i = 1 , 2, …. L

• Foreach i = 1 , 2, …. L– Initialize Ti with a random hash function gi(.)

• Foreach i = 1 , 2, …. LForeach j = 1 , 2, …. N

Store point pj on bucket gi(pj) of hash table Ti

LSH - Algorithm

g1(pi) g2(pi) gL(pi)

TLT2T1

pi

P

Algorithm : - NNS Query

• Input – Query point q

– K ( number of approx. nearest neighbors )

• Access – Hash tables Ti , i = 1 , 2, …. L

• Output– Set S of K ( or less ) approx. nearest neighbors

• S

Foreach i = 1 , 2, …. L

– S S { points found in gi(q) bucket of hash table Ti }

• Family H of (r1, r2, p1, p2)-sensitive functions, {hi(.)} – dist(p,q) < r1 ProbH [h(q) = h(p)] p1

– dist(p,q) r2 ProbH [h(q) = h(p)] p2 – p1 > p2 and r1 < r2

• LSH functions: gi(.) = { h1(.) …hk(.) } • For a proper choice of k and l, a simpler problem, (r,)-

Neighbor, and hence the actual problem can be solved

• Query Time : O(d n[1/(1+)] )– d : dimensions , n : data size

LSH - Analysis

Experiments• Data Sets

– Color images from COREL Draw library (20,000 points,dimensions up to 64)

– Texture information of aerial photographs (270,000 points, dimensions 60)

• Evaluation– Speed, Miss Ratio, Error (%) for various data sizes,

dimensions, and K values

– Compare Performance with SR-Tree ( Spatial Data Structure )

Performance Measures

• Speed– Number of disk block accesses in order to answer the

query ( # hash tables)

• Miss Ratio– Fraction of cases when less than K points are found for

K-NNS

• Error– Average of fractional error in distance to point found

by LSH as compared to nearest neighbor distance taken over entire set of queries

Speed vs. Data SizeApproximate 1 - NNS

0

2

4

6

8

10

12

14

16

18

20

0 5000 10000 15000 20000

Number of Database Points

Dis

k A

cc

es

se

s LSH, error = 0.2

LSH, error = 0.1

LSH, error = 0.05

LSH, error =0.02

SR-Tree

Speed vs. DimensionApproximate 1-NNS

0

2

4

6

8

10

12

14

16

18

20

0 20 40 60 80

Dimensions

Dis

k A

cces

ses LSH , Error = 0.2

LSH, Error = 0.1

LSH, Error = 0.05

LSH, Error = 0.02

SR- Tree

Speed vs. Nearest NeighborsApproximate K-NNS

0

2

4

6

8

10

12

14

16

0 20 40 60 80 100 120

Number of Nearest Neighbors

Dis

k A

cc

es

se

s

LSH, Error 0.2

LSH, Error 0.1

LSH, Error 0.05

Speed vs. Error

0

50

100

150

200

250

300

350

400

450

10 20 30 40 50

Error ( % )

Dis

k A

cces

ses

SR-Tree

LSH

Miss Ratio vs. Data SizeApproximate 1 -NNS

0

0.05

0.1

0.15

0.2

0.25

0 5000 10000 15000 20000

Number of Database Points

Mis

s R

atio

Error = 0.1

Error = 0.05

Conclusion

Better Query Time than Spatial Data Structures

Scales well to higher dimensions and larger data size ( Sub-linear dependence )

Predictable running timeExtra storage over-head Inefficient for data with distances concentrated around average

Future Work

• Investigate Hybrid-Data Structures obtained by merging tree and hash based structures.

• Make use of the structure of the data-set to systematically obtain LSH functions

• Explore other applications of LSH-type techniques to data mining

Questions ?

Similarity Search in High Dimensions via Hashing Aristides Gionis, Piotr Indyk, Rajeev Motwani...

Documents

Transcript of Similarity Search in High Dimensions via Hashing Aristides Gionis, Piotr Indyk, Rajeev Motwani...