Similarity Search in High Dimensions via Hashing Aristides Gionis, Piotr Indyk, Rajeev Motwani...

Similarity Search in High Dimensions via Hashing

Aristides Gionis, Piotr Indyk, Rajeev Motwani

Presented by:

Fatih Uzun

Outline

• Introduction

• Problem Description

• Key Idea

• Experiments and Results

• Conclusions

Introduction

• Similarity Search over High-Dimensional Data– Image databases, document collections etc

• Curse of Dimensionality– All space partitioning techniques degrade to linear

search for high dimensions

• Exact vs. Approximate Answer– Approximate might be good-enough and much-faster

– Time-quality trade-off

Problem Description

- Nearest Neighbor Search ( - NNS)– Given a set P of points in a normed space , preprocess P

so as to efficiently return a point p P for any given query point q, such that

• dist(q,p) (1 + ) min r P dist(q,r)

• Generalizes to K- nearest neighbor search ( K >1)

Problem Description

Key Idea

• Locality Sensitive Hashing ( LSH ) to get sub-linear dependence on the data-size for high-dimensional data

• Preprocessing : – Hash the data-point using several LSH

functions so that probability of collision is higher for closer objects

Algorithm : Preprocessing

• Input – Set of N points { p1 , …….. pn }– L ( number of hash tables )

• Output– Hash tables Ti , i = 1 , 2, …. L

• Foreach i = 1 , 2, …. L– Initialize Ti with a random hash function gi(.)

• Foreach i = 1 , 2, …. LForeach j = 1 , 2, …. N

Store point pj on bucket gi(pj) of hash table Ti

LSH - Algorithm

g1(pi) g2(pi) gL(pi)

TLT2T1

Algorithm : - NNS Query

• Input – Query point q

– K ( number of approx. nearest neighbors )

• Access – Hash tables Ti , i = 1 , 2, …. L

• Output– Set S of K ( or less ) approx. nearest neighbors

Foreach i = 1 , 2, …. L

– S S { points found in gi(q) bucket of hash table Ti }

• Family H of (r1, r2, p1, p2)-sensitive functions, {hi(.)} – dist(p,q) < r1 ProbH [h(q) = h(p)] p1

– dist(p,q) r2 ProbH [h(q) = h(p)] p2 – p1 > p2 and r1 < r2

• LSH functions: gi(.) = { h1(.) …hk(.) } • For a proper choice of k and l, a simpler problem, (r,)-

Neighbor, and hence the actual problem can be solved

• Query Time : O(d n[1/(1+)] )– d : dimensions , n : data size

LSH - Analysis

Experiments• Data Sets

– Color images from COREL Draw library (20,000 points,dimensions up to 64)

– Texture information of aerial photographs (270,000 points, dimensions 60)

• Evaluation– Speed, Miss Ratio, Error (%) for various data sizes,

dimensions, and K values

– Compare Performance with SR-Tree ( Spatial Data Structure )

Performance Measures

• Speed– Number of disk block accesses in order to answer the

query ( # hash tables)

• Miss Ratio– Fraction of cases when less than K points are found for

• Error– Average of fractional error in distance to point found

by LSH as compared to nearest neighbor distance taken over entire set of queries

Speed vs. Data SizeApproximate 1 - NNS

0 5000 10000 15000 20000

Number of Database Points

s LSH, error = 0.2

LSH, error = 0.1

LSH, error = 0.05

LSH, error =0.02

SR-Tree

Speed vs. DimensionApproximate 1-NNS

0 20 40 60 80

Dimensions

ses LSH , Error = 0.2

LSH, Error = 0.1

LSH, Error = 0.05

LSH, Error = 0.02

SR- Tree

Speed vs. Nearest NeighborsApproximate K-NNS

0 20 40 60 80 100 120

Number of Nearest Neighbors

LSH, Error 0.2

LSH, Error 0.1

LSH, Error 0.05

Speed vs. Error

10 20 30 40 50

Error ( % )

SR-Tree

Miss Ratio vs. Data SizeApproximate 1 -NNS

0 5000 10000 15000 20000

Number of Database Points

Error = 0.1

Error = 0.05

Conclusion

Better Query Time than Spatial Data Structures

Scales well to higher dimensions and larger data size ( Sub-linear dependence )

Predictable running timeExtra storage over-head Inefficient for data with distances concentrated around average

Future Work

• Investigate Hybrid-Data Structures obtained by merging tree and hash based structures.

• Make use of the structure of the data-set to systematically obtain LSH functions

• Explore other applications of LSH-type techniques to data mining

Questions ?

Similarity Search in High Dimensions via Hashing Aristides Gionis, Piotr Indyk, Rajeev Motwani...

Documents

Transcript of Similarity Search in High Dimensions via Hashing Aristides Gionis, Piotr Indyk, Rajeev Motwani...

1 By: MOSES CHARIKAR, CHANDRA CHEKURI, TOMAS FEDER, AND RAJEEV MOTWANI Presented By: Sarah Hegab.

Polylogarithmic Private Approximations and Efficient Matching Piotr Indyk MIT David Woodruff MIT, Tsinghua TCC 2006.

Adaptive Monitoring of Bursty Data Streams Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani.

Carlos Castillo, chato@yahoo-inc.com Debora Donato, debora@yahoo-inc.com Aristides Gionis, gionis@yahoo-inc.com Vanessa Murdock, vmurdock@yahoo-inc.com.

Fast Shortest Path Distance Estimation in Large Networks Michalis Potamias Francesco Bonchi Carlos Castillo Aristides Gionis.

Nathaniel Downes, CFA Associate Portfolio Manager Jitendra Motwani, CFA Senior Associate

Real Security for Server Virtualization Rajiv Motwani 2 nd October 2010.

Martina Motwani- Freelance SEO, SMM Expert and Web Consultant from Udaipur

Commoditization of Brands - Anisha Motwani

Chantier d'usage: NeoTEX...[1998-ACM Th. of computing-Indyk Motwani] [1999-VLDB-Gionis Indyk Motwani] prendre r «extraits»desobjetsàcomparer si ces r «extraits»sontlesmêmes(hashage

Customized Tour Recommendations in Urban Areas Date : 2014/11/20 Author: Aristdes Gionis, Konstaninos Pelechrinis, Theodoros Lappas, Evimaria Terzi Source:

1 Efficient Algorithms for Substring Near Neighbor Problem Alexandr Andoni Piotr Indyk MIT.

Lecture outline Clustering aggregation – Reference: A. Gionis, H. Mannila, P. Tsaparas: Clustering aggregation, ICDE 2004 Co-clustering (or bi-clustering)

Tanvi Motwani- A Few Examples Go A Long Way

Formative Assessments By Elzbieta Indyk August 15, 2015.

Seo workshop for SEO Begginers and SEO Learners - Martina Motwani

Searching on Multi-Dimensional Data COL 106 Slide Courtesy: Dan Tromer, Piotyr Indyk, George Bebis.

Recent&Developments&in&the& Sparse&Fourier&Transformpeople.csail.mit.edu/indyk/fourier-gsip.pdfRecent&Developments&in&the& Sparse&Fourier&Transform! Piotr!Indyk! MIT! Jointwork!with!

Hopcroft j e Motwani r Ullman j d Introduction to Automata t

Hashing, sketching, and other approximate algorithms …people.csail.mit.edu/indyk/emnlp.pdf · Hashing, sketching, and other approximate algorithms for ... LSH [Indyk-Motwani’98]