13975224

9
1 Locality-Sensitive Hashing: Finding a Needle in a Haystack Malcolm Slaney (Yahoo! Research) and Michael Casey (Goldsmith College, University of London) Fifth Draft! Do not distribute I. SCOPE One of the more surprising changes in computing during the last few years is the wealth of data that is now available at our fingertips. We can easily carry in our pockets thousands of songs, hundreds of thousands of images, and hundreds of hours of video. But even with the rapid growth of computer performance, we don’t have the processing power to search this amount of data by brute force. This note describes a technique known as locality-sensitive hashing (LSH) that allows one to quickly find similar entries in large databases of text, images or music. This approach belongs to a novel and interesting class of algorithms that are known as randomized algorithms. A randomized algorithm does not guarantee an exact answer, but instead provides a high probability guarantee that it will return the correct answer or one close to it. By investing additional computational effort for more sampling, the probability can be pushed as high as desired. II. RELEVANCE There are many problems that involve finding similar items. These problems are often solved by finding the nearest neighbor to an object in some metric space. This is an easy problem to state, but when the database is large and the objects are complicated the processing time grows linearly with the number of items and the complexity of the object. LSH is most valuable when searching for near matches (as opposed to exact matches) of high-dimensional items in very large databases. In these searches it can drastically reduce the computational time, at the cost of a small probability of failing to find the absolute closest match. III. PREREQUISITES This tutorial note is based on simple geometric reasoning. Some knowledge of probabilities, and a comfort with the mathematics of high-dimensional space is useful. IV. PROBLEM STATEMENT Given a query point, we wish to find the points in a large database that are closest to the query. We wish to guarantee with high probability, 1 - δ, that we return the nearest neighbor for any query point. Conceptually, this problem is easily solved by iterating through each point in the database and calculating the distance to the query object. But our database may contain billions of objects—each object described by a vector September 21, 2007 DRAFT Draft of an article to be published in IEEE Signal Processing Magazine, Lecture Notes, March 2008.

description

and hundreds of hours of video. But even with the rapid growth of computer performance, we don’t have the One of the more surprising changes in computing during the last few years is the wealth of data that is now algorithms that are known as randomized algorithms. A randomized algorithm does not guarantee an exact answer, object. LSH is most valuable when searching for near matches (as opposed to exact matches) of high-dimensional the mathematics of high-dimensional space is useful. 1

Transcript of 13975224

Page 1: 13975224

1

Locality-Sensitive Hashing:

Finding a Needle in a Haystack

Malcolm Slaney (Yahoo! Research) and Michael Casey (Goldsmith College, University of London)

Fifth Draft! Do not distribute

I. SCOPE

One of the more surprising changes in computing during the last few years is the wealth of data that is now

available at our fingertips. We can easily carry in our pockets thousands of songs, hundreds of thousands of images,

and hundreds of hours of video. But even with the rapid growth of computer performance, we don’t have the

processing power to search this amount of data by brute force.

This note describes a technique known as locality-sensitive hashing (LSH) that allows one to quickly find similar

entries in large databases of text, images or music. This approach belongs to a novel and interesting class of

algorithms that are known as randomized algorithms. A randomized algorithm does not guarantee an exact answer,

but instead provides a high probability guarantee that it will return the correct answer or one close to it. By investing

additional computational effort for more sampling, the probability can be pushed as high as desired.

II. RELEVANCE

There are many problems that involve finding similar items. These problems are often solved by finding the nearest

neighbor to an object in some metric space. This is an easy problem to state, but when the database is large and

the objects are complicated the processing time grows linearly with the number of items and the complexity of the

object. LSH is most valuable when searching for near matches (as opposed to exact matches) of high-dimensional

items in very large databases. In these searches it can drastically reduce the computational time, at the cost of a

small probability of failing to find the absolute closest match.

III. PREREQUISITES

This tutorial note is based on simple geometric reasoning. Some knowledge of probabilities, and a comfort with

the mathematics of high-dimensional space is useful.

IV. PROBLEM STATEMENT

Given a query point, we wish to find the points in a large database that are closest to the query. We wish to

guarantee with high probability, 1− δ, that we return the nearest neighbor for any query point.

Conceptually, this problem is easily solved by iterating through each point in the database and calculating the

distance to the query object. But our database may contain billions of objects—each object described by a vector

September 21, 2007 DRAFT

Draft of an article to be published in IEEE Signal Processing Magazine, Lecture Notes, March 2008.

Page 2: 13975224

2

that contains hundreds of dimensions. Therefore, we wish to find a solution that does not depend on a linear search

of the database.

A. Trees

In a one-dimensional world, it is easy to search for values by building a tree of objects. Then, given a query we

start at the top node, ask if our query object is to the left or to the right of the current node, and then recursively

descend the tree. If the tree is properly constructed, this solves the query problem in O(log N ) time, where N is the

number of objects. In a one-dimensional world, this is a binary search; the k-d tree algorithm is a multi-dimensional

version of this idea [?]. But multi-dimensional algorithms, such as k-d trees, break down when the dimensionality

of the search space is greater than a few dimensions—we end up testing nearly all the nodes in the data set and

the computational complexity grows to O(N ).

B. Hashes

Many search problems are solved using conventional computer hashing algorithms. A hash table is a data structure

that allows one to quickly map between a symbol (i.e. a string) and a value. This is done by calculating an arbitrary,

pseudo-random function of the symbol that maps the symbol into an integer that indexes a table. Thus a symbol

with dozens of characters, and perhaps hundreds of bits of data, is mapped to a relatively small index into the table.

A collision occurs when two points hash to the same value and there are special provisions to allow more than one

symbol per hash value. But a well-designed hash table allows a symbol lookup in O(1) time with O(N ) memory,

where N is the number of entries in the table.

A hash table returns an exact match. A well-designed hash function separates two symbols that are close

together into different buckets. This makes a hash table a good means of finding exact matches, but not for

finding approximate matches. By contrast, a locality-sensitive hash is an efficient means of finding near matches.

V. SOLUTION

LSH is based on a simple idea. Consider the most general view: If two points are close together, then after a

“projection” operation these two points will remain close together. Figure ?? illustrates the basic idea. Two points

that are close together on the sphere are also close together when the sphere is imaged (or projected) onto the

two-dimensional page. This is true no matter how we rotate the sphere. Two other points on the sphere that are far

apart will, for some orientations, be close together on the page, but it is more likely that the points will remain far

apart. We will describe two different kinds of projection operators, but thinking about rendering a multi-dimensional

sphere onto a two-dimensional page is a good metaphor.

Given a random “projection” operation, we note which points are close to our query. A projection maps a data

point from a high-dimensional space to a low-dimensional subspace. We create projections from a number of

different directions and keep track of the nearby points. We keep a list of these found points and note the points

that appear close to each other in more than one projection. Part of the art of solving this problem is defining a

September 21, 2007 DRAFT

Page 3: 13975224

3

Fig. 1. Splat. Two examples showing projections of two close (circles) and two distant points (squares) onto the printed page.

notion of “nearby” (a threshold) so that we keep track of a manageable number of points. Commonly, the projection

operation projects the points onto a line, so the similarity test is a simple comparison.

There are two common ways to do these projections. The original formulation for LSH assumed all points are

described by a large number of binary features [?]. Projections are formed by selecting a subset of the dimensions.

A better formulation for signal processing applications computes a dot-product with Gaussian vectors, creating

arbitrary projections [?].

Calculating the distance between two objects, x and y, in a binary feature space is easy with the Hamming

distance

D =∑

i

xi "= yi (1)

where xi is the value of the i’th feature for object x and "= is the exclusive-or operator. We implement a locality-

sensitive hash by performing the Hamming calculation over subsets of the dimensions. On average, points that are

close together because they share many of the same features will remain close together in a random subspace.

Binary features are not a limitation for signal processing because integers can be represented with a unary code.

Thus in an N -bit unary code, bit i for 0 ≤ i < N represents the feature that the value of the number is less than

i. In an implementation of LSH, this bit vector does not need to be calculated. Instead, we expand the calculation

in Equation ?? to include the numerical comparison for bit i.

A. Random Projections

The key to locality-sensitive hashes of the query point −→v from a real-valued high-dimensional space is the dot

product

h(−→v ) = −→v ·−→x (2)

where the elements of the vector −→x are chosen at random from a Gaussian distribution, for example N (0, 1). This

scalar projection is then quantized into a set of hash bins, with the intention that nearby items in the original space

September 21, 2007 DRAFT

Page 4: 13975224

4

will fall into the same bin. The full hash function is given by

hx,b(−→v ) =⌊−→x ·−→v + b

w

⌋(3)

where %·& is the floor operation and 0 ≤ b < w is a uniformly distributed random variable that makes the quantization

error easier to analyze, with no loss in performance.

In order for the projection operator to “converge,” it must project nearby points to positions that are close together.

Thus, for any points p and q in Rd, we want a high probability, P1, that two close points fall into the same bucket:

PH [h(p) = h(q)] ≥ P1 for ||p− q|| ≤ R1 (4)

and we want a low probability, P2 < P1, that two points that are far apart, R2 > R1, fall into the same bucket

PH [h(p) = h(q)] ≤ P2 for ||p− q|| ≥ cR1 = R2. (5)

Because of the linearity of the dot product, the difference between two image points ||h(p) − h(q)||2 has a

magnitude whose distribution is proportional to ||p − q||2. By this argument we see that P1 > P2. We further

magnify the difference between P1 and P2, thus increasing the performance of each projection, by performing k

dot products in parallel. This increases the ratio of the probability of nearby points over not so close points:(

P1

P2

)k

>P1

P2. (6)

The k independent hashes are a projection we call gj . The projection gj transforms the query point −→v into k real

numbers. For efficient comparisons, we put all points (the query points and all the points in the database) into

buckets, quantizing the hash values in Equation ??, with the hope that similar points will fall in the same bucket.

The width of the buckets, w, determines how many points collide. A small value for w means there is a bigger

table and fewer nearest neighbor points to check; a large value means we have to sort through many points to find

the true nearest neighbors.

Within each set of k dot products that form a projection, we achieve success if the query and the nearest neighbor

are in the same bin in all k dot products. Hence, P (1k) falls as we include more dot products. However, when we

repeat this L times, only some of the projections will fail to find the nearest neighbor. This gives us additional

error tolerance. Thus, we form L of these projections to get the desired level of probability. By increasing L we

can find the nearest neighbor with arbitrarily high probability. We will discuss how these parameters are chosen in

Section ??.

B. Implementation

At this point, we have mapped a data point into a hash bucket described by k integer indices. This k-dimensional

space is sparse, but we can use conventional hashes, with no loss of performance, to efficiently find the right bucket.

For illustration, we describe the approach used by E2LSH [?], but more sophisticated approaches based on reusing

hashes [?], or different spatial tilings [?], are also possible. Even after the projection operator, one still must find

nearest neighbors along a line. A naı̈ve algorithm could easily take O(logN) operations, but we reduce this to O(1)

operations using a pair of conventional hash tables.

September 21, 2007 DRAFT

Page 5: 13975224

5

One projection of the LSH algorithm using dot products is shown in Figure ??. This processing puts one point

through k dot products and then stores a fingerprint (T2) that describes the k-dimensional result into a hash bucket

computed by hash T1. Collisions are potential nearest neighbors. This single projection allows us to find nearby

points in a small fraction of the time it takes to look at all the points in the database.

Fig. 2. A block diagram of one LSH projection.

We first use a conventional hash to map the k-dimensional projection output into a single linear index. This is

implemented by computing

T1 =

(∑

i

Hijki

)mod P1 (7)

for an arbitrary prime number (and hash-table size) P1. With a well-designed hash, the hash table is efficient, but

we still have a chance that two points will collide under T1. Thus, we need to chain the entries in a bucket and

verify that we have found the right entry by comparing the queries k-dimensional projection value to the database

values. This calculation grows as k gets larger, in addition to taking more space. Instead, we see a fingerprint,

similar to T1 for the projection vector. The fingerprint is calculated with

T2 =

(∑

i

Hijki

)mod P2. (8)

Now when checking to see if the query projection is in the T1 hash bucket, we just compare the fingerprints. Even

with a 16-bit fingerprint as determined by P2, the chance of a mistake is very small.

C. s-Stable distributions

As described above, any projection operation can be used to reduce the data to a lower-dimensional space.

Forming a dot product with a vector based on a family of random variables from a s-stable distribution simplifies

the analysis of the performance of LSH. A weighted sum of random variables from a s-stable distribution has a

probability distribution that is similar to the original distribution. More formally, a distribution D is s-stable if

for any independent, identically distributed (iid) random variables X1, ...,Xn distributed according to D and any

real numbers v1, ..., vn the random variable∑

i viXi has a probability distribution that is the same as the random

variable

(∑

i

|vi|s)(1/s)X (9)

where X is drawn from D. For s = 2, or the L2 norm, a Gaussian probability distribution is s-stable.

September 21, 2007 DRAFT

Page 6: 13975224

6

Using an s-stable distribution for our projections allows us to analytically describe the performance of LSH. We

start by calculating the probability that two points, p and q, separated by distance u = ||p − q||2, collide and fall

into the same hash bucket. The projections of the two close points will always be close, but because of quantization

they might fall on opposite sides of the barrier and thus land in different buckets. The probability that these two

points hash to the same value is given by

p(u) = Pra,b[ha,b(p) = ha,b(q)] =∫ w

0

1u

fs

(t

u

) (1− t

w

)dt (10)

where fs is the probability density function (pdf) of the hash H as given by Equation ?? and the 1 − t/2 term

represents the probability that the two points fall in the same bin of width w. For any given bucket width, w, this

probability falls as the distance u grows.

This probability can be used to calculate the probabilities in Equations ?? and ?? for an L2 space with R1 equal

to the bin width w [?]

P2 = 1− 2F (−w/c)− 2√2πw/c

(1− e−(w2/2c2)). (11)

Here F () is the cumulative pdf of a Guassian random variable. P1 is found by setting c = 1.

(P1)k is the probability that a single point falls within the same bucket as the query, so the probability that all

L projections fail to produce a collision between the query and the true nearest neighbor is equal to (1 − P k1 )L.

If we want the probability that our algorithm fails to find the true nearest neighbor is to be no more than δ, the L

must be at least

L =⌈

log δ

log(1− P k1 )

⌉(12)

for a fixed value of k to be determined.

The amount of time needed to find a nearest neighbor is the time needed to calculate the hash functions, plus

the time needed to search the buckets for collisions. Because there are kL projections of n-dimensional vectors, the

first time, Tg , is O(nkL) where n is the dimensionality of the search space. The second time, Tc increases linearly

based on the expected number of collisions for each projection Tc = O(dLNc) where d is the average number of

points in each bucket. The expected number of collisions for a single projection is

Nc =∑

q!∈D

pk(||q − q′||) (13)

where p() from Equation ?? gives the probability that each point contributes to a collision and D represents all the

points in the database. It is easy to see that Tg increases as a function of k, while Tc decreases since pk < p for

p < 1 and k > 1.

The E2LSH algorithm finds the best value for k by experimentally evaluating the cost of the calculation for

samples in the given data set. E2LSH scales the data so the w is always equal to 4. In the cover-song experiments

described elsewhere [?] E2LSH used between 7 and 14 dot products per hash (k) and more than 150 projections

(L).

September 21, 2007 DRAFT

Page 7: 13975224

7

VI. APPLICATIONS

The LSH algorithm, and related randomized algorithms, make it possible to quickly find nearest neighbors in

very large databases. Conventional hashes work well for finding exact matches, but do not help us find neighbors.

Instead, a hashing algorithm, as we have described, needs to take into account the locality of the points so that

nearby points remain nearby. These locality-sensitive hashes have been applied in a number of domains, as we will

now use as illustrations for this idea.

A. Web

A randomized algorithm was first applied to finding duplicate pages on the web. The web is full of duplicate

pages, partly because content is duplicated across sites, and partly because there is more than one URL that points

to the same file on a disk. Yet search engines don’t want to return 10 copies of the same page. A solution is based

on shingles, and each shingle represents a portion of a web page and is computed by forming a histogram of the

words found within that portion of the page. We can test to see if a portion of the page is duplicated elsewhere on

the web by looking for other shingles with the same histogram. Given that there are billions of pages on the web,

and any portion of any page might be a duplicate, there are a large number of shingles to test.

Broder [?] solved this problem by considering random selections, analogous to LSH, to test the similarity of

pages. If the shingles of the new page match shingles from the database, then it is likely the new pages bear a strong

resemblance to an existing page. The nearest-neighbor solution is important because web pages are surrounded by

navigational and other information that changes from site to site. An approximate solution to this problem is fine,

especially when balanced with the computational savings of a solution like LSH.

B. Image retrieval

A second application of LSH is for object recognition [?]. We compute a detailed metric for many different

orientations and configurations of an object we want to recognize. Then, given a new image we simply check our

database to see if a pre-computed object’s metrics are close to our query. This database can contain millions of

these poses. Using LSH allows us to quickly check to see if the query object is known. A similar idea was applied

to genomic data [?].

C. Music retrieval

We use conventional hashes to find exact musical matches. Fingerprints are representations of an audio signal

that are robust to common types of abuse that are performed to audio before it reaches our ears [?]. This can be

done, for example by noting the peaks in the spectrum because they are robust to noise and encoding their position

in time and space. One then just has to query the database for the same fingerprint. With such robust features, one

can use conventional hashing, especially when looking at many samples over time, because one only needs to find

one exact match to reduce the search space.

To find similar songs, as might happen when a song is remixed for a new audience, or more dramatically when

a different artist performs the same song, we can not use a fingerprint. Instead, we can use several seconds of the

September 21, 2007 DRAFT

Page 8: 13975224

8

song, a snippet, as a shingle. To determine if two songs are similar, we need to query the database and see if a large

enough number of the query shingles are close to one song in the database [?]. Closeness depends on the feature

vector, but long shingles provide specificity, and make LSH more important. This similarity measure is important

in the Internet era because we can eliminate duplicates to improve search results, and to link recommendation data

between similar songs. LSH is important for this nearest-neighbor check because of the size of the database.

VII. CONCLUSIONS - WHAT WE HAVE LEARNED

In this note, we have described the theory and implementation of a randomized algorithm known as locality-

sensitive hashing (LSH). Unlike conventional computer hashes that are designed to return exact matches in O(1)

time, an LSH algorithm uses dot products with random vectors to quickly find nearest neighbors. LSH provides a

probabilistic guarantee that it will return the correct answer. In systems that have other sources of error, perhaps due

to mislabeled data or the difficulties of pattern recognition, one can reduce the error due to LSH below other sources

of error, and gain significant improvement in computational effort. These randomized algorithms are important in

today’s world of Internet-sized databases.

VIII. ACKNOWLEDGMENTS

We appreciate thoughtful comments we have received from Alex Jaffe, Sara Anderson, and several reviewers.

REFERENCES

[1] Alexandr Andoni and Piotr Indyk. E2LSH 0.1 User Manual. http://web.mit.edu/andoni/www/LSH, June 21, 2005

[2] A. Andoni, M. Datar, N. Immorlica, V. Mirrokni. Locality-sensitive hashing using stable distributions. In Nearest Neighbor Methods

in Learning and Vision: Theory and Practice, T. Darrell, P. Indyk, G. Shakhnarovich (eds.), MIT Press, 2006.

[3] Alexandr Andoni and Piotr Indyk. Near-Optimal Hashing Algorithms for Near Neighbor Problem in High Dimensions. In Proceedings

of the Symposium on Foundations of Computer Science (FOCS’06), 2006.

[4] J.L. Bentley. Multidimensional binary search trees used for associative searching. Communications of the ACM, 18:509–517, 1975.

[5] Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, and Geoffrey Zweig. Syntactic clustering of the web. In Proc. of WWW,

pages 1157–1166, Santa Clara, CA, 1997.

[6] Jeremy Buhler. Efficient large-scale sequence comparison by locality-sensitive hashing. Bioinformatics 17: 419-428.

[7] P. Cano, E. Batlle, T. Kalker, and J. Haitsma. A review of algorithms for audio fingerprinting. In International Workshop on

Multimedia Signal Processing, December 2002.

[8] Michael Casey and Malcolm Slaney. Fast Recognition of Remixed Music Audio. In Proceedings ICASSP 2007, Volume 4, IV-1425–

IV-1428 , 2007.

[9] Aristides Gionis, Piotr Indyk and Rajeev Motwani. Similarity Search in High Dimensions via Hashing. in The VLDB Journal,

518–529, 1999.

[10] P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality, in STOC, 1998.

[11] Gregory Shakhnarovich, Paul Viola, Trevor Darrell. Fast Pose Estimation with Parameter-Sensitive Hashing. In Nearest Neighbor

Methods in Learning and Vision: Theory and Practice, T. Darrell, P. Indyk, G. Shakhnarovich (eds.), MIT Press, 2006.

IX. AUTHORS

Malcolm Slaney (Senior Member, IEEE) received his PhD at Purdue University for his work on diffraction

tomography. Since the start of his career he has been a researcher at Bell Labs, Schlumberger Palo Alto Research,

Apple’s Advanced Technology Lab, Interval Research, IBM Almaden Research Center, and most recently at Yahoo!

September 21, 2007 DRAFT

Page 9: 13975224

9

Research. Since 1990 he has organized the Stanford CCRMA Hearing Seminar, where he now holds the title

(Consulting) Professor. He is a coauthor (with A. C. Kak) of the book Principles of Computerized Tomographic

Imaging, which has been republished as a Classics in Applied Mathematics by SIAM Press. He is a coeditor of the

book Computational Models of Hearing. Malcolm once wondered what computer science theory researchers did

of any practical importance. He is pleasantly surprised by the applicability of LSH to signal processing problems,

and this note is partial penance.

Michael Casey (Member, IEEE) received his PhD from the MIT Media Lab’s Machine Listening Group in 1998

for research in structured audio analysis and synthesis. He was a Research Scientist at Mitsubishi Electric Research

Laboratories (MERL) and Professor of Computer Science at Goldsmiths College, University of London, where

he still holds the title of Visiting Research Professor, prior to taking up his current post as Professor of Music at

Dartmouth College, Hanover, NH, USA. Michael was a co-editor for the Audio section of the MPEG-7 International

Standard for Multimedia Content Description and principle investigator for several large grants in the UK in the

field of music information retrieval which is the main subject of his current research.

September 21, 2007 DRAFT