13975224
description
Transcript of 13975224
1
Locality-Sensitive Hashing:
Finding a Needle in a Haystack
Malcolm Slaney (Yahoo! Research) and Michael Casey (Goldsmith College, University of London)
Fifth Draft! Do not distribute
I. SCOPE
One of the more surprising changes in computing during the last few years is the wealth of data that is now
available at our fingertips. We can easily carry in our pockets thousands of songs, hundreds of thousands of images,
and hundreds of hours of video. But even with the rapid growth of computer performance, we don’t have the
processing power to search this amount of data by brute force.
This note describes a technique known as locality-sensitive hashing (LSH) that allows one to quickly find similar
entries in large databases of text, images or music. This approach belongs to a novel and interesting class of
algorithms that are known as randomized algorithms. A randomized algorithm does not guarantee an exact answer,
but instead provides a high probability guarantee that it will return the correct answer or one close to it. By investing
additional computational effort for more sampling, the probability can be pushed as high as desired.
II. RELEVANCE
There are many problems that involve finding similar items. These problems are often solved by finding the nearest
neighbor to an object in some metric space. This is an easy problem to state, but when the database is large and
the objects are complicated the processing time grows linearly with the number of items and the complexity of the
object. LSH is most valuable when searching for near matches (as opposed to exact matches) of high-dimensional
items in very large databases. In these searches it can drastically reduce the computational time, at the cost of a
small probability of failing to find the absolute closest match.
III. PREREQUISITES
This tutorial note is based on simple geometric reasoning. Some knowledge of probabilities, and a comfort with
the mathematics of high-dimensional space is useful.
IV. PROBLEM STATEMENT
Given a query point, we wish to find the points in a large database that are closest to the query. We wish to
guarantee with high probability, 1− δ, that we return the nearest neighbor for any query point.
Conceptually, this problem is easily solved by iterating through each point in the database and calculating the
distance to the query object. But our database may contain billions of objects—each object described by a vector
September 21, 2007 DRAFT
Draft of an article to be published in IEEE Signal Processing Magazine, Lecture Notes, March 2008.
2
that contains hundreds of dimensions. Therefore, we wish to find a solution that does not depend on a linear search
of the database.
A. Trees
In a one-dimensional world, it is easy to search for values by building a tree of objects. Then, given a query we
start at the top node, ask if our query object is to the left or to the right of the current node, and then recursively
descend the tree. If the tree is properly constructed, this solves the query problem in O(log N ) time, where N is the
number of objects. In a one-dimensional world, this is a binary search; the k-d tree algorithm is a multi-dimensional
version of this idea [?]. But multi-dimensional algorithms, such as k-d trees, break down when the dimensionality
of the search space is greater than a few dimensions—we end up testing nearly all the nodes in the data set and
the computational complexity grows to O(N ).
B. Hashes
Many search problems are solved using conventional computer hashing algorithms. A hash table is a data structure
that allows one to quickly map between a symbol (i.e. a string) and a value. This is done by calculating an arbitrary,
pseudo-random function of the symbol that maps the symbol into an integer that indexes a table. Thus a symbol
with dozens of characters, and perhaps hundreds of bits of data, is mapped to a relatively small index into the table.
A collision occurs when two points hash to the same value and there are special provisions to allow more than one
symbol per hash value. But a well-designed hash table allows a symbol lookup in O(1) time with O(N ) memory,
where N is the number of entries in the table.
A hash table returns an exact match. A well-designed hash function separates two symbols that are close
together into different buckets. This makes a hash table a good means of finding exact matches, but not for
finding approximate matches. By contrast, a locality-sensitive hash is an efficient means of finding near matches.
V. SOLUTION
LSH is based on a simple idea. Consider the most general view: If two points are close together, then after a
“projection” operation these two points will remain close together. Figure ?? illustrates the basic idea. Two points
that are close together on the sphere are also close together when the sphere is imaged (or projected) onto the
two-dimensional page. This is true no matter how we rotate the sphere. Two other points on the sphere that are far
apart will, for some orientations, be close together on the page, but it is more likely that the points will remain far
apart. We will describe two different kinds of projection operators, but thinking about rendering a multi-dimensional
sphere onto a two-dimensional page is a good metaphor.
Given a random “projection” operation, we note which points are close to our query. A projection maps a data
point from a high-dimensional space to a low-dimensional subspace. We create projections from a number of
different directions and keep track of the nearby points. We keep a list of these found points and note the points
that appear close to each other in more than one projection. Part of the art of solving this problem is defining a
September 21, 2007 DRAFT
3
Fig. 1. Splat. Two examples showing projections of two close (circles) and two distant points (squares) onto the printed page.
notion of “nearby” (a threshold) so that we keep track of a manageable number of points. Commonly, the projection
operation projects the points onto a line, so the similarity test is a simple comparison.
There are two common ways to do these projections. The original formulation for LSH assumed all points are
described by a large number of binary features [?]. Projections are formed by selecting a subset of the dimensions.
A better formulation for signal processing applications computes a dot-product with Gaussian vectors, creating
arbitrary projections [?].
Calculating the distance between two objects, x and y, in a binary feature space is easy with the Hamming
distance
D =∑
i
xi "= yi (1)
where xi is the value of the i’th feature for object x and "= is the exclusive-or operator. We implement a locality-
sensitive hash by performing the Hamming calculation over subsets of the dimensions. On average, points that are
close together because they share many of the same features will remain close together in a random subspace.
Binary features are not a limitation for signal processing because integers can be represented with a unary code.
Thus in an N -bit unary code, bit i for 0 ≤ i < N represents the feature that the value of the number is less than
i. In an implementation of LSH, this bit vector does not need to be calculated. Instead, we expand the calculation
in Equation ?? to include the numerical comparison for bit i.
A. Random Projections
The key to locality-sensitive hashes of the query point −→v from a real-valued high-dimensional space is the dot
product
h(−→v ) = −→v ·−→x (2)
where the elements of the vector −→x are chosen at random from a Gaussian distribution, for example N (0, 1). This
scalar projection is then quantized into a set of hash bins, with the intention that nearby items in the original space
September 21, 2007 DRAFT
4
will fall into the same bin. The full hash function is given by
hx,b(−→v ) =⌊−→x ·−→v + b
w
⌋(3)
where %·& is the floor operation and 0 ≤ b < w is a uniformly distributed random variable that makes the quantization
error easier to analyze, with no loss in performance.
In order for the projection operator to “converge,” it must project nearby points to positions that are close together.
Thus, for any points p and q in Rd, we want a high probability, P1, that two close points fall into the same bucket:
PH [h(p) = h(q)] ≥ P1 for ||p− q|| ≤ R1 (4)
and we want a low probability, P2 < P1, that two points that are far apart, R2 > R1, fall into the same bucket
PH [h(p) = h(q)] ≤ P2 for ||p− q|| ≥ cR1 = R2. (5)
Because of the linearity of the dot product, the difference between two image points ||h(p) − h(q)||2 has a
magnitude whose distribution is proportional to ||p − q||2. By this argument we see that P1 > P2. We further
magnify the difference between P1 and P2, thus increasing the performance of each projection, by performing k
dot products in parallel. This increases the ratio of the probability of nearby points over not so close points:(
P1
P2
)k
>P1
P2. (6)
The k independent hashes are a projection we call gj . The projection gj transforms the query point −→v into k real
numbers. For efficient comparisons, we put all points (the query points and all the points in the database) into
buckets, quantizing the hash values in Equation ??, with the hope that similar points will fall in the same bucket.
The width of the buckets, w, determines how many points collide. A small value for w means there is a bigger
table and fewer nearest neighbor points to check; a large value means we have to sort through many points to find
the true nearest neighbors.
Within each set of k dot products that form a projection, we achieve success if the query and the nearest neighbor
are in the same bin in all k dot products. Hence, P (1k) falls as we include more dot products. However, when we
repeat this L times, only some of the projections will fail to find the nearest neighbor. This gives us additional
error tolerance. Thus, we form L of these projections to get the desired level of probability. By increasing L we
can find the nearest neighbor with arbitrarily high probability. We will discuss how these parameters are chosen in
Section ??.
B. Implementation
At this point, we have mapped a data point into a hash bucket described by k integer indices. This k-dimensional
space is sparse, but we can use conventional hashes, with no loss of performance, to efficiently find the right bucket.
For illustration, we describe the approach used by E2LSH [?], but more sophisticated approaches based on reusing
hashes [?], or different spatial tilings [?], are also possible. Even after the projection operator, one still must find
nearest neighbors along a line. A naı̈ve algorithm could easily take O(logN) operations, but we reduce this to O(1)
operations using a pair of conventional hash tables.
September 21, 2007 DRAFT
5
One projection of the LSH algorithm using dot products is shown in Figure ??. This processing puts one point
through k dot products and then stores a fingerprint (T2) that describes the k-dimensional result into a hash bucket
computed by hash T1. Collisions are potential nearest neighbors. This single projection allows us to find nearby
points in a small fraction of the time it takes to look at all the points in the database.
Fig. 2. A block diagram of one LSH projection.
We first use a conventional hash to map the k-dimensional projection output into a single linear index. This is
implemented by computing
T1 =
(∑
i
Hijki
)mod P1 (7)
for an arbitrary prime number (and hash-table size) P1. With a well-designed hash, the hash table is efficient, but
we still have a chance that two points will collide under T1. Thus, we need to chain the entries in a bucket and
verify that we have found the right entry by comparing the queries k-dimensional projection value to the database
values. This calculation grows as k gets larger, in addition to taking more space. Instead, we see a fingerprint,
similar to T1 for the projection vector. The fingerprint is calculated with
T2 =
(∑
i
Hijki
)mod P2. (8)
Now when checking to see if the query projection is in the T1 hash bucket, we just compare the fingerprints. Even
with a 16-bit fingerprint as determined by P2, the chance of a mistake is very small.
C. s-Stable distributions
As described above, any projection operation can be used to reduce the data to a lower-dimensional space.
Forming a dot product with a vector based on a family of random variables from a s-stable distribution simplifies
the analysis of the performance of LSH. A weighted sum of random variables from a s-stable distribution has a
probability distribution that is similar to the original distribution. More formally, a distribution D is s-stable if
for any independent, identically distributed (iid) random variables X1, ...,Xn distributed according to D and any
real numbers v1, ..., vn the random variable∑
i viXi has a probability distribution that is the same as the random
variable
(∑
i
|vi|s)(1/s)X (9)
where X is drawn from D. For s = 2, or the L2 norm, a Gaussian probability distribution is s-stable.
September 21, 2007 DRAFT
6
Using an s-stable distribution for our projections allows us to analytically describe the performance of LSH. We
start by calculating the probability that two points, p and q, separated by distance u = ||p − q||2, collide and fall
into the same hash bucket. The projections of the two close points will always be close, but because of quantization
they might fall on opposite sides of the barrier and thus land in different buckets. The probability that these two
points hash to the same value is given by
p(u) = Pra,b[ha,b(p) = ha,b(q)] =∫ w
0
1u
fs
(t
u
) (1− t
w
)dt (10)
where fs is the probability density function (pdf) of the hash H as given by Equation ?? and the 1 − t/2 term
represents the probability that the two points fall in the same bin of width w. For any given bucket width, w, this
probability falls as the distance u grows.
This probability can be used to calculate the probabilities in Equations ?? and ?? for an L2 space with R1 equal
to the bin width w [?]
P2 = 1− 2F (−w/c)− 2√2πw/c
(1− e−(w2/2c2)). (11)
Here F () is the cumulative pdf of a Guassian random variable. P1 is found by setting c = 1.
(P1)k is the probability that a single point falls within the same bucket as the query, so the probability that all
L projections fail to produce a collision between the query and the true nearest neighbor is equal to (1 − P k1 )L.
If we want the probability that our algorithm fails to find the true nearest neighbor is to be no more than δ, the L
must be at least
L =⌈
log δ
log(1− P k1 )
⌉(12)
for a fixed value of k to be determined.
The amount of time needed to find a nearest neighbor is the time needed to calculate the hash functions, plus
the time needed to search the buckets for collisions. Because there are kL projections of n-dimensional vectors, the
first time, Tg , is O(nkL) where n is the dimensionality of the search space. The second time, Tc increases linearly
based on the expected number of collisions for each projection Tc = O(dLNc) where d is the average number of
points in each bucket. The expected number of collisions for a single projection is
Nc =∑
q!∈D
pk(||q − q′||) (13)
where p() from Equation ?? gives the probability that each point contributes to a collision and D represents all the
points in the database. It is easy to see that Tg increases as a function of k, while Tc decreases since pk < p for
p < 1 and k > 1.
The E2LSH algorithm finds the best value for k by experimentally evaluating the cost of the calculation for
samples in the given data set. E2LSH scales the data so the w is always equal to 4. In the cover-song experiments
described elsewhere [?] E2LSH used between 7 and 14 dot products per hash (k) and more than 150 projections
(L).
September 21, 2007 DRAFT
7
VI. APPLICATIONS
The LSH algorithm, and related randomized algorithms, make it possible to quickly find nearest neighbors in
very large databases. Conventional hashes work well for finding exact matches, but do not help us find neighbors.
Instead, a hashing algorithm, as we have described, needs to take into account the locality of the points so that
nearby points remain nearby. These locality-sensitive hashes have been applied in a number of domains, as we will
now use as illustrations for this idea.
A. Web
A randomized algorithm was first applied to finding duplicate pages on the web. The web is full of duplicate
pages, partly because content is duplicated across sites, and partly because there is more than one URL that points
to the same file on a disk. Yet search engines don’t want to return 10 copies of the same page. A solution is based
on shingles, and each shingle represents a portion of a web page and is computed by forming a histogram of the
words found within that portion of the page. We can test to see if a portion of the page is duplicated elsewhere on
the web by looking for other shingles with the same histogram. Given that there are billions of pages on the web,
and any portion of any page might be a duplicate, there are a large number of shingles to test.
Broder [?] solved this problem by considering random selections, analogous to LSH, to test the similarity of
pages. If the shingles of the new page match shingles from the database, then it is likely the new pages bear a strong
resemblance to an existing page. The nearest-neighbor solution is important because web pages are surrounded by
navigational and other information that changes from site to site. An approximate solution to this problem is fine,
especially when balanced with the computational savings of a solution like LSH.
B. Image retrieval
A second application of LSH is for object recognition [?]. We compute a detailed metric for many different
orientations and configurations of an object we want to recognize. Then, given a new image we simply check our
database to see if a pre-computed object’s metrics are close to our query. This database can contain millions of
these poses. Using LSH allows us to quickly check to see if the query object is known. A similar idea was applied
to genomic data [?].
C. Music retrieval
We use conventional hashes to find exact musical matches. Fingerprints are representations of an audio signal
that are robust to common types of abuse that are performed to audio before it reaches our ears [?]. This can be
done, for example by noting the peaks in the spectrum because they are robust to noise and encoding their position
in time and space. One then just has to query the database for the same fingerprint. With such robust features, one
can use conventional hashing, especially when looking at many samples over time, because one only needs to find
one exact match to reduce the search space.
To find similar songs, as might happen when a song is remixed for a new audience, or more dramatically when
a different artist performs the same song, we can not use a fingerprint. Instead, we can use several seconds of the
September 21, 2007 DRAFT
8
song, a snippet, as a shingle. To determine if two songs are similar, we need to query the database and see if a large
enough number of the query shingles are close to one song in the database [?]. Closeness depends on the feature
vector, but long shingles provide specificity, and make LSH more important. This similarity measure is important
in the Internet era because we can eliminate duplicates to improve search results, and to link recommendation data
between similar songs. LSH is important for this nearest-neighbor check because of the size of the database.
VII. CONCLUSIONS - WHAT WE HAVE LEARNED
In this note, we have described the theory and implementation of a randomized algorithm known as locality-
sensitive hashing (LSH). Unlike conventional computer hashes that are designed to return exact matches in O(1)
time, an LSH algorithm uses dot products with random vectors to quickly find nearest neighbors. LSH provides a
probabilistic guarantee that it will return the correct answer. In systems that have other sources of error, perhaps due
to mislabeled data or the difficulties of pattern recognition, one can reduce the error due to LSH below other sources
of error, and gain significant improvement in computational effort. These randomized algorithms are important in
today’s world of Internet-sized databases.
VIII. ACKNOWLEDGMENTS
We appreciate thoughtful comments we have received from Alex Jaffe, Sara Anderson, and several reviewers.
REFERENCES
[1] Alexandr Andoni and Piotr Indyk. E2LSH 0.1 User Manual. http://web.mit.edu/andoni/www/LSH, June 21, 2005
[2] A. Andoni, M. Datar, N. Immorlica, V. Mirrokni. Locality-sensitive hashing using stable distributions. In Nearest Neighbor Methods
in Learning and Vision: Theory and Practice, T. Darrell, P. Indyk, G. Shakhnarovich (eds.), MIT Press, 2006.
[3] Alexandr Andoni and Piotr Indyk. Near-Optimal Hashing Algorithms for Near Neighbor Problem in High Dimensions. In Proceedings
of the Symposium on Foundations of Computer Science (FOCS’06), 2006.
[4] J.L. Bentley. Multidimensional binary search trees used for associative searching. Communications of the ACM, 18:509–517, 1975.
[5] Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, and Geoffrey Zweig. Syntactic clustering of the web. In Proc. of WWW,
pages 1157–1166, Santa Clara, CA, 1997.
[6] Jeremy Buhler. Efficient large-scale sequence comparison by locality-sensitive hashing. Bioinformatics 17: 419-428.
[7] P. Cano, E. Batlle, T. Kalker, and J. Haitsma. A review of algorithms for audio fingerprinting. In International Workshop on
Multimedia Signal Processing, December 2002.
[8] Michael Casey and Malcolm Slaney. Fast Recognition of Remixed Music Audio. In Proceedings ICASSP 2007, Volume 4, IV-1425–
IV-1428 , 2007.
[9] Aristides Gionis, Piotr Indyk and Rajeev Motwani. Similarity Search in High Dimensions via Hashing. in The VLDB Journal,
518–529, 1999.
[10] P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality, in STOC, 1998.
[11] Gregory Shakhnarovich, Paul Viola, Trevor Darrell. Fast Pose Estimation with Parameter-Sensitive Hashing. In Nearest Neighbor
Methods in Learning and Vision: Theory and Practice, T. Darrell, P. Indyk, G. Shakhnarovich (eds.), MIT Press, 2006.
IX. AUTHORS
Malcolm Slaney (Senior Member, IEEE) received his PhD at Purdue University for his work on diffraction
tomography. Since the start of his career he has been a researcher at Bell Labs, Schlumberger Palo Alto Research,
Apple’s Advanced Technology Lab, Interval Research, IBM Almaden Research Center, and most recently at Yahoo!
September 21, 2007 DRAFT
9
Research. Since 1990 he has organized the Stanford CCRMA Hearing Seminar, where he now holds the title
(Consulting) Professor. He is a coauthor (with A. C. Kak) of the book Principles of Computerized Tomographic
Imaging, which has been republished as a Classics in Applied Mathematics by SIAM Press. He is a coeditor of the
book Computational Models of Hearing. Malcolm once wondered what computer science theory researchers did
of any practical importance. He is pleasantly surprised by the applicability of LSH to signal processing problems,
and this note is partial penance.
Michael Casey (Member, IEEE) received his PhD from the MIT Media Lab’s Machine Listening Group in 1998
for research in structured audio analysis and synthesis. He was a Research Scientist at Mitsubishi Electric Research
Laboratories (MERL) and Professor of Computer Science at Goldsmiths College, University of London, where
he still holds the title of Visiting Research Professor, prior to taking up his current post as Professor of Music at
Dartmouth College, Hanover, NH, USA. Michael was a co-editor for the Audio section of the MPEG-7 International
Standard for Multimedia Content Description and principle investigator for several large grants in the UK in the
field of music information retrieval which is the main subject of his current research.
September 21, 2007 DRAFT