Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson...

Nearest Neighbor Search in High Dimensions

Seminar in Algorithms and Geometry

Mica Arie-Nachimson and Daniel Glasner

April 2009

Talk Outline

• Nearest neighbor problem– Motivation

• Classical nearest neighbor methods– KD-trees

• Efficient search in high dimensions– Bucketing method– Locality Sensitive Hashing

• ConclusionIndyk and Motwani, 1998

Gionis, Indyk and Motwani, 1999

Main Results

Nearest Neighbor Problem

• Input: A set P of points in Rd (or any metric space).

• Output: Given a query point q, find the point p* in P which is closest to q.

qp*

What is it good for?Many things!Examples:• Optical Character Recognition• Spell Checking• Computer Vision• DNA sequencing• Data compression


22

2

3

8

7

4

2

Feature space

query


aboutboat

batabate

able

scoutshout

abaut

Feature space

query


And many more…

Approximate Nearest Neighbor -NN



• Given a query point q, let:– p* point in P closest to q– r* the distance ||p*-q||

• Output: Some point p’ with distance at most r*(1+)

q

p*r*



• Given a query point q, let:– p* point in P closest to q– r* the distance ||p*-q||

• Output: Some point p’ with distance at most r*(1+)

p*·r*(1+)

·r*(1+)q

r*

Approximate vs. ExactNearest Neighbor

• Many applications give similar results with approximate NN

• Example from Computer Vision

Retiling

Slide from Lihi Zelnik-Manor

Exact NNS ~27 sec

Approximate NNS ~0.6 sec

Slide from Lihi Zelnik-Manor

Solution Method

• Input: A set P of n points in Rd.• Method: Construct a data structure

to answer nearest neighbor queries• Complexity

– Preprocessing: space and time to construct the data structure

– Query: time to return answer

Solution Method

• Naïve approach:– Preprocessing O(nd)– Query time O(nd)

• Reasonable requirements:– Preprocessing time and space poly(nd).– Query time sublinear in n.

Talk Outline




• Conclusion

Classical nearest neighbor methods

• Tree structures– kd-trees

• Vornoi Diagrams– Preprocessing poly(n), exp(d)– Query log(n), exp(d)

• Difficult problem in high dimensions– The solutions still work, but are exp(d)…

KD-tree

• d=1 (binary search tree)

5 20

7 ,8 10 ,12 13 ,15 18

12 157 8 10 13 18

13,15,187,8,10,12

1813,1510,127,8

KD-tree


5 20

7 ,8 10 ,12 13 ,15 18

12 157 8 10 13 18

13,15,187,8,10,12

1813,1510,127,8

17query

min dist = 1

KD-tree


5 20

7 ,8 10 ,12 13 ,15 18

12 157 8 10 13 18

13,15,187,8,10,12

1813,1510,127,8

16query

min dist = 2min dist = 1

KD-tree

• d>1: alternate between dimensions• Example: d=2

x

y

x

(12,5( )6,8( )17,4( )23,2( )20,10( )9,9( )1,6)

(17,4( )23,2 )(20,10)

(12,5( )6,8( )1,6 )(9,9)

KD-tree

• d>1: alternate between dimensions• Example: d=2

xx

y

x

KD-tree: complexity

• Preprocessing O(nd)• Query

– O(logn) if points are randomly distributed– w.c. O(kn1-1/k) almost linear when n close to k

• Need to search the whole tree

xx

y

x

Talk Outline




• Conclusion

Sublinear solutions

Query timePreprocessing

BucketingO(logn)nO(1/)

LSHO(n1/(1+))

[sqrt(n) when =1]

O(n1+1/(1+))

[n3/2 when =1]

2

Linear in d

Not counting logn factors

Solve -NN by reduction

r-PLEBPoint Location in Equal Balls

• Given n balls of radius r, for every query q, find a ball that it resides in, if exists.

• If doesn’t reside in any ball return NO.

Return p1 p1

r-PLEBPoint Location in Equal Balls

• Given n balls of radius r, for every query q, find a ball that it resides in, if exists.

• If doesn’t reside in any ball return NO.

Return NO

Reduction from -NN to r-PLEB

• The two problems are connected– r-PLEB is like a decision problem for -

NN

Reduction from -NN to r-PLEBNaïve Approach

• Set R=proportion between largest dist and smallest dist of 2 points

• Define r={(1+)0, (1+)1,…,R}• For each ri construct ri-PLEB• Given q, find the smallest r* which gives a

YES– Use binary search to find r*


• Set R=proportion between largest dist and smallest dist of 2 points

• Define r={(1+)0, (1+)1,…,R}• For each ri construct ri-PLEB• Given q, find the smallest ri which gives a

YES– Use binary search

r1-PLEBr2-PLEB

r3-PLEB


• Correctness– Stopped at ri=(1+)k

– ri+1=(1+)k+1

r1-PLEBr2-PLEB

r3-PLEB

(1+)k · r* · (1+)k+1


Reduction overhead:

• Space: O(log1+R) r-PLEB constructions – Size of {(1+)0, (1+)1,…,R} is log1+R

• Query: O(loglog1+R) calls to r-PLEBDependency on R

Reduction from -NN to r-PLEBBetter Approach

• Set rmed as the radius which gives n/2 connected components (C.C)

Har-Peled 2001



• Set rtop= 4nrmedlogn/

rmed

rtop


• If q2 B(pi,rmed) and q2 B(pi,rtop), set R=rtop/rmed and perform binary search on r={(1+)0, (1+)1,…,R}– R independent of input points

• If q2 B(pi,rmed) q2 B(pi,rtop) 8 i then q is “far away”– Enough to choose one point from each C.C and continue

recursively with these points (accumulating error · 1+/3)

• If q2 B(pi,rmed) for some i then continue recursively on the C.C.

rmed






rtop






rmed






rtop






rmed





• If q2 B(pi,rmed) for some i then continue recursively on the C.C. 2 + half of the points

O(loglogR)=O(log(n/)

Complexity overhead: how many r-PLEB queries? Total: O(logn)

(r,)-PLEBPoint Location in Equal Balls

• Given n balls of radius r, for query q:– If q resides in a ball of radius r, return

the ball.– If q doesn’t reside in any ball, return NO.– If q resides only in the “border” of a ball,

return either the ball or NO.

p1Return p1





Return NO





Return YES or NO

Talk Outline




• Conclusion

Bucketing Method

• Apply a grid of size r/sqrt(d)• Every ball is covered by at most k cubes

– Can show that k· Cd/d for some C<5 constant

• kn cubes cover all balls• Finite number of cubes: can use hash table

– Key: cube, Value: a ball it covers

• Space req: O(nk)r-PLEB

Indyk and Motwani, 1998

Bucketing Method

• Apply a grid of size r/sqrt(d)• Every ball is covered by at most k cubes

– Can show that k· Cd/d for some C<5 constant

• kn cubes cover all balls• Finite number of cubes: can use hash table

– Key: cube, Value: a ball it covers

• Space req: O(nk)r-PLEB

Bucketing Method

• Given query q• Compute the cube it resides in [O(d)]• Find the ball this cube intersects [O(1)]• This point is an (r,)-PLEB of q

r-PLEB

Bucketing Method


r/sqrt(d)

r/s

qrt(

d)

r-PLEB

Bucketing Method


NO

YES

YES or NO r-PLEB

Bucketing MethodComplexity

• Space required: O(nk)=O(n(1/d))• Query time: O(d)• If d=O(logn) [or n=O(2d)]

– Space req: O(nlog(1/))

• Else use dimensionality reduction in l2 from d to -2log(n) [Johnson-Lindenstrauss lemma]– Space: nO(1/)

2

Talk Outline



• Efficient search in high dimensions– Bucketing method– Local Sensitive Hashing

• Conclusion

Locality Sensitive Hashing

• Indyk & Motwani 98, Gionis, Indyk & Motwani 99

• A solution for (r,)-PLEB.• Probabilistic construction, query

succeeds with high probability.• Use random hash functions

g: X U (some finite range).• Preserve “separation” of “near” and

“far” points with high probability.

Locality Sensitive Hashing

• If ||p-q|| ≤ r, then Pr[g(p)=g(q)] is “high”

• If ||p-q|| > (1+)r, then Pr[g(p)=g(q)] is “low”

r

… g3

… g2

… g1

A locality sensitive family

• A family H of functions h: X → U is called (P1,P2,r,(1+)r)-sensitive for metric dX, if for any p,q:

– if ||p-q|| < r then Pr[ h(p)=h(q) ] > P1

– if ||p-q|| >(1+)r then Pr[ h(p)=h(q) ] < P2

• For this notion to be useful we requireP1 > P2

Intuition

• if ||p-q|| < r then Pr[ h(p)=h(q) ] > P1

• if ||p-q|| >(1+)r then Pr[ h(p)=h(q) ] < P2

h1

h2

Illustration from Lihi Zelnik-Manor

Claim

• If there is a (P1,P2,r,(1+)r) - sensitive family for dX then there exists an algorithm for (r,)-PLEB in dX with

• Space - O(dn+n1+) • Query - O(dn)

Where

~ When = 1 O(dn + n3/2) O(d¢sqrt(n))

Algorithm – preprocessing

k

h1

h2

hk

• For i = 1,…,L – Uniformly select k functions from H

– Set gi(p)=(h1(p),h2(p),…,hk(p))

gi( ) = (0,0,...,1)

gi( ) = (1,0,…,0)

hi : Rd {0,1}0 1

Algorithm – preprocessing

• For i = 1,…,L – Uniformly select k functions from H

– Set gi(p)=(h1(p),h2(p),…,hk(p))

– Compute gi(p) for all p 2 P

– Store resulting values in a hash table

Algorithm - query

• S Ã , i Ã 1• While |S| · 2L

– S Ã S [ {points in bucket gi(q) of table i}

– If 9 p 2 S s.t. ||p-q|| · (1+)rreturn p and exit.

– i++

• Return NO.

Correctness

• Property I:if ||q-p*|| · r then gi(p*) = gi(q) for some i 2 1,...,L

• Property II:number of points p2 P s.t. ||q-p|| ¸ (1+)r and gi(p*) = gi(q) is less than 2L

• We show that Pr[I & II hold] ¸ ½-1/e

Correctness

• Property I:if ||q-p*|| · r then gi(p*) = gi(q) for some i 2 1,...,L

• Property II:number of points p2 P s.t. ||q-p|| ¸ (1+)r and gi(p*) = gi(q) is less than 2L

• Choose: – k = log1/p2

n

– L = nwhere

Complexity

• k = log1/p2n

• L = nwhere• Space

L¢n + d¢n = O(n1+ + dn)

• QueryL hash function evaluations +

O(L) distance calculations = O(dn)

Hash tables Data points

~

Significance of k and L

||p-q||

Pr[

g(p

) =

g

(q)]

Significance of k and L

||p-q||

Pr[

gi(p

) =

gi(q

)

for

som

e i 2

1,.

..,L

]

Application

• Perform NNS in Rd with l1 distance.

• Reduce the problem to NNS in Hd’ the hamming cube of dimension d’.

• Hd’ = binary strings of length d’.

• dHam(s1,s2) = number of coordinates where s1 and s2 disagree.

• w.l.o.g all coordinates of all points in P are positive integer < C.

• Map integer i 2 {1,...,C} to(1,1,....,1,0,0,...0)

• Map a vector by mapping each coordinate.• Example: {(5,3,2),(2,4,1)}

{(11111,11100,11000),(11000,11110,10000)}

Embedding l1d in Hd’

C-i zerosi ones

• Distances are preserved.• Actual computations are performed

in the original space O(log C) overhead.

Embedding l1d in Hd’

A sensitive family for the hamming cube

• Hd’ = {hi : hi(b1,…,bd’) = bi for i = 1,…,d’}– If dHam(s1,s2) < r what is Pr[h(p)=h(q)] ?

at most 1-r/d’– If dHam(s,s2) > (1+)r what is Pr[h(p)=h(q)] ?

at least 1-(1+)r/d’

• Hd’ is (r,(1+)r,1-r/d’,1-(1+)r/d’) sensitive.

• Question: what are these projections in the original space?

Corollary

• We can bound · (1/1+)

• Space - O(dn+n(1+1/(1+) • Query - O(dn1/(1+)

When = 1 O(dn + n3/2) O(d¢sqrt(n))

Recent results

• In Euclidian space– · 1/(1+)2 + O(log log n / log1/3 n)

[Andoni & Indyk 2008]– ¸ 0.462/(1+)2

[Motwani, Naor & Panigrahy 2006]

• LSH family for ls s 2 [0,2)[Datar,Immorlica,Indyk & Mirrokni 2004]

• And many more.

Conclusion

• NNS is an important problem with many applications.

• The problem can be efficiently solved in low dimensions.

• We saw some efficient approximate solutions in high dimensions, which are applicable to many metrics.

Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson...

Documents

Transcript of Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson...