Given by: Erez Eyal Uri Klein Lecture Outline Exact Nearest Neighbor search Exact Nearest Neighbor...

Given by:

Erez Eyal

Uri Klein

Lecture OutlineLecture Outline Exact Nearest Neighbor searchExact Nearest Neighbor search

DefinitionDefinition Low dimensionsLow dimensions KD-TreesKD-Trees

Approximate Nearest Neighbor Approximate Nearest Neighbor search (LSH based)search (LSH based) Locality Sensitive Hashing familiesLocality Sensitive Hashing families Algorithm for Hamming CubeAlgorithm for Hamming Cube Algorithm for Euclidean spaceAlgorithm for Euclidean space

SummarySummary

OvervieOvervieww

DetaileDetailedd

Nearest Neighbor Search Nearest Neighbor Search in Springfieldin Springfield

?

?

Nearest “Neighbor” Nearest “Neighbor” Search for Homer Search for Homer

SimpsonSimpson

Home planet distance

Height

Weight

Color

Nearest Neighbor (NN) Nearest Neighbor (NN) SearchSearch

Given: a set Given: a set PP of of nn points in R points in Rdd ( (dd - - dimension)dimension)

Goal: a data structure, which given a query Goal: a data structure, which given a query point point qq, finds the , finds the nearest neighbornearest neighbor pp of of qq in in PP (in terms of some distance function D) (in terms of some distance function D)

qp

Nearest Neighbor SearchNearest Neighbor Search

Interested in designing a data structure, Interested in designing a data structure, with the following objectives:with the following objectives:

Space: O(Space: O(dndn)) Query time: O(Query time: O(d d log(log(n)n))) Data structure construction time is not Data structure construction time is not

importantimportant


DefinitionDefinition Low dimensions KD-TreesKD-Trees


SummerySummery

Simple cases: 1-D (Simple cases: 1-D (d d = 1)= 1)

A binary search will give the solutionA binary search will give the solution Space: O(Space: O(nn); Time: O(log(); Time: O(log(nn))))

q = 9

1 4 7 8 13

19

25

32

Simple cases: 2-D (Simple cases: 2-D (d d = 2)= 2)

Using Voronoi diagrams will give the Using Voronoi diagrams will give the solutionsolution

Space: O(Space: O(nn22); Time: O(log(); Time: O(log(nn))))


DefinitionDefinition Low dimensionsLow dimensions KD-Trees


SummarySummary

KD-TreesKD-Trees

KD-tree is a data structure based on recursively subdividing a set of points with alternating axis-aligned hyperplanes.

The classical KD-tree uses O(dn) space and answers queries in time logarithmic in n (worst case is O(n)), but exponential in d.

47

6

5

1

3

2

9

8

10

11

l5

l1 l9

l6

l3

l10 l7

l4

l8

l2

l1

l8

1

l2 l3

l4 l5 l7 l6

l9l10

3

2 5 4 11

9 10

8

6 7

KD-Trees ConstructionKD-Trees Construction

47

6

5

1

3

2

9

8

10

11

l5

l1 l9

l6

l3

l10l7

l4

l8

l2

l1

l8

1

l2 l3

l4 l5 l7 l6

l9l10

3

2 5 4 11

9 10

8

6 7

q

KD-Trees QueryKD-Trees Query

KD-Trees AlgorithmsKD-Trees Algorithms



Approximate Nearest Neighbor Approximate Nearest Neighbor search (LSH based)search (LSH based) Locality Sensitive Hashing families Algorithm for Hamming CubeAlgorithm for Hamming Cube Algorithm for Euclidean spaceAlgorithm for Euclidean space

SummarySummary

A conjecture: “The curse of A conjecture: “The curse of dimensionality”dimensionality”

“However, to the best of our knowledge, lower bounds for exact NN Search in high dimensions do not seem sufficiently convincing to justify the curse of dimensionality conjecture” (Borodin et al. ‘99)

In an exact solution, any algorithm for high dimension must use either n(1) space or have d(1) query time

Why Approximate NN?Why Approximate NN?

Approximation allow significant speedup of calculation (on the order of 10’s to 100’s)

Fixed-precision arithmetic on computer causes approximation anyway

Heuristics are used for mapping features to numerical values (causing uncertainty anyway)

Approximate Nearest Approximate Nearest Neighbor (ANN) SearchNeighbor (ANN) Search

Given: a set Given: a set PP of of nn points in R points in Rdd ( (dd - - dimension) and a slackness parameter dimension) and a slackness parameter >0>0

Goal: a data structure, which given a Goal: a data structure, which given a query point query point qq of which the nearest of which the nearest neighbor in neighbor in PP is is aa, finds any , finds any pp s.t. D( s.t. D(qq, , pp))(1+(1+)D()D(qq, , aa))

q a

(1+)D(q, a)

Locality Sensitive Locality Sensitive HashingHashing

A (rA (r11, r, r22, P, P11, P, P22) - Locality Sensitive Hashing ) - Locality Sensitive Hashing (LSH) family, is a family of hash functions (LSH) family, is a family of hash functions s.t. for a random hash function s.t. for a random hash function hh and for and for any pair of points any pair of points a, ba, b we have: we have:

D(D(aa, , bb))rr11 Pr[ Pr[hh((aa)=)=hh((bb)])]PP11

D(D(aa, , bb))rr22 Pr[ Pr[hh((aa)=)=hh((bb)])]PP22

(r(r11<r<r22, P, P11>>PP22))

[Indyk-Motwani ’98]

(A common method to reduce dimensionality without loosing distance information)

Hamming CubeHamming Cube

A A dd-Dimensional hamming cube Q-Dimensional hamming cube Qdd is the is the set {0, 1}set {0, 1}dd

For any For any aa, , bbQQdd we define Hamming we define Hamming distance H: distance H:

||),(1

i ibabaHd

i

LSH – Example in Hamming LSH – Example in Hamming CubeCube

={h|h(aa)=)=aaii, , ii{1, …, {1, …, d}d}}}

Pr[Pr[h(qq)=)=h(aa)]=1-)]=1-HH((qq, , aa)/)/dd

Pr is a monotonically decreasing Pr is a monotonically decreasing function in function in H(q, a)

Multi-index hashing:Multi-index hashing:

={={g|gg|g((aa)=()=(hh11((aa) ) hh22((aa)… )… hhkk((aa))}))}

Pr[Pr[g(qq)=)=g(aa)]=(1-)]=(1-HH((qq, , aa)/)/dd)k

Pr is a monotonically decreasing Pr is a monotonically decreasing function in function in kk



Approximate Nearest Neighbor Approximate Nearest Neighbor search (LSH based)search (LSH based) Locality Sensitive Hashing familiesLocality Sensitive Hashing families Algorithm for Hamming Cube Algorithm for Euclidean spaceAlgorithm for Euclidean space

SummarySummary

LSH – ANN Search Basic LSH – ANN Search Basic SchemeScheme

PreprocessPreprocess:: Construct several such ‘Construct several such ‘g’ g’ functions for functions for

each each l{1,…, d} Store each Store each aaP at the place at the place ggii((aa) of the ) of the

corresponding hash tablecorresponding hash table

QueryQuery:: Perform binary search on Perform binary search on l In each step retrieve In each step retrieve ggii((qq) (of ) (of ll, if exists), if exists) Return the last non empty resultReturn the last non empty result

ANN Search in Hamming ANN Search in Hamming CubeCube

-test : Pick a subset C of {1, 2, …, d}

independently, at random w.p. For each iC, pick independently

and uniformly ri{0, 1}

For any aQd: )2(mod)(

Ci

ii ara

(Equivalently, we may pick R{0, 1}d s.t. Ri is 1 w.p. , and the test is an inner product of R and a. Such R represents a -test )

[Kushilevitz et al. ’98]


Define: (a, b)=Pr[Pr[((aa))((bb)])] For For a query q, Let H(, Let H(aa, , qq))l,l, H( H(bb, ,

qq))>>ll(1+)Then for Then for =1/(2l):

(a, q)12<(b, q)Where:

And define: 2-1=(1-e-/2)

1

1

1

21115.0

21115.0

l

l

l

l


Data structure:S ={S1, …, Sd}

Positive integers - M, TFor any l{1,…, d}, Sl={1,…, M}

For any j{1,…, M}, j consists of a set {t1,…, tT} (each tk is a (1/(2l))-test) and a table Aj of 2T

entries


In each Sl, construct j as follows:

Pick {t1,…, tT} independently at random

For vQd, the trace t(v)=(t1(v),…, tT(v)){0,1}T

An entry z{0, 1}T in Aj contains a point aP, if H(t(a), z)(1+(1/3) )T (else empty)

The space complexity: dnnTdMdO T log2


For any query q and a, bP s.t. H(q, a)l and H(q, b)>(1+)l, it can be proven using Chernoff bounds that:

[Alon & Spencer ’92]

T

T

eTbtqtH

eTatqtH

2

2

9

2

2

9

2

1

3,Pr

3,Pr

This gives the result that the trace t functions as a LSH family (in its essence)

(When the event presented in these inequalities occur for some j in Sl, j is said to ‘fail’)


Search Algorithm:We perform a binary search on l. In every

step: Pick j in Sl uniformly, at random

Compute t(q) from the list of tests in j

Check the entry labeled t(q) in Aj: If the entry contains a point from P, restrict the

search to lower l’s Otherwise restrict the search to greater l’s

Return the last non-empty entry in the search


Search Algorithm: Example

Initializel=d/2

Is Aj(t(q))empty?

Calculate t(q)Choose j Access Sl

ResAj(t(q)),llower half

No

lupper halfYes

l coveredalready?

No

Yes


Construction of S is said to ‘fail’, if for some l more than M/log(d) structures j in Sl ‘fail’

Define (for some ):

dneT

dddM

log2ln2

9

logloglog

12

11

Then S’s construction fails w.p. of at most If S does not fail, then for every query the

search algorithm fails to find an ANN w.p. of at most


Query time complexity:

ndpolydO

dnddO

ddTO

log

logloglogloglog

log2

Space complexity: 1OndO

Complexities are also proportional to -2



Approximate Nearest Neighbor Approximate Nearest Neighbor search (LSH based)search (LSH based) Locality Sensitive Hashing familiesLocality Sensitive Hashing families Algorithm for Hamming CubeAlgorithm for Hamming Cube Algorithm for Euclidean space

SummarySummary

Euclidean SpaceEuclidean Space

The The dd-Dimensional Euclidean Space -Dimensional Euclidean Space lliidd

is Rd endowed with the Li distance

For any For any aa, , bbQQdd we define we define Li distance: distance:

i

d

j

i

jji babaL

1

),(

The algorithm presented deals with The algorithm presented deals with ll2dd, and

with ll1dd under minor changes

Euclidean SpaceEuclidean Space

Define: (a, r) is the closed ball around a

with radius r (a, r)=P(a, r) (A subset of Rd)

[Kushilevitz et al. ’98]

LSH – ANN Search LSH – ANN Search Extended SchemeExtended Scheme

PreprocessPreprocess:: Prepare a data structure for each Prepare a data structure for each

‘‘hamming ballhamming ball’ induced by any ’ induced by any a, a, bbP..

QueryQuery:: Start with some maximal ballStart with some maximal ball In each step calculate the ANNIn each step calculate the ANN Stop according to some thresholdStop according to some threshold

ANN Search in Euclidean ANN Search in Euclidean SpaceSpace

For aP, Define a Euclidian to Hamming mapping (a, r){0, 1}DF

Define a parameter L Given a set of i.i.d. unit vectors z1, …, zD

For each zi, The cutting points c1, …, cF

are equally spaced on: Each zi and cj define a coordinate in the

DF-hamming cube, on which the projection of any b(a, r) is 0 iff

LzaLza ii ,

ji czb


Euclidian to hamming Mapping ExampleEuclidian to hamming Mapping Example::

d=d=33, , D=2, F=3

1

0

1

0

1

1

(a)

z1

z2 0

1

1

0

1

1

(b)

z1

z2

a3

a2

a1

a(aiR)

b3

b2

b1

b(biR)


It can be proven that, expectedly, the mapping preserves the relative distances between points in P

This mapping gets more accurate as r grows smaller:

drL1

1 log1


Data structure:S={Sa|aP}

Positive integers - D, F, LFor any aP, Sa consists of: A list of all other P’s elements sorted by

increasing distance from a A structure Sa,b for any ba (bP)


Let r=L2(a, b), then Sa,b consists of:

A list of D i.i.d. unit vectors {z1, …, zD}

For each unit vector zi, a list of F cutting points

A Hamming Cube data structure of dimension DF, containing (a, r)

The size of (a, r)


Search Algorithm (using a positive integer T): Pick a random a0P where b0 is the farthest

point from a0, and start from Sa0,b0 (r0=L2(a0, b0))

For any Saj,bj: Query for ANN of (q) in the Hamming Cube

d.s. and get result (a’) If L2(q, a’)>r-1/10 return a’

Otherwise, pick T points of (aj, rj) at random, and let a” be the closest to q among them

Let aj+1 be the closest to q of {aj, a’, a”}


Let b’P be the farthest from aj+1 s.t. 2L2(aj+1, q)L2(aj+1, b’), Using a binary search on the sorted list of Sa(j+1)

If can’t find, return aj+1

Otherwise, let bj+1=b’


Each ball in the search contains q’s (exact) NN

q

ai

bi


contains only points

from

contains at most points w.p. of at least 1-2-T

10,1max 1min

jj

rraq

jaqq , 11,5.0 jj ra

jaqq , 11, jj ra

q

ai-1

bi-1

ai

bi-1


1,, jjj aqqra

q

ai

bi

ai-1


Conclusion:

In the expected case, this gives us an O(log(n)) number of iterations

22 ,5.0, jjjj rara


Search Algorithm: Example

q

a0

b0

b1a1


Construction of S is said to ‘fail’, if for some Sa,b, does not preserve the relative distances

Define (for some ):

3

521 loglogloglog

OF

ndddOD

Then S’s construction fails w.p. of at most If S does not fail, then for every query the

search algorithm finds an ANN


Query time complexity:

ndpolydO

O

dFDOnO

log

CubeDF ain search log

2

Space complexity:

1OndO Complexities are also proportional to -2

Remark – Additional Remark – Additional WorkWork

Related Works: Jon M. Kleinberg. “Two Algorithms for

Nearest-Neighbor Search in High Dimensions”, 1997

P. Indyk and R. Motwani. “Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality”, 1988

P. Indyk and R. Motwani. “Similarity search in High Dimensions via Hashing”, 1999

Remark – Additional Remark – Additional WorkWork

Related Works:

[P. Indyk and R. Motwani ‘99]

RemarkRemark – Additional – Additional WorkWork

[P. Indyk and R. Motwani ‘99]

Related Works:

SummarySummary The Goal: linear space and

logarithmic search time Approximate nearest neighbor Locality Sensitive Hash functions Amplify probability by concatenating Discretization of values by projection

of points on vector units

Good Bye (Approximate) Good Bye (Approximate) NeighborNeighbor

[http://www.thesimpsons.com]

For questions feel free to consult your neighbors:

[email protected]@weizmann.ac.il

Given by: Erez Eyal Uri Klein Lecture Outline Exact Nearest Neighbor search Exact Nearest Neighbor...

Documents

Transcript of Given by: Erez Eyal Uri Klein Lecture Outline Exact Nearest Neighbor search Exact Nearest Neighbor...