Given by: Erez Eyal Uri Klein Lecture Outline Exact Nearest Neighbor search Exact Nearest Neighbor...
-
date post
19-Dec-2015 -
Category
Documents
-
view
217 -
download
0
Transcript of Given by: Erez Eyal Uri Klein Lecture Outline Exact Nearest Neighbor search Exact Nearest Neighbor...
Lecture OutlineLecture Outline Exact Nearest Neighbor searchExact Nearest Neighbor search
DefinitionDefinition Low dimensionsLow dimensions KD-TreesKD-Trees
Approximate Nearest Neighbor Approximate Nearest Neighbor search (LSH based)search (LSH based) Locality Sensitive Hashing familiesLocality Sensitive Hashing families Algorithm for Hamming CubeAlgorithm for Hamming Cube Algorithm for Euclidean spaceAlgorithm for Euclidean space
SummarySummary
OvervieOvervieww
DetaileDetailedd
?
Nearest “Neighbor” Nearest “Neighbor” Search for Homer Search for Homer
SimpsonSimpson
Home planet distance
Height
Weight
Color
Nearest Neighbor (NN) Nearest Neighbor (NN) SearchSearch
Given: a set Given: a set PP of of nn points in R points in Rdd ( (dd - - dimension)dimension)
Goal: a data structure, which given a query Goal: a data structure, which given a query point point qq, finds the , finds the nearest neighbornearest neighbor pp of of qq in in PP (in terms of some distance function D) (in terms of some distance function D)
qp
Nearest Neighbor SearchNearest Neighbor Search
Interested in designing a data structure, Interested in designing a data structure, with the following objectives:with the following objectives:
Space: O(Space: O(dndn)) Query time: O(Query time: O(d d log(log(n)n))) Data structure construction time is not Data structure construction time is not
importantimportant
Lecture OutlineLecture Outline Exact Nearest Neighbor searchExact Nearest Neighbor search
DefinitionDefinition Low dimensions KD-TreesKD-Trees
Approximate Nearest Neighbor Approximate Nearest Neighbor search (LSH based)search (LSH based) Locality Sensitive Hashing familiesLocality Sensitive Hashing families Algorithm for Hamming CubeAlgorithm for Hamming Cube Algorithm for Euclidean spaceAlgorithm for Euclidean space
SummerySummery
Simple cases: 1-D (Simple cases: 1-D (d d = 1)= 1)
A binary search will give the solutionA binary search will give the solution Space: O(Space: O(nn); Time: O(log(); Time: O(log(nn))))
q = 9
1 4 7 8 13
19
25
32
Simple cases: 2-D (Simple cases: 2-D (d d = 2)= 2)
Using Voronoi diagrams will give the Using Voronoi diagrams will give the solutionsolution
Space: O(Space: O(nn22); Time: O(log(); Time: O(log(nn))))
Lecture OutlineLecture Outline Exact Nearest Neighbor searchExact Nearest Neighbor search
DefinitionDefinition Low dimensionsLow dimensions KD-Trees
Approximate Nearest Neighbor Approximate Nearest Neighbor search (LSH based)search (LSH based) Locality Sensitive Hashing familiesLocality Sensitive Hashing families Algorithm for Hamming CubeAlgorithm for Hamming Cube Algorithm for Euclidean spaceAlgorithm for Euclidean space
SummarySummary
KD-TreesKD-Trees
KD-tree is a data structure based on recursively subdividing a set of points with alternating axis-aligned hyperplanes.
The classical KD-tree uses O(dn) space and answers queries in time logarithmic in n (worst case is O(n)), but exponential in d.
47
6
5
1
3
2
9
8
10
11
l5
l1 l9
l6
l3
l10 l7
l4
l8
l2
l1
l8
1
l2 l3
l4 l5 l7 l6
l9l10
3
2 5 4 11
9 10
8
6 7
KD-Trees ConstructionKD-Trees Construction
47
6
5
1
3
2
9
8
10
11
l5
l1 l9
l6
l3
l10l7
l4
l8
l2
l1
l8
1
l2 l3
l4 l5 l7 l6
l9l10
3
2 5 4 11
9 10
8
6 7
q
KD-Trees QueryKD-Trees Query
Lecture OutlineLecture Outline Exact Nearest Neighbor searchExact Nearest Neighbor search
DefinitionDefinition Low dimensionsLow dimensions KD-TreesKD-Trees
Approximate Nearest Neighbor Approximate Nearest Neighbor search (LSH based)search (LSH based) Locality Sensitive Hashing families Algorithm for Hamming CubeAlgorithm for Hamming Cube Algorithm for Euclidean spaceAlgorithm for Euclidean space
SummarySummary
A conjecture: “The curse of A conjecture: “The curse of dimensionality”dimensionality”
“However, to the best of our knowledge, lower bounds for exact NN Search in high dimensions do not seem sufficiently convincing to justify the curse of dimensionality conjecture” (Borodin et al. ‘99)
In an exact solution, any algorithm for high dimension must use either n(1) space or have d(1) query time
Why Approximate NN?Why Approximate NN?
Approximation allow significant speedup of calculation (on the order of 10’s to 100’s)
Fixed-precision arithmetic on computer causes approximation anyway
Heuristics are used for mapping features to numerical values (causing uncertainty anyway)
Approximate Nearest Approximate Nearest Neighbor (ANN) SearchNeighbor (ANN) Search
Given: a set Given: a set PP of of nn points in R points in Rdd ( (dd - - dimension) and a slackness parameter dimension) and a slackness parameter >0>0
Goal: a data structure, which given a Goal: a data structure, which given a query point query point qq of which the nearest of which the nearest neighbor in neighbor in PP is is aa, finds any , finds any pp s.t. D( s.t. D(qq, , pp))(1+(1+)D()D(qq, , aa))
q a
(1+)D(q, a)
Locality Sensitive Locality Sensitive HashingHashing
A (rA (r11, r, r22, P, P11, P, P22) - Locality Sensitive Hashing ) - Locality Sensitive Hashing (LSH) family, is a family of hash functions (LSH) family, is a family of hash functions s.t. for a random hash function s.t. for a random hash function hh and for and for any pair of points any pair of points a, ba, b we have: we have:
D(D(aa, , bb))rr11 Pr[ Pr[hh((aa)=)=hh((bb)])]PP11
D(D(aa, , bb))rr22 Pr[ Pr[hh((aa)=)=hh((bb)])]PP22
(r(r11<r<r22, P, P11>>PP22))
[Indyk-Motwani ’98]
(A common method to reduce dimensionality without loosing distance information)
Hamming CubeHamming Cube
A A dd-Dimensional hamming cube Q-Dimensional hamming cube Qdd is the is the set {0, 1}set {0, 1}dd
For any For any aa, , bbQQdd we define Hamming we define Hamming distance H: distance H:
||),(1
i ibabaHd
i
LSH – Example in Hamming LSH – Example in Hamming CubeCube
={h|h(aa)=)=aaii, , ii{1, …, {1, …, d}d}}}
Pr[Pr[h(qq)=)=h(aa)]=1-)]=1-HH((qq, , aa)/)/dd
Pr is a monotonically decreasing Pr is a monotonically decreasing function in function in H(q, a)
Multi-index hashing:Multi-index hashing:
={={g|gg|g((aa)=()=(hh11((aa) ) hh22((aa)… )… hhkk((aa))}))}
Pr[Pr[g(qq)=)=g(aa)]=(1-)]=(1-HH((qq, , aa)/)/dd)k
Pr is a monotonically decreasing Pr is a monotonically decreasing function in function in kk
Lecture OutlineLecture Outline Exact Nearest Neighbor searchExact Nearest Neighbor search
DefinitionDefinition Low dimensionsLow dimensions KD-TreesKD-Trees
Approximate Nearest Neighbor Approximate Nearest Neighbor search (LSH based)search (LSH based) Locality Sensitive Hashing familiesLocality Sensitive Hashing families Algorithm for Hamming Cube Algorithm for Euclidean spaceAlgorithm for Euclidean space
SummarySummary
LSH – ANN Search Basic LSH – ANN Search Basic SchemeScheme
PreprocessPreprocess:: Construct several such ‘Construct several such ‘g’ g’ functions for functions for
each each l{1,…, d} Store each Store each aaP at the place at the place ggii((aa) of the ) of the
corresponding hash tablecorresponding hash table
QueryQuery:: Perform binary search on Perform binary search on l In each step retrieve In each step retrieve ggii((qq) (of ) (of ll, if exists), if exists) Return the last non empty resultReturn the last non empty result
ANN Search in Hamming ANN Search in Hamming CubeCube
-test : Pick a subset C of {1, 2, …, d}
independently, at random w.p. For each iC, pick independently
and uniformly ri{0, 1}
For any aQd: )2(mod)(
Ci
ii ara
(Equivalently, we may pick R{0, 1}d s.t. Ri is 1 w.p. , and the test is an inner product of R and a. Such R represents a -test )
[Kushilevitz et al. ’98]
ANN Search in Hamming ANN Search in Hamming CubeCube
Define: (a, b)=Pr[Pr[((aa))((bb)])] For For a query q, Let H(, Let H(aa, , qq))l,l, H( H(bb, ,
qq))>>ll(1+)Then for Then for =1/(2l):
(a, q)12<(b, q)Where:
And define: 2-1=(1-e-/2)
1
1
1
21115.0
21115.0
l
l
l
l
ANN Search in Hamming ANN Search in Hamming CubeCube
Data structure:S ={S1, …, Sd}
Positive integers - M, TFor any l{1,…, d}, Sl={1,…, M}
For any j{1,…, M}, j consists of a set {t1,…, tT} (each tk is a (1/(2l))-test) and a table Aj of 2T
entries
ANN Search in Hamming ANN Search in Hamming CubeCube
In each Sl, construct j as follows:
Pick {t1,…, tT} independently at random
For vQd, the trace t(v)=(t1(v),…, tT(v)){0,1}T
An entry z{0, 1}T in Aj contains a point aP, if H(t(a), z)(1+(1/3) )T (else empty)
The space complexity: dnnTdMdO T log2
ANN Search in Hamming ANN Search in Hamming CubeCube
For any query q and a, bP s.t. H(q, a)l and H(q, b)>(1+)l, it can be proven using Chernoff bounds that:
[Alon & Spencer ’92]
T
T
eTbtqtH
eTatqtH
2
2
9
2
2
9
2
1
3,Pr
3,Pr
This gives the result that the trace t functions as a LSH family (in its essence)
(When the event presented in these inequalities occur for some j in Sl, j is said to ‘fail’)
ANN Search in Hamming ANN Search in Hamming CubeCube
Search Algorithm:We perform a binary search on l. In every
step: Pick j in Sl uniformly, at random
Compute t(q) from the list of tests in j
Check the entry labeled t(q) in Aj: If the entry contains a point from P, restrict the
search to lower l’s Otherwise restrict the search to greater l’s
Return the last non-empty entry in the search
ANN Search in Hamming ANN Search in Hamming CubeCube
Search Algorithm: Example
Initializel=d/2
Is Aj(t(q))empty?
Calculate t(q)Choose j Access Sl
ResAj(t(q)),llower half
No
lupper halfYes
l coveredalready?
No
Yes
ANN Search in Hamming ANN Search in Hamming CubeCube
Construction of S is said to ‘fail’, if for some l more than M/log(d) structures j in Sl ‘fail’
Define (for some ):
dneT
dddM
log2ln2
9
logloglog
12
11
Then S’s construction fails w.p. of at most If S does not fail, then for every query the
search algorithm fails to find an ANN w.p. of at most
ANN Search in Hamming ANN Search in Hamming CubeCube
Query time complexity:
ndpolydO
dnddO
ddTO
log
logloglogloglog
log2
Space complexity: 1OndO
Complexities are also proportional to -2
Lecture OutlineLecture Outline Exact Nearest Neighbor searchExact Nearest Neighbor search
DefinitionDefinition Low dimensionsLow dimensions KD-TreesKD-Trees
Approximate Nearest Neighbor Approximate Nearest Neighbor search (LSH based)search (LSH based) Locality Sensitive Hashing familiesLocality Sensitive Hashing families Algorithm for Hamming CubeAlgorithm for Hamming Cube Algorithm for Euclidean space
SummarySummary
Euclidean SpaceEuclidean Space
The The dd-Dimensional Euclidean Space -Dimensional Euclidean Space lliidd
is Rd endowed with the Li distance
For any For any aa, , bbQQdd we define we define Li distance: distance:
i
d
j
i
jji babaL
1
),(
The algorithm presented deals with The algorithm presented deals with ll2dd, and
with ll1dd under minor changes
Euclidean SpaceEuclidean Space
Define: (a, r) is the closed ball around a
with radius r (a, r)=P(a, r) (A subset of Rd)
[Kushilevitz et al. ’98]
LSH – ANN Search LSH – ANN Search Extended SchemeExtended Scheme
PreprocessPreprocess:: Prepare a data structure for each Prepare a data structure for each
‘‘hamming ballhamming ball’ induced by any ’ induced by any a, a, bbP..
QueryQuery:: Start with some maximal ballStart with some maximal ball In each step calculate the ANNIn each step calculate the ANN Stop according to some thresholdStop according to some threshold
ANN Search in Euclidean ANN Search in Euclidean SpaceSpace
For aP, Define a Euclidian to Hamming mapping (a, r){0, 1}DF
Define a parameter L Given a set of i.i.d. unit vectors z1, …, zD
For each zi, The cutting points c1, …, cF
are equally spaced on: Each zi and cj define a coordinate in the
DF-hamming cube, on which the projection of any b(a, r) is 0 iff
LzaLza ii ,
ji czb
ANN Search in Euclidean ANN Search in Euclidean SpaceSpace
Euclidian to hamming Mapping ExampleEuclidian to hamming Mapping Example::
d=d=33, , D=2, F=3
1
0
1
0
1
1
(a)
z1
z2 0
1
1
0
1
1
(b)
z1
z2
a3
a2
a1
a(aiR)
b3
b2
b1
b(biR)
ANN Search in Euclidean ANN Search in Euclidean SpaceSpace
It can be proven that, expectedly, the mapping preserves the relative distances between points in P
This mapping gets more accurate as r grows smaller:
drL1
1 log1
ANN Search in Euclidean ANN Search in Euclidean SpaceSpace
Data structure:S={Sa|aP}
Positive integers - D, F, LFor any aP, Sa consists of: A list of all other P’s elements sorted by
increasing distance from a A structure Sa,b for any ba (bP)
ANN Search in Euclidean ANN Search in Euclidean SpaceSpace
Let r=L2(a, b), then Sa,b consists of:
A list of D i.i.d. unit vectors {z1, …, zD}
For each unit vector zi, a list of F cutting points
A Hamming Cube data structure of dimension DF, containing (a, r)
The size of (a, r)
ANN Search in Euclidean ANN Search in Euclidean SpaceSpace
Search Algorithm (using a positive integer T): Pick a random a0P where b0 is the farthest
point from a0, and start from Sa0,b0 (r0=L2(a0, b0))
For any Saj,bj: Query for ANN of (q) in the Hamming Cube
d.s. and get result (a’) If L2(q, a’)>r-1/10 return a’
Otherwise, pick T points of (aj, rj) at random, and let a” be the closest to q among them
Let aj+1 be the closest to q of {aj, a’, a”}
ANN Search in Euclidean ANN Search in Euclidean SpaceSpace
Let b’P be the farthest from aj+1 s.t. 2L2(aj+1, q)L2(aj+1, b’), Using a binary search on the sorted list of Sa(j+1)
If can’t find, return aj+1
Otherwise, let bj+1=b’
ANN Search in Euclidean ANN Search in Euclidean SpaceSpace
Each ball in the search contains q’s (exact) NN
q
ai
bi
ANN Search in Euclidean ANN Search in Euclidean SpaceSpace
contains only points
from
contains at most points w.p. of at least 1-2-T
10,1max 1min
jj
rraq
jaqq , 11,5.0 jj ra
jaqq , 11, jj ra
q
ai-1
bi-1
ai
bi-1
ANN Search in Euclidean ANN Search in Euclidean SpaceSpace
Conclusion:
In the expected case, this gives us an O(log(n)) number of iterations
22 ,5.0, jjjj rara
ANN Search in Euclidean ANN Search in Euclidean SpaceSpace
Construction of S is said to ‘fail’, if for some Sa,b, does not preserve the relative distances
Define (for some ):
3
521 loglogloglog
OF
ndddOD
Then S’s construction fails w.p. of at most If S does not fail, then for every query the
search algorithm finds an ANN
ANN Search in Euclidean ANN Search in Euclidean SpaceSpace
Query time complexity:
ndpolydO
O
dFDOnO
log
CubeDF ain search log
2
Space complexity:
1OndO Complexities are also proportional to -2
Remark – Additional Remark – Additional WorkWork
Related Works: Jon M. Kleinberg. “Two Algorithms for
Nearest-Neighbor Search in High Dimensions”, 1997
P. Indyk and R. Motwani. “Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality”, 1988
P. Indyk and R. Motwani. “Similarity search in High Dimensions via Hashing”, 1999
SummarySummary The Goal: linear space and
logarithmic search time Approximate nearest neighbor Locality Sensitive Hash functions Amplify probability by concatenating Discretization of values by projection
of points on vector units
Good Bye (Approximate) Good Bye (Approximate) NeighborNeighbor
[http://www.thesimpsons.com]
For questions feel free to consult your neighbors:
[email protected]@weizmann.ac.il