Nearest Neighbors Algorithms in Euclidean and Metric Spaces · kd-tree for a collection of points...

Nearest Neighbors Algorithmsin Euclidean and Metric Spaces

[email protected]

Introduction

kd-trees and basic search algorithms

kd-trees and random projection trees: improved search algorithms

kd-trees and random projection trees: diameter reduction

Metric trees and variants

Distance concentration phenomena: an introduction

A metric from optimal transportation: the earth mover distance

Nearest Neighbors Algorithms in Euclidean and MetricSpaces

Introduction







Applications

. A core problem in the following applications:

I clustering, k-means algorithms

I information retrieval in data base

I information theory : vector quantization encoding

I classification in learning theory

I . . .

Nearest Neighbors: Getting Started

. Input: a set of points (aka sites) P in Rd , a query point q

. Output: nn(q,P), the point of P nearest to q

d(q,P) = d (q, nn(q,P)) . (1)

q

nn(q)

The Euclidean Voronoi Diagramand its Dual the Delaunay Triangulation

. Voronoi and Delaunay diagrams

. Key properties:

I Voronoi cells of all dimensions

I Voronoi - Delaunay via the nerve construction

I Duality : cells of dim. d − k vs cells of dimension k

I The empty ball property

Nearest Neighbors Using Voronoi Diagrams

p

q

nn(q)

. Nearest neighbor by walking- start from any point p ∈ P- while ∃ a neighbor n(p) of p inVor(P)

closer to q than p,step to it: p = n(p)

- done nn(q) = p

. Argument: the Delaunay neighborhood of a point is completeVor(p,P)= cell of p in Vor(P)N(p) = set of neighbors of p in Vor(P)N ′(p) = {p}

⋃N(p)

Vor(p,N ′(p)) = Vor(p,P)

The Nearest Neighbors Problem: Overview

. Strategy: prepocess point set P of n points in Rd into a data structure(DS) for fast nearest neighbor queries answer.

. Ideal wish list:

I The DS should have linear size

I A query should have sub-linear complexity i.e. o(n)

I When d = 1: balanced binary search trees yield O(log n)

. Core difficulties:

I curse of dimensionality: typically space Rd has a high d dimensionand n� d .

I Interpretation (meaningfull-ness) of distances in high dimensionalspaces.

The Nearest Neighbors Problem: Elementary Options

. The trivial solution :O(dn) space, O(dn) query time

. Voronoi diagram

d = 2, O(n) space O(log n) query time

d > 2, O(nd

d2 e)

space

→ Under locally uniform condition on point distributionthe 1-skeleton Delaunay hierarchy achieves :O(n) space, O(cd log n) expected query time.

. Spatial partitions based on trees

The Nearest Neighbors Problem: Variants

. Variants:

I k-nearest neighbors: find the k points in P that are nearest to q

I given r > 0, find the points in P at distance less than r from q

I Various metrics

I L2, Lp, L∞I String: Hamming distanceI Images, graphs: optimal transportI Point sets: distances via optimal alignment

. Main contenders:

I Tree like data structures:

I kd-treesI quad-treesI metric trees

I Locally Sensitive Hashing

I Hierarchical k-means

Main Contenders: Typical Results

. Four main contenders . Winners: size effect

. Randomized kd-trees and forest: data structures which are simple, effective,versatile, controlled (in terms of quality performances).

.Ref: Muja and Lowe, VISAPP 2009

.Ref: O’Hara and Draper, Applications of Computer Vision (WACV), 2013


Introduction







kd-tree for a collection of points (sites) P

. Definition:

I A binary tree

I Any internal node implements a spatial partition induced by a hyperplaneH, splitting the point cloud into two equal subsets

I right subtree: points p on one side of HI left subtree: remaining points

I The process halts when a node contains ≤ n0 points

kd-tree for a collection of points P

. Algorithm build kdTree(S)

n← newNodeif | S |≤ n0 then

Store the point of S into a container of nreturn n

elsedir = depth mod dProject the points of S along direction dirCompute the median m{Split into two equal subsets}n.sample ← sample v realizing the medianvalueL← point from S\{v} whose dirth coord is< mR ← point from S\{v} whose dirth coord is≥ mn.left ← build kdTree(L)n.right ← build kdTree(R)return n

kd-tree: search

. Three strategies:

I the defeatist search: simple, but may fail(Nb: see later, distance concentration phenomema)

I the descending search: always succeeds, but may take time

I the priority search: strikes a compromise between the defeatist anddescending strategies

kd-tree search: the defeatist search. Key idea: recursively visit the subtree containing the query point

. Algorithm defeatist search kdTree: the defeatist search in a kd tree.

Require: Maintains nn(q) of q, and τ = d(q, nn(q)n← root; τ ← d(q, n.sample)while n 6= NIL do

Possibly update nn(q) using n.sample, and τif q ∈ Domain of L then

defeatist search kdTree(n.left)if q ∈ Domain of R then

defeatist search kdTree(n.right)

. Complexity: assuming leaves of size n0 – depth satisfies 2hn0 = n

I search cost: O(n0 + log(n/n0))

. Caveat: failureq

nn(q)

kd-tree search: the descending search. Key idea: visit one or two subtree, depending on the distance d(q, nn(q)) computed

. Algorithm descending search kdTree: the descending search in a kd tree.

Require: Maintains nn(q) of q, and τ = d(q, nn(q)Require: Uses the domain of a node n (an intersection of half-spaces)

n← rootτ ← d(q, n.sample)while n 6= NIL do

Possibly update nn(q) using n.sampleif Sphere(q, τ) ∩ Domain of L then

descending search kdTree(n.left)if Sphere(q, τ) ∩ Domain of R then

descending search kdTree(n.right)

q

n

τ

The value of τ ensures that the topcell will be visited.

kd-tree search: the priority search (idea)

. Priority search, key ideas:

I Uses a priority queue to store nodes (regions), with a priority inverselyproportional to the distance to q.

I Upon popping a node, the corresponding subtree is descended to visit thenode closest to q. Upon descending, nn(q) is updated.

I While descending, the child not visited is possibly enqueued,

kd-tree search: priority search (algorithm). Uses a priority queue Q to enumerate nodes by increasing distance to query q

Ensure: Maintains nn(q) of q, and τ = d(q, nn(q)nn(q)← root.sampleQ.insert(root)while True do

if Q.empty() thenreturn

{Node with highest priority}r ← Q.pop(){The nearest box is too far wrt nn(q)}if d(bbox(r), q) > τ then

return{Descend into box nearest to q,} {and possiblyenqueue the second node}for Nodes n on the path from r to the box nearestto q do{Possibly update nn(q) and τ}d ← d(q, r .sample)if d < τ then

nn(q)← r .sample; τ ← d{Possibly enqueue the second subtree}f ← farther subtree of r w.r.t qif d(bbox(f ), q) ≤ τ then{Insert with priority inverse to distance to q}Q.insert(f , 1/d)

Box of current node r

Visited

q

Enqueud withpriority 1/d

kd-tree search: priority search (analysis)

. Pros and cons:

I ++ nn always found

I ++ linear storage

I – nn often found at an early stage ... then time spent in useless recursion

I – In the worst-case, all nodes are visited.

I – Maintaining the priority queue Q has a cost

. Improvements:

I Stopping the recurssion once a fraction of nodes has been visited

I Backing up defeatist search with overlapping cells

I Combining multiple randomized kd-trees

References

Sam06 H. Samet. Foundations of multidimensional and metric data structures.Morgan Kaufmann, 2006.

SDE05 G. Shakhnarovich, T. Darrell, and P. Indyk (Eds). Nearest-NeighborsMethods in Learning and Vision. Theory and Practice. MIT press, 2005.


Introduction







Random projection trees (RPTrees)Aka Random partition trees (RPTrees!)

. kd-tree: axis parallel splits

. Splitting along a random direction U ∈ Sd−1: project onto U and split at the(perturbed) median

v

. Resulting spatial partition

Improvements aiming at fixing the defeatist search

. Defeatist search: (early) choice of one side is risky

. Simple improvements:

– Allow overlap between cells in a nodeselected points stored twice: spill trees

– Use randomization to obtain different partitions:several spatial partitions may rescue the defeatist search:

– Use several trees, and pick the best neighbor(s)

Random projection trees: generic algorithm

. Below: version where one also jitters the median

. Algorithm build RPTree(S)

Ensure: Build the RPTree of a point set Sif If |S | ≤ n0 then

n← newNodeStore S into nreturn n

Pick U uniformly at random from the unit spherePick β uniformly at random from [1/4, 3/4]Let v be the β-fractile point on the projection of S onto URule(x) = (left if < x ,U >< v , otherwise right)left tree ← build RPTree({x ∈ S : Rule(x) = left})right tree ← build RPTree({x ∈ S : Rule(x) = right})return (Rule(·), left tree, right tree)

. Remark: RP trees have the following property – more later: diameter of thecells decrease down the tree at a rate depending on the intrinsic dimension ofthe data.

RPTrees: varying splits and their applications. Various types of splits possibleRandomized partition tree:

• exact split

Randomized partition tree:

• perturbed split

Spill tree with overlapping split:

• regular spill tree

• virtual spill tree

1/2 1/2 β 1− β1/2 + α

1/2 + α

. NB: splits monitor the tree structure and the search route

. Spill trees:

– Regular spill trees:overlapping cells yield redundant storage of points

– Virtual spill trees:median splits used – no redundant storagequery routed in multiple leaves using overlapping splits

. Summary: tree creation versus search

Routing data Routing queries (defeatist style)

RP tree Perturbed split Perturbed splitRegular spill tree Overlapping split Median splitVirtual spill tree Median split Overlapping split

Regular spill trees: size

. Tree depth: assume that

I the number of nodes transmitted to a son decreases by a factor at leastβ = 1/2 + α,

I a leaf accommodates up to n0 points

Then: the tree depth l satisfies β ln ≤ n0 i.e. l = O(log1/βnn0

).

. Tree size:

n02l = n02log1/β

nn0 .

. Examples:

I α = 0.05: O(n1.159).

I α = 0.1: O(n1.357).

Spill trees: compromising Storage vs NN searches

. Spill trees: overlapping splits yield superlinear storage

. Yet, search mail fail too:

p1 = nn(q) ∈ Left q ∈ right

median: m(S)1 + α fractile1 + α fractile

I p1 = nn(q): routed in left subtree only

I query point q: routed in right subtree only

Failure of the defeatist search

. Goal: probability that a defeatist seach does not return the exact nearestneighbor(s)?

. The event to be analyzed, denoted Err:

I k = 1 :the NN query does not return p(1)

I k > 1: the NN query does not return p(1),. . . ,p(k)

Qualifying the hardness of nearest neighbor queries

. Notations:

I Dataset P = p1, . . . , pn

I Sorted dataset wrt q: p(1), . . . , p(n)

Φ(q,P) =1

n

n∑i=2

∥∥q − p(1)

∥∥∥∥q − p(i)

∥∥ . (2)

. Extreme cases:

I Φ ∼ 0: p1 isolated, finding it shouldbe easy

I Φ ∼ 1: points equidistant from q;finding p(1) should be hard

q

q

p1

p1

. Rationale: in using RPT and spill trees with the defeatist search, theprobability of success should depend upon Φ.

Generalizations of the function Φ

. Rationale: function Φ shall be used for nodes containing a subset of thedatabase

. For a cell containing m points – evaluate the remaining points in that cell:

Φm(q,P) =1

m

m∑i=2

∥∥q − p(1)

∥∥∥∥q − p(i)

∥∥ . (3)

. If one is interested in the k nearest neighbors – evaluate the remaining pointstoo:

Φk,m(q,P) =1

m

m∑i=k+1

∥∥q − p(1)

∥∥+ · · ·+∥∥q − p(k)

∥∥∥∥q − p(i)

∥∥ . (4)

Random projections: relative position of three points. In the sequel: q, x , y : 3 points with ‖q − x‖ ≤ ‖q − y‖

. Colinearity index q, x , y :

coll(q, x , y) =< q − x , y − x >

‖q − x‖ ‖y − x‖ (5)

. Event E: < y ,U > falls strictly in-between < q,U > and < x ,U >

Lemma 1. Consider q, x , y ∈ Rd and ‖q − x‖ ≤ ‖q − y‖. The proba. overrandom directions U, of E , satisfies:

P [E ] =1

πarcsin

‖q − x‖‖q − y‖

√1− coll(q, x , y)2 (6)

Corollary 2.

1

π

‖q − x‖‖q − y‖

√1− coll(q, x , y)2 ≤ P [E ] ≤ 1

2

‖q − x‖‖y − x‖ (7)

Demo with DrGeoCompulsory tools for geometers

. IPE: http://ipe.otfried.org/

. DrGeo: http://www.drgeo.eu/

qx

U

(qU) y

y´

x´

http://ipe.otfried.org/

http://www.drgeo.eu/

Random projections: separation of neighbors

Theorem 3. Consider q, p1, . . . , pn ∈ Rd , and a random direction U.

The expected fraction of the projected pi that fall between q and p(1) is at most

1

2Φ(q,P).

. Proof. Let Zi be the event : � p(i) falls between q and p(1) in the projection�. By the corollary 2, P [Zi ] ≤ (1/2)

∥∥q − p(1)

∥∥ / ∥∥q − p(i)

∥∥. Then, apply thelinearity of expectation.

Theorem 4. Let S ⊂ P with p(1) ∈ S . If U is chosen uniformly at random,then 0 < α < 1, the proba than an α fraction of the projected points in S fallbetween q and p(1) is bounded by

1

2αΦ|S|(q,P).

. Proof. Previous Thm + Markov’s inequality.

Regular spill trees. Recap:

I Storage: point possibly stored twice using overlapping split withparameter α; depth is O(log n/n0)

I Query routing: routing to a single leaf

Theorem 5. Let β = 1/2 + α. The error probability is:

P [Err ] ≤ 1

2α

∑i=0,...,l

Φβi n(q,P) (8)

. Proof, steps:

I Internal node at depth i contains βin points

I For such a node: proba to have q separated from p(1)

p(1) transmitted to one side of the split ⇒ a fraction α of the pointsof the cell fall between q and p(1): this occurs with proba(1/2α)Φβin(q,P)

I To conclude: union-bound over all levels i

Virtual spill trees

. Recap:

I Storage: each point stored in a single leaf with median splits; depth isO(log n/n0)

I Query routing: with overlapping splits of parameter α

Theorem 6. Let β = 1/2. The error probability is:

P [Err ] ≤ 1

2α

∑i=0,...,l

Φβi n(q,P) (9)

. Proof, mutatis mutandis:

I Consider the path root - leaf of p(1)

I For a level, bound the proba. to have q routed to one side only

I Add up for all levels

Random projection trees. Recap:

I Pick a random direction and project points onto it

I Split at the β fractile for β ∈ (1/4, 3/4)

I Storage: each point mapped to a single leaf

I Query routing: query point mapped to a single leaf too

Theorem 7. Consider an RP tree for P. Define β = 3/4, andl = log1/β(n/n0). One has:

P[NN query does not return p(1)

]≤

∑i=0,...,l

Φβi n ln2e

Φβi n

(10)

. Proof, key steps:

I F : fraction of points separating q and p(1) in projection

I Since split chosen at random in interval of mass 1/2: it separates q and p(1)

with proba. F/(1/2)

I Integrating yields the result for one level; then, union bound.

Spill trees: probability of NN search failure

Theorem 8. (Spill trees) Consider a spill tree of depth l = log 11/β(n/n0),with

I β = 1/2 + α for regular spill trees,

I and β = 1/2 for virtual spill trees.

If this tree is used to answer a query q, then:

P [Err ] ≤ 1

2α

∑i=0,...,l

Φβi n(q,P) (11)

Nb: β in: number of data points found in an internal node at depth i

Randomization is critical for separationSeparation property fails in using coordinate axis (kd-trees)

z

x

y

x1

(0, . . . , 0,M)t

(M, 0, . . . , 0)t

. Consider the following point set:

I x1: the all-ones vector

I For each xi , i > 1: pick arandom coord and set it to alarge value M; set theremaining coords to uniformrandom numbers is (0, 1)

. This examples rules out kd-trees: kd-trees separate q and p(1), even thoughfunction Φ is arbitrarily small

I The NN of q (=origin) is x1

I Upon picking a random direction (a coord. axis), the fraction of pointsfalling in-between q and x1 is arbitrarily large:

1

n(n − n

d) = 1− 1

d

I But by growing M, function Φ gets close to 0.

Doubling dimension (Assouad dimension) and doublingmeasures

. Def.: A metric space X with metric is called doubling is any ball B(x , r) withx ∈ X and r > 0 is contained in at most M balls of radius r/2. The doublingdimension is log2 M.

• Exple: lineB

• Exple: k-affine space: O(k)• Example: smooth k manifold:also O(k)Nb: proof uses bound on sectionalcurvatures.

. Def.: A measure µ on a metric space X is called doubling if

µ(B(x , 2r) ≤ Cµ(B(x , r)).

Equivalently, ∃d0 such that ∀α > 0:

µ(B(x , 2r) ≤ αd0µ(B(x , r)).

The dimension of the doubling measure satisfies d0 = log2 C .. Remarks:

I A measure space supporting a doubling measure is necessarily a doublingmetric space, with dimension depending on C .

I Conversely, any complete doubling metric space supports a doublingmeasure.

Intermezzo: Data and their intrinsic dimension. Intrinsic dimension: in many real world problems, features may be correlated,redundant, causing data to have low intrinsic dimension, i.e., data lies close toa low-dimensional manifold

. Example 1: rotating an image

I Consider an n × n pixel image, with each pixel encode in the RGBchannels: 1 image ∼ on point in dimension d = 3n2.

I Consider N rotated versions of this image: N point in R3n2

I But these points intrinsically have one degree of freedom (that of therotation)

. Example 2: human body motion capture

I N markers attached to body (typically N=100).

I each marker measures position in 3 dimensions, 3N dimensional featurespace.

I But motion is constrained by a dozen-or-so joints and angles in thehuman body.

.Ref: Verma et al. Which spatial partitions are adaptive to intrinsic

dimension? UAI 2009

Bounding function Φ in specific settingsImproving the bound Φ ≤ 1

Theorem 9. Let µ be a continuous measure on Rd , a doubling measure ofdimension d0 ≥ 2. Assume p1, . . . , pn ∼ µ. With probability ≥ 1− 3δ, for all2 ≤ m ≤ n:

Φm(q,P) ≤ 6( 2

mln

1

δ

)1/d0 .

Theorem 10. Under the same hypothesis:– For both variants of the spill trees:

P [Err ] ≤ cokdoα

(8 max(k, ln 1/δ)

n0

)1/d0

– For random projection trees with n0 ≥ c0(3k)d0 max(k, ln 1/δ):

P [Err ] ≤ cok(do + ln n0)(8 max(k, ln 1/δ)

n0

)1/d0

. Rmk: failure proba. can be made arbitrarily small by taking n0 large enough.

References

DS13 S. Dasgupta and K. Sinha. Randomized partition trees for exact nearestneighbor search. JMLR: Workshop and Conference Proceedings, 30:1–21,2013.

V12 S. Vempala. Randomly-oriented kd Trees Adapt to Intrinsic Dimension.FSTTCS, 2012.

VKD09 N. Verma, S. Kpotufe, S. Dasgupta, Which spatial partitions are adaptiveto intrinsic dimension? UAI 2009.


Introduction







Recursive splits: how many splits are required to halve thediameter of a point set?

. Another set defined along axis:

I Consider S = ∪i=1,...,d{t ei ,−1 ≤ t ≤ 1}.I S ⊂ B(0, 1) and covered by 2d balls B(·, 1/2) (this num. is minimal)

I Assouad dimension is log 2d

z

x

y

e3

e2

e1

. Observation: kd-trees requires

– d splits / levels to halve the diameter of S– this requires in turn ≥ 2d points

. Fact: RPTree will do the job!

Local covariance dimension and its multi-scale estimation

. Def.: a set T ⊂ RD has covariance dimension (d , ε) if the largest deigenvalues of its covariance matrix satisfy

σ21 + · · ·+ σ2

d ≥ (1− ε) · (σ21 + · · ·+ σ2

D).

. Def.: Local covariance dimension with parameters (d , ε, r): the previous musthold when restricting T to balls of radius r .

. Multi-scale estimation from a point cloud P:

For each datapoint p and each scale rCollect samples in B(x , r)Compute covariance matrixCheck how many eigenvalues are required: yields the dimension

Partitioning rules that adapt to intrinsic dimension

. Principal component analysis: split the data at the median along theprincipal direction of covariance.

I Drawback 1: estimation of principal component requires a significant

amount of data and only about1

2lfraction of data remains at a cell at

level l

I Drawback 2: computationally too expensive for some applications

. 2-means i.e. solution of k-means with k = 1: compute the 2-means solution,and split the data as per the cluster assignment

I Drawback 1: 2-means is an NP-hard optimization problem

I Drawback 2: the best known (1 + ε)-approximation algorithm for 2-means

(A. Kumar, Y. Sabharwal, and S. Sen, 2004) would require a prohibitive

running time of O(2dO(1)

Dn), since we need ε ≈ 1/d .

I Approximate solution can be obtained using Loyd iterations.

Random projections and distances

. In RD : distance roughly get shrunk by a factor 1/√D

1

1√D

Lemma 11. Fix any vector x ∈ RD . Pick any random unit vector U onSd−1. One has:

P[|< x ,U >|≤ α ‖x‖√

D

]≤ 2

πα (12)

P[|< x ,U >|≥ β ‖x‖√

D

]≤ 2

βexp−β2/2 (13)

Random projections and diameter. Projecting a subset S ⊂ RD along a random direction: how does thediameter of the projection compares to that of S?

. S full dimensional:

diam(projection) ≤ diam(S)

. S has Assouad dimension d :(with high probability)

diam(projection) ≤ diam(S)√

d/D

U U

. Rmk:

Cover S with 2d balls of radius 1/24d balls of radius 1/4(1/ε)d balls of radius ε

Random projection trees algorithm: rationale

I Keep the good properties of PCA at a much lower cost

I intuition: splitting along a random direction is not thatdifferent since it will have some component in the direction ofthe principal component

I Generally works, but in some cases fails to reduce diameter

I Think of a dense spherical cluster around the mean containingmost of the data and a concentric shell of points much fartheraway

I characterized by the average interpoint distance ∆A within cellbeing much smaller than its diameter ∆

I ⇒ another split is used, based on distance from the mean

Linear versus spherical cuts

. Linear split with jiter:

{Split by projection: no outlier}ChooseRule(S)choose a random unit direction vpick any x ∈ S at randomlet y ∈ S a point realizing the diameter of Schoose δ at random in [−1, 1] ‖x − y‖ /

√d

Rule(x) := x · v ≤ (medianz∈S(z · v) + δ)

. Combined split:

{Split by projection: no outlier}ChooseRule(S)if ∆2(S) ≤ c ·∆2

A(S) thenchoose a random unit direction vRule(x) := x · v ≤medianz∈S(z · v)

else{Spherical cut: remove outliers}Rule(x) := ‖x −mean(S)‖ ≤medianz∈S(‖z −mean(S)‖)

Random projection trees algorithm: RPTree-max andRPTree-mean

. Algorithm:

MakeTree(S)if |S | < MinSize then

return (Leaf )else

Rule ← ChooseRule(S)LeftTree ← Maketree({x ∈ S : Rule(x) = true})RightTree ← Maketree({x ∈ S : Rule(x) = false})return [Rule, LeftTree,RightTree]

. Two options

I RPTree-max: linear split with jiter

I RPTree-mean: combined split

Performance guarantee:amortized (i.e., global) result for RPTree-max

. Def.: radius of a cell C of aRPTree: smallest r > 0 such thatS ∩ C ⊂ B(x , r) for some x ∈ C .

x

C

S

r

Theorem 12. (RPTree-max) Consider a RPTree-max built for a datasetS ⊂ RD . Pick any cell C of the tree; assume that S ∩ C has Assouaddimension ≤ d . There exists a constant c1 such that with proba. ≥ 1/2, forevery descendant C ′ more than c1d log d levels below C , one hasradius(C ′) ≤ radius(C)/2.

Performance guarantee:per-level result for RPTree-mean, with adaptation to covariance dimension

Theorem 13. (RPTree-mean) There exists constants 0 < c1, c2, c3 < 1 forwhich the following holds.

I Consider any cell C such that S ∩ C has covariance dimension (d , ε),ε < c1

I Pick x ∈ S ∩C at random, and let C ′ be the cell containing it at the nextlevel down

I Then, if C is split:

• by projection: ( ∆2(S) ≤ c · ∆2A(S) )

E[∆2A(S ∩ C ′)] ≤ (1− (c3/d))∆2

A(S ∩ C )

• by distance i.e. spherical cut:

E[∆2(S ∩ C ′)] ≤ c2∆2(S ∩ C )

. NB: the expectation is over the randomization in splitting C and the choiceof x ∈ S ∩ C .

Empirical results: contenders

. Algorithms:

I dyadic trees: pick a direction and split at the midpoint; cycle throughcoordinates.

I kd-tree: split at median along direction with largest spread.

I random projection trees: split at the median along a random direction.

I PD / PCA trees: split at the median along the principal eigenvector ofthe covariance matrix.

I two means trees: solve the 2-means; pick the direction spanned by thecentroids, and split the data as per cluster assignment.

Real word datasets

. Real world datasets:

I Teapot dataset: rotated images of a teapot (1 B&W image: 50x30pixels); thus, 1D dataset in ambient dimension 1500.

I Robotic arm: dataset in R12; yet, robotic arm has 2 joints: (noisy) 2Ddataset in ambient dimension 12.

I 1 from the MNIST OCR dataset; 20x20 B&W images, i.e. points inambient dimension 400.

Empirical results: local covariance dimension estimation

. Conventions: bold lines: estimate d(r); dashed lines: std dev; numbers: ave.num of samples in balls of the given radius

. Observations:

I Swiss roll (ambient space dim is 3): failure at small (noise dominates)and large scales (sheets get blended).

I Teapot: clear small dimensional structure at low scale, but rather 3-4than 1.

.Ref: N. Verma, S. Kpotufe, and S. Dasgupta, 2009

Empirical results: performance for NN searches

. Searching p(1): performance is the order of the NN found / dataset size

I tree depth: if one stops the recursion at a given depth

I numbers indicate ratio ‖q − nn(q)‖ /∥∥q − p(1)

∥∥

. Observations:

I quality index deteriorates with depth (separation does occur)

I 2M and PD (i.e. PCA trees) consistently yield better nearest neighbors:better adaptation to the intrinsic dimension


Empirical results: regression

. Regression:

I one predicts the rotation angle (response variable)

I performance is l2 error on the response variable

I theory says that best results are expected for data structure adapting tothe intrinsic dimension

. Observations:

I Best results for 2M trees, PD (i.e., PCA) trees, and RP trees.


Diameter reduction again: the revenge of kd-trees

. Diameter reduction property: holds for kd-trees on randomly rotated data

. Rmk: one random ration suffices

.Ref: Vempala, Santosh. Randomly-oriented kd Trees Adapt to Intrinsic

Dimension. FSTTCS. Vol. 18. 2012.

References

I Dasgupta, Sanjoy, and Yoav Freund. Random projection trees and lowdimensional manifolds. Proceedings of the fortieth annual ACMsymposium on Theory of computing. ACM, 2008.

I Verma, Nakul, Samory Kpotufe, and Sanjoy Dasgupta. Which spatialpartition trees are adaptive to intrinsic dimension?. Proceedings of thetwenty-fifth conference on uncertainty in artificial intelligence. AUAIPress, 2009.

I J. Heinonen, Lectures on analysis on metric spaces, Springer, 2001.


Introduction







Metric spaces

Definition 14. A metric space is a pair (M, d), with d : M ×M → R+, suchthat:

I (1) Positivity: d(x , y) ≥ 0

I (1a) Self-distance: d(x , x) = 0

I (1b) Isolation: x 6= y ⇒ d(x , y) > 0

I (2) Symmetry: d(x , y) = d(y , x)

I (3) Triangle inequality: d(x , y) ≤ d(x , z) + d(y , z)

. Product metric. Assume that for some k > 1:

M = M1 × · · · ×Mk . (14)

and that each (Mi , di ) is a metric space. For p ≥ 1, the product metric is:

d(x , y) = (k∑

k=1

di (xi , yi )p)1/p (15)

Some particular cases are:

I (Mi = R, di =| · |): Lp metrics.

I p = 1, di = uniform metric: Hamming distance.

A geometric distance: the Hausdorff distance

. Hausdorff distance. Consider a metric space (M, d). The Hausdorff distanceof two non-empty subsets X and Y is defined by

dH(X ,Y ) = max(H(X ,Y ),H(Y ,X )), with H(X ,Y ) = supx∈X

infy∈Y

d(x , y). (16)

Note that the one-sided distance is not symmetric, as seen on Fig. 1.. Rmk. For closed set, the min distance is realized: inf becomes min; supbecomes max.

X

Y

Figure: The one-sided Hausdorff distance is not symmetric

A geometric distance for two ordered point clouds:the least Root Mean Square Deviation: lRMSD

. Data: two point sets A = {ai}i=1,...,n,B = {bi}i=1,...,n, with a 1-1correspondence ai ↔ bi. Root Mean Square Deviation:

RMSD(A,B) =

√√√√1

n

n∑i=1

‖ai − bi‖2 (17)

a1 a2

a3a4

a5

b1

b2

b3b4

b5

. least Root Mean Square Deviation:

lRMSD(A,B) = ming∈SE(3)

RMSD(A, g · B). (18)

.Ref: Umeyama, IEEE PAMI 1991

.Ref: Steipe, Acta Crystallographica Section A, 2002

Using the triangle inequality

Lemma 15. For any three points p, q, s ∈ M, for any r > 0, and for anypoint set P ⊂ M, one has:

| d(q, p)− d(p, s) |≤ d(q, s) ≤ d(q, p) + d(p, s) (19)

d(q, s) ≥ dP(q, s) := maxp∈P| d(q, p)− d(p, s) | (20)

{d(p, s) > d(p, q) + r ⇒ d(q, s) > r .d(p, s) < d(p, q)− r ⇒ d(q, s) > r .

(21)

p

q

s

lower bound

Figure: Lower bound from the triangle inequality, see Lemma 15

Metric tree: definition. Definition:

I A binary tree

I Any internal node implements a spherical cut defined by the distance µ to

a pivot v

I right subtree: points p such that d(pivot, p) ≥ µI left subtree: points p such that d(pivot, p) < µ

pivot:v

µ

Figure: Metric tree for a square domain (A) One step (B) Full tree

Metric tree: construction. Recursively construction:

I Choose a pivot, ideally inducing a partition into subsets of the same size

I Assign points to subtrees and recurse

I Complexity under the balanced subtrees assumption: O(n log n).

Algorithm 1 Algorithm build MetricTree(S){build MetricTree(S)}if S = ∅ then

return NILn← newNodeDraw at random Q ⊂ S and v ∈ Qn.pivot ← vµ← median({d(v , p), p ∈ Q\{p}}){The pivot splits points into two subsets}L← {s ∈ S\{p}|d(s, v) < µ}R ← {s ∈ S\{p}|d(s, v) ≥ µ}{For each subtree: min/max distances to points in that subtree}n.(d1, d2)← (min,max) of distances d(v , p), p ∈ Ln.(d3, d4)← (min,max) of distances d(v , p), p ∈ R{Recursion}n.L← build MetricTree(L)n.R ← build MetricTree(R)

Searching a metric tree: algorithm

Algorithm 2 Algorithmsearch MetricTree(T , q){Note of T is denoted n}nn(q)← ∅τ ←∞if n = NIL then

return{Check whether the pivot is the nn}l ← d(q, n.pivot)if l < τ then

nn(q)← n.pivotτ ← l

{Dilate the distance intervals for leftand right subtrees}Il ← [n.d1 − τ, n.d2 + τ ]Ir ← [n.d3 − τ, n.d4 + τ ]if l ∈ Il then

search MetricTree(n.L, q)if l ∈ Ir then

search MetricTree(n.R, q)

L R

T

τ τ τ τ

d1 : minp∈Ld(v, p)

d2 : maxp∈Ld(v, p)

d3 : minp∈Rd(v, p)

d4 : maxp∈Rd(v, p)

• v: pivot node

• distances d1, d2, d3, d4

Searching a metric tree: correctness – pruning lemma

Lemma 16. Consider the intervals associated with a node, as defined inAlgorithm 1, that is Il ← [n.d1 − τ, n.d2 + τ ] Ir ← [n.d3 − τ, n.d4 + τ ]. Then:(1) If l 6∈ Il , the left subtree can be pruned.(2) If l 6∈ Ir , the left subtree can be pruned.

Proof.We prove (1), as condition (2) is equivalent. Let us denote IL = [d1, d2]. Sincel = d(v , q) 6∈ Il , we have d(v , q) < d1 − τ and d(v , q) > d2 + τ . We analyzethese two conditions in turn.

. Condition on the right hand side. By definition of d2, we have:

∀p ∈ L : d(v , q) > d(v , p) + τ.

Using the triangle inequality for d(v , q) yields

d(v , p) + d(p, q) ≥ d(v , q) > d(v , p) + τ ⇒ d(q, p) > τ.

. Mutatis mutandis.

Metric tree: choosing the pivot. By the pruning lemma: for small τ and if q is picked uniformly at random,the measure of the boundary of the spheres of radius d1, . . . , d4 determines theprobability that no pruning takes place.⇒ pick the pivot so as to minimize this measure.

. Example in 2D: 3 choices for the pivot, so as to split the unit square (mass:1) into two regions of equal size (mass: 1/2)

. Choice of pivots (illustrated usingµ (rather than the di s):

I Best pivot: pc

I Worst pivot: pm

Figure: Metric trees:minimizing the measure ofboundaries.

From metric trees to metric forests. Compromising speed versus accurary

I The exact search may be replaced by the defeatist search: visit onesubtree only, instead of using the pruning lemma.

I Then, using a forest of trees rescues erroneous branching decisions in thecourse of the search.

Figure: Metric forest

References

AMN+98 S. Arya et al. An optimal algorithm for approximate nearest neighbor searchingfixed dimensions. Journal of the ACM (JACM), 45(6):891–923, 1998.

ML09 Marius Muja and David G Lowe. Fast approximate nearest neighbors withautomatic algorithm configuration. In VISAPP (1), pages 331–340, 2009.

MSMO03 Francisco Moreno-Seco, Luisa Mico, and Jose Oncina. A modification of thelaesa algorithm for approximated k-nn classification. Pattern RecognitionLetters, 24(1):47–53, 2003.

OD13 S. O’Hara and B.A. Draper. Are you using the right approximate nearestneighbor algorithm? In Applications of Computer Vision (WACV), 2013 IEEEWorkshop on, pages 9–14. IEEE, 2013.

Yia93 Peter N Yianilos. Data structures and algorithms for nearest neighbor search ingeneral metric spaces. In SODA, volume 93, pages 311-321, 1993.


Introduction







p-norms and Unit Balls. Notations:

I d: the dimension of the space

I F : a 1d distribution

I X = (X1, . . . ,Xd) a random vector such that Xi ∼ FI P = {p(j)}: a collection on n iid realizations of X

. Generalizations of Lp norms, p > 0:

‖·‖p = (i∑i

| Xi |p)1/p (22)

Unit balls: see plots

. Cases of interest in the sequel:

I Minkowski norms: p, an integer p ≥ 1:

I fractional p-norms: 0 < p < 1. NB: triangle inequality not respected; NB:balls not convex for p < 1. sometimes called pre-norms.

. Study the variation of ‖·‖p as a function of d

Concentration of the Euclidean Norm: Observations. Plotting the variation of the following for random points in [0, 1]d :

min ‖·‖2, E[‖·‖2

]−σ[‖·‖2

], E

[‖·‖2

],E[‖·‖2

]+σ[‖·‖2

], max ‖·‖2,M =

√d

(23)

. Observation:

I The average value increases with the dimension d

I The standard deviation seems to be constant

I For d ≤ 10 i.e. d small: the min and max values are close to the bounds:lower bound is 0, upper bound is M =

√d

I For d large say d ≥ 10, the norm concentrates within a small portion ofthe domain; the gap wrt the bounds widens when d increases.

Concentration of the Euclidean Norm: Theorem

Theorem 17. Let X ∈ Rd be a random vector with iid components Xi ∼ F .There exist constants a and b that do not depend on the dimension (theydepend on F), such that:

E[‖X‖2

]=√ad − b + O(1/d) (24)

Var[‖X‖2

]= b + O(1/

√d). (25)

. Remarks:

I The variance is small wrt the expectation, see plot

I The error made in using E[‖X‖2

]instead of ‖X‖2 becomes negligible: it

looks like points are on a sphere of radius E[‖X‖2

].

I The results generalize even if the Xi are not independent; then, d getsreplaced by the number of degrees of freedom.

Contrast and Relative Contrast: Definition

. Contrast and relative contrast of n iid random draws from X . The annuluscentered at the origin and containing the points is characterized by:

Contrasta := Dmax − Dmin = maxj

∥∥∥p(j)∥∥∥p−min

j

∥∥∥p(j)∥∥∥p. (26)

and the relative contrast is defined by:

Contrastr =Dmax − Dmin

Dmin. (27)

. Variation of the contrast | Dmax − Dmin | for various p and increasing d :

p = 3 p = 2 p = 1

p = 2/3 p = 2/5

Contrast and Relative Contrast: the case of Minkowski norms

Theorem 18. Consider n points which are iid realization of X . There existsa constant Cp such that the absolute contrast of a Minkowski norm satisfies:

Cp ≤ limd→∞

E[Dmax − Dmin

d1/p−1/2

]≤ (n − 1)Cp. (28)

. Observations:

I The contrast grows as d1/p−1/2

Metric Contrast Dmax − Dmin

L1 C1

√d

L2 C2

L3 0

I The Manhattan metric: only one for which the contrast grows with d .

I For the Euclidean metric, the contrast converge to a constant.

I For p ≥ 3, the contrast converges to zero: the distance does notdiscriminate between the notions of close and far.

I NB: the bounds depend on n; it makes sense to try to exploit theparticular coordinates at hand (cf later).

. NB: Thm also exist for the relative contract and other p-norms

Practical Implications for Exact NN Queries. Curse of dimensionality: the concentration of distance can be such that amethod with expected logarithmic cost performs no better than a linear scan.Recall the non defeatist search strategies in kd-tree and metric trees: if alldistances are comparable, a linear number of nodes gets visited.

. Use less concentrated metrics, with more discriminative power.

. Sanity check: in running a NN query, make sure that distances aremeaningful: bi-modality of the distribution of distance is a good sanity check.

. Example: selected features yield two modes rather than a continuum ofdistances:

A wise use of distances

. Distance filtering:

I What is the nearest neighbor in high dimensional spaces?, Hinneburg etal, VLDB 2000.

I Using sketch-map coordinates to analyze and bias molecular dynamicssimulations, Parrinello et al, PNAS 109, 2012.

. Feature selection:

I Random Forests, Breiman, Machine learning 2001

I Principal Differences Analysis: Interpretable Characterization ofDifferences between Distributions, Mueller et al, NIPS 2015

References

AHK01 Charu C Aggarwal, Alexander Hinneburg, and Daniel A Keim. On thesurprising behavior of distance metrics in high dimensional space.Springer, 2001.

CTP11 M. Ceriotti, G. Tribello, and M. Parrinello. Simplifying the representationof complex free-energy landscapes using sketch-map. PNAS,108(32):13023-13028, 2011.

FWV07 D. Francois, V. Wertz, and M. Verleysen. The concentration of fractionaldistances. IEEE Trans. on Knowledge and Data Engineering,19(7):873-886, 2007.

HAK00 Alexander Hinneburg, Charu C Aggarwal, and Daniel A Keim. What isthe nearest neighbor in high dimensional spaces? VLDB 2000.


Introduction







Comparing Histograms. Bin-to-bin methods:

d(H,K) =∑i

| hk − kk | (29)

→ overestimates the distance since neighboring bins are not considered.

. Mixing (e.g. quadratic) methods:

d2(H,K) = (h− k)tA(h− k) (30)

→ underestimates distances : tends to accentuate the similarity of colordistributions without a pronounced mode.

. Illustrations:

(A) (B)

(C) (D)

Transport Plan Between Two Weighted Point Sets. Weighted point sets:

P = {(p1,wp1 ) . . . , (pm,wpm )} and Q = {(q1,wq1 , . . . , (qn,wqn )}. (31)

NB: nodes from P (resp. Q): production (resp. demand) nodesShorthand for the sum of masses: WP =

∑i wpi , WQ =

∑j wqj .

. A metric d(·, ·): distance between two points dij = d(pi , qj).

. A transport plan: is a set of non-negative flows fij circulating on the edges ofthe bipartite graph P × Q.

(pi, wpi)

(qj, wqj)

fij

P Q

Figure: Transport plan between two weighted point sets

The Earth Mover Distance. Optimization problem:

Minimize :CEMD-LP =∑ij

fijdij under the constraints: (32)

(C1)fij ≥ 0

(C2)∑

j fij ≤ wpi , ∀i(C3)

∑i fij ≤ wqj ,∀j

(C4)∑

i

∑j fij = min(WP ,WQ).

(33)

These constraints read as follows:

I (C1) Flows are positive

I (C2,C3) A node cannot export (resp. receive) more than its weight.

I (C4) The total flow neither exceeds the production nor the demand.

. Earth mover distance: defined from the cost by

dEMD-LP =CEMD-LP∑

ij fij=

CEMD-LP

min(WP ,WQ)(34)

. Advantages:

I Applies to signatures in general, the histograms being a particular case.

I Embeds the notion of nearness, via the metric in the ground space.

I Allows for partial matches. See however, the comment in section ??.

I Easy to compute: linear program.

.Ref: Rubner, Tomasi, Guibas, IJCV, 2000

Application to image retrieval

. Image coding, two options:

I convert image to histogram using a fixed binning of the color space; massof bin: num. of pixel within it.

I cluster pixels (say with k-means): mass of cluster is the fraction of pixelsassigned to it

. Search on DB of 20,000 images: (a) L1 (d) Quadratic form (e) EMD

Mallow’s Distance: Definition. Consider: two RV in Rd : X ∼ P,Y ∼ Q.

. Mallows distance between X and Y : minimum of expected differencebetween X and Y over all joint distributions F for (X ,Y ), such that themarginal of X is P and that of Y is Q:

Mp(X ,Y ) = minF∈F

(EF ‖X − Y ‖p

)1/p: (X ,Y ) ∼ F , X ∼ P,Y ∼ Q}. (35)

. Discrete setting: P andQ

P = {(x1,wp1 ) . . . , (xm,wpm )} (36)

Q = {(y1,wq1 , . . . , (yn,wqn )}. (37)

. Joint distribution is specified by probabilities on all pairs i.e. F = {fij}, andthe fact that it respects the marginals yields:∑

j

fij = pi ,∑i

fij = qj ,∑ij

fij = 1. (38)

. Functional to be minimized becomes:

EF ‖X − Y ‖p =∑ij

fij ‖xi − yj‖p . (39)

Mallow’s Distance versus EMD

. Mallows’ distance (WP = WQ = 1):Mp = 1/4× 0 + 1/4× 1 + 1/4× 1 + 1/4× 0

. EMD, assuming uniform weights on all points, ie WP = 2 and WQ = 4: EMD= 0 since a flow of 2 units satisfies all constraints.

X Y

(x1 = 1, p1 = 1/2)

(x2 = 4, p2 = 1/2) (y4 = 4, q4 = 1/4)

(y1 = 1, q1 = 1/4)

(y2 = 2, q2 = 1/4)

(y3 = 3, q3 = 1/4)

f11 = 1/4

f12 = 1/4

f23 = 1/4

f24 = 1/4

Figure: Mallows’s distance

References

LB01 Elizaveta Levina and Peter Bickel. The earth mover’s distance is themallows distance: Some insights from statistics. In Computer Vision,2001. ICCV 2001. Proceedings. Eighth IEEE International Conferenceon, volume 2, pages 251–256. IEEE, 2001.

RTG00 Y. Rubner, C. Tomasi, and L. J. Guibas. The earth mover’s distance as ametric for image retrieval. International Journal of Computer Vision,40(2):99–121, 2000.

ZLZ05 Ding Zhou, Jia Li, and Hongyuan Zha. A new mallows distance basedmetric for comparing clusterings. In Proceedings of the 22nd internationalconference on Machine learning, pages 1028–1035. ACM, 2005.

CM14 F. Cazals and D. Mazauric. Mass transportation problems withconnectivity constraints, with applications to energy landscapecomparison. Submitted, 2014. Preprint: Inria tech report 8611.

Nearest Neighbors Algorithms in Euclidean and Metric Spaces · kd-tree for a collection of points...

Documents

Transcript of Nearest Neighbors Algorithms in Euclidean and Metric Spaces · kd-tree for a collection of points...