A Nonlinear Approach to Dimension Reduction

A Nonlinear Approach to Dimension Reduction

Robert Krauthgamer

Weizmann Institute of Science

Joint work with Lee-Ad Gottlieb

A Nonlinear Approach to Dimension Reduction 2

Data As High-Dimensional Vectors Data is often represented by vectors in Rd

Image color or intensity histogram Document word frequency

A typical goal – Nearest Neighbor Search: Preprocess data, so that given a query vector, quickly find closest

vector in data set. Common in various data analysis tasks – classification, learning,

clustering.


Curse of Dimensionality [Bellman’61] Cost of maintaining data is exponential in dimension This observation extends to many useful operations

Nearest Neighbor Search [Clarkson’94]

Dimension reduction = represent high-dimensional data in a low-dimensional space

Map given vectors into a low-dimensional space, while preserving most of the data

Goal: Trade-off accuracy for computational efficiency Common interpretation: preserve pairwise distances


The JL Lemma

Can be realized by a simple linear transformation A random k£d matrix works – entries from {-1,0,1} [Achlioptas’01] or

Gaussian [Gupta-Dasgupta’98,Indyk-Motwani’98]

Applications in a host of problems in computational geometry

Can we do better? A nearly-matching lower bound was given by [Alon’03] But it’s existential…

Theorem [Johnson-Lindenstrauss, 1984]:

For every n-point set X½Rm and 0<<1, there is a map :XRk, for k=O(-2log n), that preserves all distances within 1+:

||x-y||2 < ||(x)-(y)||2 < (1+) ||x-y||2, 8x,y2X.


Doubling Dimension Definition: Ball B(x,r) = all points within distance r from x.

The doubling constant (of a metric M) is the minimum value ¸ such that every ball can be covered by ¸ balls of half the radius First used by [Assouad’83], algorithmically by [Clarkson’97] The doubling dimension is dim(M)=log ¸(M) [Gupta-K.-Lee’03]

Applications: Approximate nearest neighbor search [Clarkson’97,

K.-Lee’04,…, Cole-Gottlieb’06, Indyk-Naor’06,…] Spanners [Talwar’04,…,Gottlieb-Roditty’08] Compact Routing [Chan-Gupta-Maggs-Zhou’05,…] Network Triangulation [Kleinberg-Slivkins-Wexler’04,…] Distance oracles [HarPeled-Mendel’06] Embeddings [Gupta-K.-Lee’03, K.-Lee-Mendel-Naor’04,

Abraham-Bartal-Neiman’08] Here ≤7.


A Stronger Version of JL?

The near-tight lower bound of [Alon’03] is existential Holds for X=uniform metric, where dim(X)=log n

Open: Extend to spaces with doubling dimension¿log n? Open: JL-like embedding into dimension k=O(dim(X))?

Even constant distortion would be interesting [Lang-Plaut’01,Gupta-K.-Lee’03]: Cannot be attained by linear transformations [Indyk-Naor’06]

Example [Kahane’81, Talagrand’92]:

xj = (1,1,…,1,0,…,0) 2 Rn (Wilson’s helix).

I.e. ||xi-xj|| = |i-j|1/2.

Theorem [Johnson-Lindenstrauss, 1984]:

Every n-point set X½l2 and 0<<1, has a linear embedding

:Xl2k, for k=O(-2log n), such that for all x,y2X,

||x-y||2 < ||(x)- (y)||2 < (1+) ||x-y||2.

distortion = 1+².

We present two partial resolutions, using Õ(dim2(X)) dimensions:

1. Distortion 1+ for a single scale, i.e. pairs where ||x-y||2[r,r].

2. Global embedding of the snowflake metric, ||x-y||½.

2’. Conjecture correct whenever ||x-y||2 is a Euclidean metric.


I. Embedding for a Single Scale Theorem 1. For every finite subset X½l2, and all 0<<1, r>0,

there is embedding f:Xl2k for k=Õ(log(1/)(dim X)2), satisfying

1. Lipschitz: ||f(x)-f(y)|| ≤ ||x-y|| for all x,y2X

2. Bi-Lipschitz at scale r: ||f(x)-f(y)|| ≥ (||x-y||) whenever ||x-y||2 [r, r]

3. Boundedness: ||f(x)|| ≤ r for all x2X

Compared to open question: Bi-Lipschitz only at one scale (weaker) But achieves distortion = absolute constant (stronger)

This talk: illustrate the proof for constant distortion The 1+ distortion is later attained for distances 2[’r, r] Overall approach: divide and conquer


Step 1: Net Extraction Extract from the point set X a r-net

Net properties: r-Covering r-Packing

Packing ) a ball of radius s contains O((s/r)dim(X)) net points

We shall build a good embedding for net points Why can we ignore non-net points?

Covering radius: r

Packing distance: r


Step 1: Net Extraction and Extension Recall: ||f||Lip = min {L¸0: ||f(x)-f(y)|| · L||x-y|| for all x,y}

Lipschitz Extension Theorem [Kirszbraun’34]: For every X½l2, every map f:Xl2

k can be extended to f’:l2l2k, such that

||f’||Lip · ||f||Lip.

Therefore, a good embedding just for the net points suffices Smaller net resolution less distortion for the non-net points

f

¸9r¸3r

f’ ¸r


Step 2: Padded Decomposition Partition the space probabilistically into clusters

Properties [Gupta-K.-Lee’03, Abraham-Bartal-Neiman’08]: Cluster diameter: bounded by dim(X)¢r.

Size: By the doubling property, bounded by (dim(X)/)dim(X) Padding: Each point is r-padded with probability > 9/10 Support: O(dim(X)) partitions

≤ dim(X)¢r

r-padded

not r-padded


Step 3: JL on Individual Clusters For each partition, consider each individual cluster

Reduce dimension using the JL lemma Constant distortion Target dimension (logarithmic in size): O(log((dim(X)/)dim(X)) = Õ(dim(X))

Then translate some point to the origin

JL


The story so far… To review

Step 1: Extract net points Step 2: Build family of partitions Step 3: For each partition, apply in each cluster JL and translate to origin

Embedding guarantees for

a single partition Intracluster distance: Constant distortion Intercluster distance:

Min distance: 0 Max distance: dim(X)¢r Not good enough Let’s backtrack…


The story so far… To review

Step 1: Extract net points Step 2: Build family of partitions Step 3: For each partition, apply in each cluster a Gaussian transform Step 4: For each partition, apply in each cluster JL and translate to origin

Embedding guarantees for a single partition Intracluster distance: Constant distortion Intercluster distance:

Min distance: 0 Max distance: dim(X)¢r

Not good enough when ||x-y||≈r Let’s backtrack…


Step 3: Gaussian Transform For each partition, apply within each cluster the Gaussian

transform (kernel) to distances [Schoenberg’38] G(t) = (1-e-t2)1/2

“Adaptation” to scale r:

Gr(t) = r(1-e-t2/r2)1/2

A map f:l2 l2 such that

||f(x)-f(y)|| = Gr(||x-y||) Threshold: Cluster diameter is at most r (Instead of dim(X)¢r) Distortion: Small distortion of distances in relevant range

Transform can increase dimension… but JL is the next step


Step 4: JL on Individual Clusters Steps 3 & 4:

New embedding guarantees Intracluster: Constant distortion Intercluster:

Min distance: 0 Max distance: r (instead of dim(X)¢r)

Caveat: Still not good enough when ||x-y||≈r Also smooth the map near cluster boundaries

JLGaussian

smaller diameter smaller dimension


Step 5: Glue Partitions We have an embedding for each partition

For padded points, the guarantees are perfect For non-padded points, the guarantees are weak

“Glue” together embeddings for different partitions Concatenate all dim(X) embeddings (and scale down)

Non-padded case occurs 1/10 of the time, so it gets “averaged away”

||F(x)-F(y)||2 = ||f1(x)-f1(y)||2 + … + ||fm(x)-fm(y)||2 ¼ m¢||x-y||2

Final dimension = Õ(dim2(X)): Number of partitions: O(dim(X)) Dimensions for partitioning: Õ(dim(X))

f1(x) = (1,7), f2(x) = (2,8), f3(x) = (4,9)

) F(x) = f1(x) f2(x) f3(x) = (1,7,2,8,4,9)


Kirszbraun’s theorem extends embedding to non-net points without increasing dimension

Step 6: Kirszbraun Extension Theorem

Embedding

Embedding + K.


Single Scale Embedding – Review Steps:

Net extraction Padded Decomposition Gaussian Transform JL Glue partitions Extension theorem

Theorem 1: Every finite X½l2 admits embedding f:Xl2

k for k=Õ((dim X)2), such that


2. Bi-Lipschitz at scale r: ||f(x)-f(y)|| ≥ (||x-y||) whenever ||x-y||2 [r, r]



Single Scale Embedding – Strengthened Steps:

Net extraction nets Padded Decomposition Padding probability 1-. Gaussian Transform JL Already 1+ distortion Glue partitions Higher percentage of padded points Extension theorem

Theorem 1: Every finite X½l2 admits embedding f:Xl2

k for k=Õ((dim X)2), such that


2. Gaussian at scale r: ||f(x)-f(y)||=(1±)Gr(||x-y||) whenever ||x-y||2 [r, r]



II. Snowflake Embedding Theorem 2. For all 0<<1, every finite subset X½l2 admits an

embedding F:Xl2k, for k=Õ(-4(dim X)2), with distortion 1+ for

the snowflake metric ||x-y||½.

We’ll illustrate the construction for constant distortion. The snowflake technique is due to [Assouad’83] We give the first implementation achieving distortion 1+.


II. Snowflake Embedding Theorem 2. For every 0<<1 and finite subset X½l2 there is an

embedding F:Xl2k of the snowflake metric ||x-y||½ achieving

dimension k=Õ(-4(dim X)2) and distortion 1+, i.e.

Compared to open question: We embed the snowflake metric (weaker) But achieve distortion 1+ (stronger)

We generalize [Kahane’81, Talagrand’92] who study Euclidean embedding of Wilson’s helix (real line w/distances |x-y|1/2)

We’ll illustrate the construction for constant distortion The snowflake technique is due to [Assouad’83] We give first implementation that achieves distortion 1+.


Assouad’s Technique Basic idea.

Consider single scale embeddings for all r=2i Fix points x,y2X, and suppose ||x-y||≈s

r = 16s r = 8s r = 4s r = 2s r = s r = s/2 r = s/4 r = s/8 r = s/16

x y

Lipschitz: ||f(x)-f(y)|| ≤ ||x-y|| = s

Gaussian: ||f(x)-f(y)||=(1±)Gr(||x-y||)

Boundedness: ||f(x)|| ≤ r


Assouad’s Technique Now scale down each embedding by r½ (snowflake)

r = 16s s s½/4 r = 8s s s½/8½ r = 4s s s½/2 r = 2s s s½/2½ r = s s s½

r = s/2 s/2 s½/2½ r = s/4 s/4 s½/2 r = s/8 s/8 s½/8½ r = s/16 s/16 s½/4

||f(x)-f(y)||/r½ ||f(x)-f(y)||

Combine these embeddings by addition (and staggering)


Snowflake Embedding – Review Steps:

Compute single scale embeddings for all r=2i Scale-down embedding for r by r½

Combine embeddings by addition (with some staggering)

By taking more refined scales (powers of 1+ instead of 2) and further staggering, can achieve distortion 1+ for the snowflake

Theorem 2. For all 0<<1, every subset X½l2 embeds into l2k for

k=Õ(-4(dim X)2), with distortion 1+ to the snowflake


Conclusion Gave two (1+)-distortion low-dimension embeddings for

doubling l2-spaces Single scale Snowflake

This framework can be extended to l1 and l∞. Some obstacles: Dimension reduction: Can’t use JL Lipschitz extension: Can’t use Kirszbraun Threshold: Can’t use Gaussian transform

Many of the steps in the single-scale embedding are nonlinear, although most “localities” are mapped (near) linearly Explain empirical success (e.g. Locally Linear Embeddings)?

Applications? Clustering is one potential area …

A Nonlinear Approach to Dimension Reduction

Documents

Transcript of A Nonlinear Approach to Dimension Reduction