Distance Learning for high dimensional data

58
Distance Learning for high dimensional data Pablo Groisman with M. Jonckheere and F. Sapienza University of Buenos Aires and IMAS-CONICET

Transcript of Distance Learning for high dimensional data

Page 1: Distance Learning for high dimensional data

;

Distance Learning for high dimensional data

Pablo Groismanwith M. Jonckheere and F. Sapienza

University of Buenos Aires and IMAS-CONICET

Page 2: Distance Learning for high dimensional data

Sin Ciencia, Tecnología e InnovaciónProductiva... No hay futuro.

Page 3: Distance Learning for high dimensional data

La magnitud de los probelmas...

...nos obliga a potenciar y promover la producción ytransmisión del conocimiento, reconociendo a éste comoel principal bien social y estratégico de las naciones paragarantizar la mejora sostenible de la calidad de vida desus habitantes.

Directorio Conicet

09-2018

Page 4: Distance Learning for high dimensional data

Motivation: Hands Images

Wrist rotation

Fing

ers

exte

nsio

n

c©J. B. Tenenbaum, V. de Silva, J. C. Langford, Science (2000)

Page 5: Distance Learning for high dimensional data

Motivation: MNIST Dataset

c©MNIST Dataset

Page 6: Distance Learning for high dimensional data

Motivation: Speaker Identification

Page 7: Distance Learning for high dimensional data

Motivation: A problem at Aristas SRL

Problem

Clustering of high dimensional chemical formulasDistance between them in terms of e.g. olfactory properties

Data size

106 formulasDimension d ∼ 4000

Page 8: Distance Learning for high dimensional data

A curse of dimensionality

Let ωD(r) = ωD(1)rD be the volume of the ball of radius r in RD.

ωD(1)− ωD(1− ε)ωD(1) = 1− (1− ε)D D→∞−−−−→ 1

In high dimensional Euclidean spaces every two points of atypical large set are at similar distance.

Page 9: Distance Learning for high dimensional data

A curse of dimensionality

Let ωD(r) = ωD(1)rD be the volume of the ball of radius r in RD.

ωD(1)− ωD(1− ε)ωD(1) = 1− (1− ε)D D→∞−−−−→ 1

In high dimensional Euclidean spaces every two points of atypical large set are at similar distance.

Page 10: Distance Learning for high dimensional data

A curse of dimensionality

Let ωD(r) = ωD(1)rD be the volume of the ball of radius r in RD.

ωD(1)− ωD(1− ε)ωD(1) = 1− (1− ε)D D→∞−−−−→ 1

In high dimensional Euclidean spaces every two points of atypical large set are at similar distance.

Page 11: Distance Learning for high dimensional data

Clustering: K-means, DBSCAN, etc.

c©scikit-learn developers

Page 12: Distance Learning for high dimensional data

Dimensionality Reduction: Isomap

Constructs the k-nn graph and finds the optimal path. The weightof an edge is given |qi − qj |.

c©J. B. Tenenbaum, V. de Silva, J. C. Langford, Science (2000).

Page 13: Distance Learning for high dimensional data

Dimensionality Reduction: Isomap

TheoremGiven ε > 0 and δ > 0, for n large enough

P(

1− ε ≤ dgeodesic(x, y)dgraph(x, y) ≤ 1 + ε

)> 1− δ.

[Bernstein, de Silva, Langford, Tenenbaum (2000)].

c©J. B. Tenenbaum, V. de Silva, J. C. Langford, Science (2000).

Page 14: Distance Learning for high dimensional data

Motivation

In most unsupervised learning tasks, a notion of similaritybetween data points is both crucial and usually not directlyavailable as an input.

The efficiency of tasks like dimensionality reduction andclustering might crucially depend on the distance chosen.Since the data lies in an (unknown) lower dimensional surface,this distance has to be inferred from the data itself.

Page 15: Distance Learning for high dimensional data

Motivation

In most unsupervised learning tasks, a notion of similaritybetween data points is both crucial and usually not directlyavailable as an input.The efficiency of tasks like dimensionality reduction andclustering might crucially depend on the distance chosen.

Since the data lies in an (unknown) lower dimensional surface,this distance has to be inferred from the data itself.

Page 16: Distance Learning for high dimensional data

Motivation

In most unsupervised learning tasks, a notion of similaritybetween data points is both crucial and usually not directlyavailable as an input.The efficiency of tasks like dimensionality reduction andclustering might crucially depend on the distance chosen.Since the data lies in an (unknown) lower dimensional surface,this distance has to be inferred from the data itself.

Page 17: Distance Learning for high dimensional data

Motivation

In most unsupervised learning tasks, a notion of similaritybetween data points is both crucial and usually not directlyavailable as an input.The efficiency of tasks like dimensionality reduction andclustering might crucially depend on the distance chosen.Since the data lies in an (unknown) lower dimensional surface,this distance has to be inferred from the data itself.

Page 18: Distance Learning for high dimensional data

Motivation

We look for a distance that takes into account theunderlying structure (surface) of the data and the underlyingdensity from which the points are sampled.

Page 19: Distance Learning for high dimensional data

The Problem

Let M ⊆ RD be a d-dimensional surface (we expect d D).

Consider n independent points on M with common densityf :M 7→ R≥0.

¿Can we learn the structure of M?Dimension, distances between points, etc.

Page 20: Distance Learning for high dimensional data

The Problem

Let M ⊆ RD be a d-dimensional surface (we expect d D).Consider n independent points on M with common densityf :M 7→ R≥0.

¿Can we learn the structure of M?Dimension, distances between points, etc.

Page 21: Distance Learning for high dimensional data

The Problem

Let M ⊆ RD be a d-dimensional surface (we expect d D).Consider n independent points on M with common densityf :M 7→ R≥0.

¿Can we learn the structure of M?Dimension, distances between points, etc.

Page 22: Distance Learning for high dimensional data

The Problem

Let M ⊆ RD be a d-dimensional surface (we expect d D).Consider n independent points on M with common densityf :M 7→ R≥0.

¿Can we learn the structure of M?Dimension, distances between points, etc.

Page 23: Distance Learning for high dimensional data

Sample Fermat’s distance

α ≥ 1 a parameter, X = a discrete set of points q, x, y ∈ X

rxy = (q1, . . . , qK) an X-path from x to y

T (rxy) =K−1∑j=1|qi+1 − qi|α, DX(x, y) = inf T (rxy)

The optimal path for α = 1.5, α = 3 and α = 6.

The density of points X in Ω1 is higher than in Ω2.

Page 24: Distance Learning for high dimensional data

Sample Fermat’s distance

α ≥ 1 a parameter, X = a discrete set of points q, x, y ∈ Xrxy = (q1, . . . , qK) an X-path from x to y

T (rxy) =K−1∑j=1|qi+1 − qi|α,

DX(x, y) = inf T (rxy)

The optimal path for α = 1.5, α = 3 and α = 6.

The density of points X in Ω1 is higher than in Ω2.

Page 25: Distance Learning for high dimensional data

Sample Fermat’s distance

α ≥ 1 a parameter, X = a discrete set of points q, x, y ∈ Xrxy = (q1, . . . , qK) an X-path from x to y

T (rxy) =K−1∑j=1|qi+1 − qi|α, DX(x, y) = inf T (rxy)

The optimal path for α = 1.5, α = 3 and α = 6.

The density of points X in Ω1 is higher than in Ω2.

Page 26: Distance Learning for high dimensional data

Sample Fermat’s distance

α ≥ 1 a parameter, X = a discrete set of points q, x, y ∈ Xrxy = (q1, . . . , qK) an X-path from x to y

T (rxy) =K−1∑j=1|qi+1 − qi|α, DX(x, y) = inf T (rxy)

The optimal path for α = 1.5, α = 3 and α = 6.

The density of points X in Ω1 is higher than in Ω2.

Page 27: Distance Learning for high dimensional data

Fermat’s Distance

f : M → R a density.

For x, y ∈M and β ≥ 0 we define Fermat’s distance by

D(x, y) = infΓ

∫Γ

1fβd`,

the minimization is over all curves Γ from x to y.

Page 28: Distance Learning for high dimensional data

Fermat’s Distance

f : M → R a density.

For x, y ∈M and β ≥ 0 we define Fermat’s distance by

D(x, y) = infΓ

∫Γ

1fβd`,

the minimization is over all curves Γ from x to y.

Page 29: Distance Learning for high dimensional data

Fermat’s Distance

f : M → R a density.

For x, y ∈M and β ≥ 0 we define Fermat’s distance by

D(x, y) = infΓ

∫Γ

1fβd`,

the minimization is over all curves Γ from x to y.

Page 30: Distance Learning for high dimensional data

Fermat’s principle

In optics, the path taken between two points by a ray of light is anextreme of the functional

Γ 7→∫Γ

n(x)d`, n = refractive index

D(x, y) = infΓ

∫Γ

1fβ

f−β ∼ n

c©S.Thorgerson - Pink Floyd, The Dark Side of the Moon (1973), Harvest, Capitol.

Page 31: Distance Learning for high dimensional data

Fermat’s principle

In optics, the path taken between two points by a ray of light is anextreme of the functional

Γ 7→∫Γ

n(x)d`, n = refractive index

D(x, y) = infΓ

∫Γ

1fβ

f−β ∼ n

c©S.Thorgerson - Pink Floyd, The Dark Side of the Moon (1973), Harvest, Capitol.

Page 32: Distance Learning for high dimensional data

Fermat’s principle

In optics, the path taken between two points by a ray of light is anextreme of the functional

Γ 7→∫Γ

n(x)d`, n = refractive index

D(x, y) = infΓ

∫Γ

1fβ

f−β ∼ n

c©S.Thorgerson - Pink Floyd, The Dark Side of the Moon (1973), Harvest, Capitol.

Page 33: Distance Learning for high dimensional data

Fermat’s distance

TheoremFor x, y ∈M and Xn i.i.d ∼ f we have

limn→∞

nβDXn(x, y) = D(x, y)

with β = (α− 1)/d.

Heuristics:

r = (q1, . . . , qk) a path∑|qi+1 − qi|α =

∑|qi+1 − qi|α−1|qi+1 − qi|

Page 34: Distance Learning for high dimensional data

Fermat’s distance

TheoremFor x, y ∈M and Xn i.i.d ∼ f we have

limn→∞

nβDXn(x, y) = D(x, y)

with β = (α− 1)/d.

Heuristics:

r = (q1, . . . , qk) a path∑|qi+1 − qi|α =

∑|qi+1 − qi|α−1|qi+1 − qi|

Page 35: Distance Learning for high dimensional data

Heuristics:

r = (q1, . . . , qk) a path∑|qi+1 − qi|α =

∑|qi+1 − qi|α−1|qi+1 − qi|

ncd|qi+1 − qi|df(qi) 1 ⇐⇒ n1/d|qi+1 − qi| c1

f(qi)1/d

n(α−1)/d|qi+1 − qi|α−1 c 1f(qi)(α−1)/d

infrn(α−1)/d∑ |qi+1 − qi|α inf

Γ

∫Γ

1fβ

d`.

Page 36: Distance Learning for high dimensional data

Heuristics:

r = (q1, . . . , qk) a path∑|qi+1 − qi|α =

∑|qi+1 − qi|α−1|qi+1 − qi|

ncd|qi+1 − qi|df(qi)

1 ⇐⇒ n1/d|qi+1 − qi| c1

f(qi)1/d

n(α−1)/d|qi+1 − qi|α−1 c 1f(qi)(α−1)/d

infrn(α−1)/d∑ |qi+1 − qi|α inf

Γ

∫Γ

1fβ

d`.

Page 37: Distance Learning for high dimensional data

Heuristics:

r = (q1, . . . , qk) a path∑|qi+1 − qi|α =

∑|qi+1 − qi|α−1|qi+1 − qi|

ncd|qi+1 − qi|df(qi) 1 ⇐⇒

n1/d|qi+1 − qi| c1

f(qi)1/d

n(α−1)/d|qi+1 − qi|α−1 c 1f(qi)(α−1)/d

infrn(α−1)/d∑ |qi+1 − qi|α inf

Γ

∫Γ

1fβ

d`.

Page 38: Distance Learning for high dimensional data

Heuristics:

r = (q1, . . . , qk) a path∑|qi+1 − qi|α =

∑|qi+1 − qi|α−1|qi+1 − qi|

ncd|qi+1 − qi|df(qi) 1 ⇐⇒ n1/d|qi+1 − qi| c1

f(qi)1/d

n(α−1)/d|qi+1 − qi|α−1 c 1f(qi)(α−1)/d

infrn(α−1)/d∑ |qi+1 − qi|α inf

Γ

∫Γ

1fβ

d`.

Page 39: Distance Learning for high dimensional data

Heuristics:

r = (q1, . . . , qk) a path∑|qi+1 − qi|α =

∑|qi+1 − qi|α−1|qi+1 − qi|

ncd|qi+1 − qi|df(qi) 1 ⇐⇒ n1/d|qi+1 − qi| c1

f(qi)1/d

n(α−1)/d|qi+1 − qi|α−1 c 1f(qi)(α−1)/d

infrn(α−1)/d∑ |qi+1 − qi|α inf

Γ

∫Γ

1fβ

d`.

Page 40: Distance Learning for high dimensional data

Heuristics:

r = (q1, . . . , qk) a path∑|qi+1 − qi|α =

∑|qi+1 − qi|α−1|qi+1 − qi|

ncd|qi+1 − qi|df(qi) 1 ⇐⇒ n1/d|qi+1 − qi| c1

f(qi)1/d

n(α−1)/d|qi+1 − qi|α−1 c 1f(qi)(α−1)/d

infrn(α−1)/d∑ |qi+1 − qi|α inf

Γ

∫Γ

1fβ

d`.

Page 41: Distance Learning for high dimensional data

Heuristics:

r = (q1, . . . , qk) a path∑|qi+1 − qi|α =

∑|qi+1 − qi|α−1|qi+1 − qi|

ncd|qi+1 − qi|df(qi) 1 ⇐⇒ n1/d|qi+1 − qi| c1

f(qi)1/d

n(α−1)/d|qi+1 − qi|α−1 c 1f(qi)(α−1)/d

infrn(α−1)/d∑ |qi+1 − qi|α inf

Γ

∫Γ

1fβ

d`.

Page 42: Distance Learning for high dimensional data

Clustering with Fermat K-medoids in the Swiss roll

(a) 2D data (c) Adjusted mutualinformation

(e) Adjusted Rand in-dex

(b) 3D data (d) Accuracy (f) F1 score

Page 43: Distance Learning for high dimensional data

Clustering with Fermat t-SNE

Page 44: Distance Learning for high dimensional data

Algorithmic considerations

Restricted Fermat’s distance:

D(k)X (x, y) = inf

r = (q1, . . . , qK)qi+1 ∈ Nk(qi)

K−1∑k=1|qi+1 − qi|α.

Proposition: Given ε > 0, we can choose k = O (log(n/ε)) suchthat

P(D

(k)Xn

(x, y) = DXn(x, y))> 1− ε.

→ We can reduce the running time from O (n3) to O (n2(logn)2).

Page 45: Distance Learning for high dimensional data

Algorithmic considerations

Restricted Fermat’s distance:

D(k)X (x, y) = inf

r = (q1, . . . , qK)qi+1 ∈ Nk(qi)

K−1∑k=1|qi+1 − qi|α.

Proposition: Given ε > 0, we can choose k = O (log(n/ε)) suchthat

P(D

(k)Xn

(x, y) = DXn(x, y))> 1− ε.

→ We can reduce the running time from O (n3) to O (n2(logn)2).

Page 46: Distance Learning for high dimensional data

Algorithmic considerations

Restricted Fermat’s distance:

D(k)X (x, y) = inf

r = (q1, . . . , qK)qi+1 ∈ Nk(qi)

K−1∑k=1|qi+1 − qi|α.

Proposition: Given ε > 0, we can choose k = O (log(n/ε)) suchthat

P(D

(k)Xn

(x, y) = DXn(x, y))> 1− ε.

→ We can reduce the running time from O (n3) to O (n2(logn)2).

Page 47: Distance Learning for high dimensional data

Conclusions

• We have introduced Fermat’s distance and way to estimate itwith a sample.

• It defines a notion of distance between sample points thattakes into account the geometry of the clouds of point,including possible non-homogeneities.• We have proved that this estimator in fact approximates

Fermat’s distance, which is a good way to measure distance inthis (general) setting.

Page 48: Distance Learning for high dimensional data

Conclusions

• We have introduced Fermat’s distance and way to estimate itwith a sample.• It defines a notion of distance between sample points thattakes into account the geometry of the clouds of point,including possible non-homogeneities.

• We have proved that this estimator in fact approximatesFermat’s distance, which is a good way to measure distance inthis (general) setting.

Page 49: Distance Learning for high dimensional data

Conclusions

• We have introduced Fermat’s distance and way to estimate itwith a sample.• It defines a notion of distance between sample points thattakes into account the geometry of the clouds of point,including possible non-homogeneities.• We have proved that this estimator in fact approximatesFermat’s distance, which is a good way to measure distance inthis (general) setting.

Page 50: Distance Learning for high dimensional data

Applications

• Clustering

• Dimensionality reduction• Density estimation• Regression• Any learning task that requires a notion of distance as aninput.• Any other task that requires a notion of distance as an input.• etc.

Page 51: Distance Learning for high dimensional data

Applications

• Clustering• Dimensionality reduction

• Density estimation• Regression• Any learning task that requires a notion of distance as aninput.• Any other task that requires a notion of distance as an input.• etc.

Page 52: Distance Learning for high dimensional data

Applications

• Clustering• Dimensionality reduction• Density estimation

• Regression• Any learning task that requires a notion of distance as aninput.• Any other task that requires a notion of distance as an input.• etc.

Page 53: Distance Learning for high dimensional data

Applications

• Clustering• Dimensionality reduction• Density estimation• Regression

• Any learning task that requires a notion of distance as aninput.• Any other task that requires a notion of distance as an input.• etc.

Page 54: Distance Learning for high dimensional data

Applications

• Clustering• Dimensionality reduction• Density estimation• Regression• Any learning task that requires a notion of distance as aninput.

• Any other task that requires a notion of distance as an input.• etc.

Page 55: Distance Learning for high dimensional data

Applications

• Clustering• Dimensionality reduction• Density estimation• Regression• Any learning task that requires a notion of distance as aninput.• Any other task that requires a notion of distance as an input.

• etc.

Page 56: Distance Learning for high dimensional data

Applications

• Clustering• Dimensionality reduction• Density estimation• Regression• Any learning task that requires a notion of distance as aninput.• Any other task that requires a notion of distance as an input.• etc.

Page 57: Distance Learning for high dimensional data

Download

A prototype implementation is available at

https://github.com/facusapienza21/Fermat_distance

Page 58: Distance Learning for high dimensional data

Thanks!