Learning Eigenfunctions: Links with Spectral Clustering...

31
Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA Yoshua Bengio Pascal Vincent Jean-Franc ¸ois Paiement University of Montreal April 2, Snowbird Learning’2003

Transcript of Learning Eigenfunctions: Links with Spectral Clustering...

Page 1: Learning Eigenfunctions: Links with Spectral Clustering ...lisa/pointeurs/snowbird_eigenfn_talk.pdf · Link between Spectral Clustering and Eigenfunctions Equivalence between eigenvectors

Learning Eigenfunctions: Links with

Spectral Clustering and Kernel PCA

Yoshua Bengio

Pascal Vincent

Jean-Francois Paiement

University of Montreal

April 2, Snowbird Learning’2003

Page 2: Learning Eigenfunctions: Links with Spectral Clustering ...lisa/pointeurs/snowbird_eigenfn_talk.pdf · Link between Spectral Clustering and Eigenfunctions Equivalence between eigenvectors

Learning Modal Structures of the Distribution

Manifold learning and clustering

= learning where are the main high-density zones

Learning a tranformation that reveals “clusters” and manifolds:

Cluster = zone of high density separated from other clusters by regions of

low density

Page 3: Learning Eigenfunctions: Links with Spectral Clustering ...lisa/pointeurs/snowbird_eigenfn_talk.pdf · Link between Spectral Clustering and Eigenfunctions Equivalence between eigenvectors

Spectral Embedding Algorithms

Many learning algorithms, e.g.� spectral clustering,

� kernel PCA,

� Local Linear Embedding (LLE),

� Isomap,

� Multi-Dimensional Scaling (MDS),

� Laplacian eigenmapshave at their core the following (or its equivalent):

1. Start from � data points � �2. Construct a � � � “neighborhood” or similarity matrix �

(with corresponding [possibly data-dependent] kernel � � � � )

3. Normalize it (and make it symmetric), yielding

��

(with corresponding kernel

�� � � � )

4. Compute � largest (equivalently, smallest) e-values/e-vectors

5. Embedding of � � = -th elements of each of the � e-vectors (possiblyscaled using e-values)

Page 4: Learning Eigenfunctions: Links with Spectral Clustering ...lisa/pointeurs/snowbird_eigenfn_talk.pdf · Link between Spectral Clustering and Eigenfunctions Equivalence between eigenvectors

Kernel PCA

Data � � � � is implicitly mapped to “feature space” � � � � of kernel� � � � s.t.

� � � � �

�� � � � �� � � �

PCA is performed in feature space:

Projecting points in high-dim might allow to find straight line along which they

are almost aligned (if basis, i.e. kernel, is “right”).

Page 5: Learning Eigenfunctions: Links with Spectral Clustering ...lisa/pointeurs/snowbird_eigenfn_talk.pdf · Link between Spectral Clustering and Eigenfunctions Equivalence between eigenvectors

Kernel PCA

Eigenvectors � � of (generally infinite) matrix ��� � � � � � � � � � � � � � � are

� � � �� � � � � � �

where� � is an eigenvector of Gram matrix � � � � � � � � � � � .

Projection on � -th p.c. = � � � � � � � � � � � �� � � � � � � � �

N.B. need � centered: � � � � � � � � � , � subtractive normalization

�� � � � � � � � � � � � � � � � � � � � � � � � � ��� � � ��� � � � � � � � �

(Scholkopf 96)

Page 6: Learning Eigenfunctions: Links with Spectral Clustering ...lisa/pointeurs/snowbird_eigenfn_talk.pdf · Link between Spectral Clustering and Eigenfunctions Equivalence between eigenvectors

Laplacian Eigenmaps

� Gram matrix from Laplace-Beltrami operator ( �� � � ), which on finite

data (neighborhood graph) gives graph Laplacian.

� Gaussian kernel. Approximated by k-NN adjacency matrix

� Normalization: row average - Gram matrix.

� Laplace-Beltrami operator � : justified as a smoothness regularizer on

the manifold � : � � � � � � � � � ��� � � , which equals eigenvalue of

� for eigenfunctions � .

� Successfully used for semi-supervised learning.

(Belkin & Niyogi, 2002)

Page 7: Learning Eigenfunctions: Links with Spectral Clustering ...lisa/pointeurs/snowbird_eigenfn_talk.pdf · Link between Spectral Clustering and Eigenfunctions Equivalence between eigenvectors

Spectral Clustering

� Normalize kernel or Gram matrix divisively:

�� � � � �

� � � �

� � � � � � � � � � � � � � � �

� Embedding of � � = �� � � �� � � � � � � �� � � where� � is � -th eigenvector of

Gram matrix.

� Perform clustering on the embedded points (e.g. after normalizing them

by their norm).

Weiss, Ng, Jordan, ...

Page 8: Learning Eigenfunctions: Links with Spectral Clustering ...lisa/pointeurs/snowbird_eigenfn_talk.pdf · Link between Spectral Clustering and Eigenfunctions Equivalence between eigenvectors

Spectral Clustering

unitsphere

principal eigenfns approx. kernel (= dot product) in MSE sense

� � � � � � � � � �� � � � � � and � � almost colinear

� � � � � � � � � �� � � � � � and � � almost orthogonal

� points in same cluster mapped to points with near angle, even ifnon-blob cluster (global constraint = transitivity of “nearness”)

Page 9: Learning Eigenfunctions: Links with Spectral Clustering ...lisa/pointeurs/snowbird_eigenfn_talk.pdf · Link between Spectral Clustering and Eigenfunctions Equivalence between eigenvectors

Density-Dependent Hilbert Space

Define a Hilbert space with density-dependent inner product

� � � � � � � � � � � � � � � �

with density� � � .

A kernel function � � � � defines a linear operator in that space:

� � � � � � � � � � � � � �

Page 10: Learning Eigenfunctions: Links with Spectral Clustering ...lisa/pointeurs/snowbird_eigenfn_talk.pdf · Link between Spectral Clustering and Eigenfunctions Equivalence between eigenvectors

Eigenfunctions of a Kernel

Infinite-dimensional version of eigenvectors of Gram matrix:

� � � � � � � � � � � � � � � � � � �

(some conditions to obtain a discrete spectrum)

Convergence of

e-vec/e-values of Gram matrix from � data sampled from� � �

to

e-functions/e-values of linear operator with underlying� � � ,

proven as � � � (Williams+Seeger 2000).

Page 11: Learning Eigenfunctions: Links with Spectral Clustering ...lisa/pointeurs/snowbird_eigenfn_talk.pdf · Link between Spectral Clustering and Eigenfunctions Equivalence between eigenvectors

Link between Spectral Clustering and Eigenfunctions

Equivalence between eigenvectors and eigenfunctions (and corr.

eigenvalues) when� � � is the empirical distribution:

Proposition 1: If we choose for� � � the empirical distribution of the

data, then the spectral embedding from

�� is equivalent to values of the

eigenfunctions of the normalized kernel

�� :� � � � � � � � � .

Proof: come and see our poster!

Page 12: Learning Eigenfunctions: Links with Spectral Clustering ...lisa/pointeurs/snowbird_eigenfn_talk.pdf · Link between Spectral Clustering and Eigenfunctions Equivalence between eigenvectors

Link between Kernel PCA and Eigenfunctions

Proposition 2: If we choose for� � � the empirical distribution of the

data, then the kernel PCA projection is equivalent to scaled values of the

eigenfunctions of

�� : � � � � � � � � � � .

Proof: come and see our poster!

Consequence: up to the choice of kernel, kernel

normalization, and up to scaling by � � , spectral

clustering, Laplacian eigenmaps and kernel PCA

give the same embedding. Isomap, MDS and LLE also

give eigenfunctions but from a different type of

kernel.

Page 13: Learning Eigenfunctions: Links with Spectral Clustering ...lisa/pointeurs/snowbird_eigenfn_talk.pdf · Link between Spectral Clustering and Eigenfunctions Equivalence between eigenvectors

From Embedding to General Mapping

� Laplacian eigenmaps, spectral clustering, Isomap, LLE, and MDS only

provided an embedding for the given data points.

� Natural generalization to new points: consider these algorithms as

learning eigenfunctions of

�� .

� eigenfunctions � � provide a mapping for new points. e.g. for empirical

� � � :� � � � �

�� � �

�� � � � � �

� Data-dependent “kernels” (Isomap, LLE): need to compute

�� � � � � �

without changing

�� � � � � � � . Reasonable for Isomap, less clear it makes

sense for LLE.

Page 14: Learning Eigenfunctions: Links with Spectral Clustering ...lisa/pointeurs/snowbird_eigenfn_talk.pdf · Link between Spectral Clustering and Eigenfunctions Equivalence between eigenvectors

Criterion to Learn Eigenfunctions

Proposition 3: Given the first � � eigenfunctions � � of a symmetric

function � � � � , the � -th one can be obtained by minimizing w.r.t. � the

expected value of � � � � � � � � � � � ��� �

�� �

� � � � � � � � � � over

� � � � � � � � � � . Then we get � � � � � and � � � � �� � � .

This helps understand what the eigenfunctions are doing (approximating

the “dot product” � � � � ) and provides a possible criterion for

estimating the eigenfunctions when� � � is not an empirical distribution.

Kernels such as the Gaussian kernel and nearest-neighbor related kernels

force the eigenfunctions to reconstruct correctly only � � � � for nearby

objects: in high-dim, don’t trust Euclidean distance between far objects.

Page 15: Learning Eigenfunctions: Links with Spectral Clustering ...lisa/pointeurs/snowbird_eigenfn_talk.pdf · Link between Spectral Clustering and Eigenfunctions Equivalence between eigenvectors

Using a Smooth Density to Define Eigenfunctions?

� Use your best estimator� � � of the density of the data, instead of the

data, for defining the eigenfunctions.

� Constrained class of e-fns, e.g. neural networks, can force e-fns to be

smooth and not necessarily local.

� Advantage? better generalization away from training points?

� Advantage? better scaling with � ? (no Gram matrix, no e-vectors)

� Disadvantage? optimization of e-fns may be more difficult?

Page 16: Learning Eigenfunctions: Links with Spectral Clustering ...lisa/pointeurs/snowbird_eigenfn_talk.pdf · Link between Spectral Clustering and Eigenfunctions Equivalence between eigenvectors

Recovering the Density from the Eigenfunctions?

Visually the eigenfunctions appear to capture the main characteristics of

the density.

Can we obtain a better estimate of the density using the principal

eigenfunctions?

� (Girolami 2001): truncating the expansion� � � � ��� � � � � � .

� Use ideas similar to (Teh+Roweis 2003) and other mixtures of factor

analyzers and project back in input space, convoluting with a

model of reconstruction error as noise.

Page 17: Learning Eigenfunctions: Links with Spectral Clustering ...lisa/pointeurs/snowbird_eigenfn_talk.pdf · Link between Spectral Clustering and Eigenfunctions Equivalence between eigenvectors

Role of Kernel Normalization?

Subtractive normalization yields to kernel PCA:

�� � �

�� �� � � � � � � � � � ��

Thus the corresponding kernel

�� � � � �

�� � � �

�� � is expanded:

�� � � � � � � � � � � � � � � � � � � � � � � � � � �

� � � � � � � � � � � � � � � � � � � � ��� � � ��� � � � � � � � �

� the constant function is an eigenfunction

� eigenfunctions have zero mean and unit variance

� double-centering normalization (MDS, Isomap): ���

� above

(based on relation between dot product and distance)

� What can be said about the divisive normalization? Seems better at

clustering.

�� � � � � � � � � � �

� � � � � � � �� � � � � � � � ��

Page 18: Learning Eigenfunctions: Links with Spectral Clustering ...lisa/pointeurs/snowbird_eigenfn_talk.pdf · Link between Spectral Clustering and Eigenfunctions Equivalence between eigenvectors

Multi-layer Learning of Similarity and Density?

The learned eigenfunctions capture salient features of the distribution:

abstractions such as clusters and manifolds.

Old AI (and connectionist) idea: build high-level abstractions on top of

lower-level abstractions.

localEuclideansimilarity

fartherreachingnotion ofsimilarity

density+

improveddensity

+empirical

model

Page 19: Learning Eigenfunctions: Links with Spectral Clustering ...lisa/pointeurs/snowbird_eigenfn_talk.pdf · Link between Spectral Clustering and Eigenfunctions Equivalence between eigenvectors

Density-Adjusted Similarity and Kernel

CBA

Want � and � “closer” than � and � .

Define a density adjusted distance as a geodesic wrt a Riemannian metric,

with metric tensor that penalizes low density.

SEE OTHER POSTER (Vincent & Bengio)

Page 20: Learning Eigenfunctions: Links with Spectral Clustering ...lisa/pointeurs/snowbird_eigenfn_talk.pdf · Link between Spectral Clustering and Eigenfunctions Equivalence between eigenvectors

Density-Adjusted Similarity and Kernel

original spirals

Gaussian kernel spectral embedding

-0.4

-0.2

0

0.2

0.4

0.6

0.8

-0.2 0 0.2 0.4 0.6 0.8 1

-6-5-4-3-2-10123

-0.08

-0.06

-0.04

-0.02

0

0.02

0.04

0.06

0.08

0.1

0.04 0.045 0.05 0.055 0.06 0.065 0.07 0.075 0.08 0.085

-6-5-4-3-2-10123

Density-adjusted embedding Density-adjusted embedding

-0.1

-0.08

-0.06

-0.04

-0.02

0

0.02

0.04

0.06

0.08

0.1

-0.1 -0.05 0 0.05 0.1 0.15

-6-5-4-3-2-10123

-0.1

-0.08

-0.06

-0.04

-0.02

0

0.02

0.04

0.06

0.08

0.1

-0.05 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35

-6-5-4-3-2-10123

Page 21: Learning Eigenfunctions: Links with Spectral Clustering ...lisa/pointeurs/snowbird_eigenfn_talk.pdf · Link between Spectral Clustering and Eigenfunctions Equivalence between eigenvectors

Conclusions

� Many unsupervised learning algorithms (kernel PCA, spectral

clustering, Laplacian eigenmaps, MDS, LLE, ISOMAP) are linked:

compute eigenfunctions of a normalized kernel.

� Embedding can be generalized to mapping applicable to new points.

� Eigenfunctions seem to capture salient features of the distribution by

minimizing kernel reconstruction error.

� Many questions open:

� eigenfunctions � recover explicit density function?

� finding e-fns with smooth� � � ?

� meaning of various kernel normalization?

� multi-layer learning?

� density-adjusted similarity (see Vincent & Bengio poster).

Page 22: Learning Eigenfunctions: Links with Spectral Clustering ...lisa/pointeurs/snowbird_eigenfn_talk.pdf · Link between Spectral Clustering and Eigenfunctions Equivalence between eigenvectors

Proposition 3

The principal eigenfunction of the linear operator

� � � � � �� �� � � � � � � � �

corresponding to kernel � is the (or a, if repeated e-values) norm-1

function � that minimizes the reconstruction error

� � � � � � � � � � � � � � � � � � � �

� � � � � � �

Page 23: Learning Eigenfunctions: Links with Spectral Clustering ...lisa/pointeurs/snowbird_eigenfn_talk.pdf · Link between Spectral Clustering and Eigenfunctions Equivalence between eigenvectors

Proof of Proposition 1

Proposition 1: If we choose for� � � the empirical distribution of the

data, then the spectral embedding from

�� is equivalent to values of the

eigenfunctions of the normalized kernel

�� :� � � � � � � � � .

(Simplified) proof:

As shown in Proposition 3, finding function � and scalar � minimizing

��

� � � � � � � � � � � �

� � � � � �

s.t. � � � yields a solution that satisfies

�� � � � � � � � � � � � �

with � the (possibly repeated) maximum norm eigenvalue.

Page 24: Learning Eigenfunctions: Links with Spectral Clustering ...lisa/pointeurs/snowbird_eigenfn_talk.pdf · Link between Spectral Clustering and Eigenfunctions Equivalence between eigenvectors

Proof of Proposition 1

With empirical� � � , the above becomes ( � � � ):

�� �

�� � � � � � � � � � � � � � � � � �

Write � � � � � � � and

�� � � �

�� � � � � � � , then

�� � � � � � �

and we obtain for the principal eigenvector:�

� � � � � � � � ��

For the other eigenvalues, consider the “residual kernel”

� � � � � � � � � � � ��� �

� � � � � � � � � and recursively apply the

same reasoning to obtain� � � � � � � � � ,� � � � � � � � � , etc...

Q.E.D.

Page 25: Learning Eigenfunctions: Links with Spectral Clustering ...lisa/pointeurs/snowbird_eigenfn_talk.pdf · Link between Spectral Clustering and Eigenfunctions Equivalence between eigenvectors

Proof of Proposition 2

Proposition 2: If we choose for� � � the empirical distribution of the

data, then the kernel PCA projection is equivalent to scaled values of the

eigenfunctions of

�� : � � � � � � � � � � .

(Simplified) proof:

Apply the linear operator�

� on both sides of

�� � � � � � � � :

��

�� � � � � �

�� � � �

or changing the order of integrals on the left-hand side:

� � ���

�� � � �

�� � � � � � � � � � � � �

�� � � � � � � � �

Plug-in

�� � � � � �

�� � � �

�� � � :

� � � ��� � � �

�� � � �

�� � � �

�� � �

�� � ��� � �

�� � � � �

Page 26: Learning Eigenfunctions: Links with Spectral Clustering ...lisa/pointeurs/snowbird_eigenfn_talk.pdf · Link between Spectral Clustering and Eigenfunctions Equivalence between eigenvectors

Proof of Proposition 2

� � � ��

� � � � �

� � � � � � � � �

which contains elements of covariance matrix � :

� � � �

�� � �

�� � � � �

thus yielding

��

� � � � �

� � �

�� � ��� � � ��� � ��� � � � �

��

� � � � � � � �

� � � � �

or

�� � � � � � � � � �

�� � �

�� � � � � � � � � � �

�� �

where � � � ��

� � has elements � � � � �

� � � � � . So, where

�� � � takes its

values,

� � � � � � � �

where � � � � � � ��

� � is also the � -th e-vector of � .

Page 27: Learning Eigenfunctions: Links with Spectral Clustering ...lisa/pointeurs/snowbird_eigenfn_talk.pdf · Link between Spectral Clustering and Eigenfunctions Equivalence between eigenvectors

Proof of Proposition 2

PCA projection on � � is

� � � � � � ��

� � �

� � � � � �

� � � � ��

� � �

� � � � �

� � ��

� � � � �

� � � � � � � � � �

� � � � � � �

Q.E.D.

Page 28: Learning Eigenfunctions: Links with Spectral Clustering ...lisa/pointeurs/snowbird_eigenfn_talk.pdf · Link between Spectral Clustering and Eigenfunctions Equivalence between eigenvectors

Proof of Proposition 3

Proposition 3: Given the first � � eigenfunctions � � of a symmetric

function � � � � , the � -th one can be obtained by minimizing w.r.t. � the

expected value of � � � � � � � � � � � ��� �

�� �

� � � � � � � � � � over

� � � � � � � � � � . Then we get � � � � � and � � � � �� � � .

Proof:

Reconstruction error using approximation

� � � � � � � ��

�� �

� � � � � � � � � :

� � � � � � � � � � � � � � � ��

�� �

� � � � � � � � � �

� � � � � � �

where � � � � � �� �� �� � � with �� � � , and � � � � � � are the first � �

(eigenfunction,eigenvalue) pairs in order of decreasing absolute value of

� � .

Page 29: Learning Eigenfunctions: Links with Spectral Clustering ...lisa/pointeurs/snowbird_eigenfn_talk.pdf · Link between Spectral Clustering and Eigenfunctions Equivalence between eigenvectors

Proof of Proposition 3

Minimization of � � wrt �� gives� � �

� �� � � � � � � � �� �� � � �� �

� ��

�� �

� � � � � � � � � �� � � �� � � � � � � � �

�� � � �� � � �� ��

� ��

�� �

� � � � � � � � � �� � � �� � � � � � � � (1)

� � � � � ��

� � �� �� � � �� � � � � � � � ��

�� �

� � � � � � � � � � � � � � �

� � � �� ��� � � ��� � �

� � � � � �

� � � � � �� �� �

using eq. 1. �� � should be maximized.

� � �

� �� ��� � � � � � � � �� �� � � �� �

� ��

�� �

� � � � ��� � � � �� �� � � �

Page 30: Learning Eigenfunctions: Links with Spectral Clustering ...lisa/pointeurs/snowbird_eigenfn_talk.pdf · Link between Spectral Clustering and Eigenfunctions Equivalence between eigenvectors

Proof of Proposition 3

and set it equal to zero:

� � � � �� � � � �� ��

�� �

� � � � ��� � � � �� � � � �

Using �� � � � �� � �� � � � �� � �

� � � � :� �� � �� �� �

� ��

�� �

� � � � � � � � � �� � � � � (2)

Using recursive assumption that � � are orthogonal for � � :

� �� � �� �� �

� ��

�� �

� � � � � �� � � � ���

Write the application of � in terms of the eigenfunctions:

� �� �

��� �

� � � � � �� � � � � �

Page 31: Learning Eigenfunctions: Links with Spectral Clustering ...lisa/pointeurs/snowbird_eigenfn_talk.pdf · Link between Spectral Clustering and Eigenfunctions Equivalence between eigenvectors

Proof of Proposition 3

we obtain

�� �� � � � � � � �� � � � � �

�� � � �

� � � � � �� � � � � �

Applying Perceval’s thm to obtain the norm on both sides:�� � � � ��

� �� � � � �� �

�� � � �

� ��

� �� � � � �� �

If distinct � ’s, � � � � � � � � and �� � max. when � �� � � � � = 1 and

� �� � � � � � � for � � , � � � � �� � and � � � �� � .

Since � � �� �� and obtained �� � � � and �� � � � , get � � � � � and

� � � � �� � � .

Q.E.D.