Clustering in kernel embedding spaces and organization of...

Clustering in kernel embedding spaces and organization of documents

Stéphane Lafon

Collaborators:

Raphy Coifman (Yale), Yosi Keller (Yale), Ioannis G. Kevrekidis(Princeton), Ann B. Lee (CMU), Boaz Nadler (Weizmann)

Motivation

Data everywhere:

• Scientific databases: biomedical/physics

• The web (web search, ads targeting, ads spam/fraud detection…)

• Books: Conventional libraries + {Amazon, Google Print…}

• Military applications (Intelligence, ATR, undersea mapping, situational awareness…)

• Corporate and financial data

• …

Motivation

Tremendous need for automated ways of

• Organizing this information (clustering, parametrization of data sets, adapted metrics on the data)

• Extracting knowledge (learning, clustering, pattern recognition,dimensionality reduction)

• Make it useful to end-users (dimensionality reduction)

All of this in the most efficient way.

Challenges

• data sets are high-dimensional: need for dimension reduction

• data sets are nonlinear: need to learn these nonlinearities

• data likely to be noisy: need for robustness.

• data sets are massive.

Highlights of this talk:

• Review the diffusion framework: parametrization of data sets, dimensionality reduction, diffusion metric

• Present a few results on clustering in the diffusion space

• Extend these results to non-symmetric graphs: beyond diffusions.

From data sets to graphs

{ }( )

1 2, ,...,

,

( , )

Let be a data set.

Construct a graph where to each point corresponds a node, every two nodes are connected by an edge with a non-negative weight .

n

i

X x x x

G X Wx

w x y

=

=

••

Choice of the weight matrix

( , ).

The quantity should reflect the degree of similarityor interaction between and

The choice of the weight is crucial and application-driven.

w x yx y

2( , ) /( , )

Examples: Take a correlation matrix, and throw away values that are not close to 1.

If we have a distance metric on the data, consider using . Sometimes, the data already comes in

d x yd x y e ε−•

•• the form of a graph (e.g. social networks).

In practise, difficult problem! Feature selection, dynamical data sets...

Markov chain on the data

1

( ) ( , ).

( , )( , )( )

( , ) 1 ( , ) 0

Define the degree of node as

Form the by matrix with entries . In other words, .

Because and ,

is the transition matrix of a Markov

z X

y X

x d x w x z

w x yn n P p x y P D Wd x

p x y p x y

P

∈

−

∈

=

= =

= ≥

∑

∑

( , )

chain on the graph of the data.

is called "normalized graph Laplacian" [Chung'97].Notation: entries of are denoted .t t

I PP p x y

−

Powers of P

Main idea: the behavior of the Markov chain over different time scales will reveal geometric structures of the data.

Time parameter t

( , ) is the probability of transition from to in time steps.Therefore, it is close to 1 if is easily reachable from in steps.This happens if there are many paths connecting these two po

tp x y x y ty x t

ints

defines the granularity of the analysis.Increasing the value of is a way to integrate the local geometric information:

transitions are likely to happen between similar data points and occur rar

tt

ely otherwise.

Assumptions on the weights

( , ) ( , ).

Our object of interest: powers of .

We now assume that the weight function is symmetric, i.e., that the graph is symmetric.

P

w x y w y x=

Spectral decomposition of P

0 1 1

0 0

{ }

11

1

1

With these conditions, has a sequence of eigenvaluesand a collection of corresponding (right) eigenvectors

In addition, it can be checked that and

n

mt t

m m m

P

P

λ λ λψ

ψ λ ψ

λ ψ

−≥ ≥ ≥

=

⎛ ⎞⎜ ⎟⎜ ⎟= ≡⎜ ⎟⎜ ⎟⎝ ⎠

…

.

Diffusion coordinates

{ }.

0

Each eigenvector is a function defined on the data points.Therefore, we can think of the (right) eigenvectors

as forming a on the data set For any choice of

set of coordin , defin

ae the ma

tesm

Xt

ψ

≥

1 1

2 2

1 1

( )( )

:

( )

pping:t

t

t

tn n

xx

x

x

λψλψ

λ ψ− −

⎛ ⎞⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎝ ⎠

ψ

Original space Diffusion space

tψ

Over the past 5 years, new techniques have emerged for manifold learning

• Isomap [Tenenbaum-DeSilva-Langford’00]

• L.L.E. [Roweis-Saul’00]

• Laplacian eigenmaps [Belkin-Niyogi’01]

• Hessian eigenmaps [Donoho-Grimes’03]

• …

They all aim at finding coordinates on data sets by computing the eigenfunctions of a psd matrix.


2 / 2( , )

Data set: collection of images of handwritten digits "1"

Weight kernel where is the distance between two images and x yw x y e x y L x yε− −•

• = −

unordered

Diffusion distance

2 ( ,1/

( ) ( )

( ) ( ) ( , ) ( , ) ( , )

Meaning of mapping : two data points and are mapped as and so that the distance between them is equal to the so-called "diffusion distance":

t t t

t t t t t L X

x y x y

x y D x y p x p y

Ψ Ψ Ψ

Ψ −Ψ = = −i i).

Two points are in this metric ifclose highly connect they are in the gred aph.

π

Original space Diffusion space

Euclidean distance in diffusion space

Diffusion metric in original space

tψ

Diffusion distanceThe diffusion metric measures proximity in terms of connectivity in the graph.

In particular,

• It is useful to detect and characterize clusters.

• It allows to develop learning algorithms based on the preponderance of evidences.

• It is very robust to noise, unlike the geodesic distance.

−1 −0.5 0 0.5 1−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

1.2

A

B

−1 −0.5 0 0.5 1−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

A

B

0 0.5 1 1.5 20

50

100

Geodesic distance

0 0.5 1 1.5 20

50

100

Diffusion distance

Resistance distance, mean commute time, Green’s function…

Dimension reduction2

1 1

2 2

1 1

1

( )( )

:

( )0,

1Since , not all terms are numerically significant in

Therefore, for a given value of one only needs to embed usingcoordinates for which

t

t

t

tn n

tm

xx

x

xt

λ λ

λψλψ

λ ψ

λ

− −

≥ ≥ ≥

⎛ ⎞⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎝ ⎠≥

ψ

…

is non-negligible.The key to dimensionality reduction: decay of the spectrum

of the powers of the transition matrix.

0 20 40 60 80 100 120 140 160 180 2000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

P

P2

P4

P8

P16

Example: image analysis

1. Take an image

2. Form the set of all 7x7 patches

3. Compute the diffusion maps

Parametrization

Example: image analysis

−0.2 00.2 0.4

0.6 0.81 1.2

−1

−0.5

0

0.5

1

−1

−0.5

0

0.5

1

−1−0.5

00.5

1

−1−0.5

00.5

1

−1

−0.5

0

0.5

1

Regular diffusion maps Density-normalized diffusion maps

−0.2 00.2 0.4

0.6 0.81 1.2

−0.5

0

0.5

1

−1

−0.5

0

0.5

1

−1−0.5

00.5

1

−1−0.5

00.5

1−1

−0.5

0

0.5

1

Regular diffusion maps Density-normalized diffusion maps

Now, if we drop some patches…

Example: Molecular dynamics(with Raphy R. Coifman, Ioannis Kevrekidis, Mauro Maggioni and Boaz Nadler)

Output of the simulation of a molecule of 7 atoms in a solvent. The molecule evolves in a potential, and is also subject to interactions with the solvent.

Data set: millions of snapshots of the molecule (coordinates of atoms).

Graph matching and data alignment(with Yosi Keller)

Suppose we have two datasets and with approximately the same geometry.Question: how can we match/align/register and ?

Since we have coordinates for and , we can embed both sets,and align them

X YX Y

X Y in diffusion space.

Graph matching and data alignmentIllustration: moving heads.

The heads of 3 subjects wearing strange masks are recorded in 3 movies.

β

α

We obtain 3 data sets where each frame is a data point.

Because this is a constrained system (head-neck mechanical articulation), all 3 sets exhibit approximately the same geometry.

Graph matching and data alignment

We compute 3 density-invariant embeddings.

0

0.5

1

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Yellow set embedding0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

0

0.5

1

Red set embedding

0

0.5

1 00.2

0.40.6

0.81

0

0.2

0.4

0.6

0.8

1

Black set embedding

We then align the 3 data sets in diffusion spaceusing a limited number of landmarks.

Graph matching and data alignment

Science News articles (reloaded ☺)

:

( , )

Data set: collection of documents from Science News Data modeling : we form a mutual information feature matrix .

For each document and word , we compute the information shared by and Q

i j i j

Q i j

••

,,

:log :

:

number of words in document where total number of words in document

total number of words in all documents

We can work on the rows of

i ji j

ii j

j

N j iN

N iN N

N j

Q

⎧⎛ ⎞ ⎪= ⎜ ⎟ ⎨⎜ ⎟ ⎪⎝ ⎠

⎩•

( , )

in order to organize documentsor on its columns to organize words. Word j

Document i Q i j

↓

⎛ ⎞⎜ ⎟⎜ ⎟⎜ ⎟→⎜ ⎟⎜ ⎟⎜ ⎟⎝ ⎠

Clustering and data subsampling

[TODO: Matlab demo]

Organizing documents: we work on rows of .

Automatic extraction of themes, organization of documents according to these themes.Can work at different levels of granularity by changing

the number of diff

Q

usion coordinates considered.

Clustering of documents in diffusion space(with Ann B. Lee)

In the diffusion space, we can use classical geometric objects:

• can use hyperplanes

• can use Euclidean distance

• can use balls

• can apply usual geometric algorithms: k-means, coverings…

We now show that clustering in the diffusion space is equivalent to compressing the diffusion matrix P.

−1

0

1−1.5

−1

−0.5

0

0.5

1

123

−1

0

1−1.5

−1

−0.5

0

0.5

1

123

Learning nonlinear structures

K-means in diffusion spaceK-means in original space

Coarsening of the graph and Markov chain lumping

{ }1Given an arbitrary partition of the nodes into subsets,we coarsen the graph into a new graph with meta-nodes. Each meta-node is identified with a subset The new edge-weight funct

i i k

i

S k

G G kS

≤ ≤

•

•

1

( , ) ( , )

( ) ( , )

( , )( , )

( )

ion is defined by

We form a new Markov chain on this new graph:

i j

i jx S y S

i i jj k

i ji j

i

w S S w x y

d S w S S

w S Sp S S

d S

∈ ∈

≤ ≤

=

•

=

=

∑ ∑

∑

Coarsening of the graph and Markov chain lumping

Coarsening

w(x,y)

S3

S1

S1 S3

S2

S2

w(S1, S3)~

Equivalence with matrix approximation problem

( ) ( )( )

The new Markov matrix is , and it is an approximation of .Are the spectral properties of preserved by the coarsening operation?

Define the approximate eigenvectors of by

lx S

l i

P k k PP

Px x

Sπ ψ

ψ ∈

×

=

{ }

( )

These are not actual eigenvectors of , but only approximations. The quality of approximation depends on the choice of the partition .

i

ix S

i

x

PS

π∈

∑∑

Error of approximation

12

1 11

1

( ) ( )( ) ( ) ( )( ) ( )

Proposition: We have, for all 0 ,

where is the distortion in the diffusion space:

This is thei i

l l l

ii k x S z S i i

l k

P D

D

x zD S x zS S

πψ λψ

π πππ π≤ ≤ ∈ ∈

≤ ≤ −

− ≤

⎛ ⎞= Ψ −Ψ⎜ ⎟⎜ ⎟⎝ ⎠∑ ∑∑

{ }

This results justifies the

quantization distortion in

use of k-means in diffusion

duced by ,where all the points have a mass distribut

space (MeilaShi) and coverings for data clustering.

ion .

Any clus

iSπ

tering scheme that aims at minimizing thisdistortion preserves the spectral properties of theMarkov chain.

Clustering of words: automatic lexicon organization

Organizing words: we work on columns of .We can subsample the word collection by quantizing the diffusion space.

Clusters can be interpretated as "wordlets", or semantic units.Centers of the clusters

Q

are representative of a concept.

Automatic extraction of concepts from a collection of documents.

Clustering and data subsampling

Going multiscale

−1 −0.5 0 0.5 1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

−1 −0.5 0 0.5 1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

−1 −0.5 0 0.5 1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

−4−3

−2−1

0

−4

−2

0

2

4

−4

−3

−2

−1

0

1

2

3

λ1t Ψ

1

λ2t Ψ

2

λ 3t Ψ

3

−4−3

−2−1

0

−4

−2

0

2

4

−4

−3

−2

−1

0

1

2

3

λ1t Ψ

1

λ2t Ψ

2

λ 3t Ψ

3

−4−3

−2−1

0

−4

−2

0

2

4

−4

−3

−2

−1

0

1

2

3

λ1t Ψ

1

λ2t Ψ

2

λ 3t Ψ

3

t=1 t=64t=16

Beyond diffusions

( , )( )

The previous ideas can be generalized to the wider settingof a graph, with edge-weights , and a set of positive node-weights .

This time, we consider oriented interactions b

wπ

•

i ii

directed signed

etween datapoints (e.g. source-target). No spectral theory available! The node-weight can be used to rescale the relative importance

of each data point (e.g. density estimate). This framework contain

•

• s diffusion matrices as in this casethe node-weight is equal to the out-degree of each node.

Coarsening

{ }This time we have no spectral embedding.

However, for a given partitioning of the nodes,the same coarsening procedure can still be applied to the graph.

iS

Coarsening

S3

S2S1

S3

S2S1

w(S1,S2)~

w(x,y)

( , ) ( , ) ( ) ( ) and i j i

i j ix S y S x S

w S S w x y S xπ π∈ ∈ ∈

= =∑ ∑ ∑

Beyond spectral embeddings

1

1

( , ).

.

,

Idea: form the matrix and view each row as a feature vector representing The graph now

lives as a cloud of points in Similarly for the coarser graph form and obtain

a

n

A W a xx G

G A W

−

−

= Π

= Π

i

Question: how do these pointsets compare ?cloud of points in .k

Comparison between graphs

A

A

?T Tf A f AQ≈

Tf ATf

T Tf Q f=

Q Q

. is , whereas is They can be related by the matrix whose rows are

the indicators of the partitioning.

A n n A k kn k Q

× ××

In other words, this diagram should approximately commute:QA AQ≈

Error of approximation

( )1/

12

1/: 1

1

( ) ( )( ) ( , ) ( , ) .( ) ( )

We now have a similar result as before:Proposition:

sup

where

is the distortion resulting from quantizin

n

i i

T

f f

ii k x S z S i i

f QA AQ D

x zD S a x a zS S

D

ππ

π

π πππ π

∈ =

≤ ≤ ∈ ∈

− ≤

⎛ ⎞= −⎜ ⎟⎜ ⎟⎝ ⎠∑ ∑∑ i i

g the setof rows of , with mass distribution A π

Antarctic food web

Squid Krill

Zoo-plankton

Phyto-plankton

Cormo-rant

LargefishGull

Petrel

Sheathbill

Spermwhale

CO2 Sun

Minkewhale

Bluewhale

Alba-tross

Tern

Rightwhale

Fulmar

FinWhale

Leopardseal

Killer whale

Weddellseal

Rossseal

Smallfish

AdeliepenguinGentoopenguinEmperorpenguinMacaronipenguin

Kingpenguin

Chinstrappenguin

Seiwhale

Humpbackwhale

Octopus

Elephantseal

Crabeaterseal

Skua

Partitioning the forward graph

Zooplankton

KrillSmall fish

Squid

GullKing penguin

Large fishLeopard seal

OctopusWeddell seal

Ross seal

Adelie penguinChinstrap penguinEmperor penguinGentoo penguin

Macaroni penguinCO2, Sun

Cormorant, AlbatrossCrabeater sealElephant seal

Fin whale, FulmarBlue whale

Humpback whaleSouthern right whale

Minke whaleSei whale

Sperm whaleKiller whale, Petrel

PhytoplanktonSheathbillSkua,Tern

Partitioning the reverse graph

CO2, SunPhytoplanktonZooplankton

Krill

GullElephant seal

PetrelSheathbill

Adelie penguinAlbatross, FulmarChinstrap penguin

Crabeater sealEmperor penguinGentoo penguin

King penguinLarge fish, OctopusMacaroni penguin

Ross seal, CormorantSkua, Squid, Tern

Sperm whaleWeddell seal

Blue whaleFin whale

Humpback whaleSouthern right whale

Minke whaleSmall fishSei whale

Killer whaleLeopard seal

Conclusion

• Very flexible framework providing coordinates on data sets

• Diffusion distance relevant to data mining and machine learning applications

• Clustering meaningful in this embedding space

• Can be extended to directed graphs

• Multiscale bases and analyses on graphs

Thank you!

Clustering in kernel embedding spaces and organization of...

Documents

Transcript of Clustering in kernel embedding spaces and organization of...