Clustering in kernel embedding spaces and organization of...

47
Clustering in kernel embedding spaces and organization of documents Stéphane Lafon Collaborators: Raphy Coifman (Yale), Yosi Keller (Yale), Ioannis G. Kevrekidis (Princeton), Ann B. Lee (CMU), Boaz Nadler (Weizmann)

Transcript of Clustering in kernel embedding spaces and organization of...

  • Clustering in kernel embedding spaces and organization of documents

    Stéphane Lafon

    Collaborators:

    Raphy Coifman (Yale), Yosi Keller (Yale), Ioannis G. Kevrekidis(Princeton), Ann B. Lee (CMU), Boaz Nadler (Weizmann)

  • Motivation

    Data everywhere:

    • Scientific databases: biomedical/physics

    • The web (web search, ads targeting, ads spam/fraud detection…)

    • Books: Conventional libraries + {Amazon, Google Print…}

    • Military applications (Intelligence, ATR, undersea mapping, situational awareness…)

    • Corporate and financial data

    • …

  • Motivation

    Tremendous need for automated ways of

    • Organizing this information (clustering, parametrization of data sets, adapted metrics on the data)

    • Extracting knowledge (learning, clustering, pattern recognition,dimensionality reduction)

    • Make it useful to end-users (dimensionality reduction)

    All of this in the most efficient way.

  • Challenges

    • data sets are high-dimensional: need for dimension reduction

    • data sets are nonlinear: need to learn these nonlinearities

    • data likely to be noisy: need for robustness.

    • data sets are massive.

  • Highlights of this talk:

    • Review the diffusion framework: parametrization of data sets, dimensionality reduction, diffusion metric

    • Present a few results on clustering in the diffusion space

    • Extend these results to non-symmetric graphs: beyond diffusions.

  • From data sets to graphs

    { }( )

    1 2, ,...,

    ,

    ( , )

    Let be a data set.

    Construct a graph where to each point corresponds a node, every two nodes are connected by an edge with a non-negative weight .

    n

    i

    X x x x

    G X Wx

    w x y

    =

    =

    ••

  • Choice of the weight matrix

    ( , ).

    The quantity should reflect the degree of similarityor interaction between and

    The choice of the weight is crucial and application-driven.

    w x yx y

    2( , ) /( , )

    Examples: Take a correlation matrix, and throw away values that are not close to 1.

    If we have a distance metric on the data, consider using . Sometimes, the data already comes in

    d x yd x y e ε−•

    •• the form of a graph (e.g. social networks).

    In practise, difficult problem! Feature selection, dynamical data sets...

  • Markov chain on the data

    1

    ( ) ( , ).

    ( , )( , )( )

    ( , ) 1 ( , ) 0

    Define the degree of node as

    Form the by matrix with entries . In other words, .

    Because and ,

    is the transition matrix of a Markov

    z X

    y X

    x d x w x z

    w x yn n P p x y P D Wd x

    p x y p x y

    P

    =

    = =

    = ≥

    ( , )

    chain on the graph of the data.

    is called "normalized graph Laplacian" [Chung'97].Notation: entries of are denoted .t t

    I PP p x y

  • Powers of P

    Main idea: the behavior of the Markov chain over different time scales will reveal geometric structures of the data.

  • Time parameter t

    ( , ) is the probability of transition from to in time steps.Therefore, it is close to 1 if is easily reachable from in steps.This happens if there are many paths connecting these two po

    tp x y x y ty x t

    ints

    defines the granularity of the analysis.Increasing the value of is a way to integrate the local geometric information:

    transitions are likely to happen between similar data points and occur rar

    tt

    ely otherwise.

  • Assumptions on the weights

    ( , ) ( , ).

    Our object of interest: powers of .

    We now assume that the weight function is symmetric, i.e., that the graph is symmetric.

    P

    w x y w y x=

  • Spectral decomposition of P

    0 1 1

    0 0

    { }

    11

    1

    1

    With these conditions, has a sequence of eigenvaluesand a collection of corresponding (right) eigenvectors

    In addition, it can be checked that and

    n

    mt t

    m m m

    P

    P

    λ λ λψ

    ψ λ ψ

    λ ψ

    −≥ ≥ ≥

    =

    ⎛ ⎞⎜ ⎟⎜ ⎟= ≡⎜ ⎟⎜ ⎟⎝ ⎠

    .

  • Diffusion coordinates

    { }.

    0

    Each eigenvector is a function defined on the data points.Therefore, we can think of the (right) eigenvectors

    as forming a on the data set For any choice of

    set of coordin , defin

    ae the ma

    tesm

    Xt

    ψ

    1 1

    2 2

    1 1

    ( )( )

    :

    ( )

    pping:t

    t

    t

    tn n

    xx

    x

    x

    λψλψ

    λ ψ− −

    ⎛ ⎞⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎝ ⎠

    ψ

    Original space Diffusion space

  • Over the past 5 years, new techniques have emerged for manifold learning

    • Isomap [Tenenbaum-DeSilva-Langford’00]

    • L.L.E. [Roweis-Saul’00]

    • Laplacian eigenmaps [Belkin-Niyogi’01]

    • Hessian eigenmaps [Donoho-Grimes’03]

    • …

    They all aim at finding coordinates on data sets by computing the eigenfunctions of a psd matrix.

  • Diffusion coordinates

    2 / 2( , )

    Data set: collection of images of handwritten digits "1"

    Weight kernel where is the distance between two images and x yw x y e x y L x yε− −•

    • = −

    unordered

  • Diffusion coordinates

  • Diffusion distance

    2 ( ,1/

    ( ) ( )

    ( ) ( ) ( , ) ( , ) ( , )

    Meaning of mapping : two data points and are mapped as and so that the distance between them is equal to the so-called "diffusion distance":

    t t t

    t t t t t L X

    x y x y

    x y D x y p x p y

    Ψ Ψ Ψ

    Ψ −Ψ = = −i i).

    Two points are in this metric ifclose highly connect they are in the gred aph.

    π

    Original space Diffusion space

    Euclidean distance in diffusion space

    Diffusion metric in original space

  • Diffusion distanceThe diffusion metric measures proximity in terms of connectivity in the graph.

    In particular,

    • It is useful to detect and characterize clusters.

    • It allows to develop learning algorithms based on the preponderance of evidences.

    • It is very robust to noise, unlike the geodesic distance.

    −1 −0.5 0 0.5 1−0.8

    −0.6

    −0.4

    −0.2

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    A

    B

    −1 −0.5 0 0.5 1−0.8

    −0.6

    −0.4

    −0.2

    0

    0.2

    0.4

    0.6

    0.8

    1

    A

    B

    0 0.5 1 1.5 20

    50

    100

    Geodesic distance

    0 0.5 1 1.5 20

    50

    100

    Diffusion distance

    Resistance distance, mean commute time, Green’s function…

  • Dimension reduction2

    1 1

    2 2

    1 1

    1

    ( )( )

    :

    ( )0,

    1Since , not all terms are numerically significant in

    Therefore, for a given value of one only needs to embed usingcoordinates for which

    t

    t

    t

    tn n

    tm

    xx

    x

    xt

    λ λ

    λψλψ

    λ ψ

    λ

    − −

    ≥ ≥ ≥

    ⎛ ⎞⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎝ ⎠≥

    ψ

    is non-negligible.The key to dimensionality reduction: decay of the spectrum

    of the powers of the transition matrix.

    0 20 40 60 80 100 120 140 160 180 2000

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    P

    P2

    P4

    P8

    P16

  • Example: image analysis

    1. Take an image

    2. Form the set of all 7x7 patches

    3. Compute the diffusion maps

    Parametrization

  • Example: image analysis

    −0.2 00.2 0.4

    0.6 0.81 1.2

    −1

    −0.5

    0

    0.5

    1

    −1

    −0.5

    0

    0.5

    1

    −1−0.5

    00.5

    1

    −1−0.5

    00.5

    1

    −1

    −0.5

    0

    0.5

    1

    Regular diffusion maps Density-normalized diffusion maps

    −0.2 00.2 0.4

    0.6 0.81 1.2

    −0.5

    0

    0.5

    1

    −1

    −0.5

    0

    0.5

    1

    −1−0.5

    00.5

    1

    −1−0.5

    00.5

    1−1

    −0.5

    0

    0.5

    1

    Regular diffusion maps Density-normalized diffusion maps

    Now, if we drop some patches…

  • Example: Molecular dynamics(with Raphy R. Coifman, Ioannis Kevrekidis, Mauro Maggioni and Boaz Nadler)

    Output of the simulation of a molecule of 7 atoms in a solvent. The molecule evolves in a potential, and is also subject to interactions with the solvent.

    Data set: millions of snapshots of the molecule (coordinates of atoms).

  • Graph matching and data alignment(with Yosi Keller)

    Suppose we have two datasets and with approximately the same geometry.Question: how can we match/align/register and ?

    Since we have coordinates for and , we can embed both sets,and align them

    X YX Y

    X Y in diffusion space.

  • Graph matching and data alignmentIllustration: moving heads.

    The heads of 3 subjects wearing strange masks are recorded in 3 movies.

    β

    α

    We obtain 3 data sets where each frame is a data point.

    Because this is a constrained system (head-neck mechanical articulation), all 3 sets exhibit approximately the same geometry.

  • Graph matching and data alignment

    We compute 3 density-invariant embeddings.

    0

    0.5

    1

    0 0.2 0.4 0.6 0.8 10

    0.2

    0.4

    0.6

    0.8

    1

    Yellow set embedding0

    0.2

    0.4

    0.6

    0.8

    1

    0

    0.2

    0.4

    0.6

    0.8

    1

    0

    0.5

    1

    Red set embedding

    0

    0.5

    1 00.2

    0.40.6

    0.81

    0

    0.2

    0.4

    0.6

    0.8

    1

    Black set embedding

    We then align the 3 data sets in diffusion spaceusing a limited number of landmarks.

  • Graph matching and data alignment

  • Science News articles (reloaded ☺)

    :

    ( , )

    Data set: collection of documents from Science News Data modeling : we form a mutual information feature matrix .

    For each document and word , we compute the information shared by and Q

    i j i j

    Q i j

    ••

    ,,

    :log :

    :

    number of words in document where total number of words in document

    total number of words in all documents

    We can work on the rows of

    i ji j

    ii j

    j

    N j iN

    N iN N

    N j

    Q

    ⎧⎛ ⎞ ⎪= ⎜ ⎟ ⎨⎜ ⎟ ⎪⎝ ⎠

    ⎩•

    ( , )

    in order to organize documentsor on its columns to organize words. Word j

    Document i Q i j

    ⎛ ⎞⎜ ⎟⎜ ⎟⎜ ⎟→⎜ ⎟⎜ ⎟⎜ ⎟⎝ ⎠

  • Clustering and data subsampling

    [TODO: Matlab demo]

    Organizing documents: we work on rows of .

    Automatic extraction of themes, organization of documents according to these themes.Can work at different levels of granularity by changing

    the number of diff

    Q

    usion coordinates considered.

  • Clustering of documents in diffusion space(with Ann B. Lee)

    In the diffusion space, we can use classical geometric objects:

    • can use hyperplanes

    • can use Euclidean distance

    • can use balls

    • can apply usual geometric algorithms: k-means, coverings…

    We now show that clustering in the diffusion space is equivalent to compressing the diffusion matrix P.

  • −1

    0

    1−1.5

    −1

    −0.5

    0

    0.5

    1

    123

    −1

    0

    1−1.5

    −1

    −0.5

    0

    0.5

    1

    123

    Learning nonlinear structures

    K-means in diffusion spaceK-means in original space

  • Coarsening of the graph and Markov chain lumping

    { }1Given an arbitrary partition of the nodes into subsets,we coarsen the graph into a new graph with meta-nodes. Each meta-node is identified with a subset The new edge-weight funct

    i i k

    i

    S k

    G G kS

    ≤ ≤

    1

    ( , ) ( , )

    ( ) ( , )

    ( , )( , )

    ( )

    ion is defined by

    We form a new Markov chain on this new graph:

    i j

    i jx S y S

    i i jj k

    i ji j

    i

    w S S w x y

    d S w S S

    w S Sp S S

    d S

    ∈ ∈

    ≤ ≤

    =

    =

    =

    ∑ ∑

  • Coarsening of the graph and Markov chain lumping

    Coarsening

    w(x,y)

    S3

    S1

    S1 S3

    S2

    S2

    w(S1, S3)~

  • Equivalence with matrix approximation problem

    ( ) ( )( )

    The new Markov matrix is , and it is an approximation of .Are the spectral properties of preserved by the coarsening operation?

    Define the approximate eigenvectors of by

    lx S

    l i

    P k k PP

    Px x

    Sπ ψ

    ψ ∈

    ×

    =

    { }

    ( )

    These are not actual eigenvectors of , but only approximations. The quality of approximation depends on the choice of the partition .

    i

    ix S

    i

    x

    PS

    π∈

    ∑∑

  • Error of approximation

    12

    1 11

    1

    ( ) ( )( ) ( ) ( )( ) ( )

    Proposition: We have, for all 0 ,

    where is the distortion in the diffusion space:

    This is thei i

    l l l

    ii k x S z S i i

    l k

    P D

    D

    x zD S x zS S

    πψ λψ

    π πππ π≤ ≤ ∈ ∈

    ≤ ≤ −

    − ≤

    ⎛ ⎞= Ψ −Ψ⎜ ⎟⎜ ⎟⎝ ⎠∑ ∑∑

    { }

    This results justifies the

    quantization distortion in

    use of k-means in diffusion

    duced by ,where all the points have a mass distribut

    space (MeilaShi) and coverings for data clustering.

    ion .

    Any clus

    iSπ

    tering scheme that aims at minimizing thisdistortion preserves the spectral properties of theMarkov chain.

  • Clustering of words: automatic lexicon organization

    Organizing words: we work on columns of .We can subsample the word collection by quantizing the diffusion space.

    Clusters can be interpretated as "wordlets", or semantic units.Centers of the clusters

    Q

    are representative of a concept.

    Automatic extraction of concepts from a collection of documents.

  • Clustering and data subsampling

  • Clustering and data subsampling

  • Going multiscale

    −1 −0.5 0 0.5 1

    −0.8

    −0.6

    −0.4

    −0.2

    0

    0.2

    0.4

    0.6

    0.8

    −1 −0.5 0 0.5 1

    −0.8

    −0.6

    −0.4

    −0.2

    0

    0.2

    0.4

    0.6

    0.8

    −1 −0.5 0 0.5 1

    −0.8

    −0.6

    −0.4

    −0.2

    0

    0.2

    0.4

    0.6

    0.8

    −4−3

    −2−1

    0

    −4

    −2

    0

    2

    4

    −4

    −3

    −2

    −1

    0

    1

    2

    3

    λ1t Ψ

    1

    λ2t Ψ

    2

    λ 3t Ψ

    3

    −4−3

    −2−1

    0

    −4

    −2

    0

    2

    4

    −4

    −3

    −2

    −1

    0

    1

    2

    3

    λ1t Ψ

    1

    λ2t Ψ

    2

    λ 3t Ψ

    3

    −4−3

    −2−1

    0

    −4

    −2

    0

    2

    4

    −4

    −3

    −2

    −1

    0

    1

    2

    3

    λ1t Ψ

    1

    λ2t Ψ

    2

    λ 3t Ψ

    3

    t=1 t=64t=16

  • Beyond diffusions

    ( , )( )

    The previous ideas can be generalized to the wider settingof a graph, with edge-weights , and a set of positive node-weights .

    This time, we consider oriented interactions b

    i ii

    directed signed

    etween datapoints (e.g. source-target). No spectral theory available! The node-weight can be used to rescale the relative importance

    of each data point (e.g. density estimate). This framework contain

    • s diffusion matrices as in this casethe node-weight is equal to the out-degree of each node.

  • Coarsening

    { }This time we have no spectral embedding.

    However, for a given partitioning of the nodes,the same coarsening procedure can still be applied to the graph.

    iS

    Coarsening

    S3

    S2S1

    S3

    S2S1

    w(S1,S2)~

    w(x,y)

    ( , ) ( , ) ( ) ( ) and i j i

    i j ix S y S x S

    w S S w x y S xπ π∈ ∈ ∈

    = =∑ ∑ ∑

  • Beyond spectral embeddings

    1

    1

    ( , ).

    .

    ,

    Idea: form the matrix and view each row as a feature vector representing The graph now

    lives as a cloud of points in Similarly for the coarser graph form and obtain

    a

    n

    A W a xx G

    G A W

    = Π

    = Π

    i

    Question: how do these pointsets compare ?cloud of points in .k

  • Comparison between graphs

    A

    A

    ?T Tf A f AQ≈

    Tf ATf

    T Tf Q f=

    Q Q

    . is , whereas is They can be related by the matrix whose rows are

    the indicators of the partitioning.

    A n n A k kn k Q

    × ××

    In other words, this diagram should approximately commute:QA AQ≈

  • Error of approximation

    ( )1/

    12

    1/: 1

    1

    ( ) ( )( ) ( , ) ( , ) .( ) ( )

    We now have a similar result as before:Proposition:

    sup

    where

    is the distortion resulting from quantizin

    n

    i i

    T

    f f

    ii k x S z S i i

    f QA AQ D

    x zD S a x a zS S

    D

    ππ

    π

    π πππ π

    ∈ =

    ≤ ≤ ∈ ∈

    − ≤

    ⎛ ⎞= −⎜ ⎟⎜ ⎟⎝ ⎠∑ ∑∑ i i

    g the setof rows of , with mass distribution A π

  • Antarctic food web

    Squid Krill

    Zoo-plankton

    Phyto-plankton

    Cormo-rant

    LargefishGull

    Petrel

    Sheathbill

    Spermwhale

    CO2 Sun

    Minkewhale

    Bluewhale

    Alba-tross

    Tern

    Rightwhale

    Fulmar

    FinWhale

    Leopardseal

    Killer whale

    Weddellseal

    Rossseal

    Smallfish

    AdeliepenguinGentoopenguinEmperorpenguinMacaronipenguin

    Kingpenguin

    Chinstrappenguin

    Seiwhale

    Humpbackwhale

    Octopus

    Elephantseal

    Crabeaterseal

    Skua

  • Partitioning the forward graph

    Zooplankton

    KrillSmall fish

    Squid

    GullKing penguin

    Large fishLeopard seal

    OctopusWeddell seal

    Ross seal

    Adelie penguinChinstrap penguinEmperor penguinGentoo penguin

    Macaroni penguinCO2, Sun

    Cormorant, AlbatrossCrabeater sealElephant seal

    Fin whale, FulmarBlue whale

    Humpback whaleSouthern right whale

    Minke whaleSei whale

    Sperm whaleKiller whale, Petrel

    PhytoplanktonSheathbillSkua,Tern

  • Partitioning the reverse graph

    CO2, SunPhytoplanktonZooplankton

    Krill

    GullElephant seal

    PetrelSheathbill

    Adelie penguinAlbatross, FulmarChinstrap penguin

    Crabeater sealEmperor penguinGentoo penguin

    King penguinLarge fish, OctopusMacaroni penguin

    Ross seal, CormorantSkua, Squid, Tern

    Sperm whaleWeddell seal

    Blue whaleFin whale

    Humpback whaleSouthern right whale

    Minke whaleSmall fishSei whale

    Killer whaleLeopard seal

  • Conclusion

    • Very flexible framework providing coordinates on data sets

    • Diffusion distance relevant to data mining and machine learning applications

    • Clustering meaningful in this embedding space

    • Can be extended to directed graphs

    • Multiscale bases and analyses on graphs

    Thank you!