S.V.N. Vishwanathan - Purdue Universityvishy/talks/Graphs.pdf · S.V.N. Vishwanathan:Graph Kernels,...

34
S.V.N. Vishwanathan: Graph Kernels, Page 1 Graph Kernels S.V.N. Vishwanathan [email protected] Purdue University Joint work with Karsten Borgwardt, Nic Schraudolph, and Risi Kondor

Transcript of S.V.N. Vishwanathan - Purdue Universityvishy/talks/Graphs.pdf · S.V.N. Vishwanathan:Graph Kernels,...

Page 1: S.V.N. Vishwanathan - Purdue Universityvishy/talks/Graphs.pdf · S.V.N. Vishwanathan:Graph Kernels, Page 28 Unlabeled Graphs: We computed graph kernels on four datasets for molecular

S.V.N. Vishwanathan: Graph Kernels, Page 1

Graph KernelsS.V.N. Vishwanathan

[email protected]://www.stat.purdue.edu/~vishy

Purdue University

Joint work with Karsten Borgwardt, Nic Schraudolph, andRisi Kondor

Page 2: S.V.N. Vishwanathan - Purdue Universityvishy/talks/Graphs.pdf · S.V.N. Vishwanathan:Graph Kernels, Page 28 Unlabeled Graphs: We computed graph kernels on four datasets for molecular

Graphs are Everywhere

S.V.N. Vishwanathan: Graph Kernels, Page 2

Two protein molecules The Internet

Comparing Graphs:How similar are two graphs?

Comparing Nodes:How similar are two nodes of a graph?

Page 3: S.V.N. Vishwanathan - Purdue Universityvishy/talks/Graphs.pdf · S.V.N. Vishwanathan:Graph Kernels, Page 28 Unlabeled Graphs: We computed graph kernels on four datasets for molecular

Adjacency Matrix

S.V.N. Vishwanathan: Graph Kernels, Page 3

1

23

4

5

6

7 8

vertices/nodes edges

Undirected Graph G(V, E)

A =

0 1 0 0 0 0 1 01 0 1 0 0 0 0 00 1 0 1 0 0 0 10 0 1 0 1 1 0 10 0 0 1 0 1 0 00 0 0 1 1 0 1 01 0 0 0 0 1 0 10 0 1 1 0 0 1 0

sub-matrix of A = a subgraph of G

Page 4: S.V.N. Vishwanathan - Purdue Universityvishy/talks/Graphs.pdf · S.V.N. Vishwanathan:Graph Kernels, Page 28 Unlabeled Graphs: We computed graph kernels on four datasets for molecular

Degree Matrix

S.V.N. Vishwanathan: Graph Kernels, Page 4

1

23

4

5

6

7 8

D =

2 0 0 0 0 0 0 00 2 0 0 0 0 0 00 0 3 0 0 0 0 00 0 0 4 0 0 0 00 0 0 0 2 0 0 00 0 0 0 0 3 0 00 0 0 0 0 0 3 00 0 0 0 0 0 0 3

Normalized Adjacency matrix A = D−1A is a stochastic ma-trix (each row sums to one)

Page 5: S.V.N. Vishwanathan - Purdue Universityvishy/talks/Graphs.pdf · S.V.N. Vishwanathan:Graph Kernels, Page 28 Unlabeled Graphs: We computed graph kernels on four datasets for molecular

Graph Laplacian

S.V.N. Vishwanathan: Graph Kernels, Page 5

1

23

4

5

6

7 8

L =

2 −1 0 0 0 0 −1 0−1 2 −1 0 0 0 0 0

0 −1 3 −1 0 0 0 −10 0 −1 4 −1 −1 0 −10 0 0 −1 2 −1 0 00 0 0 −1 −1 3 −1 0−1 0 0 0 0 −1 3 −1

0 0 −1 −1 0 0 −1 3

L = D − A

Normalized version

L = D−12(D − A)D−

12

Spectrum bounded between 0 and 2

Page 6: S.V.N. Vishwanathan - Purdue Universityvishy/talks/Graphs.pdf · S.V.N. Vishwanathan:Graph Kernels, Page 28 Unlabeled Graphs: We computed graph kernels on four datasets for molecular

Random Walk

S.V.N. Vishwanathan: Graph Kernels, Page 6

1

23

4

5

6

7 8

?

From a vertex i randomly jump to any adjacent vertex j

Probability of jumping to j proportional to Aij

Page 7: S.V.N. Vishwanathan - Purdue Universityvishy/talks/Graphs.pdf · S.V.N. Vishwanathan:Graph Kernels, Page 28 Unlabeled Graphs: We computed graph kernels on four datasets for molecular

Walks of Length 2

S.V.N. Vishwanathan: Graph Kernels, Page 7

1

23

4

5

6

7 8

A2 =

2 0 1 0 0 1 0 10 2 0 1 0 0 1 11 0 3 1 1 1 1 10 1 1 4 1 1 2 10 0 1 1 2 1 1 11 0 1 1 1 3 0 20 1 1 2 1 0 3 01 1 1 1 1 2 0 3

Entries of A2 = number of length 2 walks

Entries of A2 = probability of length 2 walks

Page 8: S.V.N. Vishwanathan - Purdue Universityvishy/talks/Graphs.pdf · S.V.N. Vishwanathan:Graph Kernels, Page 28 Unlabeled Graphs: We computed graph kernels on four datasets for molecular

Idea!

S.V.N. Vishwanathan: Graph Kernels, Page 8

1

23

4

5

6

7 8

Count walks?

Count number ofwalks between twonodes

Two nodes are similarif they are connectedby many walks

Does not work :(

If graph has cycles then number of walks goes to∞

Page 9: S.V.N. Vishwanathan - Purdue Universityvishy/talks/Graphs.pdf · S.V.N. Vishwanathan:Graph Kernels, Page 28 Unlabeled Graphs: We computed graph kernels on four datasets for molecular

A Better Idea!

S.V.N. Vishwanathan: Graph Kernels, Page 9

1

23

4

5

6

7 8

Count walks? Discount contributionof longer walks

Count number ofwalks between twonodes

Two nodes are similarif they are connectedby many walks

Works if discounting factor chosen appropriately!

Page 10: S.V.N. Vishwanathan - Purdue Universityvishy/talks/Graphs.pdf · S.V.N. Vishwanathan:Graph Kernels, Page 28 Unlabeled Graphs: We computed graph kernels on four datasets for molecular

Diffusion Kernels

S.V.N. Vishwanathan: Graph Kernels, Page 10

Discounting Factor:Discount a k length walk by λk/k! for 0 ≤ λ ≤ 1

Similarity:Similarity defined as

k(i, j) =

[∑k

λk

k!Ak

]ij

= [exp(λA)]ij

Kondor and Lafferty:Work with diffusion and hence the graph Laplacian

k(i, j) =

[∑k

λk

k!Lk

]ij

= [exp(λL)]ij

They show that this is a valid p.s.d kernel

Page 11: S.V.N. Vishwanathan - Purdue Universityvishy/talks/Graphs.pdf · S.V.N. Vishwanathan:Graph Kernels, Page 28 Unlabeled Graphs: We computed graph kernels on four datasets for molecular

Extensions

S.V.N. Vishwanathan: Graph Kernels, Page 11

Laplacian as a regularizer:For any real-valued function f on the vertices of a graph

〈f, Lf〉 = f>Lf = −1

2

∑i∼j

(fi − fj)2

Can regularize differently if we replace L by

r(L) :=∑i

r(ρi)lil>i

Any monotonically increasing function of ρ admissible

Page 12: S.V.N. Vishwanathan - Purdue Universityvishy/talks/Graphs.pdf · S.V.N. Vishwanathan:Graph Kernels, Page 28 Unlabeled Graphs: We computed graph kernels on four datasets for molecular

Smola and Kondor

S.V.N. Vishwanathan: Graph Kernels, Page 12

Other Kernels:

r(ρ) = 1 + σ2ρ, K = (I + σ2L)−1 regularized Laplacianr(ρ) = (1− λρ)−p, K = (I − λL)p p-step random walk

Page 13: S.V.N. Vishwanathan - Purdue Universityvishy/talks/Graphs.pdf · S.V.N. Vishwanathan:Graph Kernels, Page 28 Unlabeled Graphs: We computed graph kernels on four datasets for molecular

Comparing Graphs

S.V.N. Vishwanathan: Graph Kernels, Page 13

3

4

5

6

1

2

7 8

Count matching walks?

3'1'

2'

4'5'

Count number ofmatching walks intwo graphs

Discount contributionof longer walks

Two graphs are simi-lar if many walks arematching

Three Questions:How to formalize this intuition?How to compute this efficiently?How is this related to diffusion kernels?

Page 14: S.V.N. Vishwanathan - Purdue Universityvishy/talks/Graphs.pdf · S.V.N. Vishwanathan:Graph Kernels, Page 28 Unlabeled Graphs: We computed graph kernels on four datasets for molecular

Direct Product Graph

S.V.N. Vishwanathan: Graph Kernels, Page 14

3'1'

2'

4'

31

2

11' 21'31'

12'

32'

22'

13'23'

33'14'

24'

34'

X

Formal Definition

V×(G×G′) ={(v, v′) : v ∈ V, v′ ∈ V ′}E×(G×G′) ={((v, v′), (w,w′)) : (v, w) ∈ E, (v′, w′) ∈ E ′}

Page 15: S.V.N. Vishwanathan - Purdue Universityvishy/talks/Graphs.pdf · S.V.N. Vishwanathan:Graph Kernels, Page 28 Unlabeled Graphs: We computed graph kernels on four datasets for molecular

Key Insight

S.V.N. Vishwanathan: Graph Kernels, Page 15

Random Walk on Product Graph:Equivalent to simultaneous random walk on input graphs

Kernel Definition:

k(G,G′) =1

|G| |G′|∑k

λk

k!e>Ak

× e =1

|G| |G′|e> exp(λA×) e

Page 16: S.V.N. Vishwanathan - Purdue Universityvishy/talks/Graphs.pdf · S.V.N. Vishwanathan:Graph Kernels, Page 28 Unlabeled Graphs: We computed graph kernels on four datasets for molecular

Extensions

S.V.N. Vishwanathan: Graph Kernels, Page 16

Different Decay Factor (Gärtner et al.):Using a λk decay

k(G,G′) =1

|G||G′|∑k

λk e>Ak× e

=1

|G||G′|e>(I−λA×)−1 e

Taking expectations:Instead of summing, take expectations

k(G,G′) =∑k

λk q>×Ak×p× = q>×(I−λA×)−1p×

p× and q× are initial and stopping probabilities resp.

Page 17: S.V.N. Vishwanathan - Purdue Universityvishy/talks/Graphs.pdf · S.V.N. Vishwanathan:Graph Kernels, Page 28 Unlabeled Graphs: We computed graph kernels on four datasets for molecular

Efficient Computation

S.V.N. Vishwanathan: Graph Kernels, Page 17

Product Graph is Huge:If G and G′ have n vertices then product graph has n2

verticesAdjacency matrix A× is of size n2 × n2

Houston we have a problem:Kernel computation involves

k(G,G′) = q>× exp(λA×)︸ ︷︷ ︸O(n6) !

or

k(G,G′) = q>× (I−λA×)−1︸ ︷︷ ︸O(n6) !

Page 18: S.V.N. Vishwanathan - Purdue Universityvishy/talks/Graphs.pdf · S.V.N. Vishwanathan:Graph Kernels, Page 28 Unlabeled Graphs: We computed graph kernels on four datasets for molecular

Kronecker Products

S.V.N. Vishwanathan: Graph Kernels, Page 18

Definition (by example):

A =

[1 00 1

]and B =

2 5 25 2 51 5 2

then

A⊗B =

2 5 2 0 0 05 2 5 0 0 01 5 2 0 0 00 0 0 2 5 20 0 0 5 2 50 0 0 1 5 2

Page 19: S.V.N. Vishwanathan - Purdue Universityvishy/talks/Graphs.pdf · S.V.N. Vishwanathan:Graph Kernels, Page 28 Unlabeled Graphs: We computed graph kernels on four datasets for molecular

Key Insight

S.V.N. Vishwanathan: Graph Kernels, Page 19

The adjacency matrix of the product graph

A× = A⊗ A′

Can compute exp(A×) as

exp(A×) = exp(A)︸ ︷︷ ︸O(n3)

⊗ exp(A′)︸ ︷︷ ︸O(n3)

Computing (I−λA)−1 involves a bit more work . . .

Page 20: S.V.N. Vishwanathan - Purdue Universityvishy/talks/Graphs.pdf · S.V.N. Vishwanathan:Graph Kernels, Page 28 Unlabeled Graphs: We computed graph kernels on four datasets for molecular

Sylvester Equations

S.V.N. Vishwanathan: Graph Kernels, Page 20

Claim:Computing the Gärtner et. al kernel is no harder thansolving

X = A′XA> + P

Sylvester Equations:The above equation is called a Sylvester equationWell studied in control theoryEfficiently solvable in O(n3) timedylap method in Matlab

Page 21: S.V.N. Vishwanathan - Purdue Universityvishy/talks/Graphs.pdf · S.V.N. Vishwanathan:Graph Kernels, Page 28 Unlabeled Graphs: We computed graph kernels on four datasets for molecular

Before the proof . . .

S.V.N. Vishwanathan: Graph Kernels, Page 21

vec operator:

B =

2 5 25 2 51 5 2

and vec(B) =

251525252

Key Equation:

vec(ABC)= (C> ⊗ A) vec(B)

Page 22: S.V.N. Vishwanathan - Purdue Universityvishy/talks/Graphs.pdf · S.V.N. Vishwanathan:Graph Kernels, Page 28 Unlabeled Graphs: We computed graph kernels on four datasets for molecular

The Proof

S.V.N. Vishwanathan: Graph Kernels, Page 22

Rewrite the Sylvester equation as

vec(X) = vec(A′XA>) + vec(P )

Apply key equation

vec(X) = (A⊗ A′) vec(X) + vec(P )

Rearrange

(I−A⊗ A′) vec(X) = vec(P )

or equivalently

vec(X) = (I−A⊗ A′)−1 vec(P )

Let p× = vec(P ) and multiply both sides by q×

q>× vec(X) = q>×(I−A⊗ A′)−1p× = K(G,G′)

Page 23: S.V.N. Vishwanathan - Purdue Universityvishy/talks/Graphs.pdf · S.V.N. Vishwanathan:Graph Kernels, Page 28 Unlabeled Graphs: We computed graph kernels on four datasets for molecular

Other Schemes

S.V.N. Vishwanathan: Graph Kernels, Page 23

Basic Idea:

vec(A′XA>)︸ ︷︷ ︸O(n3)

= (A⊗ A′) vec(X)︸ ︷︷ ︸O(n4)

Can exploit sparsity of A and A′ to speed up things

Fixed Point Iteration:Solve for a fixed point (Kashima et. al):

(I−A⊗ A′) vec(X∞) = vec(X∞)

Conjugate Gradient:Fast matrix-vector multiplication to speed up CG solverConvergence depends on spectrum of A and A′

Page 24: S.V.N. Vishwanathan - Purdue Universityvishy/talks/Graphs.pdf · S.V.N. Vishwanathan:Graph Kernels, Page 28 Unlabeled Graphs: We computed graph kernels on four datasets for molecular

Relation to Diffusion Kernels

S.V.N. Vishwanathan: Graph Kernels, Page 24

Laplacian of the Direct Product Graph:In general L× 6= L1 ⊗ L2 :(But there is a fix ...

Cartesian Product of Graphs:

V� ={(v, v′) : v ∈ V, v′ ∈ V ′}E� ={((v, v′), (w,w′)) : (v, w) ∈ E, (v′, w′) ∈ E ′}

For Cartesian products

A� = A1 ⊕ A2 := A1 ⊗ I + I ⊗ A2

L� = L1 ⊕ L2

All our efficient computation tricks apply!Is the kernel PSD?

Page 25: S.V.N. Vishwanathan - Purdue Universityvishy/talks/Graphs.pdf · S.V.N. Vishwanathan:Graph Kernels, Page 28 Unlabeled Graphs: We computed graph kernels on four datasets for molecular

Experiments

S.V.N. Vishwanathan: Graph Kernels, Page 25

Scaling Behavior - I:Begin with empty graphs of size 2k where k = 1, . . . 10

Randomly insert edges untilavg. degree at least 2 orgraph is full

Generate 10 random graphs and compute kernel matrix

Page 26: S.V.N. Vishwanathan - Purdue Universityvishy/talks/Graphs.pdf · S.V.N. Vishwanathan:Graph Kernels, Page 28 Unlabeled Graphs: We computed graph kernels on four datasets for molecular

Experiments

S.V.N. Vishwanathan: Graph Kernels, Page 26

Scaling Behavior - II:Begin with empty graphs of size 32

Randomly insert edges untilavg. fill-in of adjacency matrix is 10% . . . 100% andGraph is connected

Generate 10 random graphs and compute kernel matrix

Page 27: S.V.N. Vishwanathan - Purdue Universityvishy/talks/Graphs.pdf · S.V.N. Vishwanathan:Graph Kernels, Page 28 Unlabeled Graphs: We computed graph kernels on four datasets for molecular

Experiments

S.V.N. Vishwanathan: Graph Kernels, Page 27

Impact of the vec-trick:Same graphs as the runtime vs nodes experimentUse the vec trick in the fixed point iterationCompare to original fixed point iteration

Page 28: S.V.N. Vishwanathan - Purdue Universityvishy/talks/Graphs.pdf · S.V.N. Vishwanathan:Graph Kernels, Page 28 Unlabeled Graphs: We computed graph kernels on four datasets for molecular

Experiments

S.V.N. Vishwanathan: Graph Kernels, Page 28

Unlabeled Graphs: We computed graph kernels on fourdatasets for molecular function prediction: Mutag andPtc (chemical compounds), Enzyme and Protein (proteinstructures). We report runtimes for computing a 100 × 100kernel matrix.

dataset Mutag Ptc Enzyme Proteinnodes/graph 17.7 26.7 32.6 38.6edges/node 2.2 1.9 3.8 3.7

Direct 18’09" 142’53" 31h* 36d*Sylvester 25.9" 73.8" 48.3" 69’15"

Conjugate 42.1" 58.4" 44.6" 55.3"Fixed-Point 12.3" 32.4" 13.6" 31.1"

Page 29: S.V.N. Vishwanathan - Purdue Universityvishy/talks/Graphs.pdf · S.V.N. Vishwanathan:Graph Kernels, Page 28 Unlabeled Graphs: We computed graph kernels on four datasets for molecular

Experiments

S.V.N. Vishwanathan: Graph Kernels, Page 29

Labeled Graphs: We repeated the above graph kernel com-putation, now using either a linear or delta kernel betweennode labels as well.

kernel delta lineardataset Mutag Ptc Enzyme Protein

Direct 7.2h 1.4d* 2.4d* 5.3d*Sylvester 3.9d* 2.7d* 89.8" 25’24"

Conjugate 2’35" 3’20" 124.4" 3’01"Fixed Point 1’05" 1’31" 50.1" 1’47"

Page 30: S.V.N. Vishwanathan - Purdue Universityvishy/talks/Graphs.pdf · S.V.N. Vishwanathan:Graph Kernels, Page 28 Unlabeled Graphs: We computed graph kernels on four datasets for molecular

What I did not talk about

S.V.N. Vishwanathan: Graph Kernels, Page 30

Random walks on other semirings e.g. (min,+)

Why (min,+) does not yield p.s.d kernels

Differences between A, A, L, and L

Kernels on vertices (yields marginal graph kernels ofKashima et. al)

Extensions to trajectories of ARMA models (joint work withRené Vidal and Alex Smola)

General theory using Binet-Cauchy theorem (joint workwith Alex Smola)

Connections to Rational kernels of Cortes et. al

Connections to R-Convolution kernels of Haussler

Page 31: S.V.N. Vishwanathan - Purdue Universityvishy/talks/Graphs.pdf · S.V.N. Vishwanathan:Graph Kernels, Page 28 Unlabeled Graphs: We computed graph kernels on four datasets for molecular

Overview of my Research

S.V.N. Vishwanathan: Graph Kernels, Page 31

Structured Input:StringsGraphsARMA models

Structured Output:Exponential families in feature space

Optimization for Machine Learning:Bundle methodssubBFGS

TheoryFundamental limitations of kernelsRates of convergence of boosting algorithms

Page 32: S.V.N. Vishwanathan - Purdue Universityvishy/talks/Graphs.pdf · S.V.N. Vishwanathan:Graph Kernels, Page 28 Unlabeled Graphs: We computed graph kernels on four datasets for molecular

Conclusion

S.V.N. Vishwanathan: Graph Kernels, Page 32

First unifying view of

Diffusion kernelsRegularization on graphsGeometric and random walk kernelsMarginal graph kernels

Efficient computation by exploiting Kronecker products

Papers at http://www.stat.purdue.edu/~vishy

Page 33: S.V.N. Vishwanathan - Purdue Universityvishy/talks/Graphs.pdf · S.V.N. Vishwanathan:Graph Kernels, Page 28 Unlabeled Graphs: We computed graph kernels on four datasets for molecular

Big Open Question

S.V.N. Vishwanathan: Graph Kernels, Page 33

Comparing paths in two different graphs is polynomial

Subgraph isomorphism is known to be NP-hard

Computing the so-called universal graph kernel whichcounts all common subgraphs of two graphs is harder thansubgraph isomorphism

When we compare any other subgraphs e.g.

simple paths (where vertices do not repeat)cyclestrees

we seem to lose polynomial run-time

Are there other subgraphs for which efficient computationis possible?

Page 34: S.V.N. Vishwanathan - Purdue Universityvishy/talks/Graphs.pdf · S.V.N. Vishwanathan:Graph Kernels, Page 28 Unlabeled Graphs: We computed graph kernels on four datasets for molecular

References

S.V.N. Vishwanathan: Graph Kernels, Page 34

Journal Papers[1] S. V. N. Vishwanathan, Karsten Borgwardt, Nicol N. Schraudolph, and Imre Risi Kondor. On graph kernels. J. Mach. Learn. Res.,

2008. submitted.

[2] S. V. N. Vishwanathan, A. J. Smola, and R. Vidal. Binet-Cauchy kernels on dynamical systems and its application to the analysis ofdynamic scenes. International Journal of Computer Vision, 73(1):95–119, 2007.

Conference Papers[1] S. V. N. Vishwanathan, Karsten Borgwardt, and Nicol N. Schraudolph. Fast computation of graph kernels. Technical report, NICTA,

2006.

[2] S. V. N. Vishwanathan and A. J. Smola. Binet-Cauchy kernels. In L. K. Saul, Y. Weiss, and L. Bottou, editors, Advances in NeuralInformation Processing Systems 17, pages 1441–1448, Cambridge, MA, 2005. MIT Press.

Applications to Bioinformatics[1] Karsten M. Borgwardt, H.-P. Kriegel, S. V. N. Vishwanathan, and N. Schraudolph. Graph kernels for disease outcome prediction from

protein-protein interaction networks. In Russ B. Altman, A. Keith Dunker, Lawrence Hunter, Tiffany Murray, and Teri E Klein, editors,Proceedings of the Pacific Symposium of Biocomputing 2007, Maui Hawaii, January 2007. World Scientific.

[2] Karsten M. Borgwardt, S. V. N. Vishwanathan, and H.-P. Kriegel. Class prediction from time series gene expression profiles us-ing dynamical systems kernels. In Russ B. Altman, A. Keith Dunker, Lawrence Hunter, Tiffany Murray, and Teri E Klein, editors,Proceedings of the Pacific Symposium of Biocomputing 2006, pages 547–558, Maui Hawaii, January 2006. World Scientific.

[3] K. M. Borgwardt, C. S. Ong, S. Schönauer, S. V. N. Vishwanathan, A. J. Smola, and H.P. Kriegel. Protein function prediction viagraph kernels. In Proceedings of Intelligent Systems in Molecular Biology (ISMB), Detroit, USA, 2005.