An Attributed Graph Kernel from The Jensen-Shannon Divergence · 2014-09-08 · Contribution In...

Post on 06-Jul-2020

2 views 0 download

Transcript of An Attributed Graph Kernel from The Jensen-Shannon Divergence · 2014-09-08 · Contribution In...

An Attributed Graph Kernel from The Jensen-Shannon Divergence

Lu Bai*, Horst Bunke^, and Edwin R. Hancock*

*Department of Computer Science, University of York, UK ^Institute of Computer Science and Applied Mathematics,

University of Bern, Switzerland

Contribution

In past have reported new graph-kernel based on Jensen-Shannon divergence between both graph von Neumann entropies and Shannon entropies of random walk pdf’s.

Difficult estimating overlap entropy.

Ignore node label and attribute information.

Here we present information theoretic way to extend Jensen-Shannon graph kernel using tree indexing and label strengthening.

Outline

Background and Motivation

Attributed Jensen-Shannon Diffusion Kernel

Jensen-Shannon divergence

Tree-index for label strengthening

Shannon label entropy

Attributed Jensen-Shannon diffusion kernel

Experiments

Conclusions

Background and Motivation (Graph Kernels)

Graph Kernels: Why use graph kernel?

Kernel offers an elegant solution to the cost of computation on high dimensional feature space [K. Riesen, and H. Bunke, 2009, Pattern Recognition]

Existing Graph Kernels from the R-convolution [Haussler, 1999]

Random walk based kernels

Product graph kernels [Gartner et al., 2003, ICML]

Marginalized kernels on graphs [Kashima et al., 2003, ICML]

Path based kernels

Shortest path kernel [Borgwardt, 2005, ICDM]

Restricted subgraphs or subtrees based kernels

A Weisfeiler-Lehman subtree kernel [Shevashidze et al., 2010, JMLR]

A graphlet count kernel [Shevashidze et al., 2009, ICML]

A neighborhood subgraph kernel [Costa and Grave, 2010, ICML]

Graph kernels from the classical and quantum Jensen-Shannon (JS) divergence: 1) the JS kernel [Bai and Hancock, 2013, JMIV], 2) the quantum JS kernel [Bai et al. 2014,

Pattern Recognition], and 3) the fast JS subgraph kernel [Bai and Hancock, 2013, ICIAP] .

Hypergraph kernel from the random walk [Wachman et al., 2009, ICML] , and the JSD

Background and Motivation (Graph Kernels)

Graph Kernels: Why use graph kernel?

Kernel offers an elegant solution to the cost of computation on high dimensional feature space [K. Riesen, and H. Bunke, 2009, Pattern Recognition]

Existing Graph Kernels from the R-convolution [Haussler, 1999]

Random walk based kernels

Product graph kernels [Gartner et al., 2003, ICML]

Marginalized kernels on graphs [Kashima et al., 2003, ICML]

Path based kernels

Shortest path kernel [Borgwardt, 2005, ICDM]

Restricted subgraphs or subtrees based kernels

A Weisfeiler-Lehman subtree kernel [Shevashidze et al., 2010, JMLR]

A graphlet count kernel [Shevashidze et al., 2009, ICML]

A neighborhood subgraph kernel [Costa and Grave, 2010, ICML]

Graph kernels from the classical and quantum Jensen-Shannon (JS) divergence: 1) the JS kernel [Bai and Hancock, 2013, JMIV], 2) the quantum JS kernel [Bai et al. 2014,

Pattern Recognition], and 3) the fast JS subgraph kernel [Bai and Hancock, 2013, ICIAP] .

Hypergraph kernel from the random walk [Wachman et al., 2009, ICML] , and the JSD

Background and Motivation (Graph Kernels)

Graph Kernels: Why use graph kernel?

Kernel offers an elegant solution to the cost of computation on high dimensional feature space [K. Riesen, and H. Bunke, 2009, Pattern Recognition]

Existing Graph Kernels from the R-convolution [Haussler, 1999]

Random walk based kernels

Product graph kernels [Gartner et al., 2003, ICML]

Marginalized kernels on graphs [Kashima et al., 2003, ICML]

Path based kernels

Shortest path kernel [Borgwardt, 2005, ICDM]

Restricted subgraphs or subtrees based kernels

A Weisfeiler-Lehman subtree kernel [Shevashidze et al., 2010, JMLR]

A graphlet count kernel [Shevashidze et al., 2009, ICML]

A neighborhood subgraph kernel [Costa and Grave, 2010, ICML]

Graph kernels from the classical and quantum Jensen-Shannon (JS) divergence: 1) the JS kernel [Bai and Hancock, 2013, JMIV], 2) the quantum JS kernel [Bai et al. 2014,

Pattern Recognition], and 3) the fast JS subgraph kernel [Bai and Hancock, 2013, ICIAP] .

Hypergraph kernel from the random walk [Wachman et al., 2009, ICML] , and the JSD

Background and Motivation (Graph Kernels)

Graph Kernels: Why use graph kernel?

Kernel offers an elegant solution to the cost of computation on high dimensional feature space [K. Riesen, and H. Bunke, 2009, Pattern Recognition]

Existing Graph Kernels from the R-convolution [Haussler, 1999]

Random walk based kernels

Product graph kernels [Gartner et al., 2003, ICML]

Marginalized kernels on graphs [Kashima et al., 2003, ICML]

Path based kernels

Shortest path kernel [Borgwardt, 2005, ICDM]

Restricted subgraphs or subtrees based kernels

A Weisfeiler-Lehman subtree kernel [Shevashidze et al., 2010, JMLR]

A graphlet count kernel [Shevashidze et al., 2009, ICML]

A neighborhood subgraph kernel [Costa and Grave, 2010, ICML]

Graph kernels from the classical and quantum Jensen-Shannon (JS) divergence: 1) the JS kernel [Bai and Hancock, 2013, JMIV], 2) the quantum JS kernel [Bai et al. 2014,

Pattern Recognition], and 3) the fast JS subgraph kernel [Bai and Hancock, 2013, ICIAP] .

Hypergraph kernel from the random walk [Wachman et al., 2009, ICML] , and the JSD

Background and Motivation (Graph Kernels)

Drawbacks of the existing R-convolution kernels

1) Definitions of R-convolution kernels: for a pair of graph Gp and Gq, assume and are their substructure sets respectively, then the R-convolution kernel is

where

2) Neglects non-isomorphic but similar substructures,

Graph Kernels from Jensen-Shannon Divergence

Classical Jensen-Shannon Divergence (JSD)

Definition Classical JSD is a non-extensive mutual information measure between probability

distributions over structured data. Related to the Shannon entropy. Always well defined, symmetric, negative definite and bounded. If P and Q are two probability distributions, JSD is

Jensen-Shannon graph kernel [Bai and Hancock, JMIV, 2013]

For a pair of graphs Gp(Vp, Ep) and Gq(Vq, Eq), the Jensen-Shannon divergence is

where is a composite structure graph formed from the pair of

(sub)graphs using the disjoint union or graph product. Advantages: more efficient than the R-convolution kernels Drawbacks: a) restricted to attributed graphs, b) cannot reflect interior

topology information of graphs, and c) lacking pairwise correspondence information between vertices.

Graph Kernels from Jensen-Shannon Divergence

Classical Jensen-Shannon Divergence (JSD)

Definition Classical JSD is a non-extensive mutual information measure between probability

distributions over structured data. Related to the Shannon entropy. Always well defined, symmetric, negative definite and bounded. If P and Q are two probability distributions, JSD is

Jensen-Shannon graph kernel [Bai and Hancock, JMIV, 2013]

For a pair of graphs Gp(Vp, Ep) and Gq(Vq, Eq), the Jensen-Shannon divergence is

where is a composite structure graph formed from the pair of

(sub)graphs using the disjoint union or graph product. Advantages: more efficient than the R-convolution kernels Drawbacks: a) restricted to attributed graphs, b) cannot reflect interior

topology information of graphs, and c) lacking pairwise correspondence information between vertices.

Graph Kernels from Jensen-Shannon Divergence

Classical Jensen-Shannon Divergence (JSD)

Definition Classical JSD is a non-extensive mutual information measure between probability

distributions over structured data. Related to the Shannon entropy. Always well defined, symmetric, negative definite and bounded. If P and Q are two probability distributions, JSD is

Jensen-Shannon graph kernel [Bai and Hancock, JMIV, 2013]

For a pair of graphs Gp(Vp, Ep) and Gq(Vq, Eq), the Jensen-Shannon divergence is

where is a composite structure graph formed from the pair of

(sub)graphs using the disjoint union or graph product. Advantages: more efficient than the R-convolution kernels Drawbacks: a) restricted to attributed graphs, b) cannot reflect interior

topology information of graphs, and c) lacking pairwise correspondence information between vertices.

Tree Index Strengthening

Tree-index (TI) label strengthening

Example: each strengthened label corrsponds to a subtree of height h=2

Each strengthened vertex label corresponds to a subtree rooted at the vertex. A pair of strengthened vertex labels corresponding if subtrees are isomorphic.

Drawbacks: TI method leads to a rapid explosion of the label length. Strengthening a vertex label by taking the union of the neighbouring label lists ignores the original label information of the vertex.

Tree Index Strengthening

Tree-index (TI) label strengthening

Example: each strengthened label corrsponds to a subtree of height h=2

Each strengthened vertex label corresponds to a subtree rooted at the vertex. A pair of strengthened vertex labels corresponding if subtrees are isomorphic.

Drawbacks: TI method leads to a rapid explosion of the label length. Strengthening a vertex label by taking the union of the neighbouring label lists ignores the original label information of the vertex.

Tree Index Strengthening

Tree-index (TI) label strengthening

Example: each strengthened label corrsponds to a subtree of height h=2

Each strengthened vertex label corresponds to a subtree rooted at the vertex. A pair of strengthened vertex labels corresponding if subtrees are isomorphic.

Drawbacks: TI method leads to a rapid explosion of the label length. Strengthening a vertex label by taking the union of the neighbouring label lists ignores the original label information of the vertex.

Tree Index Strengthening

Tree-index (TI) label strengthening

Example: each strengthened label corrsponds to a subtree of height h=2

Each strengthened vertex label corresponds to a subtree rooted at the vertex. A pair of strengthened vertex labels corresponding if subtrees are isomorphic.

Drawbacks: TI method leads to a rapid explosion of the label length. Strengthening a vertex label by taking the union of the neighbouring label lists ignores the original label information of the vertex.

Jensen-Shannon Diffusion Kernel (TI Method)

Overcome problem: at each iteration h strengthen vertex label by taking union of original vertex label and its neighbouring vertex labels. Pseudocode:

Tree Index Jensen-Shannon Divergence

Shannon label entropy: Assume label set L={l1, l2,…, li, lI} contains all the possible labels of two graphs. Label frequency probability distribution

Resulting Shannon label entropy defined as

JSD between discrete probability distribution: assume two discrete probability distributions are P={p1,…,pm,…,pM} and Q={q1,…,qm,…,qM}, then the JSD between P and Q is

Tree Index Jensen-Shannon Divergence

Shannon label entropy: Assume label set L={l1, l2,…, li, lI} contains all the possible labels of two graphs. Label frequency probability distribution

Resulting Shannon label entropy defined as

JSD between discrete probability distribution: assume two discrete probability distributions are P={p1,…,pm,…,pM} and Q={q1,…,qm,…,qM}, then the JSD between P and Q is

Tree Index Jensen-Shannon Divergence

Shannon label entropy: Assume label set L={l1, l2,…, li, lI} contains all the possible labels of two graphs. Label frequency probability distribution

Resulting Shannon label entropy defined as

JSD between discrete probability distribution: assume two discrete probability distributions are P={p1,…,pm,…,pM} and Q={q1,…,qm,…,qM}, then the JSD between P and Q is

Jensen-Shannon Diffusion Kernel

The Jensen-Shannon diffusion kernel For a pair of graphs G and G’, we have their label probability distributions

as and . The JSD between G and G’ is

The Jensen-Shannon diffusion kernel is defined as

Jensen-Shannon diffusion kernel is positive definite (pd)

Because the JSD is symmetric, thus a diffusion kernel k=-exp{s(G,G’)} associated with any dissimilarity measure is pd.

Advantages

The new attributed diffusion kernel overcomes some shortcomings arising in the R-convolution kernels and our previous Jensen-Shannon kernel [Bai and Hancock, 2013, JMIV]

Correspondence between the discrete probabilities. No such correspondence information in our previous Jensen-Shannon kernel.

Shannon label entropy represents the ambiguity of the compressed strengthened labels at an iteration h. Each label corresponds a subtree rooted at the vertex containing the label. All the subtrees are considered in the computation of the new Jensen-Shannon diffusion kernel.

Identical strengthened labels correspond to the same isomorphic subtrees. Correspondence between the probability distribution reflects the correspondence between pairs of isomorphic subtrees. New kernel reflects more interior topological information for graphs.

New kernel not restricted to un-attributed graphs.

Advantages

The new attributed diffusion kernel overcomes some shortcomings arising in the R-convolution kernels and our previous Jensen-Shannon kernel [Bai and Hancock, 2013, JMIV]

Correspondence between the discrete probabilities. No such correspondence information in our previous Jensen-Shannon kernel.

Shannon label entropy represents the ambiguity of the compressed strengthened labels at an iteration h. Each label corresponds a subtree rooted at the vertex containing the label. All the subtrees are considered in the computation of the new Jensen-Shannon diffusion kernel.

Identical strengthened labels correspond to the same isomorphic subtrees. Correspondence between the probability distribution reflects the correspondence between pairs of isomorphic subtrees. New kernel reflects more interior topological information for graphs.

New kernel not restricted to un-attributed graphs.

Advantages

The new attributed diffusion kernel overcomes some shortcomings arising in the R-convolution kernels and our previous Jensen-Shannon kernel [Bai and Hancock, 2013, JMIV]

Correspondence between the discrete probabilities. No such correspondence information in our previous Jensen-Shannon kernel.

Shannon label entropy represents the ambiguity of the compressed strengthened labels at an iteration h. Each label corresponds a subtree rooted at the vertex containing the label. All the subtrees are considered in the computation of the new Jensen-Shannon diffusion kernel.

Identical strengthened labels correspond to the same isomorphic subtrees. Correspondence between the probability distribution reflects the correspondence between pairs of isomorphic subtrees. New kernel reflects more interior topological information for graphs.

New kernel not restricted to un-attributed graphs.

Advantages

The new attributed diffusion kernel overcomes some shortcomings arising in the R-convolution kernels and our previous Jensen-Shannon kernel [Bai and Hancock, 2013, JMIV]

Correspondence between the discrete probabilities. No such correspondence information in our previous Jensen-Shannon kernel.

Shannon label entropy represents the ambiguity of the compressed strengthened labels at an iteration h. Each label corresponds a subtree rooted at the vertex containing the label. All the subtrees are considered in the computation of the new Jensen-Shannon diffusion kernel.

Identical strengthened labels correspond to the same isomorphic subtrees. Correspondence between the probability distribution reflects the correspondence between pairs of isomorphic subtrees. New kernel reflects more interior topological information for graphs.

New kernel not restricted to un-attributed graphs.

Advantages

The new attributed diffusion kernel overcomes some shortcomings arising in the R-convolution kernels and our previous Jensen-Shannon kernel [Bai and Hancock, 2013, JMIV]

Correspondence between the discrete probabilities. No such correspondence information in our previous Jensen-Shannon kernel.

Shannon label entropy represents the ambiguity of the compressed strengthened labels at an iteration h. Each label corresponds a subtree rooted at the vertex containing the label. All the subtrees are considered in the computation of the new Jensen-Shannon diffusion kernel.

Identical strengthened labels correspond to the same isomorphic subtrees. Correspondence between the probability distribution reflects the correspondence between pairs of isomorphic subtrees. New kernel reflects more interior topological information for graphs.

New kernel not restricted to un-attributed graphs.

Experiments

Standard Graph Datasets: MUTAG, NCI1, NCI109, ENZYMES, PPIs, and PTC(MR)

Alternative state-of-the-art kernels for comparision: The Jensen-Shannon graph kernel (JSGK) [Bai and Hancock, JMIV, 2013]

The Weisfeiler-Lehman subtree kernel (WLSK) [Shervashidze et al., JMLR, 2010]

The shortest path kernel (SPGK) [Borgwardt and Kriegel, ICDM, 2005]

The graphlet count kernel with graphlet of size 3 (GCGK) [Shervashidze et al., ICML, 2009]

The backtracless kernel using cycles identified by the Ihara zeta function (BRWK) [Aziz et al., TNNLS, 2013]

Experiments

Experimental results

Conclusion

Showed how to incorporate attributes into Jensen-Shannon graph-kernel.

Based on label strengthening via tree indexing.

Labels have information theoretic characterisation.

Kernel proves effective on bioinformatics datasets and outperforms a number of alternatives.

Future

Hypergraphs via oriented line graphs.

Directed graphs via directedgraph entropies (Cheng et al Phys Rev E 2014).

Acknowledgments

Prof. Edwin R. Hancock is supported by a Royal Society Wolfson Research Merit Award.

We thank Prof. Karsten Borgwardt and Dr. Nino Shervashidze for providing the Matlab implementation for the various graph kernel methods, and Dr. Geng Li for providing the graph datasets.

Thank you!