Mining Coherent Dense Subgraphs across Multiple Biological Networks

25
Mining Coherent Dense Subgraphs across Multiple Biological Networks Vahid Mirjalili CSE 891

description

Mining Coherent Dense Subgraphs across Multiple Biological Networks. Vahid Mirjalili CSE 891. Motivation: Finding patterns across multiple networks, to identify biological modules, and function prediction Current algorithms are too costly Developed a novel algorithm: CODENSE - PowerPoint PPT Presentation

Transcript of Mining Coherent Dense Subgraphs across Multiple Biological Networks

Page 1: Mining Coherent Dense  Subgraphs  across Multiple  Biological Networks

Mining Coherent Dense Subgraphs across Multiple Biological Networks

Vahid MirjaliliCSE 891

Page 2: Mining Coherent Dense  Subgraphs  across Multiple  Biological Networks

• Motivation:• Finding patterns across multiple networks, to

identify biological modules, and function prediction

• Current algorithms are too costly• Developed a novel algorithm: CODENSE– Scalable in number and size– Adjustable based on the exact or approximate

pattern mining

Page 3: Mining Coherent Dense  Subgraphs  across Multiple  Biological Networks

• Clustering can detect meaningful biological modules– e.g. a dense protein interaction sub-network may

correspond to a protein complex– Dense co-expression sub-network may represent a co-

expression cluster• Biological modules are expected to be active across

multiple conditions• One idea: aggregate all the networks and identify

dense sub-graphs in the aggregated network– Risk of false positive detection

Page 4: Mining Coherent Dense  Subgraphs  across Multiple  Biological Networks

Aggregated graph:False positive in the aggregated graph

• Adding six graphs together, and deleting the edges that occur less than 3 times resulting summary graph

Page 5: Mining Coherent Dense  Subgraphs  across Multiple  Biological Networks

Solution to the false-positive summary-graph

• Frequent sub-graphs• Mine the dense sub-graphs directly in each

original network• A sub-graph is frequent if it occurs in multiple

times in a set of graphs• In biological networks, each gene occur only

once in a graph no isomorphism problem

Page 6: Mining Coherent Dense  Subgraphs  across Multiple  Biological Networks

Frequent dense sub-grpah

• A frequent dense sub-graph doesn’t show accurate information– Some edges in the frequent sub-graph shown

above do not occur in the original set– It is more meaningful to divide this to two

sub-graphs

Page 7: Mining Coherent Dense  Subgraphs  across Multiple  Biological Networks

Coherent Dense Sub-graphs

• All edges in a coherent sub-graphs should have correlated occurrences in the original graph set

• CODENSE divides the networks into 2 meta-graphs and perform clustering on these two graphs only (instead of individual networks)– CODENSE can distinguish the two modules– Good scalability– Discovery of overlapping clusters

Page 8: Mining Coherent Dense  Subgraphs  across Multiple  Biological Networks

Overlapping Sub-graphs

• Partition-based clustering algorithms fail to identify overlapping sub-graphs

• Mining Overlapping Dense Sub-graphs (MODES)

Page 9: Mining Coherent Dense  Subgraphs  across Multiple  Biological Networks

Application• Identify frequent co-expression clusters across multiple

microarray datasets

Microarray dataset: – Un-weighted, undirected graph– Each gene represents a node– Two genes are connected by an edge if they show high

expression correlation• A densely connected sub-graph tight co-expression cluster• Clusters from a single microarray dataset include spurious

links, and may not be homogenous in function and regulation

Page 10: Mining Coherent Dense  Subgraphs  across Multiple  Biological Networks

Problem Formulation

• A relation graph contains n simple graphs, such as – A common vertex set V is shared by the graphs

• Support(G): the numbers of graphs in a relation graph dataset (D)

• A graph is frequent if support(G) > threshold• Summary graph: is an un-weighted graph

extracted from D, where an edge exists only if it occurs in more than k graphs in D

Page 11: Mining Coherent Dense  Subgraphs  across Multiple  Biological Networks

Problem Formulation

• Edge Support Vector: is the weight of edge e in graph i (for an un-weighted graph it would be 0 or 1)

)(ewi

]1,1,1,0,0,0[)(]0,0,0,1,1,1[)(

cbwbaw

Page 12: Mining Coherent Dense  Subgraphs  across Multiple  Biological Networks

• Second-Order Graph: where each node represents an edge from the relation graph dataset (D) and an edge between nodes u and v exists if w(u) and w(v) are highly correlated

• For efficiency, only construct the S graph for a sub-graph of the summary graph

),( sEVVS

Page 13: Mining Coherent Dense  Subgraphs  across Multiple  Biological Networks

• Coherent Graph: a sub-graph extracted from the summary graph is coherent if– All its edges have support > k– Its second-order graph is dense

• Graph Density:

keGe )(support:)(sub

)1(2)(

nnmGdensity

m: number of edgesn: n umber of nodes

Page 14: Mining Coherent Dense  Subgraphs  across Multiple  Biological Networks

Two facts:• If a frequent sub-graph is dense, then it must

be dense in the summary graph as well, but the reverse way doesn’t hold true always

• If a sub-graph is coherent (its edges have high correlation across the dataset), then its second-order sub-graph is dense

Page 15: Mining Coherent Dense  Subgraphs  across Multiple  Biological Networks
Page 16: Mining Coherent Dense  Subgraphs  across Multiple  Biological Networks

• Aggregate the graphs into a summary graph• Eliminate infrequent edges

Page 17: Mining Coherent Dense  Subgraphs  across Multiple  Biological Networks

MODES: Mining Overlapping DEnse Subgraphs

• Developed based on HCS: Highly Connected Sub-graphs

• Can efficiently identify dense sub-graphs• Can mine overlapping sub-graphs• Two approaches:– Minimum cut– Normalized cut (Shi, Malik 2000)

• Apply the normalized cut in the initial steps of HCS algorothm, then if the size of partitions is small proceed with minimum cut

Page 18: Mining Coherent Dense  Subgraphs  across Multiple  Biological Networks
Page 19: Mining Coherent Dense  Subgraphs  across Multiple  Biological Networks

C

Page 20: Mining Coherent Dense  Subgraphs  across Multiple  Biological Networks

CODENSE analysis

• Simplify the identification of coherent dense sub-graphs across n graphs into mining in two special graphs: summary graph + second-order graph

• Can mine network modules• Can mine both exact and approximate patterns

(by modifying the similarity threshold)• Can be extended to weighted graph (using Pearson

correlation instead of Euclidean distance )

Page 21: Mining Coherent Dense  Subgraphs  across Multiple  Biological Networks

Experimental Study: co-expression network

• 39 yeast microarray datasets• 6661 genes• Calculate the Pearson correlation between the

expression levels (r)

• Construct the relation graph, (connectivity of two genes determined by the Pearson correlation)

2

2

1)2(

rrn

n: number of measurements

Page 22: Mining Coherent Dense  Subgraphs  across Multiple  Biological Networks

• Create the summary graph , while removing edges that occur less than 6 times across 39 graphs

• Apply MODES to identify dense sub-grahs: sub( ) with cutoff density d1

• For each sub( ), construct the second-order graph S• Apply MODES to S to identify sub-grpahs with

density > d2• Transform the edges vertices, and apply MODES

again to identify the dense sub-graphs with density > d3

G

G

G

Page 23: Mining Coherent Dense  Subgraphs  across Multiple  Biological Networks

Functional Module Discovery:MODES vs CODENSE

• MODES identified 366 clusters, but only 151 were functionally homogenous (42%)

• CODENSE identified 770 clusters, which 76% of those were homogenous

• Improvement is due to second-order graph by eliminating edges which do not show co-occurrence across all networks

Page 24: Mining Coherent Dense  Subgraphs  across Multiple  Biological Networks

• Example of MODES false positive: MODES identified 5 genes: MSF1, PHB1, CBP4, NDI1, SCO2 which

are not functionally homogenous

Protein biosynthesis replicative cell aging mitochondrial electron transfer

Page 25: Mining Coherent Dense  Subgraphs  across Multiple  Biological Networks

Functional prediction:

• CODENSE identified this 6-nodes sub-graph• 5 genes belong to “protein biosynthesis”

category

• Predict: ASC1 must be involved in proteinbiosynthesis as well

Test with 448 known genes: 50% accuracy