Network motifs: discovery and applications Guy Zinman Seminar in Bioinformatics Technion, Spring...
-
date post
20-Dec-2015 -
Category
Documents
-
view
216 -
download
2
Transcript of Network motifs: discovery and applications Guy Zinman Seminar in Bioinformatics Technion, Spring...
Network motifs: discovery and applications
Guy Zinman
Seminar in Bioinformatics
Technion, Spring 2005
Outline
• Theory of network motifs• Definition, Algorithm
• Application to E. Coli transcription network• The dynamic behavior of the motifs
• Finding active subnetworks• Simulated annealing• experiments
Network
Network
• Dictionary definition: • A group or system of (electric) components and
connecting circuitry designed to function in a specific manner.
• Network is the backbone of a complex system
• Studies of networks are similar to paleontology: learning about an animal
from its backbone
Network motifs
• The notion of motif, widely used for sequence analysis, is generalized to the level of networks.
• Network Motifs are defined as patterns of interconnections that recur in many different parts of a network at frequencies much higher than those found in randomized networks.
Network motifs (cont.)
Such motifs are found in networks from:
• Biochemistry• Transcriptional regulation networks
• Neurobiology• Neuron connectivity
• Ecology • Food webs
• Engineering• Electoronic circuits• World Wide Web
Network motifs (cont.)
Schematic view of motif detection
• Occurrence of the FFL motif:
Random vs designed/evolved features
• Large networks may contain information about design principles and/or evolution of the complex system
• Which features are there for a reason:• design principles (e.g. feed-forward loops)• constraints (e.g. the all nodes on the Internet must be
connected to each other)• evolution, growth dynamics (e.g. network growth is
mainly due to gene duplication)
Network motifs
• Alon U. et al: “Network Motifs: Simple building Blocks of Complex Networks”; Science, 2002.
• Different motifs were found in different classes of network.
• The motif reflect the underlying processes that generate each type of network.
Motifs detected
• Two significant motifs:
Both appeared numerous times in non-homologous gene systems that perform diverse biological functions
Motifs detected
Motifs detected
Main tasks for detecting network motifs
There are two main tasks in detecting network motifs:
(1) generating an ensemble of proper random networks
(2) counting the subgraphs in the real network and in random networks.
The algorithm
• Starting point: graph with directed edges
• Scan for n-node subgraphs (n=3,4) and count number of occurrences
• Compare to Erdos-Renyi randomized graph• (randomization preserves in-, out- and in+out- degree
of each node)
All 3-node connected subgraphs
• 13 different isomorphic types of 3-node connected subgraph
• There are:199 4-node subgraphs, 9364 5-node subgraphs ……
Generation of randomized network
• Algorithm A• Employ a Markov-chain algorithm based on starting
with the real network and repeatedly swapping randomly chosen pairs of connections (X1 => Y1, X2 => Y2 is replaced by X1 => Y2, X2 => Y1) until the network is well randomized.
• Switching is prohibited if the either of the connections X1 => Y2 or X2 => Y1 already exist.
Generation of randomized network
• Algorithm B• Each network was presented as a connectivity matrix
M, such that Mij = 1 if there is a connection directed from node i to node j, and 0 otherwise.
• The goal is to create a randomized connectivity matrix Mrand, which has the same number of nonzero elements in each row and column as the corresponding row and column of the real connectivity matrix.
Generation of randomized network
• Ri = ∑jMrand,ij = ∑jMij, Ci = ∑iMrand,ij = ∑iMij. • To generate the randomized networks, we start with an empty
matrix Mrand. • We then repeatedly randomly choose a row n according to the
weights pi = Ri/∑Ri and a column m according to the weights qj = Rj/∑Rj.
• If Mrand,nm = 0, we set Mrand,mn = 1. • We then set Rm = Rm – 1 and Cn = Cn – 1. If the entry (m, n)
was previously entered to the randomized matrix, that is, ifMrand,mn = 1, or if m = n, we choose a new (m, n).
• This process is repeated until all Ri = 0 and Cj = 0.
Network motif detection
• For each nonzero element (i,j):
Looping through all connected elements Mik = 1, Mki = 1, Mjk = 1, and Mkj = 1. This is recursively repeated with elements (i, k), (k, i), (j,k), and (k, j) until an n-node subgraph is obtained.
• A table is formed that counts the number of appearances of each type of subgraph in the network, correcting for the fact that multiple submatrices of M can correspond to one isomorphic architecture owing to symmetries.
Network motif detection
• This process is repeated for each of the randomized networks. The number of appearances of each type of subgraph in the random ensemble is recorded, to assess its statistical significance.
• The present concepts and algorithms are easily generalized to nondirected or directed graphs with several “colors” of edges and nodes, multipartite graphs, and so forth.
Criteria for Network Motif Selection
• The probability that it appears in a randomized network an equal or greater number of times than in the real network is smaller than P = 0.01.
Reminder:p-value: the probability to get the given result when the tested subject is not affected by the experiment.
if p-value < 0.01 than the subject is considered to be affected (the hypothesis is correct).
Run time complexity
• The performance of this algorithm scales with the total number of n-node subgraphs in the network.
• The number of subgraphs and the algorithm runtime also increase dramatically for subgraphs with n ≥ 5.
Sampling method for subgraph counting
• Kashtan et al.: “Efficient sampling algorithm for estimating subgraph concentrations and detecting network motifs”; Bioinformatics, 2004.
• This algorithm samples subgraphs in order to estimate their relative frequency.
• The runtime of the algorithm asymptotically does not depend on the network size.
• Surprisingly, few samples are needed to detect network motifs reliably.
Subgraph sampling
Procedure description:• pick a random edge from the network and then expand
the subgraph iteratively by picking random neighboring edges until the subgraph reaches n nodes.
• For each random choice of an edge, in order to pick an edge that will expand the subgraph size by one, prepare a list of all such candidate edges and then randomly choose an edge from the list.
Subgraph sampling
• Finally, the sampled subgraph is defined by the set of n nodes and all the edges that connect between these nodes in the original network.
• Finding n-node subgraphs for n ≥5 is much easier now….
Comparing sampling method results with exhaustive enumeration
Transcriptional Regulation Network ofEscherichia coli
• Operon – a group of contiguous genes that are transcribed into a single mRNA molecule.
• The transcriptional network is represented as a directed graph: each operon represents a node and edges represent
direct transcriptional
interactions.
Application to E. Coli
Alon U.: “Network motifs in the transcriptional regulation network of Eschersichia coli”; Nature Genetics, 2002.
• Database - RegulonDBcontains interactions between Transcription Factors and the operons they regulate
• Contains 577 interactions, 424 operons and 116 TFs• 35 more TFs were added from literature• Previously described algorithm was run on this data (1000
random networks)
Significant motifs
Feedforward loop
found in 22 different systems,
10 TFs and 40 operons
P-Val=0.001
Concentration of FFL
Same in the yeast regulatory network
• Young et. al: Transcriptional Regulatory Networks in Saccharomyces cerevisiae; Science, 2002
• Can you think of a possible role for this motif?
Dynamics for the FFL
• Mangan et al., “Structure and function of the feed-forward loop”; PNAS, 2003.
Consider Sx and Sy as
Input signal – small molecules
That activate or inhibit the
Activity of X and Y.
Coherency of FFLs
• The FFL is ‘coherent’ if the direct effect of the general TF on the effector has the same sign.
• 85% of the FFL found were coherent.
Significant motif
Single Input Motif (SIM)
• Single Transcription Factor controls set of operons.
• All operons in a SIM are regulated
with the same sign.
• Appeared in 24 different systems
Dynamics for the SIM
Significant motif
Dense Overlapping Regulon (DOR) -
a layer of overlapping interactions between operons and a group of TFs, much denser than this structure would appear in an Erdos-Renyi random graph
E. Coli network
Dor detection
Briefly…
• Define a (nonmetric) distance measure between operon k and j.
• The operons were clustered.
• DORs corresponded to clusters with more than C=10 connections, with ratio of connections to TF greater than R=2.
mFinder
• A software tool for estimating subgraph concentrations and detecting network motifs.
• www.weizmann.ac.il/mcb/UriAlon/
Discussion
• The concept of homology between genes based on sequence motifs has been crucial for understanding the function of uncharacterized genes.
• Likewise, the notion of similarity between connectivity patterns in networks, based on network motifs, may be helpful in gaining insight into the dynamic behavior of newly identified gene circuits.
Discussion
• Until now we considered only transcription interactions specifically manifested by transcription factors that bind regulatory sites.
• This transcriptional network can be thought of as ‘slow’ part of the cellular regulation network (time scale of minutes).
Discussion
• An additional layer of faster interactions, which include interaction between proteins (often subsecond timescale), contributes to the full regulatory behavior.
Finding active subnetworks
• Ideker, T.: “Discovering regulatory and signaling circuits in molecular interaction networks”; Bioinformatics, 2002.
• Integrates protein-protein and protein-DNA interactions with mRNA expression data, in a goal of better understanding the molecular mechanism of the observed gene expression.
• Uses a method of searching the network to find ‘active subnetwork’, i.e., connected sets of genes with unexpectedly high levels of differential expression, under one or more perturbation.
Methodology
• Using a molecular interaction network to analyze changes in expression over 20 perturbations to the yeast galactose utilization (GAL) pathway.
• Determining which conditions significantly affected the gene expression in each active subnetwork.
The means
• Combining a rigorous statistical measure for scoring subnetworks with a search algorithm for identifying subnetworks with high score.
• To rate the biological activity of a particular subnetwork, begin with assessing the significance of differential expression for each gene.
• The error model provided by VERA (Variability and ERror Assessment) program.• VERA estimates the parameters of a statistical model using
the method of maximum likelihood.
• Output: p-values (pi), representing the significance of expression change.
Basic z-score calculation
Basic z-score calculation
• Each pi is converted to z-score:
zi = Φ-1(1-pi) • Φ-1 = The inverse normal CDF (cumulative distribution
function)• Smaller p-values correspond to larger z-score
z-score - quantifies how different from normal the given value is:
x
xxxZ
• Aggregate z-score for an entire subnetwork A of k genes:
Notice:
• zA will also be distributed according the standard normal (because the variables are independent).
• Subnetworks of all sizes are comparable under this scoring system, independent of k.
• A high zA indicates a biologically active subnetwork.
Ai
iA Zk
Z1
Scoring of Subnetworks
Calibrating z against background distribution
• Randomly sample gene sets of size k using a Monte Carlo approach, compute their scores zA, and calculate standard deviation parameters for each k.
• The corrected subnet score SA is:
k
kAA
ZS
Scoring an example subnetwork
Za Zb Zc Zd ZA SA
Scoring over multiple conditions
• Starting with a matrix of p-values (genes vs. conditions) and corresponding z-scores.
• Producing m different aggregate scores, one for each condition, and sorting them.
• Finding the probability that at least j of the m conditions had scores above zA(j)
• Monte Carlo technique is used for estimating the mean and the standard deviation from random gene set of size k.
Scoring over multiple conditions
Finding the maximal scoring
• Problem:
Finding the maximal scoring connected subgraph is NP-hard.
The Difficulty in Searching Global Optima
Global maxima
Local maximaLocal maxima
subnetwork
sig
nifi
can
ce
sco
re
Rugged landscapes and local maxima problem
Monte Carlo random search
• Known also as the ‘Metropolis algorithm’• A simulation technique for conformational sampling and
optimization based on a random search for energetically favourable conformations
• Finding global (or at least “good” local) maximum by biased random walk may take some luck …
Global maxima
Local maxima
Local maxima
subnetwork
sig
nifi
can
ce
sco
re
Climbing mountains easier: simulated annealing
Global maxima
Local maxima
Local maxima
subnetwork
sig
nifi
can
ce
sco
re
In order to get out from a local maxima one needs to allow for locally unfavorable moves
Introduction to simulated annealing
Simulated annealing (Kirkpatrick et al.,1983).Mathematical method developed together with Monte Carlo techniques to avoid false maxima Method simulates slow cooling of a solidifying solution to form a single crystal
Origin: The annealing process of heated solids
Intuition:By allowing occasional descent in the search process, we might be able to escape the trap of local maxima.
In our context:
Allow nodes to be removed from the subsets, even if the resulting subnetwork’s score is a (little) lower.
• What can be an adverse effect of this method?
Consequences of the Occasional Ascents
Help escaping the local optima.
desired effectMight pass global optima
after reaching it
adverse effect
So the result is not guaranteed to be optimal. But here we don’t care- any high-scoring subnetwork is
suspected to be biologically significant.
Climbing mountains easier: simulated annealing
• Defining a “temperature” function.• Increasing the effective “temperature” means
higher probability of accepting moves that increase the energy Thus, the likelihood of escaping from a local maximum may be tuned.
Control of Annealing Process
Acceptance of a search step (Metropolis Criterion):
Assume the performance change in the search direction is .
Accept a descending step only if it pass a random test, i.e. with probability
p =
Always accept a ascending step, i.e. 0
Te
Control of Annealing Process
Cooling Schedule:
T, the annealing temperature, is the parameter that control the frequency of acceptance of decending steps.
We gradually reduce temperature T(k) between 1 and 0. The probability to accept declining steps is proportional!
Te
In our context
• Input:
Graph G = (V,E) of molecular interactions,
N – number of iteration
Ti – temperature function which decreases from Tstart to Tend
• Output:
Gw – Subgraph of G
• Initialize Gw by setting each node to an ‘active/inactive’ state randomly (with p = ½).
Simulated Annealing Algorithm
• For i = 1 to N DO• Randomly pick a node v from V and toggle it’s state.
• Compute the score si for the working subgraph Gw
• IF (si > si-1), keep v toggled;
• ELSE keep v toggled with probability iii TSSep /)( 1
Heuristics for improved annealing
• Look for M active subnetworks simultaneously.
• M is a user defined variable• Maintaining multiple components can improve
the efficiency of annealing.• Can be done by:
• multiple annealing runs
Or by• extending the annealing approach to maintain a
graph state vector of the top M component scores.
Galactose metabolic flow
Results:
Experiment #1
small network of 362 interaction. 2 conditions of the expression data: gal80 deletion vs. WT.
5 significant subnetworks were found, including 41 out of 77 significant genes.
Score and temperature vs. number of iteration
Temperature cooling is geometric from 1 to 0.
• N =
• By the end of the run, each of the 5 subnetworks reach a (local) maximum.
5101
Evaluation of the subnetworks
Z-score distribution with real data
Z-score distribution with random data ( scrambled nodes z-scores )
Z-score distribution of the top 5 active networks.
Experiment #2
• Network consists of all known interactions:7145 protein-protein interactions from BIND317 regulation interactions from TRANSFAC
• Expression data includes 20 perturbations to genes in the Galactose pathway.
• 7 active subnetworks found. The biggest consists of 340 genes.
• Repeating annealing with the network above, generated 5 significant sub-sub-networks.
• All results were evaluated with methods similar to what we have seen.
Results:
Discussion
Cytoscape
• www.cytoscape.org
Summary
• Theory of network motifs• Definition, Alogorithm
• Application to E. Coli transcription network• The dynamic behavior of the motifs
• Finding active subnetworks• Simulated annealing• 2 experiments
References
• S Shen-Orr, R Milo, S Mangan & U Alon,
Network motifs in the transcriptional regulation network of Escherichia coli.
Nature Genetics, 31:64-68 (2002).
• R Milo, S Shen-Orr, S Itzkovitz, N Kashtan, D Chklovskii & U Alon,Network Motifs: Simple Building Blocks of Complex Networks
Science, 298:824-827 (2002).
• Ideker, T., Ozier, O., Schwikowski, B., and Siegel, A.
Discovering regulatory and signaling circuits in molecular interaction networks.
Bioinformatics 18 : S233 (2002).
• S. Mangan and U. AlonStructure and function of feed forward loop network motif.
PNAS 100:11980-11985 (2003).
• N. Kashtan, S. Itzkovitz, R. Milo and U. AlonEfficient sampling algorithm for estimating subgraph concentration and detecting network motifs; Bioinformatics 20:1746-175 (2004).
• S. kirkpatrick, C. D. Gelatt and M. P. VecchiOptimization by simulated annealing
Science 220:671-680 (1983).
Thank you