Post on 14-Jan-2016
description
Microarrays, Expression, and Regulatory Networks
Thanks to Prof. Mehmet Koyuturk, Case Western Reserve University.
Central Dogma
3. DNA Microarrays2
A functional protein (or sometimes, RNA) that is coded by a particular gene is often called the product of that gene
Gene Expression
3. DNA Microarrays
3
Gene expression is the process of synthesizing a functional gene product (protein or RNA) from a segment of DNA that specifies inheritable information ( a gene)
In a multicellular organism, all cells contain identical genomes, but different genes are expressed in different types of cells
Regulation of gene expression Development, response to environmental signals
Studying Genome-wide Expression
3. DNA Microarrays
4
The types and of expression levels of prescribed genes are linked to the phenotype of a cell The concentration of mRNA corresponding to each
gene in the genome provides a measure of gene expression
Gene expression is regulated at various stages, so mRNA concentration is not necessarily a perfect indicator of the activity of a gene (in terms of the activity of gene product) Splice variants Post-translational modification Proteomics
Research Questions
3. DNA Microarrays
5
Knowledge of genome-wide expression makes it possible to study fundamental questions related to gene expression What genes are expressed in what cell types (e.g.,
different tissues)? How does gene expression change over time (e.g.,
cell cycle)? How does a certain type of disease influence the
expression of one or more genes to alter phenotype?
How do the expression levels of different groups of genes change under different conditions?
How do genes regulate each other’s expression?
DNA Microarray Technology
3. DNA Microarrays
6
Measure the amount of mRNA (corresponding to each gene) existing in a given cell in bulks (for thousands of genes)
The major tool in transcriptomics It is possible to measure the expression of a large
number of genes (often, the entire genome) in a single sample
Makes it possible to compare the expression levels of several genes in one sample
Large-scale application of traditional techniques Hybridization-based methods
Hybridization
3. DNA Microarrays
7
The process of joining two complementary strands of DNA or one each of DNA and RNA to form a double-stranded molecule
Key idea in microarray technology
What is a DNA Microarray?
3. DNA Microarrays
8
A DNA microarray is a slide onto which a regular pattern of spots is deposited Each spot contains many copies of a specified
single-stranded DNA sequence (i.e., multiple biologically identical sequences)
All sequences are chemically bonded to the surface of the slide
There is a different DNA sequence at each spot (i.e., the sequences at different spots are biologically different)
Spots are small Can fit thousands of spots on a single slide a few
centimeters across
How do DNA Microarrays work?
3. DNA Microarrays
9
The DNA sequences in spots act as probes that hybridize with complementary sequences If complementary sequence exists in the sample,
the corresponding DNA sequence in a spot will hybridize
Solutions extracted from tissue samples contain large numbers of mRNAs of many different types that happen to be present in the cells at the time of the experiment The amount of mRNA hybridized in a spot provides
an estimate of the concentration of the mRNA in the sample
Each spot targets a different type of mRNA (gene)
Measuring mRNA Concentration
3. DNA Microarrays
10
The spots in which hybridization takes place can be visualized using fluorescence techniques Sequences in the sample are fluorescently labeled The DNA that hybridizes is visually identifiable as
glowing spots on the array Spots that have nothing hybridized are not visible The intensity of fluorescence at each spot is
proportional to the amount of the corresponding type of mRNA in the sample
DNA microarrays can detect presence of sequences corresponding to all spots simultaneously
Types of Microarrays
3. DNA Microarrays
11
1. Oligonucleotide arrays2. cDNA arrays
Oligonucleotide Arrays
3. DNA Microarrays
12
Oligo: Just a few, scanty Oligonucleotide arrays use short DNA
sequences (in the spots) Usually 25 nucleotides Several spots correspond to one gene
Oligonucleotides are synthesized in situ based on the sequences One base at a time
Sometimes called a chip Commercially manifactured by Affymatrix
Production of Oligonucleotide Arrays
3. DNA Microarrays13
Photolitography & combinatorial chemistry
Hybridization Specificity
3. DNA Microarrays
14
Each oligonucleotide should hybridize to a specific gene in the organism Short sequences => Cross hybridization is more probable There are a lot of genes in an organism with related
sequences Perfect match/Mismatch (PM/MM) probe strategy
Two spots for each oligonucleutide PM: Identical to the target MM: Differs only at the base in the middle of sequence Assumption: non-specific binding is identical for PM
and MM probes PM-MM provides a measure of specific hybridization
Use of Oligonucleotide Arrays
3. DNA Microarrays15
cDNA Arrays
3. DNA Microarrays
16
A cDNA is a DNA strand synthesized using a reverse transcriptase enzyme, which makes a DNA sequence that is complementary to an RNA template Reverse of what happens in transcription It is possible to synthesize cDNAs from mRNAs present in
cells There are cDNA libraries that contain sequences of genes
known to be expressed in particular cell types Use cDNA sequences as probe sequences on
microarrays Knowledge of sequence is not necessary Experimentally identifying a set of suitable cDNAs is
sufficient
cDNA Arrays
3. DNA Microarrays
17
cDNAs are quite long: 500-2000 bases Hybridization is much more specific A cDNA contains a large fraction of a gene sequence,
but not necessarily the entire gene Generally, one spot is adequate to recognize a single
gene The process of array manufacture is less
reproducible It is not easy to control the amount of DNA at
each spot It is not usually possible to compare absolute
intensities of spots from different slides Use two samples on one array!
Two-color Hybridization
3. DNA Microarrays
18
Two samples One test sample One control (reference) sample
Prepare RNA extracts from each sample separately
Make cDNA from each sample using nucleotides labeled with a different color Reference sample: Green (Cy5) Test sample: Red (Cy3)
Mix labeled populations, let mixture to hybridize with array cDNAs from different samples should bind to spot
in proportion to their concentrations
Two-Color Hybridization
3. DNA Microarrays
19
Red spot: The gene is expressed significantly more in the test sample
Green spot: The gene is expressed significantly more in the reference sample
Yellow spot: The gene has about the same expression level in both samples
The fraction of red intensity to green intensity provides a measure of relative expression
Use of cDNA Arrays
3. DNA Microarrays20
Use of Two-Color Hybridization
3. DNA Microarrays
21
Compare cell/tissue samples Cells before and after an experimental
perturbation Successive times during a temporally staged
process Between stages of differentiation Mutant cell vs. wild type
How do we compare multiple samples? Time-course experiments
Comparing Multiple Samples
3. DNA Microarrays
22
Choose a single reference sample Need not be related to samples being examined Time course experiments: Initial sample
Since the concentration of each mRNA in the reference sample is mixed, the relative expression with respect to reference sample provides a fair comparison between all other samples
Reference sample should provide a hybridization signal for each gene (should have non-zero mRNA concentration) Approximation to ideal reference sample: Equal
mixture of material from all samples
Oligonucleotide vs. cDNA Arrays
3. DNA Microarrays
23
cDNA does not require probe design cDNA provides higher specificity due to longer
sequences of targets However, cDNA may contain repetitive sequences
that are often obtained in various genes Techniques like PM/MM enhance specificty of
oligonucleotides cDNA arrays are more useful on a global level
Screening steady-state mRNA expression levels Oligonucleotide arrays are more useful when
more precise analysis is required SNPs
Relative Expression
3. DNA Microarrays
24
Ri : Red intensity (test sample)
Gi : Green intensity (reference sample)
Intensity ratio: Ti = Ri / Gi If > 1, the gene is up-regulated in the test sample If < 1, the gene is down-regulated in the test
sample Eliminates spot-to-spot variability to a certain
extent
Channel Normalization
3. DNA Microarrays
25
There are millions of individual mRNA molecules in one sample It can be assumed that the average mass of each
molecule is approximately the same It can be assumed that arrayed elements
represent a random sampling of the genes in the organism
We use two samples of equal mass, so the total hybridization intensities should be the same
n
i i
n
i i
G
R
tN1
1
i
i
ti G
R
NT
1ˆ
log Ratio
3. DNA Microarrays
26
Mi = log2(Ri / Gi) log-transformation makes the distribution closer to
normal distribution Mi = 1 => gene i’s expression level is doubled Mi = -1 => gene i’s expression level is halved Mi = 0 => gene i’s expression level is unchanged
Average Intensity of a Spot
3. DNA Microarrays
27
2 2
1log log
2i i i i iA R G R G
log-scaled geometric average of the intensities for the test and reference samples
A measure of the overall expression of a gene
Ratio/Intensity Plot
3. DNA Microarrays
28
x axis: overall expression of a geney axis: change in expression of a gene (across samples)
Mean Relative Intensity
3. DNA Microarrays
29
For a gene whose expression level has not changed, we expect that Ri / Gi so that Mi = 0 Most genes should have unchanged expression level In our example, most points are below the horizonal axis This is likely to be because of a systematic bias, rather
than suggesting that most genes are down-regulated in the experiment
Dye bias Efficiency of labeling in two DNA populations may be
different Binding between DNA and probe may be affected by the
dye in a systematic way Efficiency of detecting flourescent signal may be
different
Array Normalization
3. DNA Microarrays
30
Used to minimize systematic variations in the gene expression levels of the two samples hybridized to the array and allows comparison of gene expression levels across multiple slides
Main assumption: After log-transformation the distribution of relative intensity values approach a normal distribution
Housekeeping Genes
3. DNA Microarrays
31
Normalize using housekeeping genes A housekeeping gene is one that is assumed to be
expressed at a constant level that does not change between reference and test samples
Shift data so that we will have Mi = 0 for housekeeping genes
It is not easy to find genes whose expression will surely remain unchanged
Global Normalization
3. DNA Microarrays
32
Subtract the mean relative intensity over all spots from all spots so that the mean will be zero
All these methods are global in the sense that they only change the position of the cloud of points in the M/A plot, not the shapeˆ
i iM M M 1
n
ii
MM
n
Self-Normalization
3. DNA Microarrays
33
Dye-flip experiments Another way of eliminating dye bias Perform a second experiment by in which the red
and green labeling of samples is done in reverse Subtract Mi values from each other Result will be twice the unbiased Mi value, since the
term that corresponds to bias will be canceled out The normalized value of each spot depends only
on the measured intensity ratios for that spot Bias is assumed to be independent in all spots Bias is assumed to be reproducible between arrays
Intensity-Dependent Bias
3. DNA Microarrays
34
Bias may depend on the average intensity on a spot
In our example, there is an upward trend in the Mi values for higher values of Ai
Whether a gene (on a global sense) is up- or down- regulated should not depend on its average expression level Fluorescence detector may be
saturated at high intensity
LOWESS
3. DNA Microarrays
35
LOcally WEighted Scatterplot Smoothing Fit a smooth curved function m(A) through the
data points This is an estimate of bias as a function of average
intensity Correct values as
The shift depends on the average intensity on the spot, but the function that determines shift is global Neither global nor self-normalization
ˆ ( )i i iM M m A
Normalization by LOWESS
3. DNA Microarrays
36
Gene Normalization
3. DNA Microarrays
37
Array normalization makes arrays cross-comparable
Two identically expressed genes in terms of Cy5 intensities may end up having different log ratios
Solution: Center expression values for each gene so that each gene will have mean (or median) expression value of 0
Example (on blackboard)
Gene Expression Matrix
3. DNA Microarrays
38
Genes
Samples
Now, we are ready to analyze our data!
4. Gene Expression Data Analysis
Analyzing Gene Expression Data
4. Gene Expression Data Analysis
40
Clustering How are genes related in terms of their expression
under different conditions? Differential gene expression
Which genes are affected by change in condition, tissue, disease?
Classification (supervised analysis) Given expression profile for a gene, can we assign a
function? Given the expression levels of several genes in a
sample, can we characterize the type of sample (e.g., cancerous or normal)?
Regulatory network inference How do genes regulate each others expression to
orchestrate cellular function?
Clustering
4. Gene Expression Data Analysis
41
Group similar items together Clustering genes based on their expression
profiles We can measure the expression of multiple genes
in multiple samples Genes that are functionally related should have
similar expression profiles Gene expression profile
A vector (or a point) in multi-dimensional space, where each dimension corresponds to a sample
Clustering of multi-dimensional real-valued data is a well-studied problem
Motivating Example
4. Gene Expression Data Analysis
42
Expression levels of 2,000 genes in 22 normal and 40 tumor colon tissues (Alon et al. , PNAS,
1999)
Applications of Clustering
4. Gene Expression Data Analysis
43
Functional annotation If a gene with unknown function is clustered
together with genes that perform a particular function, then that is likely to be associated with that function
Identification of regulatory motifs If a group of genes are co-regulated, then it is
likely that their regulation is modulated by similar transcription factors, so looking for common elements in the neighborhood of the coding sequences of genes in a cluster, we can identify regulatory motifs and their location (promoters)
Modular analysis
Gene Expression Matrix
4. Gene Expression Data Analysis
44
m g
en
es
n samples Generally, m >> n
m = O(103) n = O(101)
Each row is an n-dimensional vector
Expression profile
Tiniii
ij
eeee
njmieE
],...,,[
1 ,1 ],[
21
Proximity Measures
4. Gene Expression Data Analysis
45
How do we decide which genes are similar to each other?
Euclidian distance
Manhattan distance
n
kjkikjiji eeeeeeEuclidian
1
2
2)(),(
| |),(1
1 jk
n
kikjiji eeeeee tanManhat
Distance
4. Gene Expression Data Analysis
46
Minkowski distance General version of Euclidian, Manhattan etc.
p is a parameter
n
k
pjkikpjiji eeeeeeMinkowski
1
)(),(
jkiknk
ji eeee 1
max
Normalization
4. Gene Expression Data Analysis
47
If we want to measure the distance between directions rather than absolute magnitude, it may be necessary to standardize mean and variation of expression levels for each gene
i
iikik
Tiniii
n
kiikii
n
kikii
eeeeee
en
e
en
e
'''2
'1
'
1
2
1
,],...,,[
)(1
)(
1)(
Correlation
4. Gene Expression Data Analysis
48
The similarity between the variation of two random variables
A vector is treated as sampling of a random variable
Covariance
2
1
],[][
))((1
],[
ijii
n
kjjkiikji
eeCoveVar
een
eeCov
Pearson Correlation Coefficient
4. Gene Expression Data Analysis
49
Pearson correlation coefficient
Pearson correlation is equal to the cosine of the angle (or inner product of) normalized expression profiles
Pearson correlation is normalized
ji
n
kjjkiik
ji
jiji
ee
eVareVar
eeCoveePearson
1
))((
][][
],[),(
1),(1 ji eePearson
),(),( ''jiji eePearsoneePearson
Euclidian Distance & Correlation
4. Gene Expression Data Analysis
50
Euclidian distance (normalized) and Pearson correlation coefficient are closely related
These are the two most commonly used proximity measures in gene expression data analysis
Without loss of generality, we will use to denote the distance between two expression profiles
)),( 1(2),( ''jiji eePearsonneeEuclidian
),( jiij ee
Other Measures of correlation
4. Gene Expression Data Analysis
51
Pearson is vulnerable to outliers If two genes have very high expression in a single
profile, it might dominate to show that the two expression levels are highly correlated
Jackknife correlation: Estimate n correlations by taking each dimension (sample) out, take the minimum among them
Pearson is not robust for non-Gaussian distributions Spearman’s rank order correlation coefficient: Rank
expression levels, replace each expression level with its rank
More robust against outliers A lot of loss of information
Clustering Methods
4. Gene Expression Data Analysis
52
Hierarchical clustering Group genes into a tree
(a.k.a, dendrogram), so that each branch of the tree corresponds to a cluster
Higher branches correspond to coarser clusters
Partitioning Partition genes into several
groups so that similar genes will be in the same partition
Hierarchical clustering
4. Gene Expression Data Analysis
53
Direction of clustering Bottom-up (agglomerative): Start from individual
genes, join them into groups until only one group is left
Top-down (divisive): Start with one group consisting of all genes, keep partitioning groups until each group contains exactly one gene
Agglomerative clustering is computationally less expensive Why?
Hierarchical clustering methods are greedy Once a decision is made, it cannot be undone
Agglomerative clustering
4. Gene Expression Data Analysis
54
Start with m clusters: Each cluster contains one gene
At each step, choose two clusters that are closest (or most correlated), merge them
How do we evaluate the distance between two clusters? Single-linkage: If clusters contain two very close
genes, than the clusters are close to each other)(min),(
,ij
CjCilk
lk
CC
Agglomerative Clustering
4. Gene Expression Data Analysis
55
Complete linkage: Two clusters are close to each other only if all genes inside them are close to each other
Group average: Two clusters are close to each other if their centers are close to each other
k lCi Cj
ijlk
lk CCCC 1
),(
)(max),(,
ijCjCi
lklk
CC
Divisive Clustering
4. Gene Expression Data Analysis
56
Recursive bipartitioning Find an “optimal” partitioning of the genes into two
clusters Recursively work on each partition Since the number of clusters is an issue for partitioning
based clustering algorithms, the magic number 2 solves a lot of problems
May be computationally expensive The problem is “global” At every level of the tree, we have to work on all of the
genes If tree is imbalanced, there might be as many as m
levels With a reasonable stopping criterion, maybe
considered a partition-based clustering as well
Partition Based Clustering
4. Gene Expression Data Analysis
57
Find groups of genes such that genes in each group are similar to each other, while being somewhat less similar to those in other clusters
Easily interpratable Especially, for large datasets (as compared to
hierarchical)
Number of Clusters
4. Gene Expression Data Analysis
58
Clustering is “unsupervised”, so generally we do not have prior knowledge on how many clusters underly the data
It is very difficult to partition data into an “unknown” number of clusters
Most algorithms assume that K (number of clusters) is known
Try different values of K, find the one that results in best clustering
Very expensive
Overlapping vs. Disjoint Clusters
4. Gene Expression Data Analysis
59
Genes do not have a single function Most genes might be involved in
different processes, so their expression profiles might demonstrate similarities with different genes in different contexts
Can we allow a gene to be included in more than one cluster?
Allowing overlaps between clusters poses additional challenges To what extent do we allow overlaps?
(We definitely don’t want to identify two identical clusters)
Fuzzy Clustering
4. Gene Expression Data Analysis
60
Assign weights to each gene-cluster pair, showing the extent (or likelihood) of the gene belonging to the cluster Difficult interpretation Partitioning is a special case of fuzzy clustering,
where the weights are restricted to binary values Hierarchical clustering is also “fuzzy” in some
sense Continuous relaxation might alleviate
computational complexity as well
K-Means Clustering
4. Gene Expression Data Analysis
61
The most famous clustering algorithm Given K, find K disjoint clusters such that the
total intracluster variation is minimized
kCi
ik
k eC
1
kCi
iik e ),(
K
kk
1
Cluster mean:
Intracluster variation:
Total intracluster variation:
K-Means Algorithm
4. Gene Expression Data Analysis
62
K-Means is an iterative algorithm that alters parameters based on each other’s values until no improvement is possible
1. Choose K expression profiles randomly, designate each of them as the center of one of the K clusters
2. Assign each gene to a cluster2.1. Each gene is assigned to the cluster with closest
center to its profile
3. Redetermine cluster centers4. If any gene was moved, go back to Step 2, else
stop
Sample Run of K-Means
4. Gene Expression Data Analysis
63
Self Organizing Maps
4. Gene Expression Data Analysis
64
Just like K-means, we have K clusters, but this time they are organized into a map Often a 2D grid We want to organize clusters so that similar
clusters will be in proximity in the map A way of visualizing in low-dimensional (2D) space
Just like K-means, each cluster is associated with a weight vector It was the cluster center in K-means
Each weight vector is first initialized randomly to some gene’s expression profile
SOM Algorithm
4. Gene Expression Data Analysis
65
At each step, a gene is selected at random The distance between the gene’s expression
profile and each cluster’s weight vector is calculated, and the cluster with closest weight vector becomes the winner
The winner’s and its neighbors’ (according to the 2D mapping) weight vectors are adjusted to represent the gene’s expression profile better
Cj is the winner cluster for gene i at time t α is a decreasing function of time, θ is the
neighborhood function
))()(,()()()1( ikjkkk etwCCttwtw
Sample SOM Output
4. Gene Expression Data Analysis
66
Gene Co-expression Network
4. Gene Expression Data Analysis
67
Nodes represent genes Weighted edges between nodes represent
proximity (correlation) between genes’ expression profiles
This is indeed a way of predicting interactions between genes
Graph Theoretical Clustering
4. Gene Expression Data Analysis
68
Partition the graph into heavy subgraphs Maximize total weight (number of edges) inside a
cluster Minimize total weight (number of edges) between
clusters Heuristic algorithms
CLICK: Recursive min-cut CAST: Iterative improvement one by one for each
cluster Loss of information?
Model Based Clustering
4. Gene Expression Data Analysis
69
Generating model Each cluster is associated with a distribution (that
generates expression profiles for associated genes) specified by model parameters
The probability that a gene belongs to a cluster is specified by hidden parameters
Expectation Maximization (EM) algorithm Start with a guess of model parameters E-step: Compute expected values of hidden parameters
based on model parameters M-step: Based on hidden parameters, estimate model
parameters to maximize the likelihood of observing the data at hand, iterate
K-means is a special case
Evaluation of Clusters
4. Gene Expression Data Analysis
70
In general, we want to maximize intra-cluster similarity, while minimizing inter-cluster similarity
Homogeneity, separation Based on the proximity metric
Reference partition Information on “true clusters” that comes from a
different source (apart from expression data) Molecular annotation (e.g., Gene Ontology) Jaccard coefficient, sensitivity, specificity
Cluster annotation Processes that are significantly enriched in a cluster
Homogeneity & Separation
4. Gene Expression Data Analysis
71
Heterogeneity (or homogeneity in reverse direction) How similar are the genes in one cluster?
Separation How dissimilar are different clusters?
Good clustering: high heterogeneity, low separation
kCji
ijCCCH
,)1(
2)(
k lCi Cj
ijlk
lk CCCCS 1
),(
Overall Quality
4. Gene Expression Data Analysis
72
Overall heterogeneity
Overall separation
How do these change with respect to number of clusters? Can we optimize these values to choose the best
number of clusters?
kC
kk CHCm
H )(1
lk
lk
CClklk
CClk
CCSCCCC
S,
,
),(1
Bayesian Information Criterion
4. Gene Expression Data Analysis
73
A statistical criterion for evaluating a model Penalizes model complexity (number of free
parameters to be estimated)
k is the number of free parameters in the model, which increases with the number clusters
RSS is the “total error” in the model Trade-off number of clusters and optimization
function to choose the best number of clusters
Reference Partitioning
4. Gene Expression Data Analysis
74
If there is information about “ground truth” from an independent source, we can compare our clustering to such reference partitioning
Pairwise assessment Let Cij = 1 if gene i and gene j are assigned to the
same cluster by the clustering algorithm, 0 otherwise
Let Rij = 1 if gene i and gene j are in the same cluster according to reference partition
jiijij
jiijij
jiijij
jiijij
RCnRCn
RCnRCn
,10
,01
,00
,11
)(
Comparing Partitions
4. Gene Expression Data Analysis
75
Rand index (symmetric)
Jaccard coefficient (sparse)
Minkowski measure (sparse)
01100011
0011
nnnn
nnRand
011011
11
nnn
nJaccard
0111
0110
nn
nnMinkowski
Cluster Annotation
4. Gene Expression Data Analysis
76
Clustering results in groups of genes that are co-expressed (or co-regulated) For each group, can we tell something about the
biological phenomena that underlies our observation (their co-expression)?
We have partial knowledge on the function of many individual genes Gene Ontology, COG (Clusters of Ortholog Groups),
PFAM (Protein Domain Families) Taking a statistical approach, we can assign
function to each group of genes A function popular in a cluster is associated with
that cluster
Gene Ontology
4. Gene Expression Data Analysis
77
Ontology: Study of being (e.g., conceptualization) Gene Ontology is an attempt to develop a
standardized library of cellular function Unified view of life: Processes, structures, and
functions recur in diverse organisms Three concepts of Gene Ontology
Biological process: A recognized series of events or molecular functions (e.g., cell cycle, development, metabolism)
Molecular function: What does a gene’s product do? (e.g., binding, enzyme activity, receptor activity)
Cellular component: Localization within the cell (e.g., membrane, nucleus, ubiquitin ligase complex)
Hierarchy in Gene Ontology
4. Gene Expression Data Analysis
78
Gene Ontology is hierarchical A process might have subprocesses
Seed maturation is part of seed development A process might be described at different levels of
detail Seed dormation is a(n example of) seed maturation
Same for function and component Gene Ontology terms are related to each other
via “is a” and “part of” relationships If process A is part of process B, then A is B’s child
(B is A’s parent); B involves A If function C is a function D, then C is D’s child; C is
a more detailed specification of D
4. Gene Expression Data Analysis
79
GO Hierarchy is a DAG
4. Gene Expression Data Analysis
80
Gene Ontology is hierarchical, but the hierarcy is not represented by a tree, it is represented by a directed acyclic graph (DAG) A GO term can have
multiple parents (and obviously a GO term might (should?) have multiple children)
Annotation
4. Gene Expression Data Analysis
81
GO-based annotation assigns GO terms to a gene A gene might have multiple functions, can be
involved in multiple processes Multiple genes might be associated with the same
function, multiple genes take part in a process True-path rule
If a gene is annotated with a term, then it is also annotated by its parents (consequently, all ancestors)
How does the number of genes associated with each term changes as we go down on the GO DAG?
GO Annotation of Gene Clusters
4. Gene Expression Data Analysis
82
There a |C| genes in a cluster C |T| genes are associated with GO term t |C ∩ T| genes are in C and are associated with
t What is the association between cluster C and
term t? If we chose random clusters, would we be able to
observe that at least this many (|C ∩ T|) of the |C| genes in C are associated with t?
What is the probability of this observation? Statistical significance based on
hypergeometric distribution
Hypergeometric Distribution
4. Gene Expression Data Analysis
83
We have n items, m of which are good If we choose r items from the entire set of items
at random, what is the probability that at least k of them will be good?
n is the number of genes in the organism m=|T|, r=|C|, k= |C ∩ T| The lower p is, the more likely that there is an
underlying association between the term and the cluster (the term is significantly enriched in the cluster)
),min(
][rm
ki
r
n
ir
mn
i
m
kKPp
GO Hierarchy & Cluster Annotation
4. Gene Expression Data Analysis
84
How specific (general) is the annotation we attach to a cluster? If a cluster is larger, then it might correspond to a
more general process Some processes might be over-represented in the
study set How do we find the best location of a cluster in GO
hierarchy? Parent-child annotation
Condition probability of enrichment of a term in a cluster on the enrichment of its parent terms in the cluster
The gene space is defined as the set of genes that are associated with t’s parents
Parent-Child Annotation
4. Gene Expression Data Analysis
85
Multiple Hypotheses Testing
4. Gene Expression Data Analysis
86
The p-value for a single term provides an estimate of the probability of having the observed number of genes attached to that particular term We have many terms, even if the likelihood of
enrichment is small for a particular term, it might be very probable that one term will be enriched as much as observed in the cluster
We have to account for all hypotheses being tested simultaneously
Bonferroni correction: Apply union rule, add all p-values
Which terms should we consider while correcting for multiple hypotheses for a single term?
Representativity of Terms
4. Gene Expression Data Analysis
87
How good does a significantly enriched term represent a cluster? How many of the genes in the cluster are attached
to the term? How many of the genes attached to the term are
in the cluster? For term t that is significantly enriched in
cluster C Specificity: |C ∩ T|/|C|, a.k.a. precision Specificity: |C ∩ T|/|T|, a.k.a. recall
Biclustering
4. Gene Expression Data Analysis
88
A particular process might be active in certain conditions A group of genes
might be expressed (or up-regulated, supressed, co-regulated, etc.) in only a subset of samples
They might behave almost independently under other conditions
Clustering vs. Biclustering
4. Gene Expression Data Analysis
89
Clustering is a global approach Each gene is a point in the space defined by all
samples How about points that are clustered in a subspace?
Biclustering: While clustering genes, also choose a set of dimensions (samples) that provides best clustering and vice versa a.k.a, co-clustering, subspace clustering… This is a much harder problem, because you are not
only trying to find groups of points that are close to each other in multi-dimensional space, but also trying to identify a subspace in which groups are more evident
Biclustering Applications
4. Gene Expression Data Analysis
90
Sample/tissue classification for diagnosis The samples with leukemia show specific characters
for a subset of genes Identification of co-regulated genes
Certain sets of genes exhibit coherent activations under specific conditions (while behaving more or less arbitrarily with respect to each other under other conditions)
Functional annotation Biological processes, functional classes are
overlapping Different sets of samples reveal different functional
relationships
Biclustering Principles
4. Gene Expression Data Analysis
91
A cluster of genes is defined with respect to a cluster of samples and vice versa
The clusters are not necessarily exclusive or exhaustive A gene/condition may belong to more than one
cluster A gene/condition may not belong to any cluster at
all Biclusters are not “perfect”
Noise Statistical inference becomes particularly
important
Biclustering Formulation
4. Gene Expression Data Analysis
92
Given a gene expression matrix A with gene set G and sample set S, a bicluster is defined by a subset of genes I and a subset of samples J
General idea: A bicluster is a “good” one if AIJ , the submatrix defined by I and J, has some coherence (low variance, low rank, similar ordering of rows, etc.)
The biclustering problem can be defined as one of finding a single bicluster in the entire gene expression matrix, or as one of extracting all biclusters (with some restriction on the relationship between biclusters)
Coherence of a Submatrix
4. Gene Expression Data Analysis
93
Distribution of Biclusters
4. Gene Expression Data Analysis
94
Bipartite Graph Model
4. Gene Expression Data Analysis
95
Just like symmetric matrices, which can be modeled as arbitrary graphs, rectangular matrices can be modeled using bipartite graphs
With proper definition of edge weights, biclustering can be posed as the problem of finding “heavy” subgraphs
Row, Column, Matrix Means
4. Gene Expression Data Analysis
96
Objective Function
4. Gene Expression Data Analysis
97
Low-variance (constant) bicluster Ideal bicluster: Minimize bicluster variance
Low-rank (constant row, constant column, coherent values) bicluster Ideal constant row: Ideal constant column: General rank-one bicluster: Define residue for each value: Minimize mean squared residue
Missing Values
4. Gene Expression Data Analysis
98
Not all expression levels are available for each gene/sample pair A solution is to replace missing values (random
values, gene mean, sample mean, regression) Generalize definition row, column, and
bicluster means to handle missing values implicitly Occupancy threshold:A bicluster is one with adequate number of (non-missing) values in each row and column
Overlapping Biclusters
4. Gene Expression Data Analysis
99
The expression of a gene in one sample may be thought of as a superposition of contribution for multiple biclusters
Plaid model: : contribution of bicluster k on the expression
value of the ith gene in the jth sample and (generally binary) specify the membership
of row i and column j in the kth bicluster, respectively
Minimize
is defined to reflect “bicluster type” , , ,
Discrete Coherence
4. Gene Expression Data Analysis
100
A bicluster is defined to be one with coherent ordering of the values on rows and/or columns (as compared to values themselves)
Order-preserving submatrix (OPSM) A submatrix is order preserving if there is an
ordering of its columns such that the sequences of values in every row is increasing
Gene expression motifs (xMOTIFs) The expression level of a gene is conserved across
a subset of conditions if the gene is in the same “state” in each of the conditions
An xMOTIF is a subset of genes that are simultaneously conserved across a subset of samples
Binary Biclusters
4. Gene Expression Data Analysis
101
Quantize gene expression matrix to binary values SAMBA: A 1 corresponds to a significant change in the
expression value PROXIMUS: A 1 means that the gene is “expressed” in
the corresponding sample A bicluster is a “dense submatrix”, i.e. one with
significantly more number of 1’s than one would expect Bipartite graph model: Bicliques, heavy subgraphs It is possible to statistically quantify the density of a
submatrix Log-likelihood:
p-value:
Biclustering Algorithms
4. Gene Expression Data Analysis
102
Enumeration Go for it!
Greedy algorithms Make a locally optimal choice at every step
Divide and conquer Solve problem recursively
Alternating iterative heuristics Fix one dimension, solve for other, alternate
iteratively Model Based Parameter estimation
e.g., EM algorithm
Enumerating Biclusters
4. Gene Expression Data Analysis
103
m rows, n columns in the matrix 2m X 2n possible biclusters in total Not doable in realistic amounts of time Is it really necessary?
Put some restriction on size of biclusters SAMBA models the problem as one of finding
heavy subgraphs in a bipartite graph Key assumption is sparsity: Nodes of the bipartite
graph have bounded degree Find K heavy bipartite subgraphs (biclusters) with
bounded degree enumeration Refine them to optimize overlap and add/remove nodes
that improve bicluster quality
Greedy Algorithms
4. Gene Expression Data Analysis
104
Basic idea: Refine existing biclusters by adding/removing genes/samples to improve the objective function Generally, quite fast How to choose initial biclusters? How to jump over bad local optima? (Global awareness,
Hill-climbing) Optimization function: mean-squared residue
Node deletion: Start with a large bicluster, keep removing genes/samples that contribute most to total residue
Node addition: Start with a small bicluster, keep adding genes/samples that contribute least to total residue
Repeat these alternatingly to improve global awareness
Finding All Biclusters
4. Gene Expression Data Analysis
105
If biclusters are identified one by one, we should make sure that we do not identify the same bicluster again and again Masking discovered biclusters: Fill bicluster with
random values First identify disjoint biclusters, then grow them to
capture overlaps Flexible Overlapped Biclustering (FLOC)
Generate K initial biclusters Make decision from the gene/sample perspective
(as compared to bicluster perspective): Choose the best (maximum gain) action for each gene
Generalizing K-Means to Biclustering
4. Gene Expression Data Analysis
106
Assume K gene clusters, L sample clusters Notice that this is a little counter-intuitive, we do
not have well-defined biclusters, we rather have clusters of genes and samples, and each pair of gene and sample clusters defines a bicluster
R: mxk gene clustering matrix, C: nxl sample clustering matrix R(i,k)=1 if gene i belongs to cluster k (actually,
columns are normalized to have unit norm) Minimize total residue:
KL-Means Algorithm
4. Gene Expression Data Analysis
107
We can show that Batch iteration
Given R, compute (mxl matrix) serves as a prototype for column
clusters For each column, find the column of that is
closest to that column, update the corresponding entry of C accordingly
Once C is fixed, repeat the same for rows to compute R from
Converges to a local minimum of the objective function
OPSM Algorithm Recall that an order preserving submatrix (OPSM)
is one such that all rows have their entries in the same order
Growing partial models Fix the extremes first The idea: Columns with very high or low values are
more informative for identifying rows that support the assumed linear order
Start with all (1,1) partial models, i.e., only consider the preservation of the first and last elements, keep the best ones
Expand these to obtain (2,1) models, then (2,2) until we have (s/2, s/2) models, s being the number of columns in target bicluster
4. Gene Expression Data Analysis
108
Divide and Conquer Algorithms Block clustering (a.k.a., Direct clustering)
Recursive bipartitioning Sort rows according to their mean, choose a row such
that the total variance above and below the row is minimized
Do the same for columns Pick the row or column that results in minimum intra-
cluster variances, split matrix into two based on that row or column
Continue splitting recursively One problem is that once two rows/columns go to
different biclusters, they can never come together Gap Statistics: Find a large number of biclusters, then
recombine
4. Gene Expression Data Analysis
109
Binormalization Normalize matrix on both dimensions Independent scaling of rows and columns
Here, R and C are diagonal matrices that contain row
and column means, respectively Bistochastization
Goal: Rows will add up to a constant (or will have constant norm), columns will add up to a separate constant
Repeat independent scaling of rows and columns until stability is reached
The residual of entire matrix is also normalized in the sense that both rows and columns have zero mean
4. Gene Expression Data Analysis
110
Spectral Biclustering Singular value decomposition
The eigenvalues of the matrices ATA and AAT (say, σ2) are the same
Each σ is called a singular value of A and the corresponding left and right eigenvectors are called singular vectors
If σ1 is the largest singular vector of A such that ATAv1 = σ1v1 and AATu1 = σ1u1 , then σ1u1v1
T is the best rank-one approximation to A, i.e., ||A- σuvT ||2 is minimized by σ1 , u1 , and v1
(over all orthogonal vector pairs with unit norm)
Consequently, the entries of u and v are ordered in such a way that similar rows have similar values on u, similar columns have similar values on v Split matrix based on u and v
4. Gene Expression Data Analysis
111
6. Gene Regulatory Networks
Regulation of Gene Expression
6. Gene Regulatory Networks
113
Transcriptional Regulation of telomerase protein component gene hTERT
Genetic Regulation & Cellular Signaling
6. Gene Regulatory Networks
114
Organization of Genetic Regulation
6. Gene Regulatory Networks
115
GeneUp-regulation
Down-regulation
Negative ligand-independent repression at chromatin level
Genetic network that controls flowering time in A. thaliana(Blazquez et al, EMBO Reports, 2001)
Gene Regulatory Networks Transcriptional Regulatory Networks
Nodes with outgoing edges are limited to transcription factors
Can be reconstructed by identifying regulatory motifs (through clustering of gene expression & sequence analysis) and finding transcription factors that bind to the corresponding promoters (through structural/sequence analysis)
6. Gene Regulatory Networks
116
Gene Regulatory Networks Gene expression networks
General model of genetic regulation Identify the regulatory effects of genes on each
other, independent of the underlying regulatory mechanism
Can be inferred from correlations in gene expression data, time-series gene expression data, and/or gene knock-out experiments
6. Gene Regulatory Networks
117
Observation Inference
Boolean Network Model
6. Gene Regulatory Networks
118
Binary model, a gene has only two states ON (1): The gene is expressed OFF (0): The gene is not expressed
Each gene’s next state is determined by a boolean function of the current states of a subset of other genes A boolean network is specified by two sets Set of nodes (genes) State of a gene: Collection of boolean functions
Logic Diagram
6. Gene Regulatory Networks
119
Cell cycle regulation
Retinoblastma (Rb) inhibits DNA synthesis
Cyclin Dependent Kinase 2 (cdk2) & cyclin E inactivate Rb to release cell into S phase
Up-regulated by CAK complex and down-regulated by p21/WAF1
p53
Wiring Diagram
6. Gene Regulatory Networks
120
Dynamics of Boolean Networks Gene activity profile (GAP)
Collection of the states of individual genes in the genome (network) The number of possible GAPs is 2n
The system ultimately transitions into attractor states Steady state (point) attractors Dynamic attractors: state cycle Each transient state is associated with an attractor
(basins of attraction) In practice, only a small number of GAPs correspond to
attractors What is the biological meaning of an attractor?
6. Gene Regulatory Networks
121
State Space of Boolean Networks Equate cellular with
attractors Attractor states are
stable under small perturbations Most perturbations
cause the network to flow back to the attractor
Some genes are more important and changing their activation can cause the system to transition to a different attractor
6. Gene Regulatory Networks
122
This slide is taken from the presentation by I. Shmulevich
Identification of Boolean Networks We have the “truth table” available
Binarize time-series gene expression data REVEAL
Use mutual information to derive logical rules that determine each variable If the mutual information between a set of variables and the
target variable is equal to the entropy of that variable, then that set of variables completely determines the target variable
For each variable, consider functions consisting of 1 variable, then 2, then 3, …, then i…, until one is found Once the minimum set of variables that determine a variable is
found, we can infer the function from the truth table In general, the indegrees of genes in the network is small
6. Gene Regulatory Networks
123
REVEAL
6. Gene Regulatory Networks
124
Limitations of Boolean Networks The effect of intermediate gene expression
levels is ignored It is assumed that the transitions between
states are synchronous A model incorporates only a partial description
of a physical system Noise Effects of other factors
One may wish to model an open system A particular external condition may alter the
parameters of the system Boolean networks are inherently deterministic
6. Gene Regulatory Networks
125
Probabilistic Models Stochasticity can account for
Noise Variability in the biological system Aspects of the system that are not captured by the
model Random variables include
Observed attributes Expression level of a particular gene in a particular
sample Hidden attributes
The boolean function assigned to a gene?
6. Gene Regulatory Networks
126
Probabilistic Boolean Networks Each gene is associated with multiple boolean
functions Each function is associated with a probability
Can characterize the stochastic behavior of the system
6. Gene Regulatory Networks
127
Bayesian Networks A Bayesian network is a representation of a joint
probability distribution A Bayesian network B=(G, ) is specified by two
components A directed acyclic graph G, in which directed edges
represent the conditional dependence between expression levels of genes (represented by nodes of the graph)
A function that specifies the conditional distribution of the expression level of each gene, given the expression levels of its parents Gene A is gene B’s parent if there is a directed edge from A
to B P(B | Pa(B)) = (B, Pa(B))
6. Gene Regulatory Networks
128
Conditional Independence In a Bayesian network, if no direct between two
genes, then these genes are said to be conditionally independent
The probability of observing a cellular state (configuration of expression levels) can be decomposed into product form
6. Gene Regulatory Networks
129
Variables in Bayesian Network Discrete variables
Again, genes’ expression levels are modeled as ON and OFF (or more discrete levels)
If a gene has k parents in the network, then the conditional distribution is characterized by rk parameters (r is the number of discrete levels)
Continuous variables Real valued expression levels We have to specify multivariate continuous
distribution functions Hybrid networks
6. Gene Regulatory Networks
130
Equivalence Classes of Bayesian Nets Observe that each network structure implies a
set of independence assumptions
More than one graph can imply exactly the same set of independencies (e.g., X->Y and Y->X) Such graphs are said to be equivalent
By looking at observations of a distribution, we cannot distinguish between equivalent graphs An equivalence class can be uniquely represented
by a partially directed graph (some edges are undirected)
6. Gene Regulatory Networks
131
Learning Bayesian Networks Given a training set D = {x1, x2, …, xn} of m
independent instances of the n random variables, find an equivalence class of networks B=(G, ) that best matches D x’s are the gene expression profiles
Based on Bayes’ formula, the posterior probability of a network given the data can be evaluated as
where C is a constant (independent of G) and
is the marginal likelihood that averages the probability of data
over all possible parameter assignments to G
6. Gene Regulatory Networks
132
Learning Algorithms The Bayes score S(G : D) depends on the particular
choice of priors P(G) and P( | G) The priors can be chosen to be
structure equivalent, so that equivalent networks will have the same score
decomposable, so that the score can be represented as the superposition of contributions of each gene
The problem becomes finding the optimal structure (G) We can estimate the gain associated with addition,
removal, and reversal of an edge Then, we can use greedy-like heuristics (e.g., hill
climbing)
6. Gene Regulatory Networks
133
Causal Patterns Bayesian networks model dependencies between
multiple measurements How about the mechanism that generated these
measurements? Causal network model: Flow of causality
Model not only the distribution of observations, but also the effect of observations
If gene X codes for a transcription factor of gene Y, manupilating X will affect Y, but not vice versa
But in Bayesian networks, X->Y and Y->X are equivalent
Intervention experiments (as compared to passive observation): Knock X out, then measure Y
6. Gene Regulatory Networks
134
Dynamic Bayesian Networks Dependencies do not
uncover temporal relationships Gene expression
varies over time Dynamic Bayesian
Networks model the dependency between a gene’s expression level at time t and expression levels of parent genes at time t-1
6. Gene Regulatory Networks
135
Topology of Biological Networks
Topological Characteristics of Networks Local characteristics
Subgraphs, motifs Clustering
Global characteristics Degree distribution Reachability Hierarchy, assortativity
Topology & Function Robustness: Degree distribution, reachability,
hieararchy Modularity: Motifs, clustering, hierarchy Dynamics: Do general topological properties
determine behavior?
8. Topology of Biological Networks
137
Real-World Networks
Biological networks at different scales Population, tissue, cell
Cellular networks Metabolic pathways, transcriptional networks,
protein-protein interactions Other networks
Internet, social networks (Erdös number, Kevin Bacon network, friendship), electronic circuits, parallel computers
8. Topology of Biological Networks
138
Understanding Networks Are there commonalities in the topological
characteristics of different networks? Turns out to be yes Do these have anything to do with function, origin,
growth of these networks? How about differences?
8. Topology of Biological Networks
139
Internet vs. S. cerevisiae PPI network
Graphs vs. Networks A network is a “functional”
structure, in which nodes and links are “active” Information flow Underlying dynamics
A graph is an abstraction of a network (or, in general, pairwise relationships between entities)
The two terms are commonly used interchangeably
8. Topology of Biological Networks
140
Modeling Networks
Graph representation of metabolic pathways (a) Edges may represent substrate-product relationships
between metabolites (b), or producer-consumer relationships between enzymes
We can drop “common” metabolites [c]
8. Topology of Biological Networks
141
Directed vs. Undirected Graphs An edge (or link) is directed if there is a
specified directionality (such as cause and effect) in the relationship between the two objects represented by the nodes Metabolic pathways are directed, because
many reactions are irreversible Protein-protein interactions are generally
undirected, because in most cases all we know is that they bind to each other
The semantics of topological properties (and/or motifs) may be different for directed and undirected graphs Cycle
8. Topology of Biological Networks
142
Connectivity Degree of a node in the network
How many links does a node have to other nodes?
Social networks: How “social” or “active” is a person?
Protein-protein interactions: How “sticky” or “functional” is a protein?
Directed graphs In-degree and out-degree Internet: How popular is a website? Metabolic pathways: How many
reactions use a metabolite as substrate?
8. Topology of Biological Networks
143
Degree Distribution Define P(k) as the
probability (relative frequency) that a selected node has exactly k links N(k) = Number of nodes
with degree k P(k) = N(k)/N, where N is
the total number of nodes Degree is a local property,
degree distribution is a global property
8. Topology of Biological Networks
144
Average degree distribution of metabolic networks of 43
organisms
Reachability Path
Sequence of nodes that are linked to each other that connect two specified nodes to each other
Shortest path The path between two nodes that contains
minimum number (length) of edges Quantifies the reachability between two
molecules, length of shortest path is a.k.a distance
Network’s overall navigability Diameter: Maximum distance in a network Mean path length: Average distance in a
network
8. Topology of Biological Networks
145
Small World Effect First identified on social networks
People were sent letters and asked to forward the letter to them if they personally knew a specified person, if not they were supposed to send it to a fried who could be likely to…
Result: “Six degrees of separation” (average) Most natural networks demonstrate small world
phenomenon Neural networks, WWW
Metabolism Paths of three or four reactions can link most metabolite
pairs Local perturbations in metabolite concentrations can
reach the whole network very quickly
8. Topology of Biological Networks
146
Clustering A network is clustered if we can say that
If A and B are connected and B and C are connected, than it is likely that A and C are connected
Clustering coefficient The fraction of observed triangles among all possible
triangles around a node , where k is node degree, and nI is the
number of pairs of neighbors of I that are connected to each other
Distribution of clustering coefficients The function C(k): Average clustering coefficient of
nodes with degree k Diameter and average degree depend on total number of
nodes, but P(k) and C(k) do not
8. Topology of Biological Networks
147
Random Graphs Known as Erdös-Renyi graphs
Mark N nodes, draw an edge between any pair of proteins with fixed probability p
Mean path length is proportional to log(N) Degree distribution peaks around average degree,
clustering coefficient does not depend on degree
8. Topology of Biological Networks
148
Random network, degree distribution, clustering coefficient distribution
Scale-Free Networks The degree distribution follows a power-law
P(k) k, where is the degree exponent parameter
In other words, the number of nodes with degree k is inversely proportional to an exponent of k Many low-degree nodes, a few hubs
Mean path length is proportional to log(log(N)) Terminology: Absence of a typical node in the
network
8. Topology of Biological Networks
149
Scale-free network, degree distribution, clustering coefficient distribution
Scale-Free Networks in Nature
8. Topology of Biological Networks
150
Metabolic network, Actor collaboration, WWW, Power grid
2.26, 2.3, 2.1, 4
Mathematical Model for Power Law There are y nodes of degree x, where
, i.e., The maximum degree in the graph is The number of vertices, n is:
where is the Riemann zeta function The number of edges, E is
8. Topology of Biological Networks
151
Role of Degree Exponent The smaller the value of , the more important
the hubs are If then the hubs are not relevant For 2 < < 3, then there is a hierarchy of hubs, with
the most connected hub being in contact with a small fraction of all nodes
For 2, a star-like network emerges, with the largest hub being in contact with a large fraction of nodes
Scale-free networks are generally interesting for Unusual properties emerge for this regime This is the range that is observed in most biological (as
well as non-biological) networks
8. Topology of Biological Networks
152
Hierarchical Networks “General” scale free networks still do
not capture one observed property of cellular networks
Hierarchical networks are clusters of clusters of clusters of… connected through local hubs, less local
hubs, …, global hubs
8. Topology of Biological Networks
153
Average clustering coefficientdistribution for the metabolic networks of 43 organisms
C(k) 1/k
Hubs in Cellular Networks PPI networks
Kinases form the core of the network Genetic regulation
Most transcriptional factors regulate a few genes, a few general transcription factors interact with many genes
Recall that, the gene expression matrix of yeast cell cycle contains a few strong principal components
However, incoming degree distribution is rather approximated by an exponential function
Most genes are regulated by only one to three transcription factors
8. Topology of Biological Networks
154
Growth Models How do these networks gain these properties? Preferential attachment
At each time point, a node is added to the network, and connected to a node with probability that is proportional to the current degree of that node 1st order:
2nd order:
This growth model generates scale-free networks with degree exponent 3
8. Topology of Biological Networks
155
Duplication/Divergence Gene duplications are
considered as one of the driving forces of molecular evolution When gene is duplicated,
the corresponding protein has two copies, so an additional node with the same neighbors is added to the network
Proteins with already high degree are likely to have their neighbors duplicated => Preferential attachment
8. Topology of Biological Networks
156
Network Evolution & Topology Scale free model predicts that the nodes that
appeared early in the history of the network are the most connected nodes Remnants of the RNA world, such as coenzyme A,
NAD, GTP are among the most connected substrates in the metabolic network
Elements of most ancient metabolic pathways, such as glycolisis and tricarboxylic acid cycle
In PPI networks, cross-genome comparisons indicate that, on an average, there is positive correlation between evolutionary history and number of links a protein has
8. Topology of Biological Networks
157
Assortativity Social networks are generally assortative
People who know many people also know each other
Productive authors do write papers together Most cellular networks are observed
to be disassortative In general, hubs avoid linking directly to each
other Metabolic pathways, PPI networks, as well as
WWW Function of disassortativity? Selective value of
dissartotivity? Evolution of disassortativity? Do existing models generate disassortativity?
8. Topology of Biological Networks
158
Modules & Clustering Modularity
Groups of physically or functionally linked molecules that work together to perform a (relatively) distinct function
Friend groups in social networks, labs in co-authorship Protein complexes Temporally co-regulated groups of genes
High clustering in cellular networks Average clustering coefficient is independent of network
size for metabolic pathways For an arbitrary scale-free network, average clustering
coefficient decreases by network size PPI and DDI networks also have high clustering
coefficients
8. Topology of Biological Networks
159
Subgraphs as Elementary Units Subgraphs capture specific patterns of interconnections
that characterize a given network at the local level Not all subgraphs are equally significant
The abundance of squares and the absence of triangles can tell us something fundamental about the architecture of the square lattice
8. Topology of Biological Networks
160
Does not exist at all!
Abundant!
Network Motifs A network motif is a subgraph (in topological
terms, i.e., ignoring identity of nodes) that occurs much more frequently in the network of interest, compared to a random network that has similar global properties
8. Topology of Biological Networks
161
Generating Random Graphs Random graphs are used to assess the
statistical significance of the frequency of a motif As degree distribution is the key characteristic of
scale-free networks, generally the graphs are randomized to preserve degree distribution
Simulation Edge switching algorithm
8. Topology of Biological Networks
162
Analytical Methods Arbitrary degree distribution
The probability of existence of an edge between u and v is defined as
where du and dv are specified “expected degrees” of u and vObserve that E[Du] = du where Du is the corresponding R.V.
However, in order for P to be a well-defined probability function, we must have
whereIn general, this is not the case for PPI networksThese models are generally useful for multigraphs rather
than simple graphs, because of dependencies
8. Topology of Biological Networks
163
Common Network Motifs
8. Topology of Biological Networks
164
PPI and Transcription Integrate protein-protein interactions and
transcriptional regulation Motifs might reveal how these two types of interaction
work together for regulation of cellular processes
Possible interaction patterns between a pair of proteins Red directed arrows represent transcriptional
regulation Black bidirectional arrows represent protein-protein
interaction
8. Topology of Biological Networks
165
Motifs in Integrated TRI-PPI Network
8. Topology of Biological Networks
166
Conservation of Motifs Is there a
“selective value” of motifs?
If motifs are conserved, then one might expect that proteins that are parts of motifs will also be conserved
8. Topology of Biological Networks
167
Motif Constituents Conserved Together
8. Topology of Biological Networks
168
Motif Clusters
8. Topology of Biological Networks
169
Motifs generally tend to form clusters Hierarchical
modularity On the left, 209
bi-fan motifs on Ecoli TRN are shown altogether Shared edges
are in blue, others in red
Topological Robustness Scale-free networks are robust to random attacks
It is not easy to disconnect the network via random node deletions
In random (Erdös-Renyi) graphs, the network falls apart when the number of accidental node failures reach to a certain threshold
Scale- free networks do not have such threshold: Even if 80% random nodes fail, remaining 20% are still connected
Attack vulnerability Dependence on hubs If a key hub fails, the network turns into a collection
of small isolated node clusters
8. Topology of Biological Networks
170
Robustness, Lethality, & Redundancy Lethal proteins
Only about 10% of nodes with less than 5 interactions are essential
This rate is 60% for proteins with more than 15 interactions
Redundancy: Only 18.7% of S. cerevisiae proteins are lethal when deleted individually
Evolution of robustness Highly connected yeast genes have a smaller
evolutionary distance to their orthologs in C. elegans
The structure of important proteins is subject to more selective pressure
8. Topology of Biological Networks
171
Functional and Dynamical Robustness Nodes have different biological function
Network topology is not a sole indicator of lethality Experimentally identified protein complexes
tend to be composed of uniformly essential or non-essential proteins Dispensability of whole complex determines
importance of subunits
8. Topology of Biological Networks
172