ARACNE: An Algorithm for the Reconstruction of Gene Regulatory
Aracne Jorge Viveros Summer 2006 Workshop June 29 th, 2006.
-
Upload
lorena-newman -
Category
Documents
-
view
217 -
download
0
Transcript of Aracne Jorge Viveros Summer 2006 Workshop June 29 th, 2006.
![Page 1: Aracne Jorge Viveros Summer 2006 Workshop June 29 th, 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649e255503460f94b136d3/html5/thumbnails/1.jpg)
Aracne
Jorge Viveros
Summer 2006 Workshop
June 29th, 2006
![Page 2: Aracne Jorge Viveros Summer 2006 Workshop June 29 th, 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649e255503460f94b136d3/html5/thumbnails/2.jpg)
Contents
1. Overview (the problem, the alternatives, ARACNE’s arlgorithm central idea)
2. Demo (reconstruction of gene regulatory networks for affymatrix gene expression data)
3. Algorithm details (approximating the mutual information, comparative study results, ARACNE vs Bayesian and Relevance Networks)
4. Conclusions
5. Bibliography
![Page 3: Aracne Jorge Viveros Summer 2006 Workshop June 29 th, 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649e255503460f94b136d3/html5/thumbnails/3.jpg)
1. Overview: ARACNE
Algorithm for the Reconstruction of Accurate Cellular Networks
“Reverse engineering” or “deconvolution” problem:
ga
gb
gc
gd
ge
ga gb
gc
gd ge
Information-theory + max entropy methods
Gene regulatory network
Samples
![Page 4: Aracne Jorge Viveros Summer 2006 Workshop June 29 th, 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649e255503460f94b136d3/html5/thumbnails/4.jpg)
(overview, cont’d) Authors
A.A. Margolin [1,2], I. Nemenman [2], K. Basso [3], C. Wiggings [2,4], G. Stolovitzky [5], R. Dalla-Favera [3], A. Califano [1,2]
[1] Dept. Biomedical informatics, [2] Joint Centers for Sys Biology, [3] Institute for Cancer Genetics, [4] Dept. of Appl. Physics and Appl. Math.
Columbia University
[5] IBM T.J. Watson Research Center.
Main reference:http://www.arxiv.org/abs/q-bio/0410037BMC Bioinformatics 2006, 7(Suppl 1):S7
![Page 5: Aracne Jorge Viveros Summer 2006 Workshop June 29 th, 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649e255503460f94b136d3/html5/thumbnails/5.jpg)
(overview, cont’d) Goal
Understand mammalian normal cell physiology and complex pathologic
phenotypes through elucidating gene transcriptional regulatory networks.
Thesis
Statistical associations between mRNA abundance levels helps to
uncover gene regulatory mechanisms.
![Page 6: Aracne Jorge Viveros Summer 2006 Workshop June 29 th, 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649e255503460f94b136d3/html5/thumbnails/6.jpg)
(overview: alternatives) ARACNE vs Clustering
ARACNE recovers specific transcriptional interactions but does not attempt to
recover all of them (too complex a problem).
Genome-wide clustering of gene expression profiles: cannot discern direct
(irreducible) from “cascade” transcriptional gene interactions.
clustering ARACNE
ga
gb
gc
gd
ge
ga,gb
gc,gd
ge
a b
c
d e
![Page 7: Aracne Jorge Viveros Summer 2006 Workshop June 29 th, 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649e255503460f94b136d3/html5/thumbnails/7.jpg)
(central idea) Gene network inference
edge = (direct) statistical dependency
= direct regulatory interaction
nodes = genes
Temporal gene expression data for higher eukaryotes, difficult to obtain.
Only steady-state statistical dependencies are studied.
gi
gj
![Page 8: Aracne Jorge Viveros Summer 2006 Workshop June 29 th, 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649e255503460f94b136d3/html5/thumbnails/8.jpg)
Accounting for dependence: definition and measurement
Gene expression values samples from a joint probability distribution
Consider the multi-information = average log-deviation of the joint probability distribution (JPD) from the product of its marginals (also “Kullback-Leibler divergence” (KL-div)).
Use maximum entropy methods to approximate JPD by an element of its “m-way” marginal Frechet class (m-way maximum-entropy estimate m-MEE)
Use m-MEE to define mth-order connected information (m-cinfo) to account for m-way statistical dependencies (only!).
Multi-info = sum of all m-cinfo’s.
![Page 9: Aracne Jorge Viveros Summer 2006 Workshop June 29 th, 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649e255503460f94b136d3/html5/thumbnails/9.jpg)
The multi-information
Multi-information (KL-div)
},...,1:{ MiX i “nodes, “expressions” or “genes”
Integral if conts case; sum if discrete case
JPD
Entropy of P(x)
JPD not known, approximate it!
![Page 10: Aracne Jorge Viveros Summer 2006 Workshop June 29 th, 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649e255503460f94b136d3/html5/thumbnails/10.jpg)
m-way max entropy estimate of JPD
m-MEE , , has the same m-marginals as )(mP )(xPLagrange multipliers
m-MEE has the following form:Have no analytical solution BUTcan be obtained via an iterativeProportional fitting proc (IPFP)
![Page 11: Aracne Jorge Viveros Summer 2006 Workshop June 29 th, 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649e255503460f94b136d3/html5/thumbnails/11.jpg)
Connected and Multi informations
mth-order connected information
Multi-information
Compensate for the lack of knowledge of JPD by using the (truncated!) multi-infoto establish and quantify statistical dependencies
![Page 12: Aracne Jorge Viveros Summer 2006 Workshop June 29 th, 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649e255503460f94b136d3/html5/thumbnails/12.jpg)
Detecting a particular m-way interaction
M-way interaction contributes to multi-info, iff minimum of interaction multi-information (inter multi-info) over -specific Frechet class is positive.
Inter multi-info =
and are m-MEE sharing same m-way marginals except for, perhaps,
},...,{ mi ii
)(Q *Q
PQ )(
Positivity of minimal inter multi-info is an irreducible (direct) interactionThus draw edges coming from nodes and meeting at m-edge vertex.
![Page 13: Aracne Jorge Viveros Summer 2006 Workshop June 29 th, 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649e255503460f94b136d3/html5/thumbnails/13.jpg)
Examples
321 XXX
)|()|()(),,( 23121321 xxPxxPxPxxxP
}2,1{ 3221}2,1{
3231}2,1*{ ,,,, XXIXXIIXXIXXII
Regulatory cascade (Markov chain)
Information processing inequalty
21}2,1*{ ,0 XX generically dependent (similarly, )
31}3,1*{ ,0 XX generically independent
32 , XX
}3,2,1{3221
}3,2,1*{ ,, IXXIXXII No triplet interactions (coregulation)
![Page 14: Aracne Jorge Viveros Summer 2006 Workshop June 29 th, 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649e255503460f94b136d3/html5/thumbnails/14.jpg)
(examples, cont’d) Other dependencies
2 regulates 1 and 3 OR 1 and 3 regulate 2 jointly
123P does not factorbut pairwise marginals do
![Page 15: Aracne Jorge Viveros Summer 2006 Workshop June 29 th, 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649e255503460f94b136d3/html5/thumbnails/15.jpg)
2. Demo
Platforms
1. caWorkBench2.0 (downloadable through web site) (JAVA)
Most developed features: microarray data analysis, pathway analysis and reverse engineering, sequence analysis, transcription factor binding site analysis, pattern discovery.
http://amdec-bioinfo.cu-genome.org/html/caWorkBench.htm
2. Cygwin (for windows). Windows and Linux versions available in web site
![Page 16: Aracne Jorge Viveros Summer 2006 Workshop June 29 th, 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649e255503460f94b136d3/html5/thumbnails/16.jpg)
(Demo) Sample input data file
Input_file_name.exp
N = 3 # genes
M = 2 # microarrays
Input file has N+1=4 lines
each lines has M+2 (2M+2) fields
AffyID HG_U95Av2 SudHL6.CHP ST486.CHP
G1 G1 16.477367 0.69939363 20.150969 0.5297595
G2 G2 7.6989274 0.55935365 26.04019 0.5445875
G3 G3 8.8098955 0.5445875 21.554955 0.31372303
header line
annotation name
Microarray chip names
(value,p-value)-chip1
![Page 17: Aracne Jorge Viveros Summer 2006 Workshop June 29 th, 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649e255503460f94b136d3/html5/thumbnails/17.jpg)
(Demo, cont’d) Syntax (Cygwin)
ARACNE: algorithm for gene regulatory network computation given
microarray data.
Usage:
aracne aracne GeneExpressionFile [-a | -k | -s | -t | -e | -f] aracne -adj GeneExpressioFile AdjacencyFile [-t | -e]
-a accurate | fast [default: accurate] -k gaussian kernel width [accurate method only; default: 0.15]-s Averaging Window step size [fast method only; default: 6] -t Mutual Info. threshold [default: 0] -e DPI tolerance (btw 0 and 1) [default: 1] -f mean stdev [default: no filtering]
![Page 18: Aracne Jorge Viveros Summer 2006 Workshop June 29 th, 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649e255503460f94b136d3/html5/thumbnails/18.jpg)
(Demo, cont’d) Sample output data file
input_data_file_name[non-default_param_vals].adj
# lines = N = # genes
G1:0 8 0.064729
G2:1 2 0.0298643 7 0.0521425
G3:2 1 0.0298643
G4:3 8 0.0427217
G5:4 5 0.403516
G6:5 4 0.403516 6 0.582265
G7:6 5 0.582265 9 0.38039
G8:7 1 0.0521425 8 0.743262
G9:8 0 0.064729 3 0.0427217 7 0.743262 9 0.333104
G10:9 6 0.38039 8 0.333104
AffyID ID# Associated gene ID# MI value
9
14
8 10
7
2 3
6
5
![Page 19: Aracne Jorge Viveros Summer 2006 Workshop June 29 th, 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649e255503460f94b136d3/html5/thumbnails/19.jpg)
3. Algorithm details
Incorporate information-theoretic ideas (Markov networks) to model statistical dependencies (cf. [2])
= joint prob dist function of stationary expressions of all genes (i=1,…,N)
N = # genes, Z = partition fun (normalization factor), = Hamiltonian,
, , , … = interaction potentials (e.g., genes i,j,k do not interact in the
model iff = 0.
Aim: identify nonzero potentials.
![Page 20: Aracne Jorge Viveros Summer 2006 Workshop June 29 th, 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649e255503460f94b136d3/html5/thumbnails/20.jpg)
(Algorithm details) Aracne’s model
First-order approximation: genes are independent
1st order potentials obtained from marginal probabilities (estimated experimentally).
ARACNE’s approximation: truncate joint prob dist fun to pairwise potentials
In this model non-interacting genes (includes statistically
independent genes and genes that do not interact directly,
i.e., but ).
Reduce number of potential pairwise interactions via realistic biological
assumptions.
![Page 21: Aracne Jorge Viveros Summer 2006 Workshop June 29 th, 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649e255503460f94b136d3/html5/thumbnails/21.jpg)
(algorithm details, cont’d) MI estimation
Assume two-way interaction: pairwise potentials determine all statistical dependencies.
Mutual information (MI) = measure of relatedness
= 0 iff
MI approximation:
G = bivariate standard Gaussian density
h = kernel width
![Page 22: Aracne Jorge Viveros Summer 2006 Workshop June 29 th, 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649e255503460f94b136d3/html5/thumbnails/22.jpg)
(algorithm details, cont’d)
Some details and technicalities:
Transform x, y so and their marginal distributions seem uniform
There is not a universal way of choosing h, however the ranking of the MI’s depends only weakly on them.
1,0 yx
i ii
ii
ypxp
yxp
MyxI
)()(
),(log
1)','(
2,12
)(
2
1)( 2
1
2
i
d
xxxpe
Mxp ji
ji
j
jijiji
d
yyxx
Mdyxp 2
2
22
22 2
)()(exp
2
1),(
![Page 23: Aracne Jorge Viveros Summer 2006 Workshop June 29 th, 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649e255503460f94b136d3/html5/thumbnails/23.jpg)
(algorithm details, cont’d) Establishing the network
Define threshold IO to discard MI’s (lower-bound interaction)
Shuffle genes across microarray profiles & evaluate MIs for seemingly
independent genes, choose IO based on what fraction of MIs falls below the
threshold.
Data processing inequality: if genes g1 and g2 interact thorugh g3 then
ARACNE starts with network so for every edge
look at gene triplets and remove edge with smallest MI
![Page 24: Aracne Jorge Viveros Summer 2006 Workshop June 29 th, 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649e255503460f94b136d3/html5/thumbnails/24.jpg)
(algorithm details, cont’d) Establishing the network
N = number of genes, M = number of samples
ARACNE’s algorithm complexity:
DPI analysis MI estimation (order of pairwise interactions )2N
![Page 25: Aracne Jorge Viveros Summer 2006 Workshop June 29 th, 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649e255503460f94b136d3/html5/thumbnails/25.jpg)
Perfect network reconstruction theorems
Thm 1: If MI’s are estimated with no errors and true underlying interaction network is a tree with only pairwise interactions then ARACNE will reconstruct it.
Thm 2: If Chow-Liu maximum MI info tree is subnetwork of ARACNE’s network then this is the true network.
Thm 3: “ARACNE will reconstruct tree-network topologies exactly.”
![Page 26: Aracne Jorge Viveros Summer 2006 Workshop June 29 th, 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649e255503460f94b136d3/html5/thumbnails/26.jpg)
Comparative study results
Reconstruction of class of synthetic transcriptional networks by Mendes et al
(cf. [1]) and human B lymphocyte genetic network from gene expressions
profile data.
Performance of ARACNE compared against Bayesian Networks (use LibB
package) and Relevance networks (similar to ARACNE but has less accurate
MI estimation procedure and less-developed of assigning statistical
significance).
![Page 27: Aracne Jorge Viveros Summer 2006 Workshop June 29 th, 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649e255503460f94b136d3/html5/thumbnails/27.jpg)
(results) Synthetic networks
100 genes, 200 interactions organized in two types of networks
1. Erdos-Renyi: each vertex interaction is equally likely
2. Scale-free topology: distribution of vertex connections obeys a power law
![Page 28: Aracne Jorge Viveros Summer 2006 Workshop June 29 th, 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649e255503460f94b136d3/html5/thumbnails/28.jpg)
(results) Performance metrics
Pairwise gene interaction is
“(True) positive” if their statistical regulatory interaction is directly linked.
“(True) negative” if their interaction is not direct.
Precision fraction of true interactions correctly inferred
(expected success rate in experimental validation of
predicted interactions)
Recall fraction of true interactions among all inferred ones
Performance to be assessed via Precision-Recall curves (PRCs)
FPTP
TP
NN
N
FNTP
TP
NN
N
![Page 29: Aracne Jorge Viveros Summer 2006 Workshop June 29 th, 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649e255503460f94b136d3/html5/thumbnails/29.jpg)
(results cont’d) PRCs for synthetic data
1 2
ARACNE’s performance above 40% for both models
![Page 30: Aracne Jorge Viveros Summer 2006 Workshop June 29 th, 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649e255503460f94b136d3/html5/thumbnails/30.jpg)
(result con’td) Quantitative results on synthetic data
ARACNE recovers far more true connections and predicts far less false ones
![Page 31: Aracne Jorge Viveros Summer 2006 Workshop June 29 th, 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649e255503460f94b136d3/html5/thumbnails/31.jpg)
(results cont’d) Results on Human B cells
Assembled expression profile data set of ~340 B lymphocytes from normal, tumor-related and experimentally manipulated populations.
Data set was deconvoluted by ARACNE to generate B-cell specific regulatory network of ~129,000 interactions.
Validation of the network’s quality was done by comparing inferred interactions
with those identified through biochemical methods.
See cf [3].
![Page 32: Aracne Jorge Viveros Summer 2006 Workshop June 29 th, 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649e255503460f94b136d3/html5/thumbnails/32.jpg)
Conclusions and Discussions
1. Algorithm is robust enough for its application in other network reconstruction problems in biology and the social and engineering fields.
2. Pairwise interaction model higher-order potential interactions will not be accounted for (ARACNE’s algorithm will open 3-gene loops).
3. A two-gene interaction will be detected iff there are no alternate paths.
4. To keep three-gene loops, modify tolerance for edge-removal by introducing tolerance parameter, .
5. ARACNE’s performance deteriorates as local (true) network topology deviates from a tree (tight loops may be a problem).
6. ARACNE achieved high precision and substantial recall even for few data points when compared to BN and RN (synthetic data).
7. ARACNE cannot predict the orientation of the edges of the networks.
8. The algorithm is suited for more complex (mammalian) networks.
![Page 33: Aracne Jorge Viveros Summer 2006 Workshop June 29 th, 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649e255503460f94b136d3/html5/thumbnails/33.jpg)
Bibliography
1. P. Mendes, W. Sha, K. Ye. Artificial gene networks for objective comparison of analysis algorithms. Bioinformatics 2003, 19 Suppl 2: II122-II129.
2. I. Nemenman. Information theory, multivariate dependence and genetic network inference. Technical report: arXiv:q-bio/0406015; 2004.
3. K. Basso, A.A. Margolin, G. Stolovitzky, U. Klein, R. Dalla-Favera, A. Califano. Reverse engineering of regulatory networks in human B cells. Nature Genetics, 2005, 37(4):382-390.
![Page 34: Aracne Jorge Viveros Summer 2006 Workshop June 29 th, 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649e255503460f94b136d3/html5/thumbnails/34.jpg)
Main web site
• Important documentation and relevant publications, application download and support.
AMDeC Bionformatics Core Facility at the Columbia Genome Center
AMDeC (Academic Medicine Development Company)
http://amdec-bioinfo.cu-genome.org/html/ARACNE.htm