ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

ARACNE: An Algorithm for the Reconstruction of Gene RegulatoryNetworks in a Mammalian Cellular ContextAdam A Margolin, Ilya Nemenman, Katia Basso, Chris Wiggins,Gustavo Stolovitzky, Riccardo Dalla Favera and Andrea Califano.

Reverse engineering of regulatory networks in human B cellsKatia Basso, Adam Margolin, Gustavo Stolovitzky, Ulf Klein,Riccardo Dalla-Favera, and Andrea Califano

550.635 TOPICS IN BIOINFORMATICS550.635 TOPICS IN BIOINFORMATICS

Instructor: Dr. Donald GemanStudent: Francisco Sánchez Vega

Baltimore, 28 November 2006

THE JOHNS HOPKINS UNIVERSITYTHE JOHNS HOPKINS UNIVERSITY

http://www.cis.jhu.edu/~sanchez/ARACNE_635_sanchez.ppthttp://www.cis.jhu.edu/~sanchez/ARACNE_635_sanchez.ppt

Outline

• A very short introduction to MRF

• The ARACNE approach• Theoretical framework• Technical implementation

• Experiments and results• Synthetic data• Human B cells (paper by Basso et. al)

• Discussion

What is a Markov Random Field ?What is a Markov Random Field ?• Intuitive idea: concept of Markov chain

Xt1 X(t-1) XtXt0…

P(Xt|Xt0,Xt1,…,X(t-1))=P(Xt|X(t-1))

X(t+1) …

A very short introduction to MRFs

" The world is a huge Markov chain "

(Ronald A. Howard)

What is a Markov Random Field ?What is a Markov Random Field ?

• A Markov random field is a model of the joint probability distribution over a set X of random variables.

• A generalization of Markov Chains to a process indexed by the sites of a graph and satisfying a Markov property to the neighborhood system.


MRF: a constructive definitionMRF: a constructive definition

• Let X={X1,…,XN} be the set of variables (the stochastic process) whose JPD we want to model.

• Start by considering a set of sites or vertices V={V1,…,VN}.

• Define a neighborhood system: ={v, v V}

where

(1) v V(2) v v

(3) v u u v


MRF: a constructive definitionMRF: a constructive definition• Example: “N times N” square lattice

V={(i,j):1i,j N} (i,j) ={ (k,l)V: 1(i-k)2+(l-j)2 c}

c=1 c=2 c=8


MRF: a constructive definitionMRF: a constructive definition• At this point, we can build an undirected graph G=(V, E)

• Each vertex vV is associated to one of the random variables in X• The set of edges E is given by the chosen neighborhood system.

c=1 c=2


MRF: a constructive definitionMRF: a constructive definition• Clique: induced complete subgraph

c=1

c=2


• We say that X is a MRF on (V,) if

(a) P(X=x)>0, for all x S

(b) P(Xv=xv|Xu=xu,uv)=P(Xv=xv|Xu=xu,uv) for vV


• Imagine that each variable Xv in X={X1,…,XN} can take one of a finite number Lv of values:

• Define the configuration space S as

VvLX vv },1,...,1,0{

Vv

vLS }1,...,1,0{




c=1 c=2

• Let us now associate a specific function, that we will call potential, to each clique c C, so that:

• We say that a probability distribution p PS is a Gibbs distribution wrt (V,) if it is of the form:

}:{'),'()( cxxxxx vvcc

Ccc xZxp )(exp)( 1

where Z is the partition function:

Sx Ccc xZ )(exp


Correspondence TheoremCorrespondence Theorem

X is a MRF with respect to (V, ) if and only if

p(x)=P(X=x) is a Gibbs distribution with respect to (V, )

(i.e. every given Markov Random Field has an associated Gibbs distribution and every given Gibbs distribution has an associated Markov Random Field).


Outline




• Discussion

Microarray data

Graphical model

Theoretical framework

The name of the game

• Need of a strategy to deal with uncertainty under small samples:

Maximum entropy modelsMaximum entropy models

• Philosophical basis: • Model all that is known and assume nothing about that

which is unknown.

• Given a certain dataset, choose a model which is consistent with it, but otherwise as “uniform” as possible


MaxEnt: toy example (adapted from A. Berger)MaxEnt: toy example (adapted from A. Berger)• Paul, Mary, Jane, Brian and Emily are five grad students that work in

the same research lab. • Let us try to model the probability of the discrete random variable

X“first person to arrive at the lab in the morning”.

p(X=Paul)+p(X=Mary)+p(X=Jane)+p(X=Bryan)+p(X=Emily)=1

• If we do not know anything about them:p(X=Paul) = 1/5

p(X=Mary) = 1/5

p(X=Jane) = 1/5

p(X=Bryan) = 1/5

p(X=Emily) = 1/5

The most “uniform” model is the onethat maximizes the entropy H(A)

)]([log))(( XpEXpH


MaxEnt: toy example (adapted from A. Berger)MaxEnt: toy example (adapted from A. Berger)• Imagine that we know that Mary or Jane are the first ones

to arrive 30% of the time.• In that case, we know:

p(X=Paul)+p(X=Mary)+p(X=Jane)+p(X=Bryan)+p(X=Emily)=1p(X=Mary)+p(X=Jane)=3/10

• Again, we may want to choose the most “uniform” model, although this time we need to respect the new constraint:

p(X=Paul) = (7/10)(1/3) = 7/30p(X=Mary) = (3/10)(1/2) = 3/20p(X=Jane) = (3/10)(1/2) = 3/20p(X=Bryan) = (7/10)(1/3) = 7/30p(X=Emily) = (7/10)(1/3) = 7/30

Maximize H(P(X)) under the given constraints


MaxEnt extensions ~ constrained optimization problem

• S countable set

• PS set of prob. measures on S, PS ={p(x): xS}.

set of K constraints for the problem subset of PS that satisfies all the constraints in

• Let {1,…,K} be a set of K functions SR [feature functions]

• We choose these functions so that each given constraint can be expressed as:

• If x are samples from a r.v. X, then E[i(X)]=i


KmxpxSx

mm ,...,1 ,)()(

M

iimmm x

M 1

)(1


• Ex. Let x be a sample from X”person to arrive first in the lab”

In this case, S={Paul, Mary, Jane, Bryan, Emily} ={1}, 1 p(X=Mary)+p(X=Jane)=3/10

• We can model 1 it as follows:


1 if x=Mary or x=Jane0 otherwise

So that

1=3/10

Define 1(x)

Sx

xpxxE 10/3)()()]([ 111



Find p(x)=arg max p H(p(x))

subset of PS that satisfies all the constraints in :

Of course, since p(x) is a probability measure:

KmxpxSx

mm ,...,1 ,)()(

1)(0

Sx

xp

• Use Lagrange multipliers:


K

i xiii

SxK xxpxpxppA

01 )()())(log()(),...,;(

H(p) i

• This leads to a solution of the form:

But this is a Gibbs distribution, and therefore, by theorem,we can think in terms of the underlying Markov Random Field !!

- We can profit from previous knowledge of MRF

- We have found the graphical model paradigmWe have found the graphical model paradigm that we were looking for!that we were looking for!


K

iii

x

K

iii

K

iii

xZ

x

x

xp0

0

0 )(exp)(

1

)(exp

)(exp

)(

Microarray data

Graphical model


Technicalimplementation

The name of the game

• Consider the discrete, binary case.

Approximation of the interaction structure

First order constraints:

1,111

11 ˆ

1)]([ ;

0 xif 0

1 xif 1)(

i

ixM

xEx

2,222

22 ˆ

1)]([ ;

0 xif 0

1 xif 1)(

i

ixM

xEx

5,555

55 ˆ

1)]([ ;

0 xif 0

1 xif 1)(

i

ixM

xEx

…

N

X1

X2

X3

X4

X5



Second order constraints:

otherwise 0

1 xand 1 xif 1)(

otherwise 0

0 xand 1 xif 1)(

otherwise 0

1 xand 0 xif 1)(

otherwise 0

0 xand 0 xif 1)(

)2,1(

21]1,1[,2,1

21]0,1[,2,1

21]1,0[,2,1

21]0,0[,2,1

x

x

x

x

bai

iiba bxaxIM

xE ,,2,121],[2,1 ˆ},{1

)]([

2

4

Nconstraints



For jth order:

2

j

Nj constraints

• The higher the order, the more accurate our approximation will be…

… provided we observe enough data !!

(from Elements of Statistical learning, by Hastie et Tibshirani)


“…for M → ∞ (where M is sample set size) the complete form of the JPD is restored. In fact, M > 100 is generally sufficient to estimate 2-way marginals in genomics problems, while P(gi, gj, gk) requires about an order of magnitude more samples…”

(from Dr. Munos lectures, Master MVA)

• Therefore, the model they adopt is of the form:

• All genes for which ij = 0 are declared mutually non-interacting:

• Some of these genes are statistically independent i.e. P(gi, gj) ≈ P(gi)P(gj))

• Some are genes that do not interact directly but are statistically dependent due to their interaction via other genes. i.e. P(gi, gj) ≠ P(gi)P(gj), but ij = 0

N

jijiij

N

iiii ggg

ZgP

,

),()(exp1

})({

How can we extract this information ??


• Therefore, the model they adopt is of the form:

N

jijiij

N

iii ggg

ZgiP

,

),()(exp1

})({


ij=0P(gi, gj)

=P(gi)P(gj)

Pairs of genes who interactthrough a third gene

Pairs of genes whose direct interactionis balanced out by a third gene

Pairs of genes for which the MI is a rightful indicator of dependency

• The mutual information between two random variables is a measure of the amount of information that one random variable contains about another.

• It is defined as the relative entropy (or Kullback-Leibler distance) between the joint distribution and the product of the marginals:

• Alternative formulations: x y ypxp

yxpyxpypxpyxpDYXI

)()(

),(log),())()(|),((),(

)|()(),()()(),( YXSXSYXSYSXSYXI

Mutual Information

Another toy exampleAnother toy example

XX YY

1 0

1 1

0 0

0 0

0 1

1 1

1 0

0 1

0 1

0 1

P(X=0)=6/10;P(X=1)=4/10;

P(Y=0)=4/10;P(Y=1)=6/10;

P(X=0,Y=0)=2/10P(X=0,Y=1)=4/10P(X=1,Y=0)=2/10P(X=1,Y=1)=2/10

257.06.0·4.0

2.0·log2.0

4.0·6.0

4.0·log4.0

4.0·4.0

2.0·log2.0

6.0·4.0

2.0·log2.0

)()(

),(log),(),(

,

yx ypxp

yxpyxpYXI 0

1

0

1

0

1

2

3

4

XY

0 10

1

2

3

4

5

6

7

8

Y0 1

0

1

2

3

4

5

6

7

8

X

P(X) P(Y)

P(X,Y)

Mutual Information estimation

iizzhGh

Mzf |)|(

1)( 12

The Gaussian Kernel estimator



iizzhGh

Mzf |)|(

1)( 12

i ii

ii

yfxf

yxf

MyixiI

)()(

),(log

1}){},({

-1

0

1

2

-1.5-1-0.500.511.522.5

0

1

2

3

4

YX

0

1

0

1

0

1

2

3

4

XY

-1.5 -1 -0.5 0 0.5 1 1.5 2 2.50

10

20

30

40

50

60

Y

-1.5 -1 -0.5 0 0.5 1 1.5 2 2.50

10

20

30

40

50

60

X

f(x)

f(y)

f(x,y)

P(X,Y)Another toy exampleAnother toy example


-1

0

1

2

-1.5-1-0.500.511.522.5

0

1

2

3

4

YX

0

1

0

1

0

1

2

3

4

XY

Reference (h=1)

-1

01

2

-1

0

1

2

0

2

4

6

8

10

Y

X

h’ = 4h

-1.5-1-0.500.511.522.5

-1

0

1

2

0

0.5

1

1.5

2

Y

X

h’’ = h/2



• Every pair of variables is copula-transformed:

(X1,X2)[FX1(X1),FX2

(X2)], where FX(x)=P(Xx)

• By the Probability Integral Transform Theorem, we know that FX(X)~U[0,1].

• Thus, the resulting variables have range between 0 and 1, and marginals are uniform.• Transformation is one-to-one H and MI unaffected• Reduction of the influence of arbitrary

transformations from microarray data preprocessing.• No need for position dependent kernel widths (h).


• The choice of h is critical for the accuracy of the MI estimate, but not so important for estimation of MI ranks.


• At this point, we have a procedure to construct an undirected graph from our data.

• In an “ideal world” (infinite samples, assumptions hold), we would be done.

• Unfortunately, things are not so simple !

Technical implementation

• Finite random samples• Possible higher-order interactions

For each pair:• H0: the two genes are mutually independent (no edge)

• HA: the two genes interact with each other (edge)

• We reject the null hypothesis H0 (i.e. we “draw an edge”) when the MI between two genes is big enough.

Need to choose a statistical threshold I0

Use hypothesis testing


• Chosen approach: Random permutation analysis• Randomly shuffle gene expression values and labels

Choice of the statistical threshold

Sample 1 Sample 2 Sample 2 … Sample M

Gene 1 g11 g12 g13 … g1M

Gene 2 g21 g22 g23 … g2M

… … … … … …

Gene N gn1 gn2 gn3 … gnM

• Chosen approach: Random permutation analysis• Randomly shuffle gene expression values and labels

• The resulting variables are supposed to be mutually independent, but their MI will need not be equal to zero.

• Each threshold I0 can then be assigned a p-value (which measures the probability of getting a MI value higher or equal to I0 just “by chance”).

• Thus, by fixing a p-value, we obtain the desired I0.


MI values

Num

ber

of p

airs

I0

• Chosen approach: Random permutation analysis


• Chosen approach: Random permutation analysisParameter fitting: use known fact from large deviation theory

0)|( 00MIeHIIpvaluep

1

2

3


• If we have a real distribution of the type:

ARACNe will get it wrong, because of our assumption

N

kjikjiijkkiikkjjkjiij

N

iiii gggggggggg

ZgP

,,

),,(),(),(),()(exp1

})({

=0 =0 =0

Real network ARACNe’s output

The extension of the algorithm to higherorder interactions is a possible object of

future research.



Iij=0

ARACNE will not identify the edge between gi and gj

),(),(),()(exp

1})({ kiikkjjkjiij

N

iiii ggggggg

ZgP

0


However, this situation is considered“biologically unrealistic”



Iij 0

As it is, ARACNE will put an edge between gi and gj

),(),(),()(exp

1})({ kiikkjjkjiij

N

iii ggggggg

ZgiP

=0


The algorithm can be improved using theData Processing Inequality


• The DPI is a well known theorem within the Information Theory community.

Let X,Y,Z be 3 random variables that form a Markov Chain X-Y-Z, then

I(X;Y)I(X,Z)

Proof. I(X; Y,Z) = I(X,Z) + I(X; Y|Z) = I(X,Y) + I(X; Z|Y) X and Z are conditionally independent given Y I(X; Z|Y)=0 Thus, I(X,Y) = I(X,Z) + I(X; Y|Z) I(X,Z), since I(X; Y|Z) 0

Similarly, we can prove I(Y;Z)I(X,Z) , and therefore:

I(X,Z) min[I(X;Y) ,I(Y;Z)]

The Data Processing Inequality

• A consequence of the DPI is that no transformation of Y, as clever as it can be, can increase the information that Y contains about X.

(consider the MC of the form X-Y-[Z=g(Y)] )

• From an intuitive point of view:

Mike Francisco Jean-Paul


• Going back to our case of study

gk

gjgi

0.1

0.20.3

0.1

0.20.3

0.1

0.20.3

0.1

0.20.3

N

jijiij

N

iiii ggg

ZgP

,

),()(exp1

})({

So I assume that ij=0, even though P(gi,gj) P(gi)P(gj)


• But, what if the underlying network is truly a three-gene loop?

gk

gjgi

…Then ARACNE will break the loop at the weakest edge!

Philosophy: “An interaction is retained iff there exist no alternatepaths which are a better explanation for the information exchange between two genes”

Claim: In practice, looking at the TP vs. FP tradeoff, it pays to simplify

(known flaw of the algorithm)


• Theorem 1. If MIs can be estimated with no errors, then ARACNE reconstructs the underlying interaction network exactly, provided this network is a tree and has only pairwise interactions.

• Theorem 2. The Chow-Liu (CL) maximum mutual information tree is a subnetwork of the network reconstructed by ARACNE.

• Theorem 3. Let πik be the set of nodes forming the shortest path in the network between nodes i and k. Then, if MIs can be estimated without errors, ARACNE reconstructs an interaction network without false positive edges, provided: (a) the network consists only of pairwise interactions, (b) for each j πik, Iij ≥ Iik. Further, ARACNE does not produce any false negatives, and the network reconstruction is exact iff (c) for each directly connected pair (ij) and for any other node k, we have Iij ≥ min(Ijk, Iik).


• Proof of theorem 1. (a) MIs can be estimated with no errors

(b) Network is a tree

(c) Only pairwise interactions


ARACNE reconstructs true network without errors

-(c) no problem with higher order interactions-(a) blue area boundary is ok-(b) red area is contained in yellow area.-(a),(b),DPI yellow area is ok (every edge with =0 is removed and only edges with =0 are removed)

The Chow-Liu Maximum Entropy tree (1968)

• Method for approximating the JPD of a set of discrete variables using products of distributions involving no more than pair of variables.

• The Chow-Liu method approximates a distribution P(x) by T(x), a tree structured MRF.

• They proved that minimizing the Kullback-Leibler distance between P and T amounts to maximizing the total entropy of the edges of T.


• In practice, the variance of the MI estimator may lead to the use of a tolerance so that the DPI relations become of the form:

Iij Iik(1-)


(Erdös-Rényi) (Scale-free)

The final algorithmThe final algorithm

1. Choice of a threshold I0

2. Compute all pairwise MIs

3. Draw an edge when MII0

4. Look at all the three-geneloops and prune edge with the lowest MI.

Technical implementation: summary

The final algorithmThe final algorithm

1. Choice of a threshold I0

2. Compute all pairwise MIs

3. Draw an edge when MII0

4. Look at all the three-geneloops and prune edge with the lowest MI.

2

N

3

N

pairs x M samples

maximum triplets

O(N2M2+N3)

iizzhGh

Mzf |)|(

1)( 12

i ii

ii

yfxf

yxf

MyixiI

)()(

),(log

1}){},({

Technical implementation: summary

Outline




• Discussion

• Two different sets of experiments are presented:(a) Reconstruction of synthetic networks

(b) Reconstruction of a human B lymphocyte genetic network from microarray data.

• ARACNE’s performance is compared to that of two concurrent methodologies:• Bayesian networks• Relevance networks

Experiments and results

• Model proposed by Mendes et al. as a platform for comparison of reverse engineering algorithms.

• Simplification of real biological networks…

• …but reasonably complex to model some aspects of transcriptional regulation.

“An algorithm that does not perform well on this model is unlikely to perform well in a more complex case”

Synthetic data: AGNs

• Transcriptional interactions are approximated as:

where

xi level of expression of the i-th gene

Ni number of upstream inhibitors

NA number of activators

Ij concentration of upstream inhibitors

Al concentration of activators…


• Transcriptional interactions are approximated as:

• In order to generate M samples, the parameters for gene i in imaginary microarray k are:

iiki aa ,

iiki bb ,

are some original constant values of the parametersii ba ,

ikik ,, , are uniform random variables in [0,2]

This simulates the sampling of a population of distinct phenotypes at random time points and where the efficiency of biochemical reactions may be also distinct.


• Two possible network topologies are considered:

Erdös-Rényi

Each vertex of the graph is equally likely to be connected to any other vertex.

i.e. The presence of an edge between each possible pair of genes is modeled as a Bernoulli random variable with parameter p.



Scale-free

The distribution of the number of connections k associated with each vertex follows a power law:

p(k)~k-, >0

This motivates the appearance of large interaction hubs.



Erdös-Rényi Scale-free


• Two performance indicators:

• Recall: NTP/(NTP+NFN)

(fraction of the true edges that are identified)

• Precision: NTP/(NTP+NFP)

(fraction of the identified edges that are true)


• ARACNE outperforms its competitors for sufficiently small choices of p-values

• When the p-value is not small enough, non-statistically significant MI values are accepted and the algorithm crashes.


(Erdös-Rényi) (Scale-free)

• ARACNE’s performance depends on MI being high for directly interacting genes and decreasing rapidly with the interaction distance.


• Performance is stable in terms of kernel width as long as it is not too narrow.


• Better performance for Erdös-Rényi (less loops)• High precision and substantial recall, even for small samples


• Results for the case of human B cells are described in detail in the second paper:

“Reverse engineering of regulatory networks in human B cells”

Basso et al., Nature Genetics, 37, 382-390, 2005;

Real data: Human B cells

• Large gene expression datasets such as those derived from systematic perturbations to simple organisms (yeast) are not that easily obtained for mammalian cells.

• The more complex the organism becomes, the more difficult it is to distinguish among physical and functional interaction.


• Authors assume that an equivalent dynamic richness can be obtained for a given type of human cell.

• They choose to work with 340 B lymphocytes derived from normal, tumor-related and experimentally manipulated populations.

• They feed the data to ARACNE, which generates a regulatory network containing aprox. 129,000 interactions.


• Interesting observation:• The results show a power-law tail in the relationship

between the number of genes n and their number of interactions.

• This is suggestive of a scale-free underlying network structure.

• Important because the evidence of scale-free topology in higher-order eukaryotic cells is still scarce.


• They use Gene Ontology to analyze the biological processes affected by the most relevant (top 5%) hubs.

• Focus on the c-MYC proto-oncogene:• Because it is well characterized as a transcription

factor, which helps validate the results.• Subnetwork with 2,063 genes (56 directly connected

to MYC).• The network is handicapped by certain limitations:

• The edges are undirected• Some direct connections due to certain intermediates which

are not even represented on the microarray• Some direct relations incorrectly removed by the DPI


• Some results of interest can be summarized as follows:• 29/56 (51.8%) of the first neighbors presumed “correct”

(either reported in literature or ChIP validated in lab)

• This is statistically significant wrt the expected 11% of background c-MYC targets among randomly selected genes.

• Furthermore, c-MYC target genes are significantly more enriched among first neighbors than among second neighbors (51.8% vs 19.4%).


Outline




• Discussion

• ARACNE provides a provably exact network reconstruction under a controlled set of approximations.

• Some limitations:• ARACNE opens all three-gene loops along weakest edge

(i.e. false negatives for triplets of truly interacting genes)

• Only statistical dependencies expressed as pairwise interaction potentials can be inferred.(solution: extend the model to higher orders)

• Edges are undirected (inevitable with non-temporal data?)

• Some ambiguities concerning the interpretation of the inferred irreducible statistical dependencies.

• Beware of loopy underlying topologies (go for tree-like)

Discussion

• ARACNE provides a provably exact network reconstruction under a controlled set of approximations.

• Some virtues:• Low computational complexity

• No need to discretize expression levels

• Does not rely on unrealistic network models or a priori assumptions (excuse-me ?? :-|)

• Avoidance of heuristic/stochastic search procedures

• High precision and recall on synthetic data

• Ability to infer genetic interactions on a genome-wide scale from gene-expression profiles of mammalian cells.

Discussion

• There is no “free lunch”: in order to effectively deal with the small sample regime some simplifying assumptions need to be made.

• Yet, it seems that this is the way to go: “divide and conquer paradigm”The big problem of inferring gene interaction networks should be broken down into smaller subproblems that can be addressed by relatively simple classes of models. Each model will then rely on strong assumptions, but they should be able to deal with complex scenarios when carefully chosen and properly combined.

• No magic recipes, a long and winding road ahead… (remember Minsky?)

My point of view

“…A well-known anecdote relates how, sometime in 1966, the legendary Artificial Intelligence pioneer Marvin Minsky directed an undergraduate student to solve "the problem of computer vision" as a summer project. This anecdote is often resuscitated to illustrate how egregiously the difficulty of computational vision has been underestimated. Indeed, nearly forty years later, the discipline continues to confront numerous unsolved (and perhaps unsolvable) challenges, particularly with respect to high-level "image understanding" issues such as pattern recognition and feature recognition. Nevertheless, the intervening decades of research have yielded a great wealth of well-understood, low-level techniques that are able, under controlled circumstances, to extract meaningful information from a camera scene. These techniques are indeed elementary enough to be implemented by novice programmers at the undergraduate or even high-school level…”

(from “Computer Vision for Artists and Designers” by Golan Levin)

My point of view

Thanks for your attention

ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

Documents

Transcript of ARACNE: An Algorithm for the Reconstruction of Gene Regulatory