ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

85
ARACNE: An Algorithm for the Reconstruction of Gene Regulatory Networks in a Mammalian Cellular Context Adam A Margolin, Ilya Nemenman, Katia Basso, Chris Wiggins, Gustavo Stolovitzky, Riccardo Dalla Favera and Andrea Califano. Reverse engineering of regulatory networks in human B cells Katia Basso, Adam Margolin, Gustavo Stolovitzky, Ulf Klein, Riccardo Dalla-Favera, and Andrea Califano 550.635 TOPICS IN BIOINFORMATICS 550.635 TOPICS IN BIOINFORMATICS Instructor: Dr. Donald Geman Student: Francisco Sánchez Vega Baltimore, 28 November 2006 THE JOHNS HOPKINS UNIVERSITY THE JOHNS HOPKINS UNIVERSITY http://www.cis.jhu.edu/~sanchez/ARACNE_635_sanchez.ppt http://www.cis.jhu.edu/~sanchez/ARACNE_635_sanchez.ppt

description

THE JOHNS HOPKINS UNIVERSITY. 550.635 TOPICS IN BIOINFORMATICS. ARACNE: An Algorithm for the Reconstruction of Gene Regulatory Networks in a Mammalian Cellular Context Adam A Margolin, Ilya Nemenman, Katia Basso, Chris Wiggins, Gustavo Stolovitzky, Riccardo Dalla Favera and Andrea Califano. - PowerPoint PPT Presentation

Transcript of ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

Page 1: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

ARACNE: An Algorithm for the Reconstruction of Gene RegulatoryNetworks in a Mammalian Cellular ContextAdam A Margolin, Ilya Nemenman, Katia Basso, Chris Wiggins,Gustavo Stolovitzky, Riccardo Dalla Favera and Andrea Califano.

Reverse engineering of regulatory networks in human B cellsKatia Basso, Adam Margolin, Gustavo Stolovitzky, Ulf Klein,Riccardo Dalla-Favera, and Andrea Califano

550.635 TOPICS IN BIOINFORMATICS550.635 TOPICS IN BIOINFORMATICS

Instructor: Dr. Donald GemanStudent: Francisco Sánchez Vega

Baltimore, 28 November 2006

THE JOHNS HOPKINS UNIVERSITYTHE JOHNS HOPKINS UNIVERSITY

http://www.cis.jhu.edu/~sanchez/ARACNE_635_sanchez.ppthttp://www.cis.jhu.edu/~sanchez/ARACNE_635_sanchez.ppt

Page 2: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

Outline

• A very short introduction to MRF

• The ARACNE approach• Theoretical framework• Technical implementation

• Experiments and results• Synthetic data• Human B cells (paper by Basso et. al)

• Discussion

Page 3: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

What is a Markov Random Field ?What is a Markov Random Field ?• Intuitive idea: concept of Markov chain

Xt1 X(t-1) XtXt0…

P(Xt|Xt0,Xt1,…,X(t-1))=P(Xt|X(t-1))

X(t+1) …

A very short introduction to MRFs

" The world is a huge Markov chain "

(Ronald A. Howard)

Page 4: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

What is a Markov Random Field ?What is a Markov Random Field ?

• A Markov random field is a model of the joint probability distribution over a set X of random variables.

• A generalization of Markov Chains to a process indexed by the sites of a graph and satisfying a Markov property to the neighborhood system.

A very short introduction to MRFs

Page 5: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

MRF: a constructive definitionMRF: a constructive definition

• Let X={X1,…,XN} be the set of variables (the stochastic process) whose JPD we want to model.

• Start by considering a set of sites or vertices V={V1,…,VN}.

• Define a neighborhood system: ={v, v V}

where

(1) v V(2) v v

(3) v u u v

A very short introduction to MRFs

Page 6: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

MRF: a constructive definitionMRF: a constructive definition• Example: “N times N” square lattice

V={(i,j):1i,j N} (i,j) ={ (k,l)V: 1(i-k)2+(l-j)2 c}

c=1 c=2 c=8

A very short introduction to MRFs

Page 7: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

MRF: a constructive definitionMRF: a constructive definition• At this point, we can build an undirected graph G=(V, E)

• Each vertex vV is associated to one of the random variables in X• The set of edges E is given by the chosen neighborhood system.

c=1 c=2

A very short introduction to MRFs

Page 8: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

MRF: a constructive definitionMRF: a constructive definition• Clique: induced complete subgraph

c=1

c=2

A very short introduction to MRFs

Page 9: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

• We say that X is a MRF on (V,) if

(a) P(X=x)>0, for all x S

(b) P(Xv=xv|Xu=xu,uv)=P(Xv=xv|Xu=xu,uv) for vV

MRF: a constructive definitionMRF: a constructive definition

• Imagine that each variable Xv in X={X1,…,XN} can take one of a finite number Lv of values:

• Define the configuration space S as

VvLX vv },1,...,1,0{

Vv

vLS }1,...,1,0{

A very short introduction to MRFs

Page 10: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

MRF: a constructive definitionMRF: a constructive definition

A very short introduction to MRFs

c=1 c=2

Page 11: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

• Let us now associate a specific function, that we will call potential, to each clique c C, so that:

• We say that a probability distribution p PS is a Gibbs distribution wrt (V,) if it is of the form:

}:{'),'()( cxxxxx vvcc

Ccc xZxp )(exp)( 1

where Z is the partition function:

Sx Ccc xZ )(exp

A very short introduction to MRFs

Page 12: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

Correspondence TheoremCorrespondence Theorem

X is a MRF with respect to (V, ) if and only if

p(x)=P(X=x) is a Gibbs distribution with respect to (V, )

(i.e. every given Markov Random Field has an associated Gibbs distribution and every given Gibbs distribution has an associated Markov Random Field).

A very short introduction to MRFs

Page 13: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

Outline

• A very short introduction to MRF

• The ARACNE approach• Theoretical framework• Technical implementation

• Experiments and results• Synthetic data• Human B cells (paper by Basso et. al)

• Discussion

Page 14: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

Microarray data

Graphical model

Theoretical framework

The name of the game

Page 15: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

• Need of a strategy to deal with uncertainty under small samples:

Maximum entropy modelsMaximum entropy models

• Philosophical basis: • Model all that is known and assume nothing about that

which is unknown.

• Given a certain dataset, choose a model which is consistent with it, but otherwise as “uniform” as possible

Theoretical framework

Page 16: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

MaxEnt: toy example (adapted from A. Berger)MaxEnt: toy example (adapted from A. Berger)• Paul, Mary, Jane, Brian and Emily are five grad students that work in

the same research lab. • Let us try to model the probability of the discrete random variable

X“first person to arrive at the lab in the morning”.

p(X=Paul)+p(X=Mary)+p(X=Jane)+p(X=Bryan)+p(X=Emily)=1

• If we do not know anything about them:p(X=Paul) = 1/5

p(X=Mary) = 1/5

p(X=Jane) = 1/5

p(X=Bryan) = 1/5

p(X=Emily) = 1/5

The most “uniform” model is the onethat maximizes the entropy H(A)

)]([log))(( XpEXpH

Theoretical framework

Page 17: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

MaxEnt: toy example (adapted from A. Berger)MaxEnt: toy example (adapted from A. Berger)• Imagine that we know that Mary or Jane are the first ones

to arrive 30% of the time.• In that case, we know:

p(X=Paul)+p(X=Mary)+p(X=Jane)+p(X=Bryan)+p(X=Emily)=1p(X=Mary)+p(X=Jane)=3/10

• Again, we may want to choose the most “uniform” model, although this time we need to respect the new constraint:

p(X=Paul) = (7/10)(1/3) = 7/30p(X=Mary) = (3/10)(1/2) = 3/20p(X=Jane) = (3/10)(1/2) = 3/20p(X=Bryan) = (7/10)(1/3) = 7/30p(X=Emily) = (7/10)(1/3) = 7/30

Maximize H(P(X)) under the given constraints

Theoretical framework

Page 18: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

MaxEnt extensions ~ constrained optimization problem

• S countable set

• PS set of prob. measures on S, PS ={p(x): xS}.

set of K constraints for the problem subset of PS that satisfies all the constraints in

• Let {1,…,K} be a set of K functions SR [feature functions]

• We choose these functions so that each given constraint can be expressed as:

• If x are samples from a r.v. X, then E[i(X)]=i

Theoretical framework

KmxpxSx

mm ,...,1 ,)()(

M

iimmm x

M 1

)(1

Page 19: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

MaxEnt extensions ~ constrained optimization problem

• Ex. Let x be a sample from X”person to arrive first in the lab”

In this case, S={Paul, Mary, Jane, Bryan, Emily} ={1}, 1 p(X=Mary)+p(X=Jane)=3/10

• We can model 1 it as follows:

Theoretical framework

1 if x=Mary or x=Jane0 otherwise

So that

1=3/10

Define 1(x)

Sx

xpxxE 10/3)()()]([ 111

Page 20: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

MaxEnt extensions ~ constrained optimization problem

Theoretical framework

Find p(x)=arg max p H(p(x))

subset of PS that satisfies all the constraints in :

Of course, since p(x) is a probability measure:

KmxpxSx

mm ,...,1 ,)()(

1)(0

Sx

xp

Page 21: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

• Use Lagrange multipliers:

Theoretical framework

K

i xiii

SxK xxpxpxppA

01 )()())(log()(),...,;(

H(p) i

Page 22: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

• This leads to a solution of the form:

But this is a Gibbs distribution, and therefore, by theorem,we can think in terms of the underlying Markov Random Field !!

- We can profit from previous knowledge of MRF

- We have found the graphical model paradigmWe have found the graphical model paradigm that we were looking for!that we were looking for!

Theoretical framework

K

iii

x

K

iii

K

iii

xZ

x

x

xp0

0

0 )(exp)(

1

)(exp

)(exp

)(

Page 23: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

Microarray data

Graphical model

Theoretical framework

Technicalimplementation

The name of the game

Page 24: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

• Consider the discrete, binary case.

Approximation of the interaction structure

First order constraints:

1,111

11 ˆ

1)]([ ;

0 xif 0

1 xif 1)(

i

ixM

xEx

2,222

22 ˆ

1)]([ ;

0 xif 0

1 xif 1)(

i

ixM

xEx

5,555

55 ˆ

1)]([ ;

0 xif 0

1 xif 1)(

i

ixM

xEx

N

X1

X2

X3

X4

X5

Page 25: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

• Consider the discrete, binary case.

Approximation of the interaction structure

Second order constraints:

otherwise 0

1 xand 1 xif 1)(

otherwise 0

0 xand 1 xif 1)(

otherwise 0

1 xand 0 xif 1)(

otherwise 0

0 xand 0 xif 1)(

)2,1(

21]1,1[,2,1

21]0,1[,2,1

21]1,0[,2,1

21]0,0[,2,1

x

x

x

x

bai

iiba bxaxIM

xE ,,2,121],[2,1 ˆ},{1

)]([

2

4

Nconstraints

Page 26: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

• Consider the discrete, binary case.

Approximation of the interaction structure

For jth order:

2

j

Nj constraints

• The higher the order, the more accurate our approximation will be…

… provided we observe enough data !!

Page 27: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

(from Elements of Statistical learning, by Hastie et Tibshirani)

Approximation of the interaction structure

“…for M → ∞ (where M is sample set size) the complete form of the JPD is restored. In fact, M > 100 is generally sufficient to estimate 2-way marginals in genomics problems, while P(gi, gj, gk) requires about an order of magnitude more samples…”

(from Dr. Munos lectures, Master MVA)

Page 28: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

• Therefore, the model they adopt is of the form:

• All genes for which ij = 0 are declared mutually non-interacting:

• Some of these genes are statistically independent i.e. P(gi, gj) ≈ P(gi)P(gj))

• Some are genes that do not interact directly but are statistically dependent due to their interaction via other genes. i.e. P(gi, gj) ≠ P(gi)P(gj), but ij = 0

N

jijiij

N

iiii ggg

ZgP

,

),()(exp1

})({

How can we extract this information ??

Approximation of the interaction structure

Page 29: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

• Therefore, the model they adopt is of the form:

N

jijiij

N

iii ggg

ZgiP

,

),()(exp1

})({

Approximation of the interaction structure

ij=0P(gi, gj)

=P(gi)P(gj)

Pairs of genes who interactthrough a third gene

Pairs of genes whose direct interactionis balanced out by a third gene

Pairs of genes for which the MI is a rightful indicator of dependency

Page 30: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

• The mutual information between two random variables is a measure of the amount of information that one random variable contains about another.

• It is defined as the relative entropy (or Kullback-Leibler distance) between the joint distribution and the product of the marginals:

• Alternative formulations: x y ypxp

yxpyxpypxpyxpDYXI

)()(

),(log),())()(|),((),(

)|()(),()()(),( YXSXSYXSYSXSYXI

Mutual Information

Page 31: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

Another toy exampleAnother toy example

XX YY

1 0

1 1

0 0

0 0

0 1

1 1

1 0

0 1

0 1

0 1

P(X=0)=6/10;P(X=1)=4/10;

P(Y=0)=4/10;P(Y=1)=6/10;

P(X=0,Y=0)=2/10P(X=0,Y=1)=4/10P(X=1,Y=0)=2/10P(X=1,Y=1)=2/10

257.06.0·4.0

2.0·log2.0

4.0·6.0

4.0·log4.0

4.0·4.0

2.0·log2.0

6.0·4.0

2.0·log2.0

)()(

),(log),(),(

,

yx ypxp

yxpyxpYXI 0

1

0

1

0

1

2

3

4

XY

0 10

1

2

3

4

5

6

7

8

Y0 1

0

1

2

3

4

5

6

7

8

X

P(X) P(Y)

P(X,Y)

Mutual Information estimation

Page 32: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

iizzhGh

Mzf |)|(

1)( 12

The Gaussian Kernel estimator

Another toy exampleAnother toy example

Mutual Information estimation

Page 33: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

iizzhGh

Mzf |)|(

1)( 12

i ii

ii

yfxf

yxf

MyixiI

)()(

),(log

1}){},({

-1

0

1

2

-1.5-1-0.500.511.522.5

0

1

2

3

4

YX

0

1

0

1

0

1

2

3

4

XY

-1.5 -1 -0.5 0 0.5 1 1.5 2 2.50

10

20

30

40

50

60

Y

-1.5 -1 -0.5 0 0.5 1 1.5 2 2.50

10

20

30

40

50

60

X

f(x)

f(y)

f(x,y)

P(X,Y)Another toy exampleAnother toy example

Mutual Information estimation

Page 34: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

-1

0

1

2

-1.5-1-0.500.511.522.5

0

1

2

3

4

YX

0

1

0

1

0

1

2

3

4

XY

Reference (h=1)

-1

01

2

-1

0

1

2

0

2

4

6

8

10

Y

X

h’ = 4h

-1.5-1-0.500.511.522.5

-1

0

1

2

0

0.5

1

1.5

2

Y

X

h’’ = h/2

Another toy exampleAnother toy example

Mutual Information estimation

Page 35: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

• Every pair of variables is copula-transformed:

(X1,X2)[FX1(X1),FX2

(X2)], where FX(x)=P(Xx)

• By the Probability Integral Transform Theorem, we know that FX(X)~U[0,1].

• Thus, the resulting variables have range between 0 and 1, and marginals are uniform.• Transformation is one-to-one H and MI unaffected• Reduction of the influence of arbitrary

transformations from microarray data preprocessing.• No need for position dependent kernel widths (h).

Mutual Information estimation

Page 36: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

• The choice of h is critical for the accuracy of the MI estimate, but not so important for estimation of MI ranks.

Mutual Information estimation

Page 37: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

• At this point, we have a procedure to construct an undirected graph from our data.

• In an “ideal world” (infinite samples, assumptions hold), we would be done.

• Unfortunately, things are not so simple !

Technical implementation

Page 38: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

• Finite random samples• Possible higher-order interactions

For each pair:• H0: the two genes are mutually independent (no edge)

• HA: the two genes interact with each other (edge)

• We reject the null hypothesis H0 (i.e. we “draw an edge”) when the MI between two genes is big enough.

Need to choose a statistical threshold I0

Use hypothesis testing

Technical implementation

Page 39: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

• Chosen approach: Random permutation analysis• Randomly shuffle gene expression values and labels

Choice of the statistical threshold

Sample 1 Sample 2 Sample 2 … Sample M

Gene 1 g11 g12 g13 … g1M

Gene 2 g21 g22 g23 … g2M

… … … … … …

Gene N gn1 gn2 gn3 … gnM

Page 40: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

• Chosen approach: Random permutation analysis• Randomly shuffle gene expression values and labels

• The resulting variables are supposed to be mutually independent, but their MI will need not be equal to zero.

• Each threshold I0 can then be assigned a p-value (which measures the probability of getting a MI value higher or equal to I0 just “by chance”).

• Thus, by fixing a p-value, we obtain the desired I0.

Choice of the statistical threshold

Page 41: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

MI values

Num

ber

of p

airs

I0

• Chosen approach: Random permutation analysis

Choice of the statistical threshold

Page 42: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

• Chosen approach: Random permutation analysisParameter fitting: use known fact from large deviation theory

0)|( 00MIeHIIpvaluep

1

2

3

Choice of the statistical threshold

Page 43: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

• If we have a real distribution of the type:

ARACNe will get it wrong, because of our assumption

N

kjikjiijkkiikkjjkjiij

N

iiii gggggggggg

ZgP

,,

),,(),(),(),()(exp1

})({

=0 =0 =0

Real network ARACNe’s output

The extension of the algorithm to higherorder interactions is a possible object of

future research.

Technical implementation

Page 44: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

• If we have a real distribution of the type:

Iij=0

ARACNE will not identify the edge between gi and gj

),(),(),()(exp

1})({ kiikkjjkjiij

N

iiii ggggggg

ZgP

0

Real network ARACNe’s output

However, this situation is considered“biologically unrealistic”

Technical implementation

Page 45: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

• If we have a real distribution of the type:

Iij 0

As it is, ARACNE will put an edge between gi and gj

),(),(),()(exp

1})({ kiikkjjkjiij

N

iii ggggggg

ZgiP

=0

Real network ARACNe’s output

The algorithm can be improved using theData Processing Inequality

Technical implementation

Page 46: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

• The DPI is a well known theorem within the Information Theory community.

Let X,Y,Z be 3 random variables that form a Markov Chain X-Y-Z, then

I(X;Y)I(X,Z)

Proof. I(X; Y,Z) = I(X,Z) + I(X; Y|Z) = I(X,Y) + I(X; Z|Y) X and Z are conditionally independent given Y I(X; Z|Y)=0 Thus, I(X,Y) = I(X,Z) + I(X; Y|Z) I(X,Z), since I(X; Y|Z) 0

Similarly, we can prove I(Y;Z)I(X,Z) , and therefore:

I(X,Z) min[I(X;Y) ,I(Y;Z)]

The Data Processing Inequality

Page 47: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

• A consequence of the DPI is that no transformation of Y, as clever as it can be, can increase the information that Y contains about X.

(consider the MC of the form X-Y-[Z=g(Y)] )

• From an intuitive point of view:

Mike Francisco Jean-Paul

The Data Processing Inequality

Page 48: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

• Going back to our case of study

gk

gjgi

0.1

0.20.3

0.1

0.20.3

0.1

0.20.3

0.1

0.20.3

N

jijiij

N

iiii ggg

ZgP

,

),()(exp1

})({

So I assume that ij=0, even though P(gi,gj) P(gi)P(gj)

The Data Processing Inequality

Page 49: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

• But, what if the underlying network is truly a three-gene loop?

gk

gjgi

…Then ARACNE will break the loop at the weakest edge!

Philosophy: “An interaction is retained iff there exist no alternatepaths which are a better explanation for the information exchange between two genes”

Claim: In practice, looking at the TP vs. FP tradeoff, it pays to simplify

(known flaw of the algorithm)

The Data Processing Inequality

Page 50: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

• Theorem 1. If MIs can be estimated with no errors, then ARACNE reconstructs the underlying interaction network exactly, provided this network is a tree and has only pairwise interactions.

• Theorem 2. The Chow-Liu (CL) maximum mutual information tree is a subnetwork of the network reconstructed by ARACNE.

• Theorem 3. Let πik be the set of nodes forming the shortest path in the network between nodes i and k. Then, if MIs can be estimated without errors, ARACNE reconstructs an interaction network without false positive edges, provided: (a) the network consists only of pairwise interactions, (b) for each j πik, Iij ≥ Iik. Further, ARACNE does not produce any false negatives, and the network reconstruction is exact iff (c) for each directly connected pair (ij) and for any other node k, we have Iij ≥ min(Ijk, Iik).

The Data Processing Inequality

Page 51: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

• Proof of theorem 1. (a) MIs can be estimated with no errors

(b) Network is a tree

(c) Only pairwise interactions

The Data Processing Inequality

ARACNE reconstructs true network without errors

-(c) no problem with higher order interactions-(a) blue area boundary is ok-(b) red area is contained in yellow area.-(a),(b),DPI yellow area is ok (every edge with =0 is removed and only edges with =0 are removed)

Page 52: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

The Chow-Liu Maximum Entropy tree (1968)

• Method for approximating the JPD of a set of discrete variables using products of distributions involving no more than pair of variables.

• The Chow-Liu method approximates a distribution P(x) by T(x), a tree structured MRF.

• They proved that minimizing the Kullback-Leibler distance between P and T amounts to maximizing the total entropy of the edges of T.

The Data Processing Inequality

Page 53: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

• In practice, the variance of the MI estimator may lead to the use of a tolerance so that the DPI relations become of the form:

Iij Iik(1-)

The Data Processing Inequality

(Erdös-Rényi) (Scale-free)

Page 54: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

The final algorithmThe final algorithm

1. Choice of a threshold I0

2. Compute all pairwise MIs

3. Draw an edge when MII0

4. Look at all the three-geneloops and prune edge with the lowest MI.

Technical implementation: summary

Page 55: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

The final algorithmThe final algorithm

1. Choice of a threshold I0

2. Compute all pairwise MIs

3. Draw an edge when MII0

4. Look at all the three-geneloops and prune edge with the lowest MI.

Technical implementation: summary

Page 56: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

The final algorithmThe final algorithm

1. Choice of a threshold I0

2. Compute all pairwise MIs

3. Draw an edge when MII0

4. Look at all the three-geneloops and prune edge with the lowest MI.

Technical implementation: summary

Page 57: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

The final algorithmThe final algorithm

1. Choice of a threshold I0

2. Compute all pairwise MIs

3. Draw an edge when MII0

4. Look at all the three-geneloops and prune edge with the lowest MI.

Technical implementation: summary

Page 58: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

The final algorithmThe final algorithm

1. Choice of a threshold I0

2. Compute all pairwise MIs

3. Draw an edge when MII0

4. Look at all the three-geneloops and prune edge with the lowest MI.

2

N

3

N

pairs x M samples

maximum triplets

O(N2M2+N3)

iizzhGh

Mzf |)|(

1)( 12

i ii

ii

yfxf

yxf

MyixiI

)()(

),(log

1}){},({

Technical implementation: summary

Page 59: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

Outline

• A very short introduction to MRF

• The ARACNE approach• Theoretical framework• Technical implementation

• Experiments and results• Synthetic data• Human B cells (paper by Basso et. al)

• Discussion

Page 60: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

• Two different sets of experiments are presented:(a) Reconstruction of synthetic networks

(b) Reconstruction of a human B lymphocyte genetic network from microarray data.

• ARACNE’s performance is compared to that of two concurrent methodologies:• Bayesian networks• Relevance networks

Experiments and results

Page 61: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

• Model proposed by Mendes et al. as a platform for comparison of reverse engineering algorithms.

• Simplification of real biological networks…

• …but reasonably complex to model some aspects of transcriptional regulation.

“An algorithm that does not perform well on this model is unlikely to perform well in a more complex case”

Synthetic data: AGNs

Page 62: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

• Transcriptional interactions are approximated as:

where

xi level of expression of the i-th gene

Ni number of upstream inhibitors

NA number of activators

Ij concentration of upstream inhibitors

Al concentration of activators…

Synthetic data: AGNs

Page 63: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

• Transcriptional interactions are approximated as:

• In order to generate M samples, the parameters for gene i in imaginary microarray k are:

iiki aa ,

iiki bb ,

are some original constant values of the parametersii ba ,

ikik ,, , are uniform random variables in [0,2]

This simulates the sampling of a population of distinct phenotypes at random time points and where the efficiency of biochemical reactions may be also distinct.

Synthetic data: AGNs

Page 64: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

• Two possible network topologies are considered:

Erdös-Rényi

Each vertex of the graph is equally likely to be connected to any other vertex.

i.e. The presence of an edge between each possible pair of genes is modeled as a Bernoulli random variable with parameter p.

Synthetic data: AGNs

Page 65: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

• Two possible network topologies are considered:

Scale-free

The distribution of the number of connections k associated with each vertex follows a power law:

p(k)~k-, >0

This motivates the appearance of large interaction hubs.

Synthetic data: AGNs

Page 66: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

• Two possible network topologies are considered:

Erdös-Rényi Scale-free

Synthetic data: AGNs

Page 67: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

• Two performance indicators:

• Recall: NTP/(NTP+NFN)

(fraction of the true edges that are identified)

• Precision: NTP/(NTP+NFP)

(fraction of the identified edges that are true)

Synthetic data: AGNs

Page 68: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

• ARACNE outperforms its competitors for sufficiently small choices of p-values

• When the p-value is not small enough, non-statistically significant MI values are accepted and the algorithm crashes.

Synthetic data: AGNs

(Erdös-Rényi) (Scale-free)

Page 69: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

• ARACNE’s performance depends on MI being high for directly interacting genes and decreasing rapidly with the interaction distance.

Synthetic data: AGNs

Page 70: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

• Performance is stable in terms of kernel width as long as it is not too narrow.

Synthetic data: AGNs

Page 71: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

• Better performance for Erdös-Rényi (less loops)• High precision and substantial recall, even for small samples

Synthetic data: AGNs

Page 72: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

• Results for the case of human B cells are described in detail in the second paper:

“Reverse engineering of regulatory networks in human B cells”

Basso et al., Nature Genetics, 37, 382-390, 2005;

Real data: Human B cells

Page 73: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

• Large gene expression datasets such as those derived from systematic perturbations to simple organisms (yeast) are not that easily obtained for mammalian cells.

• The more complex the organism becomes, the more difficult it is to distinguish among physical and functional interaction.

Real data: Human B cells

Page 74: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

• Authors assume that an equivalent dynamic richness can be obtained for a given type of human cell.

• They choose to work with 340 B lymphocytes derived from normal, tumor-related and experimentally manipulated populations.

• They feed the data to ARACNE, which generates a regulatory network containing aprox. 129,000 interactions.

Real data: Human B cells

Page 75: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

• Interesting observation:• The results show a power-law tail in the relationship

between the number of genes n and their number of interactions.

• This is suggestive of a scale-free underlying network structure.

• Important because the evidence of scale-free topology in higher-order eukaryotic cells is still scarce.

Real data: Human B cells

Page 76: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

• They use Gene Ontology to analyze the biological processes affected by the most relevant (top 5%) hubs.

• Focus on the c-MYC proto-oncogene:• Because it is well characterized as a transcription

factor, which helps validate the results.• Subnetwork with 2,063 genes (56 directly connected

to MYC).• The network is handicapped by certain limitations:

• The edges are undirected• Some direct connections due to certain intermediates which

are not even represented on the microarray• Some direct relations incorrectly removed by the DPI

Real data: Human B cells

Page 77: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

Real data: Human B cells

Page 78: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

Real data: Human B cells

Page 79: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

• Some results of interest can be summarized as follows:• 29/56 (51.8%) of the first neighbors presumed “correct”

(either reported in literature or ChIP validated in lab)

• This is statistically significant wrt the expected 11% of background c-MYC targets among randomly selected genes.

• Furthermore, c-MYC target genes are significantly more enriched among first neighbors than among second neighbors (51.8% vs 19.4%).

Real data: Human B cells

Page 80: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

Outline

• A very short introduction to MRF

• The ARACNE approach• Theoretical framework• Technical implementation

• Experiments and results• Synthetic data• Human B cells (paper by Basso et. al)

• Discussion

Page 81: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

• ARACNE provides a provably exact network reconstruction under a controlled set of approximations.

• Some limitations:• ARACNE opens all three-gene loops along weakest edge

(i.e. false negatives for triplets of truly interacting genes)

• Only statistical dependencies expressed as pairwise interaction potentials can be inferred.(solution: extend the model to higher orders)

• Edges are undirected (inevitable with non-temporal data?)

• Some ambiguities concerning the interpretation of the inferred irreducible statistical dependencies.

• Beware of loopy underlying topologies (go for tree-like)

Discussion

Page 82: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

• ARACNE provides a provably exact network reconstruction under a controlled set of approximations.

• Some virtues:• Low computational complexity

• No need to discretize expression levels

• Does not rely on unrealistic network models or a priori assumptions (excuse-me ?? :-|)

• Avoidance of heuristic/stochastic search procedures

• High precision and recall on synthetic data

• Ability to infer genetic interactions on a genome-wide scale from gene-expression profiles of mammalian cells.

Discussion

Page 83: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

• There is no “free lunch”: in order to effectively deal with the small sample regime some simplifying assumptions need to be made.

• Yet, it seems that this is the way to go: “divide and conquer paradigm”The big problem of inferring gene interaction networks should be broken down into smaller subproblems that can be addressed by relatively simple classes of models. Each model will then rely on strong assumptions, but they should be able to deal with complex scenarios when carefully chosen and properly combined.

• No magic recipes, a long and winding road ahead… (remember Minsky?)

My point of view

Page 84: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

“…A well-known anecdote relates how, sometime in 1966, the legendary Artificial Intelligence pioneer Marvin Minsky directed an undergraduate student to solve "the problem of computer vision" as a summer project. This anecdote is often resuscitated to illustrate how egregiously the difficulty of computational vision has been underestimated. Indeed, nearly forty years later, the discipline continues to confront numerous unsolved (and perhaps unsolvable) challenges, particularly with respect to high-level "image understanding" issues such as pattern recognition and feature recognition. Nevertheless, the intervening decades of research have yielded a great wealth of well-understood, low-level techniques that are able, under controlled circumstances, to extract meaningful information from a camera scene. These techniques are indeed elementary enough to be implemented by novice programmers at the undergraduate or even high-school level…”

(from “Computer Vision for Artists and Designers” by Golan Levin)

My point of view

Page 85: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory

Thanks for your attention