Hernández-Lemus & Rangel-Escareño, 2011. The Role of Information Theory in Gene Regulatory Network...
-
Upload
fang-ookami-garcia-campos -
Category
Documents
-
view
215 -
download
0
Transcript of Hernández-Lemus & Rangel-Escareño, 2011. The Role of Information Theory in Gene Regulatory Network...
-
8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference
1/48
In: Information Theory: New Research
Editors: P. Deloumeaux et al, pp. 137-184
ISBN: 978-1-62100-325-0
c 2011 Nova Science Publishers, Inc.
Chapter 4
THE ROLE OFINFORMATIONTHEORY
IN GENE REGULATORYNETWORKINFERENCE
Enrique Hernandez-Lemusand Claudia Rangel-Escare no
Computational Genomics Department,
National Institute of Genomic Medicine, Mexico
Abstract
One important problem in contemporary computational biology, is thatof reconstructing the best possible set of regulatory interactions between
genes (a so called gene regulatory network -GRN) from partial knowl-
edge, as given for example by means of gene expression analysis exper-
iments. Since only highly noisy-data is available, doing this represents
a challenge to common probabilistic modeling approaches. However, a
variety of algorithms rooted in information theory and maximum entropy
methods, have been developed and they have coped with the problem suc-
cessfully (to a certain degree). Mutual information maximization, Markov
random fields, use of the data processing inequality, minimum description
length, Kullback-Liebler divergence and information-based similarity are
some of these. Another approach to modeling gene regulatory networks
combines information theory and machine learning techniques. Monte
Carlo methods and variational methods can also be used to measure datainformation content. Hidden Markov models (HMM) or stochastic linear
E-mail address: [email protected]
-
8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference
2/48
138 Enrique Hernandez-Lemus and Claudia Rangel-Escareno
dynamical systems use time series data to represent information of a statesequence about the past through a discrete random variable called the hid-
den state. Similarly, stochastic linear dynamical systems represent infor-
mation about the past but through a real-valued hidden state vector. Com-
mon to these models is the fact that conditioned on the hidden state vec-
tor, the past, present and future observations are statistically independent.
State-Space models, also known as Linear Dynamical Systems (LDS) or
Kalman Filter models, are a subclass of dynamic Bayesian networks used
for modeling time series data. Expressing time series models in state-
space form allows for unobserved components - an important factor when
modeling gene expression data. Unobserved variables can model biolog-
ical effects that are not taken into account by the observables. They could
model the effects of genes that have not been included in the experiment,
levels of regulatory proteins or possible effects of mRNA degradation.Work presented here shows the use of these models to reverse engineer
regulatory networks from high-throughput data sources such as microar-
ray gene expression profiling. In this review we will also describe the
basic theoretical foundations common to such methods and will briefly
outline their virtues and limitations.
Keywords: Information theory, Network inference, probabilistic modeling
1. Introduction
A common situation in several emerging fields of science and technology,such as bioinformatics and computational biology, high energy physics and as-
tronomy, to name a few, is that researchers are confronted with datasets having
thousands of variables, large noise levels, non-linear statistical dependencies
and a very reduced sampling universe. The detection of functional and struc-
tural relationships of the data when confronted with such situations is always
a major challenge. In particular, the construction of dynamic maps of gene in-
teractions (also-called genetic regulatory networks) relies on understanding the
interplay between thousands of genes. Several issues arise in the analysis of
data related to gene function: the measurement processes generate highly noisy
signals; there are far more variables involved (number of genes and interactions
among them) than experimental samples. Another source of complexity is the
highly nonlinear character of the underlying biochemical dynamics.
Hence two important milestones in the analysis of genomic regulation are
variable selection(also calledfeature selection) andnetwork inference. The for-
-
8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference
3/48
The Role of Information Theory... 139
mer is a machine learning topic whose goal is to select from amongst thousandsof input variables, those that lead to the best predictive model. Feature selec-
tion methods applied to genomic data allows, for instance, to improve molecular
diagnosis and prognosis in complex diseases (such as cancer) by identifying a
set (called a molecular signature) of features or variables that best represent
the phenomenon. In the case of network inference, consists in representing the
(in general non-linear) set of statistical dependencies between variables on a set
(that can be the whole input dataset or a feature-selectedsubset of it) by means
of a graph. When applied to genomic expression data (e.g. from microarray
experiments), network inference is able to reverse-engineer the transcriptional
gene regulatory network (GRN) of the related cell. Knowledge of this GRN
would allow for instance, to the discovery of new drug targets to cure diseases.
Information theory (IT) has resulted on a powerful theoretical foundation
to develop algorithms and computational techniques to deal both with feature
selection and with network inference problems applied to real data. There are
however goals and challenges involved in the application of IT to genomic anal-
ysis. The applied algorithms should return intelligible models (i.e. they must
result understandable), they must also rely on little a priori knowledge, deal
with thousands of variables, detect non-linear dependencies and all of this start-
ing from tens (or at most few hundreds) of highly noisy samples. As we will
shown in this chapter, IT has provided approaches to deal with this problems.
Some of these approaches are based on machine learning techniques, basically
by modeling a targetfunction connecting the variables of a system. Here, the
output or target variable is the one to be predicted and the input variables are the
predictors.
As a means to produce intelligible models we perform feature-selection pro-
cedures. The goal of these procedures is to select inputs among a set of variables
which lead to the best predictive model. In the vast majority of cases, feature
selection is a preprocessing step prior to the actual machine learning stage. This
is a somewhat critical part of the whole inference process. In the one hand, vari-
able or feature elimination can lead to information losses. In the other, feature
selection is a mean to improve the accuracy of a model, to improve the gener-
alizability of such model, as well as its intelligibility and at the same time to
decrease the computational burden for the training and inference stages. Com-
putational methods for feature selection usually consist in a search algorithm
that explores different combinations of variables, supplemented with a measure
of performance (or score) for this combinations. There are several ways to ac-
-
8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference
4/48
140 Enrique Hernandez-Lemus and Claudia Rangel-Escareno
complish this task, in our opinion, the best benchmarking options for the GRNinference scenario, are the use of sequential search algorithms (as opposed to
stochastic search) and performance measures based on IT, since this made fea-
ture selection fast end efficient, and also provide an easy means to communicate
the results to non-specialists (e.g. molecular biologists, geneticists and physi-
cians).
GRNs are graph-theoretical constructs that describe the integrated state of
a cell (or a small population of similar cells to be more precise) under certain
biological conditions at a given time. GRNs are means for identifying gene
interactions from experimental data through the use of theoretical models and
computational analysis. The inference of such an interaction connectivity net-
work involves the solution of an inverse problem (a deconvolution) that aims touncover the interactions from the properties and dynamics of observable behav-
ior in the form of, for example, RNA transcription levels in a characteristic gene
expression profile. A growing number of deconvolution methods (also called
reverse engineering methods) have been proposed in the past [6, 62]. Their
goal is to provide a well-defined representation of the cellular network topol-
ogy from the transcriptional interactions as revealed by gene expression mea-
surements that are then treated as samples from a joint probability distribution.
The goal of deconvolution methods is the discovery of GRNs based on statisti-
cal dependencies within this joint distribution [13]. One major shortcoming is
that, surprisingly, there is still no conceptual agreement as to what the depen-
dencies are within these multivariate settings and about the role of noise and
stochastic dynamics in the problem. The special case of conditional statistical
dependence has gained, however, a certain place as a somehow useful criterion
in most biomedical applications. The central aim is to find a way to decom-
pose the Statistical Dependency Matrix (SDM) -that is, the deviation of a joint
probability distribution from the product of its marginals- into a series of well
defined contributions coming from interactions of several orders of complex-
ity. IT is therefore the right setting to do so. Typical means to reach this goal
consist in the quantification of the new information content that arise when we
look at the full joint probability distribution compared to a series of successive
independence approximations.
In GRNs each variable of the dataset is represented by a node (or vertex) in
the graph. There is a link joining two variables-nodes if these variables exhibit
a particular form of dependency (the particular form of dependency depends
explicitly on the inference method chosen). Some genes can produce a protein
-
8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference
5/48
The Role of Information Theory... 141
(or other biomolecules such as a microRNA) that is able to activate or repressthe production of another genes protein. There is thus, a presence ofcircuits
coded in the DNA of a cell. A useful way to represent this circuits is a graph
where the nodes represent the genes and the links or arcs are the interactions
between them. Here we will be dealing with reverse engineering methods for
GRNs using whole-genome gene expression data as input data. This problem
is very general and useful in contemporary research in computational molecular
biology, however it is a question that remains ut to date open due to its combina-
torial nature and the poor information content of the data. Validation of network
against available real-life data will be thus an important stage in the discovery
of reliable GRNs.
As we have seen there are two major shortcomings related to the feature
selection and network inference procedures: i) non-linearity and ii) large num-
ber of variables. IT methods are often efficient techniques to deal with issues
i) and ii) [52, 22, 21, 38, 26]. It can be seen that most of these methods rely
on some form ofmutual information metric. Mutual information (MI) is an
information-theoretic measure of dependency which is model independent and
has been used to define (and quantify) relevance, redundancy and interaction in
such large noisy datasets. MI has the enormous advantage that captures non-
linear dependencies [38, 26]. Finally MI it is rather fast to compute, hence it
can be calculated a high number of times in a still reasonable amount of time,
an explicit requirement in whole-genome transcription analysis.
2. Information Theoretical Measures and Probability
Measures
We will introduce here the essential notions of IT that will be used, like
entropy, mutual information and other measures. In order to do so, let Xand Ydenote two discrete random variables having the following features:
Finite alphabet X and Yrespectively
Joint probability mass distributionp(X, Y)
Marginal probability mass distributionsp(X)andp(Y)Let also X and Y denote two additional discrete random variables defined
onX andYrespectively, the associated probability mass distributions will be
-
8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference
6/48
142 Enrique Hernandez-Lemus and Claudia Rangel-Escareno
p(X)andp(Y), their joint probability mass distribution p(X, Y)and defined onJ, the joint probability sampling space; J =X Y. For particular realizations,we havep(x) =P(X=x)and p(y) =P(Y =y).
Following Shannon [58], for every discrete probability distributionX it ispossible to define theinformation theoretical entropyHof such distribution asfollows
H=Ks
p(X)logp(X) (1)
here H is called Shannon-Weavers entropy, Ks is a constant useful thedetermine the units in which entropy is measured andp(X)is the mass prob-
ability density for state of the random variable given by X = x. Entropywas originally developed to serve as a measure of of the amount of uncertainty
associated with the value ofXhence relating the predictabilityof an outcomewith the probability distribution.
TheKullback-Leibler divergence, KL(, :)is a non-commutative measure ofthe difference between two discrete probability distributions [33].
KL [p(Y);p(Y) ] = y Y
p(y)logp(y)p(y) (2)
TheJoint Kullback-Leibler divergencebetween two probability mass distri-
butionsp(X, Y)andp(X, Y)is given by:KL [p(X, Y);p(X, Y) ] =
xX
p(x)yY
p(y|x)logp(x, y)p(x, y) (3)
In a similar way, it is possible to define the Conditional Kullback-Leibler
divergencebetweenp(Y|X)andp(Y|X)as follows:KL [p(Y|X);p(Y|X) ] =
xX
p(x)y Y
p(y|x)logp(y|x)p(y|x) (4)
Equation 4 means that a conditional Kullback-Leibler divergence can also
be defined as the expected value of the Kullback-Leibler divergence of the con-
ditional probability mass functions averaged over the conditioning random vari-
ables.
Recalling equation 2 we notice that it can be rephrased as follows:
-
8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference
7/48
The Role of Information Theory... 143
KL [p(Y);p(Y) ] = yY
p(y)logp(y) y Y
p(y)logp(y) (5)We could see that the first term in the right hand side of equation 5 is pre-
cisely the negative of the entropy H(Y) as given by equation 1. Shannons en-
tropy depends on the distributionp(Y)and, as Shannon himself showed [58], itis maximum for a uniform distribution u(Y).H[u(Y)] = log |Y|. If we replacep(y)foru(Y)in equation 5 we get:
H[p(Y)] = log |Y|KL [p(Y); u(Y) ] (6)
As we can see, equation 6 states that the entropy of a random variable Yis the logarithm of the size of the support set minus the Kullback-Leibler di-vergence between the probability distribution of Y and the uniform distribution
over the same domain Y. Thus, the closer the probability distribution is to auniform distribution, the higher is the entropy. Hence, entropy measures ran-
domness and unpredictability of a distribution.
Now, let us consider a pair of discrete random variables(Y, X)with a JointProbability Distribution (JPD)p(Y, X). For these random variables the jointentropyH(Y, X)is given in terms of the JPD as:
H(Y, X) =
yY xXp(y, x)logp(y, x) (7)
We could notice that the maximal joint entropy is attained under indepen-
dence conditions of the random variables Y and X, that is when the JPD isfactorizedp(Y, X) =p(Y)p(X), in this case the entropy of the JPD is just thesum of their respective entropies. An inequality theorem could be stated as an
upper bound for the join entropy:
H(Y, X) H(Y) + H(X) (8)
Equality only holds ifXandY are statistically independent.
Also, given a Conditional Probability Distribution (CPD), the corresponding
conditional entropyofY givenXcan be defined as:
H(Y|X) = yY
xX
p(y, x)logp(y|x) (9)
-
8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference
8/48
144 Enrique Hernandez-Lemus and Claudia Rangel-Escareno
Conditional entropies are useful to measure the uncertainty of a randomvariable once another one (the conditioner) is known. It can be proved [12] that:
H(Y, X) = H(X) + H(Y|X) H(Y) + H(X) (10)
Or, in other words:
H(Y|X) H(Y) (11)
Equality only holds when Xand Yare statistically independent. Expression11 is extremely useful in the inference/prediction scenario: ifYis a target vari-able andXis a predictor, adding variables can only decrease the uncertainty onthe targetY. This will result almost essential for IT methods of GRN inference.
Entropy reduction by conditioning can be accounted in a pretty formal way
if we consider a measure called the mutual information, I(Y, X) which is asymmetrical measure (i.e. I(Y, X) =I(X, Y)) that is written as:
I(Y, X) =H(Y) H(Y|X) or I(X, Y) =H(X) H(X|Y) (12)
If we resort to Shannons definition of entropy (equation 1) [58] and substi-
tute it into equation 12 we get:
H(Y, X) =
yY xXp(x, y)log
p(x, y)
p(x)p(y) (13)
Mutual information can be written as the product of the Kullback-Liebler
divergence between the JPD and the product distribution:
I(Y, X) = KL [p(X, Y); p(X)p(Y) ] (14)
Mutual information is also given by the Kullback-Liebler divergence be-
tween the marginal distributionp(X)and the conditional distributionp(X|Y)
I(Y, X) = KL [p(X|Y); p(X) ] (15)
Mutual information and Kullback-Liebler divergences are two of the most
widely used IT measures to solve the GRN inference problem.
A comprehensive catalogue of algorithms to calculate diverse information
theoretical measures has been developed for [R] the statistical scientific com-
puting environment [27].
-
8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference
9/48
The Role of Information Theory... 145
3. Methods in Regulatory Network InferenceThe deconvolution of a GRN could be based on a maximum entropy opti-
mization of the JPD of gene-gene interactions as given by gene expression ex-
perimental data could be implemented as follows [26]. The JPD for the station-
ary expression of all genes, P({gi}),i = 1, . . . , N may be written as follows[38]:
P({gi}) = 1
Z expHgen (16)
Hgen = [Ni
i(gi) Ni,j
i,j(gi, gj) Ni,j,k
i,j,k(gi, gj, gk) . . .] (17)
Here N is the number of genes, Z is a normalization factor (the partitionfunction), thes areinteraction potentials. A truncation procedure in equation17 it is used to define an approximate hamiltonianHpthat aims to describe sta-tistical properties of the system. A set of variables (genes), interacts with eachother if and only if the potential between such set of variables is non-zero.The relative contribution ofis taken as proportional to the strength of the in-teraction between this set. Equation 17 does not define the potentials uniquely,
thus, additional constraints should be provided in order to avoid ambiguity. A
usual approach to do so is specify
s using maximum entropy (MaxEnt) ap-
proximations consistent with the available information on the system in the form
of marginals. Information theory provides a set of useful criteria for setting up
probability distribution functions (PDFs) on the basis of partial knowledge.
The MaxEnt estimate of a PDF is the least biased estimate possible, given
the information, i.e. the PDF that is maximally non-committal with regard to
missing information [28]. It is not possible to constrain the system via the
specification of all possible N-way potentials when N is large, hence one has
to approximate the interaction structure. According to the current genomics
literature, sample sizes of order 102 (the usual maximum size available inmost present-day studies) are generally sufficient to estimate 2-way marginals,
whereas 3-way marginals (e.g. triplet interactions i,j,k(gi, gj , gk)) requireabout an order of magnitude more samples, a sample size unattainable under
present circumstances. Being this the case, one is usually confronted with a
2-way hamiltonian of the form:
-
8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference
10/48
146 Enrique Hernandez-Lemus and Claudia Rangel-Escareno
Figure 1. A set of genesi interacts with another set of genesk by means ofa potential= 0and is non-interacting with another set of genesj since thecorresponding potential functional is equal to zero.
Happrox =N
i i(gi) N
i,j i,j(gi, gj) (18)Under that approximation, the reconstruction (or deconvolution) of the as-
sociated GRN consists in the inverse-problem of determining the complete set
of relevant 2-way interactions i,j(gi, gj) consistent with the JPD (equations16 and 17) that defines all known constrictions, e.g. the values of the stationary
expression of genesgi as given by the set ofi(gi)s and non-committal withevery other restriction in the form of a marginal. The modeling of a GRN de-
pends on the description of the interactions in the form of several correlation
functions. A great deal of work has been done within the framework of the
Bayesian Network (BN) approach [51, 23]. BN models both static and dynamic
have provided with a better understanding of the problem in terms of solvabil-
ity, noise reduction and algorithmic complexity. Since BNs are a form of the
Directed Acyclic Graph (DAG) problem, there are several instances (e.g. feed-
forward loops, feed-back cycles, etc.) in which the DAG formalism of BNs
-
8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference
11/48
The Role of Information Theory... 147
fails short. It has been noted [6] that BNs require a larger number of data points(samples) to infer the probability density distributions whereas information the-
oretical approaches perform well for steady-state data and can be applied even
when few experiments (compared to the number of genes) are available. A re-
cently developed approach is the use of statistical and information theoretical
models to describe the interactions [36].
If we consider a 2-way interaction hamiltonian, all gene pairs i,j for which
i,j= 0 are said to be non-interacting. This is true for genes that are statisticallyindependent,P(gi, gj) P(gi) P(gj), but it is also valid for genes that do nothave a direct interaction but are connected via other genes i.e. i,j = 0 butP(gi, gj) = P(gi) P(gj). Several metrics such as Pearson Correlation, SquareCorrelation and Spearman Ranked coefficients over the sampling universe have
been used, but the performance of these methods is usually poor as suffers from
a big number of false positive predictions.
3.1. Information Theoretical Methods
3.1.1. Mutual Information
An information theoretical measure that has been used successfully to infer
2-way interactions in GRNs is mutual information (MI) [38, 37, 3, 4]. MI for
a pair of random variables, andis defined asI(, ) = H() + H() H(, ). HereHis the information theoretical entropy (Shannons entropy),
H(x) = logp(xi) = ip(xi)logp(xi). MI measures the degree ofstatistical dependency between two random variables. From the definition one
can see that I(, ) = 0 if and only if andare statistically independent.Estimating MI between gene expression profiles under high throughput exper-
imental setups typical of todays research in the field is a computational and
theoretical challenge of considerable magnitude. One possible approximation
is the use of estimators. Under a Gaussian kernel approximation [60], the JPD
of a 2-way measurementXi= (xi, yi),i = 1, 2, . . . , M is given as [38]:
f(X) =
1
MiG[
h1|X
Xi|
]
h2 (19)
G is the bivariate standard normal density and h is the associated kernelwidth [38]. The mutual information could be evaluated as follows:
-
8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference
12/48
148 Enrique Hernandez-Lemus and Claudia Rangel-Escareno
I({xi}, {yi}) = 1
M
i
log f(xi, yi)
f(xi) f(yi) (20)
hence, two genes with expression profilesgi and gj for whichI(gi, gj)= 0are said to interact each other with a strength I(gi, gj) (gi, gj), whereastwo genes for which I(gi, gj) is zero are declared non-directly interactingto within the given approximations. Since MI is reparametrization invari-
ant, one usually calculates the normalized mutual information. In this case
I(gi, gj) [0, 1],i, j.
Figure 2. Panel i shows a bivariate interaction between gene A and genes B
and C, panel ii shows an indirect interaction of gene A on gene C mediated by
gene B, panel iii depicts two independent interactions between gene A and B
and gene A and C.
A highly customizable set of algorithms for mutual information inference
of gene regulatory networks has been implemented in the [R]/BioConductor
scheme [43, 42] and is called minet. The inference proceeds in two steps.
First, the Mutual Information Matrix (MIM) is computed, a square matrix whose
MIMij term is the mutual information between genexi and xj . Secondly, aninference algorithm takes the MIM matrix as input and attributes a score to
each edge connecting a pair of nodes. Different entropy estimators are imple-
mented in this package as well as different inference methods, namely aracne,
clrand mrnet, finally the package integrates accuracy assessment tools, like
PR-curves and ROC-curves, to compare the inferred network with a reference
one [41]. The approach used there is based also on techniques from informa-
tion theory, it is called themaximum relevance/minimum redundancy algorithm
(MRMR) [17] and results a highly effective information-theoretic technique for
-
8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference
13/48
The Role of Information Theory... 149
feature (or variable) selection in supervised learning. The MRMR principleconsists in selecting among the least redundant variables the ones that have the
highest mutual information with the target.
MRNET [41] extends this feature selection principle to networks in order to
infer gene-dependence relationships from microarray data. The MRMR method
[17] used in conjunction with a best-first search strategy for performing filter
selection in supervised learning problems can be performed as follows: Con-
sidering a supervised learning task, where the output is denoted by Y and V
is the set of input variables. MRMR ranks the set V of inputs according to
a score that is the difference between the mutual information with the output
variable Y (maximum relevance) and the average mutual information with the
previously ranked variables (minimum redundancy). Hence direct interactionsshould be well ranked, whereas indirect interactions should be badly ranked
by the method. Then a greedy search algorithm starts by selecting the vari-
ableXi that shows the highest mutual information to the target Y. The follow-ing selected variable Xj will be the one with a high information I(Xj ; Y) tothe target and at the same time a low information I(Xj; Xi) to the previouslyselected variable. In the following steps, given a setSof selected variables,the criterion updatesSby choosing the variable that maximizes the score. Ateach step of the algorithm, the selected variable is expected to allow an efficient
trade-off between relevance and redundancy. The MRMR criterion is therefore
an optimal pairwise approximation (a proxy) of the conditional mutual infor-
mation between any two genes Xj andYgiven the set S of selected variablesI(Xj; Y|S). MRNET (and minetalso) works-out by repeating such MRMRalgorithm for every target gene (or in any case for every gene to search for de
novotranscriptional interactions).
MRNET reverse engineers networks by means of a forward selection strat-
egy that aims to identify a maximally-independent set of neighbors for every
variable. A known limitation of algorithms based on forward selection, how-
ever, is that the quality of the selected subset strongly depends on the first vari-
able selected (dependence of initial conditions). A modified version is presented
called mrnetb[43], which is an improved version of MRNET that overcomes
this shortcoming by using backward selection followed by a sequential replace-
ment and it can be implemented with about the same computational burden as
the original forward selection strategy. The optimization problem of MRNET is
a form of binary quadratic optimization for which backward elimination com-
bined with a sequential search is known to perform well. Backward elimination
-
8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference
14/48
150 Enrique Hernandez-Lemus and Claudia Rangel-Escareno
starts with a set containing all the variables and then selects the variable Xiwhose removal induces the highest increase of the objective function. The pro-
cedure is enhanced by an iterative sequential replacement which, at each step,
swaps the status of a selected and a non selected variable such that the largest
increase in the objective function is achieved. The sequential replacement is
stopped when no further improvement is met [43]. Forward selection, backward
elimination, and sequential replacement all have an algorithmic complexity of
O(n2)so that the network built by backward elimination followed by sequentialreplacement has the same asymptotic computational cost as the one based on a
forward selection strategy alone.
As one could further notice, the inference of GRNs by means of such high
performance IT methods is posed by large computational complexity. The lim-iting condition to these approaches is the time-consuming step of computing
the MI matrix. A method has been proposed by Qiu and colleagues [53] to
reduce this computation time. It is based in the application of spectral analy-
sis to re-order the genes, so that genes that share regulatory relationships are
more likely to be placed close to each other. Then, using a sliding window ap-
proach with appropriate window size and step size, the MI for the genes within
the sliding window is then computed, and the remainder is assumed to be zero.
Qius method does not incur performance loss in regions of high-precision and
low-recall, while the computational time is significantly lowered. The essence
of Qius method is as follows: To determine the new gene ordering, a Lapla-
cian matrix is derived from the correlation matrix of the gene expression data,assuming the correlation matrix provides an adequate approximation to the adja-
cency matrix for our purpose, then it is computed the Fiedler vector [11], which
is the eigenvector associated with the second smallest eigenvalue of the Lapla-
cian matrix. Since the Fiedler vector is smooth with respect to the connectivity
described by the Laplacian matrix, the elements of the Fiedler vector are then
sorted to obtain the desired gene ordering. The computational complexity of ob-
taining the gene ordering is negligible compared to the computation of the MI
matrix. The reduction in computational complexity is the result of computing
only the diagonal part of the reshuffled MI matrix. Because the remaining en-
tries of the MI matrix are set to be zeros, there is potential loss of reconstruction
accuracy although due to Fielder minimization [53] this effect is not expected
to be significant. In fact, according with a benchmark of the method [53] in the
high-precision low-recall regime, applying the sliding window does cause a per-
formance loss. In some cases, applying the sliding window yields slightly better
-
8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference
15/48
The Role of Information Theory... 151
performance. In the low-precision regime, however, the windowed version haslower recall but this regime is dismissed, because one is not able to distinguish
biologically meaningful links from false positive ones.
3.1.2. Markov Random Fields
A Markov random field is a n-dimensional random process defined on a
discrete lattice. Usually the lattice is a regular 2-dimensional grid in the plane,
either finite or infinite. Assuming thatXnis a Markov Chain taking values in afinite set,
P(Xn= xn|Xk= xk; k=n) = P(Xn= xn|Xn1 = xn1; Xn+1 = xn+1) (21)
Hence, full conditional distribution ofXndepends of only in the neighborsXn1and Xn+1: In the 2-D setting, ifS= 1;2; . . . ; NS= 1;2; . . . ; Nis theset ofN2 points, called sites or states. The aforementioned morphism defines aconditional Markov random field [32].
Markov random field (MRF) models have been applied in several scenarios
within he computational molecular biology setting, for instance with regards
to functional prediction of proteins in protein-protein interaction networks [14,
15, 35], in the discovery of molecular pathways for protein interaction and gene
expression data [56] and in general network-based analysis for genomic data
[63, 64]. In the case of reverse engineering methods for network inference, aMRF model could be stated as follows [63]:
An arbitrary state assignment for a gene set will be denoted by x =(x1, x2, . . . , xp), here xi is the expression state (either equally or differen-tially expressed, 0 or 1 respectively) of gene i, letx be the true but unknowngene expression state. We can interpret this as a particular realization of a
random vector X = (X1, X2, . . . , X p) where Xi assigns expression state togene i. Let yi stand for the experimentally observed mRNA expression levelof genei and y the corresponding vector, that here is interpreted as a particu-lar realization of a random vector Y = (Y1, Y2, . . . , Y n). Yi itself is a vectoryi = (yi,1, yi,2, . . . yi,m, yi,m+1, yi,m+2, . . . , yi,m+n). This vector containsmreplicates under one condition and n replicates on the other condition. The
joint distribution ofYcould be given in terms of a MRF, to write down thisjoint probability we need to know conditional dependence/independence. In-
formation theory could then be useful to determine from the distributions such
-
8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference
16/48
152 Enrique Hernandez-Lemus and Claudia Rangel-Escareno
conditional dependencies. One way to do that is by means of the so-called It-erative Conditional Mode (ICM) algorithm [63] but other IT-based alternatives
could be also used.
Conditional dependencies are not the only application of IT and MRFs in
transcriptional network inference. To study functional robustness in GRNs,
Emmert-Streib and Dehmer [20] modeled the information processing within the
network as a first order Markov chain and studied the influence of single gene
perturbations on the global, asymptotic communication among genes. Differ-
ences were accounted by an information theoretic measure that allowed to pre-
dict genes that are fragilewith respect to single gene knockouts. The informa-
tion theoretic measure used to capture the asymptotic behavior of information
processing evaluates the deviation of the unperturbed (or normal (n)) state from
the perturbed (p) state caused by the perturbation of gene k. The relative entropy
or Kullback- Leibler (KL) divergence was used to quantify this deviation:
KLi,k=KLpp,i,k ;p
n,i
=m
pp,i,k (m)logpp,i,k (m)
pn,i (m) (22)
In equation 22 the stationary distributionspp,i,k andpn,i are given by:
pp,i,k = limtTtp0i (23)
pn,
i = limtTt
kp0i (24)
The Markov chain given byTk corresponds to the process obtained by per-turbing gene k in the network. By means of this Markov chain model sup-plemented with an information theoretical KL measure, Emmert-Streib and
Dehmer [20] were able to study the asymptotic behavior of the transcriptional
regulatory network of yeast regarding information propagation under the influ-
ence of single gene perturbations. Hence not only static network properties
(such as structure) of the transcriptional regulation networks but also dynamic
features (such as robustness) could be analyzed from the standpoint of IT. The
study concludes that the knocked out genes destroy some communication paths
and, hence, can still have a strong impact on the information processing within
the cell. It seems to be reasonable to assume that the further away the knockout
gene is from the starting gene (say in Dijkstra distance [16]) the less the impact
will be. This is a strong evidence that information processing on a systems level
-
8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference
17/48
The Role of Information Theory... 153
depends crucially on the information processing in a local environment of thegene that sends the information.
From a perspective of information processing the connection between
asymptotic information change and local network structure represented by their
degrees is interesting because it indicates that a local subgraph may be sufficient
to study information processing in the overall network. This finding seems truly
interesting because it would allow to reduce the computational complexity (and
the computational burden also) that arises when studying large genomes on a
systems scale. From the standpoint of information processing, it was shown
that the connection between asymptotic information changes and local network
structure at a local subgraph level may be sufficient to study information pro-
cessing in the overall network.
3.1.3. Data Processing Inequality
In engineering and information theory, the data processing inequality (DPI)
is a simple but useful theorem that states that no matter what processing you do
on some data, you cannot get more information (in the sense of Shannon [58])
out of a set of data than was there to begin with. In a sense, it provide a bound
on how much can be accomplished with signal processing [12]. More quanti-
tatively, consider two random variables, X and Y, whose mutual information is
I(X, Y). Now consider a third random variable, Z, that is a (probabilistic) func-tion of Y only. The only qualifier meansP
Z|XY(z|x, y) =P
Z|Y(z|y), which in
turn implies thatPX|Y Z(x|y, z) =PX|Y(x|y), as is easy to show using Bayestheorem. The DPI states that Z cannot have more information about X than Y
has about X; that is I(X; Z) I(X; Y). This inequality, which again is a prop-erty that Shannons information should have, can be proved, thus, I(X; Z) =H(X) H(X|Z) H(X) H(X|Y, Z) = H(X) H(X|Y) = I(X; Y).The inequality follows because conditioning on an extra variable (in this case Y
as well as Z) can only decrease entropy, and the second to last equality follows
becausePX|Y Z(x|y, z) = PX|Y(x|y). This same principle is applicable eitherto engineering control systems or to biological signal processing such as the one
present in GRNs [38, 57].
In reference [38] the DPI states that if genesg1andg3 interact only througha third gene, g2; we have that I(g1, g3) min[I(g1, g2); I(g2, g3)]. Hence,the least of the three MIs can come from indirect interactions only so that the
proposed algorithm (ARACNe) examines each gene triplet for which all three
-
8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference
18/48
154 Enrique Hernandez-Lemus and Claudia Rangel-Escareno
MIs are greater than I0 and removes the edge with the smallest value. DPI isthus useful to quantify efficiently the dependencies among a large number of
genes. The ARACNe algorithm eliminates those statistical dependencies that
might be of an indirect nature, such as between two genes that are separated by
intermediate steps in a transcriptional cascade. Such genes will very likely have
non-linear correlated expression profiles which may result in in high MI, and
otherwise would be selected as candidate interacting genes. Given a transcrip-
tion factor, application of the DPI will generate predictions about other genes
that may be its direct transcriptional targets or its upstream transcriptional reg-
ulators [39, 25].
The use of the DPI may result not only in a greater assessment of the re-
sults but also in a significant reduction of the computational burden associated
with network inference. Zola, et al. [67] presented a parallel method integrating
mutual information, data processing inequality, and statistical testing to detect
significant dependencies between genes, and efficiently exploit parallelism in-
herent in such computations. They developed a method to carry out permuta-
tion testing for assessing statistical significance of interactions, while reducing
its computational complexity by a factor ofO(n2), where n is the number ofgenes. The problem of inference (usually consuming thousand of computation
hours) at the whole genome network level by constructing a 15,222 gene net-
work of the plant Arabidopsis thaliana from 3,137 microarray experiments in
30 minutes on a 2,048-CPU IBM Blue Gene/L, and in 2 hours and 25 minutes
on a 8-node Cell blade cluster [67].
3.1.4. Minimum Description Length
One of the major drawbacks for the information theoretic models to infer
GRNs is that of setting up a threshold which defines the regulatory relationships
between genes. The minimum description length (MDL) principle has been
implemented to overcome this problem [10, 19]. The description length used
by the MDL principle is the sum of model length and data encoding length.
A user-specified fine tuning parameter is used as control mechanism between
model and data encoding, but it is difficult to find the optimal parameter. A new
inference algorithm has been proposed, which incorporates mutual information
(MI), conditional mutual information (CMI) [defined in terms of the associ-
ated conditional entropies] and predictive minimum description length (PMDL)
principle to infer gene regulatory networks from DNA microarray data. In this
-
8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference
19/48
-
8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference
20/48
156 Enrique Hernandez-Lemus and Claudia Rangel-Escareno
It is also noticeable that the MDL principle also helps to achieve a goodtrade-off between the network model complexity and the accuracy of data fit-
ting, since given a network and a dataset, the MDL principle evaluates simul-
taneously the goodness of fit of the network and the data. Intuitively, the more
complicated the network is, the better the data would be fitted. However, very
often models which are over-fitted relative to the actual systems are selected,
which give rise to numerous errors. MDL aims to achieve a good trade-off be-
tween model complexity and fitness of the data. A general criterion is thus ob-
tained for constructing the network so as to contain only direct interactions. The
convergence of the proposed MDL-based network inference algorithms can be
assessed by the recovery of the topology of some artificial networks and through
the error rate plots obtained through extensive simulations on datasets produced
by synthetic networks [66].
3.1.5. Kullback-Liebler Divergence
Kullback-Liebler divergence [33] (as well as its symmetricized version, the
Jensen-Shannon measure) are, as it turns out, very commonly used informa-
tion densities in GRN inference and other problems in computational molecular
biology. Either as unique measure [45, 44] or used in conjunction with other
indicators, such as spectral metrics [29], Markov fields [20], minimum descrip-
tion lengths [19], Bayesian networks [50, 31, 46, 48] and multivariate analysis
[40].
However, by far the most general use of the KL-divergence within GRNinformation setting is by playing the role of the multi-information: it is known
[40] that for two variables, X1 andX2, independence is well defined via de-composition of the bivariate JPD, P(X1, X2) = P(X1)P(X2), and mutualinformationI(X1; X2) =log2 P(X1, X2)/[P(X1)P(X2)] which is the onlymeasure of dependence [58]. Along the same lines, thetotal interaction(i.e. the
deviation from independence) in a multivariate JPD, P(Xi), i = 1,...,N, canbe measured by the multi-information as follows:
I(X1;X2; . . .X N) = KL [P(X1;X2; . . .X N), P] = KL
P(X1;X2; . . .X N),i
P(Xi)(29)HereP(X1; X2; . . . X N) is the full JPD andP
=
i P(Xi) is the prob-
-
8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference
21/48
The Role of Information Theory... 157
ability distribution approximated under independence assumption. SinceP
isthe maximum entropy (MaxEnt) distribution [28] that has the same univariate
marginals asPonly without statistical dependencies among the variables, themulti-information is given by the KL divergence between the JPD and its Max-
Ent approximation with univariate marginal constraints. This KL-divergence
measures the gain in information by knowing the complete JPD against assum-
ing total independence. In a similar fashion, thus, MaxEnt distributions con-
sistent with various multivariate marginals of the JPD introduce no statistical
interactions apart from the corresponding marginals. By comparing the JPD to
its MaxEnt approximations under various marginal constraints, we are expect-
ing to separate dependencies included in the low-order statistics from those not
present in them [40].
Assuming that we have a N-variables GRN and we know a set of marginal
distributions of all variable subsets (for size k 1), One can ask what is theJPD Pk that captures all multivariate interactions prescribed by these marginals,but introduces no additional dependencies. This is of course equivalent to
search for the minimum I(X1; X2; . . . X N)or conversely, its maximum entropyH(X1; X2; . . . X N), turning our inference problem into a MaxEnt problem:
Pk arg maxP,{}
H(P)
M
M(PkM PM)
(30)
whereMis the set of constrained variables.
3.1.6. Information Based Similarity
A promising approach consists in considering that the interactivity of the
system is based oncommunication channels(either real or abstract) for the bio-
signals. Thus, Information Theory (IT) could play a useful role in identifying
entropic measures between pairs {gi, gj} of genes within the sampling universeas potential interactions i,j. IT can also provide with means to test for theMaxEnt distribution, by considering, for example the Kullback-Liebler (KL)
divergence (in the sense of multi-information) or the Connected Information as
criteria of iterative convergence to the MaxEnt PDF in the same sense that the
cumulative distribution leads to the specification of usual PDFs [61].
One possible approach that we propose below is based on the quantification
of the so called Information-Based Similarity Index (IBS) [65] initially devel-
oped to work out the complex structure generated by the human heart beat time
-
8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference
22/48
158 Enrique Hernandez-Lemus and Claudia Rangel-Escareno
series. Nevertheless, IBS has proved to be a very powerful tool in the compar-ison of the dynamics of highly nonlinear processes. Within the present context
[26], the symbolic sequence represent the expression values of a single gene
(say gene k-th) all along the sampling universe (of size M), as given by a vector = gk = (gk1 , gk2 , . . . , gkM). Let us consider a series that could well representa gene expression vector. It is possible to classify each pair of successive points
into one of the following binary statesBn, if(n+1 n)< 0 thenBn= 0; inthe other caseBn= 1. This procedure maps theMstep real-valued time series(i)into anM 1step binary-valued seriesB(i). It is now possible to definea binary sequence of lengthm (called anm-bit word). Each of the m-bit wordswk represents a unique pattern in a given time series. For every unitary time-shift, the algorithm makes a different collectionW of m-bit words over thewhole time series,W = {w1, w2, . . . , wn}. It is expected that the frequencyof occurrence of these m-bit words will reflect the underlying dynamics of the
original (real-valued) time series. We are then looking to write down a proba-
bility distribution function in therank-frequencyrepresentation (RF-PDF). This
RF-PDF represents the statistical hierarchy of symbolic words of the original
series [65]. Two given symbolic sequences are said to have similarity if they
give rise to similar probability distribution functions.
Following the very same order of ideas, Yang and collaborators [65] defined
a measure of similarity (akin to statistical equivalence) between two series by
plotting the rank number of every m-bit word in the first series with the rank
for the same m-bit word in the second series. Of course since the series are
supposed to be finite, the m-bit words are not equally likely to appear. Themethod introduces the likelihood of each word by defining a weighted distance
mbetween two given symbolic sequences1 and 2 as follows:
m(1, 2) = 1
2m 1
2mk=1
|R1(wk) R2(wk)|F(wk) (31)
F(wk) is the normalized likelihood of the m-bit word k, weighted by itsgiven Shannon entropy, i.e.:
F(wk) = 1
Z
[p1(wk)logp1(wk) p2(wk)logp2(wk)] (32)
pi(wk) and Ri(wk) represent the probability and rank of a givenword wk in the i-th series. The normalization factor in equation 32 is
-
8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference
23/48
The Role of Information Theory... 159
the total Shannons entropy of the ensemble and is calculated as Z =k[p1(wk)logp1(wk) p2(wk)logp2(wk)]. m(1, 2) is called the In-
formation Based Similarity Index (IBS) between series 1, and 2 (e.g. ex-pression vectors g1 and g2 for genes 1 and 2 respectively). One notices thatm(1, 2) [0, 1]; 1, 2; m. In fact one is able to consider m(1, 2)as a probability measure. Iflim m(1, 2) 1 the series are absolutely dis-similar, whereas in the opposite case (limm(1, 2) 0) the two series be-come equivalent (in the statistical sense). One can then approximate the value
of the interaction potentials(gi, gj) as follows. If one is to consider interac-tion as given by correlation or information flow, one can notice that high values
ofm imply stronger dissimilarity, hence lower correlation and sincem is aprobability measure, one can define the complementary measure
m
= 1mand then one can approximate(gi, gj) m(gi, gj).
4. Bayesian and Machine Learning Methods
Systems biology aims to understand biological processes in living systems
by developing mathematical models which are capable of integrating both ex-
perimental and theoretical knowledge, and it works both ways: Given a pre-
specified mathematical framework, the behavior of a set of genes in a specific
GRN can be simulated under a variety of biological conditions and used to test
hypotheses. But also, given a particular pre-specified mathematical framework,
the observation of gene behavior under specific conditions may be used to inferthe underlying GRN. Generally speaking, the reconstruction of a GRN based on
experimental data is known as areverse engineering approach.
In the context of information theory combined with systems biology, there
are two well known information extraction approaches, characterized as top-
downand bottom-up, both have been used to infer GRNs from high-throughput
data sources such as microarray gene expression measurements. A top-down
approach mainly breaks down a system, in order to gain insights into the sys-
tem. On the other side, bottom-up approaches seek to construct a synthetic gene
networks.
The simplest network in an information theory approach is the correlation
network. This is an undirected graph with edges that are weighted by correla-
tion coefficients. It is simple, computationally manageable and with small data
requirement. The drawback of these is that the models are static and they do not
infer the causality of gene regulation.
-
8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference
24/48
160 Enrique Hernandez-Lemus and Claudia Rangel-Escareno
4.1. Bayesian NetworksA Bayesian network (BN) is a probabilistic graphical network model, de-
scribed by a directed acyclic graph (DAG). In the model each node represents a
random variable and edges define conditional independence relations between
these random variables. These relationships e.g, gene-gene interactions, can be
seen in a directed graph without cycles.Without cyclesmeans a gene may have
no direct or indirect interaction with itself. In order to reverse engineer a gene
network using this approach, one would need to find the directed acyclic graph
that best describes the gene expression data. This particular limitation of a di-
rected acyclic graph can be overcome by using a dynamic Bayesian network.
4.2. Dynamic Bayesian Networks
Bayesian networks that model sequences of variables are called dynamic
Bayesian networks (DBNs). Murphy and Mian [47] first introduced the use of
DBNs to model gene expression time series data. The benefits of DBNs include
the ability to handle latent variables and missing data (such as transcription
factor protein concentrations, which may have an effect on mRNA steady state
levels) and to model stochasticity. Friedman et al. [23] explored experimental
applications to microarray data analysis. Dynamic Bayesian networks may also
use continuous measurements rather than discrete. Feedback loops can also
be unfolded with respect to time, by explicitly modeling the influence of gene
g1 at time t1 on another gene g2 at time t2, where t2 > t1. An appropriatemodel for gene expression microarray data belongs to the class of linear statespace models, widely used in estimation and control problems arising in system
modeling. These models consist of a state variable that is either unobserved
or partially observed, an observable that evolves in a linear relation to the state
variable, and a structural specification which is a set of parameters in the linear
and distributional relationships between state variables, observables, and noise
terms.
4.3. State-Space Models
State-Space models, also known as Linear Dynamical Systems (LDS), are
a subclass of dynamic Bayesian networks. A state space model is a mathemati-
cal model for a process that accepts inputs which are the drivers of the process
and generates outputs that are interpreted as observable manifestations of what
-
8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference
25/48
The Role of Information Theory... 161
is going on inside the process and how this internal behavior is affected by theinputs. These models are suitable for modeling time series data where we have
a series of observations related to a series of unobserved variables changing
over time. Time series models in state-space representation can be thought of as
unobservedcomponent models. The state vector represents those unobserved
or hidden or missing variables and their dynamics over time are governed by
a state transition equation. In the very general setting of a state-space model,
the state vector determines the future evolution of the dynamic system, given
future time paths of all of the variables affecting the system. The variables are
not restricted, they can be either discrete with a countable number of possible
values or continuous with an associated density curve. For example, modeling
gene expression data assumes continuous variables and requires the inclusion
of hidden states. Hidden variables could model the effects of genes that have
not been included in the experiment, they could also model levels of regulatory
proteins as well as possible effects of mRNA or protein degradation. One goal
is to infer the characteristics and properties of the unobserved variables based
on the observations. In linear state-space models, a sequence of p-dimensional
real-valued observation vectors {y1...,yT}, is modeled by assuming that at eachtime step ytwas generated from a K-dimensional real-valued hidden (i.e. unob-served) state variable xt, and that the sequence ofxs is governed by a first-orderMarkov process. This type of model is shown pictorially in Figure (3).
A linear-Gaussian state space model of the time series {yt} is specified bythe matricesAandCcalled system matrices and is described by a pair of equa-
tions:
xt+1 = Axt+ wt (33)
yt = Cxt+ vt (34)
These two equations represent the most basic form of a state-space model.
The vector xt RK is called the state vector at time t. The state equation
(33) shows how this vector evolves with time. A is the dynamic or transitionstate matrix, and its eigenvalues are important in determining the way the data
behave. The observation equation (34) specifies the relationship between the
observed data and this newly introduced vectorxt. Cdescribes the relation be-tween state and observation, and wtand vtare zero-mean random noise vectors.
For the most general case the noise vectors could be mutually correlated,
although serially uncorrelated. In the particular Linear Gaussian case they are
-
8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference
26/48
162 Enrique Hernandez-Lemus and Claudia Rangel-Escareno
Figure 3. State-Space model.
mutually independent and independent of the initial state value x0. Assumingthat the initial statex0is fixed or Gaussian distributed, and the noise vectors are
jointly Gaussian, then the state and output of the system is also Gaussian. That
is, all future hidden states xt and observationsyt generated from those hiddenstates will be Gaussian distributed.
This model has been extensively used in state-space modeling. Brockwell
and Davis [7] develop the state-space model described by (33) and (34) as
well as the associated Kalman filter recursions and apply these in representing
ARMA (autoregressive moving average) and ARIMA (autoregressive integrated
moving average) processes. The Kalman filter recursions define recursive esti-
mators for the state vector xt, given observations up to the present time t. Stofferand Shumway [59] present a similar development and apply it to representing
ARMAX (autoregressive-moving average with exogenous terms) models. Stof-
fer and Shumway also develop the recursive smoother, which gives estimators
of the state variablextgiven observations prior to and after time t, and developstate space models that include exogenous inputs in the state equation, observa-
tion equation, or both. State-space models can be written in different ways. The
structure of the model used in this thesis includes exogenous variables in both
equations and its derivation is detailed in the next section.
-
8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference
27/48
The Role of Information Theory... 163
4.4. LDS Model for Gene ExpressionFluorescent intensities are measures of gene expression levels. Values of
some of these variables influence the values of others through the regulatory
proteins they express, including the possibility that the expression of a gene at
one time point may, in various circumstances, influence the expression of that
same gene at a later time point.
To model the effects of the influence of the expression of one gene at a
previous time point on another gene and its associated hidden variables the LDS
model with inputs we modify the structure as follows.
We let the observations y(i)t = g
(i)t , the expression level of genei at time
pointt, and the inputsht = gtandut= gt1to give the model shown in Figure
4.
Figure 4. Bayesian network representation of the model for gene expression.
This model is described by the following equations:
xt+1 = Axt+ Bgt+ wt (35)
gt = Cxt+ Dgt1+ vt (36)
-
8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference
28/48
164 Enrique Hernandez-Lemus and Claudia Rangel-Escareno
Model Assumptions
The vectorut Rpu is the exogenous input observation vector,ht R
ph
represents the exogenous influence on the hidden states. As before, the state
and observation vectorsxtand ythave dimensionsKand p, respectively.Ais the state transition matrix,Bis the input to state matrix in the state transition equation,Cis the state to observation matrix andDis the input to observation matrix.The state and observation noise vectors, wt and vt respectively, are randomvectors serially independent and identically distributed, and also independent
of the initial values ofx andy and independent of one another.
Remarks
These system matrices A ,B,C ,Dare taken to be constant in this researchbut they also may be varying over time in which case it is appropriate to add a
subscript indicating this.
When the sequence {x1, w1,...,wT} is independent then the distribution ofxt+1|xt,...x1 is the same as the distribution ofxt+1|xt, hence the state vectorxtevolves with a first-order Markov property withAas the transition matrix. The noise vectors can also be viewed as hidden variables. Here the matrix
Din the observation equation captures gene-gene expression level influences atconsecutive time points whilst the matrix Ccaptures the influence of the hiddenvariables on gene expression level at each time point. Matrix B models theinfluence of gene expression values from previous time points on the hidden
states, and A is the state transition matrix. However, our interests focus onCB + D which not only captures the direct gene to gene interaction but alsothe gene to gene interactions through the hidden states over time. This is
actually the matrix we will concentrate the analysis on, since it captures all of
the information related to gene-gene interaction over time.
5. Constrained LDS
Mathematically speaking, the idea of adding constraints to the model is ba-
sically to reduce the number of parameters to estimate. Narrowing down the
-
8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference
29/48
The Role of Information Theory... 165
range of parameters to estimate by adding extra constraints reduces dimension-ality which can considerably simplify the search for the parameters that best
describe the model. At all times during modeling with constraints diagnostics
should be made to make sure the model still fits well after taking account of
the constraints. How precisely to include these forms of information into the
inference process was not a straightforward task. However, this is the true art of
modeling.
From the biological point of view, the current application to gene expression
data is already complex. Data generation, low-level analyses and classification
are known to be crucial in getting gene expression levels. Different algorithms
can lead to different sets of genes. Hence, biological mining should be present in
any machine learning approach. In this sense, any knowledge about gene behav-
ior and regulatory interactions are helpful. Now, if this additional information
can be included and modeled, estimation not only becomes more realistic due
to the reduction os parameters but also due to a more biology based approach.
Given either a-priori or new hypothesized information leading to a set of
plausible models, the LDS model is re-trained based on this knowledge about
the parameters. The a-priori information would be supplied by past experiments
or biological knowledge, while the new hypothesized information is obtained
from the bootstrap analysis
5.1. Model Definition
Two competing motivations must be kept in mind when defining a model:
fidelity and tractability. The models fidelity describes how closely it corre-
sponds to reality. On the other hand, the models tractability focuses on the ease
with which it can be mathematically described as well as analyzed and validated
statistically based on observations and measurements. It is understandable that
increasing one (either fidelity or tractability) is usually done at the expense of
the other. Consequently, the ideal model should be developed in close cooper-
ation between the science governing the application and feasible mathematical
and statistical methods. One common assumption that aids tractability is that
model errors are normally (or Gaussian) distributed. Indeed, a large number
of existing algorithms and methods of statistical inference are based on jointly
Gaussian observables. Though rarely satisfied exactly in practice, this assump-
tion is often justified because it makes the analysis of the model tractable and
the resulting statistical inferences are robust in the sense of being insensitive to
-
8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference
30/48
166 Enrique Hernandez-Lemus and Claudia Rangel-Escareno
small departures from normality. The model definition used in this work, is de-fined with the Gaussian assumption only insofar as it makes the analysis of the
models more straight forward and tractable. However, for statistical inferences
and validation of the model, no essential use of the Gaussian assumption is be
made. Instead, more general methods such as bootstrapping are employed.
5.2. Structural Specification
We will concentrate here on incorporating a-priori information, and for this,
the emphasis is on constraining elements in the matrix D. The reason for this is
simple: D describes the direct gene-to-gene interactions over time, and there-
fore seems the most suitable place to incorporate a-priori information. Recall,that the gene regulatory network is constructed from the estimate of CB + D,
and thus has incorporated in it also the influence of hidden variables (e.g. the
influence of missing genes / proteins, etc.). Thus, the hypothesized form of this
dag entails that some elements of the matrix CB + D are zero. The idea now
would be to impose those constraints on CB + D and re-estimate the model
structural parameters under these constraints and verify that the model still fits
the data well. Imposing constraints reduces the dimensionality of the unknown-
parameter space, and thus creates a new estimation problem (one for which the
remaining unconstrained parameters can be estimated more precisely). Because
of this, solving this new estimation problem (and performing diagnostics) could
expose shortcomings in how well the constrained model describes the data, or
could expose other parts of the model structure that were obscured because of
the larger number of parameters to estimate in the unconstrained model.
5.3. Estimation
With the structural specification known, the objective is to estimate, in a
least-squares sense, the unknown or unobserved state variables from the avail-
able observations. The so-called Kalman filter solves this problem, and vari-
ations of the filter give interpolation, extrapolation, and smoothing estimators
of the state variables (see the book by Aoki [5], for example). The resulting
estimators are optimal in the sense of least- squares, given that one is restricted
to consideration of estimators that are linear functions of the observables. Their
derivation can be accomplished in generality by casting the problem in the con-
text of approximation in a Hilbert space of random variables possessing finite
-
8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference
31/48
The Role of Information Theory... 167
second order moments. This reduces the problem to one of computing projec-tions onto the subspaces spanned by the observables, but the derivations and
machinery of that theoretical approach are tedious. However, in the special
case when the states and observables are jointly Gaussian, the least squares es-
timators of state are given by conditional expectations (conditioned on the ob-
servables) which are in turn linear functions of the observables. Moreover, the
conditional expectation operator has all the essential properties of the subspace
projection operator in the Hilbert space context. As a consequence, the shorter
and more elegant analysis of the problem in the Gaussian context leads to ex-
actly the same estimators of the state variables as the more general Hilbert space
context. Thus, in terms of formulating the state estimators, there is no loss of
generality in assuming Gaussian joint distributions.
Regarding the estimation of the structural parameters, in the absence of as-
sumptions regarding the joint distributions of the state variables and observables
or any other pertinent information, a weighted least-squares approach would be
reasonable and justified. If the assumption is made that the state variables and
observables are jointly Gaussian, then the method of maximum likelihood leads
to parameter estimators that are essentially equivalent to those yielded by the
weighted least-squares approach. Thus, again there is no loss of generality in
making the Gaussian assumption for constructing estimators of structural pa-
rameters.
5.4. Derivation
To model the effects of the influence of the expression of one gene at a
previous time point on another gene and its associated hidden variables, we
consider the state-space model
xt+1= Axt+ Byt+ wt (37)
yt = C xt+ Dyt1+ vt (38)
The column vectorx is the state vector of hidden variables for the system,u is the input observation vector, C is the state to observation matrix whichcaptures the influence of the hidden variables on gene expression level at each
time point.
The matrix D describes the gene-to-gene interaction at consecutive timepoints. From this matrix we obtain the Bayesian network representation of the
-
8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference
32/48
-
8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference
33/48
The Role of Information Theory... 169
+ tr(R1
)(Syy SyxC
SyuD
CS
yx+ CP C
+ CSxuD DSyu+ DS
xuC
+ DSuuD)
where
Syy =N
j=1
Tt=1
y(j)t y
(j)t
Syx =N
j=1
Tt=1
y(j)t x
(j)t
Syu =
Nj=1
Tt=1
y(j)t u
(j)t
Sxu =N
j=1
Tt=1
x(j)t u
(j)t
Suu =N
j=1
Tt=1
u(j)t u
(j)t
P =N
j=1
Tt=1
E[xtxt|y1,...,yT]
Taking partial derivatives of (39) and making them equal to zero, we solve for
C, D and R. In other words, we find the unconstrained estimatorsthat mini-mize the likelihood function (39).
D = (Syu SyxP1Sxu)(Suu S
xuP
1Sxu)1 (40)
C = (Syx DSxu)P
1 (41)
R = 1
NT
Syy CS
yx DS
yu
(42)
To obtain the constrained estimators(Dcons, Ccons, and Rcons) we needto solve the following
-
8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference
34/48
170 Enrique Hernandez-Lemus and Claudia Rangel-Escareno
Constrained Minimization ProblemMinimize
2L(C ,D,R) = NTlog |R|
+ tr(R1)(Syy SyxC SyuD
CSyx+ CP C
+ CSxuD DSyu+ DS
xuC
+ DSuuD)
subject to the constraintDF G= 0
Solution: We introduce the Lagrange Multipliers method to minimize
the new likelihood function (39) subject to the constraintDFG= 0.
Let us define the real-valued column vector of Lagrange multipliers
= (1, 2,...,n). The likelihood function and the constraints associ-
ated with it define our objective function as:
M(C ,D,R) = tr(N Tlog |R|)
+ tr(R1)(Syy SyxC SyuD
CSyx+ CP C
+ CSxuD DSyu+ DS
xuC
+ DSuuD)
+ tr[(DF G)] (43)
Necessary conditions for a minimum of M(C ,D,R) are that elements inC,D,R,andbe chosen to give
M
C = 0,
M
D, and
M
=C onstraints= 0
The third expression implies that a minimum for M is also a minimum for thelikelihood function (39).
M
C =
Ctr(R1)(SyxC
CSyx + CP C + CSxuD
+ DSxuC)
= 2R1
(CconsP+ DconsS
xu Syx) = 0 (44)
M
D =
D
tr(R1)(SyuD
+ CSxuD DSyu + DS
xuC + DSuuD
) + F
-
8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference
35/48
The Role of Information Theory... 171
= 2R1(CconsSxu+ DconsSuu Syu) + F = 0 (45)
M
= DconsFG= 0 (46)
From (44) and (45) we get the constrained estimators forCand D
Ccons = (Syx DconsSxu)P
1 (47)
Dcons = (Syu CconsSxu 1
2
RconsF)S1uu
Using the expressions (40) and (41) for the unconstrained estimators we get the
constrainedD matrix
Dcons= D 1
2Rcons
F(Suu SxuP
1Sxu)1
Substituting these back into (46) and solving forgives:
1
2Rcons
= (DF G)(F(Suu SxuP
1Sxu)1F)1
Putting the expression above back into (5.4.) and solving for Dcons we finallyobtain the constrained estimators for C and D in terms of the unconstrainedones.
Dcons = D (DFG)(F(Suu S
xuP1Sxu)
1F)1F(Suu S
xuP1Sxu)
1
Ccons = C (DF G)(F(Suu S
xuP1Sxu)
1F)1F(Suu S
xuP1Sxu)
1SxuP1
Similarly, the constrained covariance matrix Rconsis obtained by differentiatingwith respect toR and solve.
M
R = NTR1cons(Syy SyxC
cons SyuD
cons CconsS
yx + CconsPC
cons
+CconsSxuDcons DconsS
yu + DconsS
xuC
cons+ DconsSuuD
cons)
= 0 (48)
-
8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference
36/48
172 Enrique Hernandez-Lemus and Claudia Rangel-Escareno
leads to
Rcons = R+ 1
N T(Syu+ CconsSxu+ DconsSuu)Dcons
= R+ 1
N T
1
2Rcons
FD
cons
(49)
= R 1
N T(DFG)(F(Suu S
xuP
1Sxu)1F)1G (50)
Unfortunately, this constraints cannot be implemented in the model used for
this research. The selection of the matrices F and G that could zero out some
elements inD become difficult as the size of the matrix increases. However, by
re-writting the constrained problem using the vec operator we can easily handleany matrix size.
5.5. Vec Formulation
The vec operator vectorizes a matrix by piling up the columns. That is,
suppose we want to vectorize a 2x2 matrixM
M=
m11 m12m21 m22
, vec(M) =
m11m21m12
m22
The Kronecker product of two matrices plays an important role when using the
vec operator. There are important relationships that will be used in the develop-
ment of the constrained minimization problem in vec formulation.
Definition: The Kronecker product of two matrices,A and B , whereA ismxn andB is pxq, is defined as
A B=
A11B A12B . . . A1nBA21B A22B . . . A2nB
. . . . . . . . . . . .Am1B Am2B . . . AmnB
,
which is anmpxnqmatrix.
-
8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference
37/48
The Role of Information Theory... 173
Important Operator Relationships
vec(AXB ) = (BT A)vec(X) (51)
(AC BD) = (A B)(C D) (52)
(A B)1 = A1 B1 (53)
dxTAx
dx = xT(A + AT) (54)
To show the application of the vec operator in the constraint settings let us look
at the following example.
EXAMPLE:
Let us consider a 2x2 matrix D and suppose we want to constrain it to bediagonal. Select the matricesF andG to be
D=
d11 d12d21 d22
, F =
0 1 0 00 0 1 0
G=
00
Then, applying the constraint Fvec(D)=G we get that the elements d1 andd2are zero and the matrixD becomes:
D= d11 00 d22
In general, for anyn xn matrixD we can find matricesF andG and solve theconstrained minimization problem using vec formulation as follows:
Constrained Minimization Problem 2
Minimize
2L(C ,D,R) = NTlog |R|+ tr(R1)(Syy SyxC
SyuD CSyx+ CP C
+ CSxuD DSyu+ DS
xuC
+ DSuuD)
-
8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference
38/48
174 Enrique Hernandez-Lemus and Claudia Rangel-Escareno
subject to the constraintFvec(D) -G = 0
Solution: We introduce the Lagrange Multipliers method to minimize
the objective function
M(C ,D,R) = tr(N Tlog |R|)
+ tr(R1)(Syy SyxC SyuD
CSyx+ CP C
+ CSxuD DSyu+ DS
xuC
+ DSuuD)
+ (Fvec(D) G) (55)
subject to the constraintFvec(D)-G= 0.
M
C =
Ctr(R1)(SyxC
CSyx+ CP C + CSxuD
+ DSxuC)
= 2R1(CconsP+ DconsS
xu Syx) = 0 (56)
M
vec(D) = 2vec(R1consSyu) + 2vec(R
1consCconsSxu) +
2vec(R1consDconsSuu) +vec(F) = 0 (57)
M
= Fvec(Dcons) G= 0 (58)
M
R = N TR1cons(Syy SyxC
cons SyuD
cons CconsS
yx+
CconsP C
cons+ CconsSxuD
cons DconsS
yu+ DconsS
xuC
cons+
DconsSuuD
cons) = 0 (59)
From (57) and the following expressions
vec(R1consDconsSuu) = (Suu R1cons)vec(Dcons)
vec(R1consCconsSxu) = (S
xu R
1cons)vec(Ccons)
vec(F) = F
vec(Ccons) = vec(SyxP1) (P1Sxu I)vec(Dcons)
-
8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference
39/48
The Role of Information Theory... 175
we have that,
vec(Dcons) =vec(D) 1
2((Suu S
xuP
1Sxu)1 Rcons)F
We still need to work out the value for. Hence, substituting (57) into (58) andsolving for gives:
= (Fvec(D) G)(1
2F(Suu S
xuP
1Sxu)1 Rcons)F
)1 (60)
Now, putting this expression forback into (5.5.) we obtain
vec(Dcons) =vec(D) V1F[F V1F]1(Fvec(D) G) (61)
where,
V = (Suu SxuP
1Sxu)1 Rcons)
Finally, from (59) we obtain the expression for Rcons implicitly in theform ofRcons = R+ f(Rcons) for which we will need to iterate and reshapethe matrixDconsat each iteration.
Rcons= R+ 1
NT
1
2Rcons
FDcons
(62)
5.6. Constraints Implementation - EM Procedure
In order to apply the EM algorithm, we require initial values of the state
and covariance as well as the parameters which are initialized using linear
regression. Then the EM procedure operates as follows:
E-step
Given the initial estimatorsx0, P0 and initial estimators ofA ,B,C ,D,Q andRuse the Kalman filter equations to compute the estimates forx+t andPt.
M-step
Re-estimate the unconstrainedA, B,C, D, Q, and R using the values for x+t
-
8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference
40/48
176 Enrique Hernandez-Lemus and Claudia Rangel-Escareno
and Pt in the formulas for a,b, c, d, e, and P (* Here, is where we add theconstraintsFvec(D) -G= 0*)
ALGORITHM:
1. Start with the unconstrained estimates ofC, D , andR. Equations (40)-(42).
2. The vec expression for the constrained Cconsand Dconsare in fact func-tions ofRconswhich is in turn a function of the unconstrained Ruand thepreviousRcons, and has to be calculated by iteration. That is,
vec(Dcons) = vec(D) V1F[FV1F]1(Fvec(D) G) (63)
vec(Ccons) = vec(C) (P1d I)V1F[FV1F]1(Fvec(D) G) (64)
whereV(Rcons(r))is as in (5.5.), and
Rcons= R+ f(Rcons)withRcons(0) =R and
Rcons(r + 1) = R + f(Rcons(r)); r = 0, 1, 2,... until||Rc(r+ 1) Rc(r)||< tol
Hence,
Rc= Rc(r+ 1),
Cc = C c(Rc(r+ 1)), andDc= Dc(Rc(r+ 1))
3. Now, in the iteration process,
Rc(r+ 1) = 1
N T[a Cc(Rc(r))b Dc(Rc(r))c
+(Cc(Rc(r))d + Dc(Rc(r))e c)Dc(Rc(r+ 1))]
So, for each iterationr we need to reshape vec(Dc) and vec(Cc) and putit back into matrix form to compute a new Rc(r+1). Continue this until
convergence and once we have the final Rc put it back one more time to
find vec(Dc) and vec(Cc) and reshape them.
-
8/10/2019 Hernndez-Lemus & Rangel-Escareo, 2011. The Role of Information Theory in Gene Regulatory Network Inference
41/48
The Role of Information Theory... 177
4. Then Dc and Cc are the matrices that go back to the E-step to be used(along with the other parameters) to find an updated and more accurate
estimate ofx+t andPt
6. Conclusions
Information theory as such, is concerned with the quantification, analysis
and forecasting of information processing in systems under incomplete and/or
noisy data acquisition. As we discussed in this chapter, the problem of the in-
ference and analysis of gene regulatory networks from experimental data on
gene expression at a genome wide scale, is closely related with the foundational
tenets of information theory. In fact, given the current biological understand-ing of gene regulation as an extremely complex signal processing phenomena,
information theoretical tools and concepts result a natural choice for the task
of inference/analysis of such GRNs. We presented several instances in which
information theory, either on its own, or combined with probabilistic graphical
models, Bayesian statistics and machine-learning techniques have been used in
the inference and assessment of GRNs.
Purely information theoretical approaches are based on complex graph ren-
derings (i.e. both cyclic and acyclic probabilistic models are allowed) and are
able to describe the system using either continuous or discrete probability den-
sity functions. The means for dealing with incomplete or noisy data is by quan-
tifying interactions that are usually valued by means of statistical dependencemeasures such as mutual information and Kullback-Leibler divergences, either
on a marginal or conditional setting. The use of minimum description length
as a measure of algorithmic complexity, of the data processing inequality to
discriminate between direct and indirect interactions, and of Shannons signal
processing theorems to establish thresholds or bounds of confidence, is usually
supplemented with optimization based on maximum entropy (MaxEnt) tech-
niques.
In the other hand, Bayesian/machine-learning implementations of informa-
tion theoretical models are usually based on directed acyclic graphs (DAGs),
these also allow either discrete or continuous probability distribution funct