Yun Zheng* and Chee Keong Kwoh Papers/2006ijdmb-forma… · Dynamic algorithm for inferring...

Int. J. Data Mining and Bioinformatics, Vol. 1, No. 2, 2006 111

Copyright © 2006 Inderscience Enterprises Ltd.

Dynamic algorithm for inferring qualitative models of Gene Regulatory Networks

Yun Zheng* and Chee Keong Kwoh Bioinformatics Research Center, School of Computer Engineering, Nanyang Technological University, Nanyang Avenue, Singapore 639798 E-mail: [email protected] E-mail: [email protected] *Corresponding author

Abstract: We introduce a novel algorithm, DFL (Discrete Function Learning), for reconstructing qualitative models of Gene Regulatory Networks (GRNs) from gene expression data in this paper. We analyse its complexity of O(k ⋅ N ⋅ n2) on the average and its data requirements. The experiments of synthetic Boolean networks show that the DFL algorithm is more efficient than current algorithms without loss of prediction performances. The results of yeast cell cycle gene expression data show that the DFL algorithm can identify biologically significant models with reasonable accuracy, sensitivity and high precision with respect to the literature evidences.

Keywords: algorithm; Gene Regulatory Networks (GRNs); qualitative models; mutual information; entropy.

Reference to this paper should be made as follows: Zheng, Y. and Kwoh, C.K. (2006) ‘Dynamic algorithm for inferring qualitative models of Gene Regulatory Networks’, Int. J. Data Mining and Bioinformatics, Vol. 1, No. 2, pp.111–137.

Biographical notes: Yun Zheng is currently pursuing his PhD at Bioinformatics Research Center of Nanyang Technological University in Singapore. His research interests include learning theory, information theory, modelling gene regulatory networks, classification, feature selection, Bayesian networks, statistical methods for analysis of microarray gene expression data.

Chee Keong Kwoh is an Associate Professor of Computer Engineering at Nanyang Technological University in Singapore. His research interests include applying statistical learning theory in bioinformatics research, statistical learning theory, particularly in protein-protein interaction networks; probabilistic inference, numerical expert systems, particularly in the area of probabilistic reasoning; and augmented reality systems and image processing in biomedical applications. He was the Deputy Director of Biomedical Engineering Research Centre, NTU (between 1997–2003) and has been the Director, MSc in Bioinformatics since 2002.

112 Y. Zheng and C.K. Kwoh

1 Introduction

With the availability of genome-wide gene expression data (DeRisi et al., 1997; Spellman et al., 1998), a lot of interests have been given to modelling GRNs (Bolouri and Davidson, 2002; de Jong, 2002; D’haeseleer et al., 2000; Endy and Brent, 2001; Hasty et al., 2001, 2002; Smolen et al., 2000), which are assumed to be the underlying mechanisms that regulate different gene expression patterns.

Due to the fact that very little data is available about the quantitative values of the concentrations of messenger RNA molecules and the strength of interactions between proteins and DNA, the traditional methods to simulate dynamic systems, like ordinary differential equations, cannot be applied to biological system easily. Therefore, qualitative models, like Generalised Logical Formalism (GLF) (Thomas and d’Ari, 1990; Thomas et al., 1995) and Piecewise Linear Differential Equation (PLDE) (Glass and Kauffman, 1973; Mestl et al., 1995), are introduced to meet this problem.

However, it is not easy to build such GLF and PLDE models of GRNs. Currently, almost all GLF and PLDE models are built from literature (Alur et al., 2001; de Jong et al., 2001, 2002; Ghosh and Tomlin, 2001; Mendoza et al., 1999; Sanchez and Thieffy, 2001; Sanchez et al., 1997). The manual extraction of knowledge from literature clearly cumbers the applicability of these models and the speed of building them. With the recent development of microarray technology, the expression levels of thousands of genes can simultaneously be obtained at discrete time points. It is a worthy effort to make use of these data to accelerate the building of qualitative models of GRNs.

Our aim is to learn qualitative models of GRNs from discretised microarray gene expression data. The qualitative models of GRNs are a set of discrete functions, which tell the regulatory relations between genes under consideration. In our method, the expression data are assumed to be the products of these functions. Then, we use a reverse engineering method based on information theory to find these functions from gene expression data.

In the identification of functional relations, it is still an open problem to develop an o(N ⋅ nk) time algorithm for any domain (Akutsu et al., 2000). In the following sections, we will introduce an algorithm called DFL with the expected complexity of O(k ⋅ N ⋅ n2) to solve this open problem. The DFL algorithm is more efficient and versatile than current algorithm for reconstructing qualitative GRN models like the REVEAL algorithm (Liang et al., 1998), which is for reconstructing Boolean Networks (BLNs) from binary transition pairs, without loss of prediction performances. In addition, the DFL algorithm is automatic and requires no prior information about the regulatory relations between genes under consideration.

Gene expression data are always noisy. We further introduce a method called ε function to deal with the noise problems in data sets in this paper. The experimental results show that some regulatory relations that cannot be found by the DFL algorithm are successfully identified with the ε function method.

Some probabilistic models, like Bayesian Networks (Friedman et al., 2000; Hartemink et al., 2002; Segal et al., 2003; Friedman, 2004), Dynamic Bayesian Networks (DBNs) ((Murphy and Mian, 1999; Ong et al., 2002) and Probabilistic Boolean Networks (PBNs) (Shmulevich et al., 2002), have also been proposed to model GRNs. In these models, the gene expression data sets are assumed to be generated from a joint distribution. Then, various learning algorithms are used to learn the models which encode

Dynamic algorithm for inferring qualitative models 113

the joint distribution from the gene expression data sets (Heckerman et al., 1995; Heckerman, 1995; Friedman et al., 1998, 1999). In this study, we will focus on the learning of deterministic models. The relation between our method and the probabilistic models will be discussed in Section 6.

The rest of this paper is organised as follows. In the next section, we introduce the theory foundation of learning functional relations from data. We introduce the DFL algorithm and also analyse its complexities in Section 3. We do experiments on both synthetic data sets and gene expression data of yeast cell cycle to validate the DFL algorithm in the Section 4. Then, we propose a new concept called ε function to deal with noise in the data sets in Section 5. The relation between the DFL algorithm and related work are discussed in Section 6. Finally, we summarise the works of this paper in the last section.

2 Foundation of information theory

We first introduce the notation. We use capital letters to represent random variables, such as X and Y; lower case letters to represent an instance of the random variables, such as x and y; bold capital letters, like X, to represent a set of variables; and lower case bold letters, like x, to represent a instance of X. The cardinality of X is represented by |X|. In Boolean functions, we will use the ‘+’, ‘·’ and ‘¬’ to represent logic OR, AND and INVERT (also called SWITCH) operation respectively.

Our approach is based on the information theory. First of all, we will introduce some fundamental knowledge of information theory. The entropy of a random variable X is defined in terms of probability of observing a particular value x of X as

( ) ( ) log ( ).x

H X P X x P X x= − = =∑

Hereafter, for the purpose of simplicity, we represent P(X = x) with p(x), P(Y = y) with p(y) and so on. For two variables, the joint entropy is defined in terms of the probabilities of all possible configurations of the tuple (X, Y) as

( , ) ( , ) log ( , ).x y

H X Y p x y p x y= −∑∑

Once we observed X = x, the uncertainty in Y is the entropy of the posterior distribution,

( | ) ( | ) log ( | ).y

H Y X x p y x p y x= = −∑ (1)

The average value of equation (1) over all possible values of X is the conditional entropy of Y given X (Cover and Thomas, 1991),

( | ) ( ) ( | ).x

H Y X p x H Y X x= =∑

If X becomes a set of variables X = X1, X2, …, Xn, the conditional entropy of Y given X is defined as


( | ) ( , ) log ( | ).y

H Y p y p y= −∑∑x

X x x (2)

Obviously, there are two conditional entropies which capture the relationships between H(X) and H(Y), H(X|Y) and H(Y|X). As illustrated in Figure 1, these are related with equation (3) (Shannon and Weaver, 1963):

( , ) ( | ) ( ) ( | ) ( ).H X Y H Y X H X H X Y H Y= + = + (3)

Figure 1 The relationship of entropy and mutual information. The circles represents the entropy of variables. The intersection between the circles stands for the mutual information between the variables: (a) The normal case and (b) when Y = f(X)

(a) (b)

In words, the uncertainty of X and the remaining uncertainty of Y given knowledge of X, i.e., the information contained in Y that is not shared with X, sum to the entropy of the combination of X and Y. We can now find an expression for the shared or ‘mutual information’, I(X; Y), also referred to as ‘rate of transmission’ between an input-output pair (Shannon and Weaver, 1963):

I(X; Y) = H(Y) – H(Y|X) = H(X) – H(X|Y). (4)

The shared information between X and Y corresponds to the remaining information of X if we remove the information of X that is not shared with Y. In other words, mutual information is the measure of the amount of information that one random variable contains about another random variable. From equations (3) and (4), the mutual information can also be represented as

I(X; Y) = H(X) + H(Y) – H(X, Y). (5)

Similar to equation (2), the mutual information between a vector X and Y is defined as

( ; ) ( ) ( | ) ( ) ( | )( ) ( ) ( , )

( , )( , )log .( ) ( )y

I Y H Y H Y H H YH H Y H Y

p yp yp p y

= − = −= + −

=∑∑x

X X X XX X

xxx

(6)

Next, we introduce the following theorem, which is the theoretical foundation of our algorithm. The proof of this theorem is given in Appendix A.

Theorem 2.1: If the mutual information between X and Y is equal to the entropy of Y, i.e., I(X; Y) = H(Y), then Y is a function of X.


The entropy H(Y) represents the diversity of the variable Y. The mutual information I(X; Y) represents the relation between vector X and Y. From this point of view, Theorem 2.1 actually says that the relation between vector X and Y are very strong, such that there is no more diversity for Y if X has been known. In other words, the value of X can fully determine the value of Y.

More intuitively as show in Figure 1(b), if the mutual information between a set of variables X and another variable Y, I(X; Y), is equal to entropy of Y, H(Y), then Y is fully determined by X, i.e., X give all the information needed to decide the state of Y. That is to say, Y is a function of X.

3 Methods

In this section, we begin with a formal definition of the problem of reconstructing qualitative GRN models from state transition pairs. Then, we discuss how much data are sufficient to solve this problem. In the following, we introduce the DFL algorithm to solve this problem. Finally, we analyse the complexities of the DFL algorithm.

3.1 Problem definition

In qualitative models of GRNs, the genes are represented by a set of discrete variables, V = X1, …, Xn. In GRNs, the expression level of a gene X at time step t + 1 is controlled by the expression levels of its regulatory genes, which encode the regulators of the gene X, at time step t. Hence, in qualitative models of GRNs, the genes at the same time step are assumed to be independent of each other, which is a standard assumption in learning GRNs under the qualitative models, as assumed by Liang et al. (1998), Akutsu et al. (1999, 2000, 2003), Lahdesmaki et al. (2003) and Zheng and Kwoh (2004). Formally, ∀1 ≤ i, j ≤ n, Xi(t), Xj(t) are independent. The regulatory relationships between the genes are expressed by discrete functions related to each variables. Formally, a GRN G(V, F) with indegree k (the number of inputs) consists of a set V = X1, …, Xn of nodes representing genes and a set F = f1, …, fn of discrete functions, where a discrete function fi(Xi1, …, Xik) with inputs from specified nodes Xi1, …, Xik at time step t is assigned to the node Xi at time step t + 1, as shown in the following equation

1( 1) ( ( ), , ( )),i i i ikX t f X t X t+ = … (7)

where 1 ≤ i ≤ n. The state of the GRN is expressed by the state vector of its nodes. We use

v(t) = x1, …, xn to represent the state of the GRN at time t, and 1( 1) , , nt x x′ ′+ =v … to represent the state of the GRN at time t + 1. 1 , , nx x′ ′… is calculated from x1, …, xn with equation (7). A state transition pair is v(t) → v(t + 1). The input of fi is called the parent nodes of X′, and represented with Pa(X′) = Xi1, …, Xik.

When using BLNs to model GRNs, genes are represented with binary variables with two values ON (1) and OFF (0), which means the genes are turned on or turned off respectively. In addition, the fis in equation (7) are Boolean functions. When using GLFs and PLDEs to simulate GRNs, the fis in equation (7) are multi-value discrete functions.

The problem of inferring the qualitative model of the GRN from input-output transition pairs (time series of gene expression) is defined as follows.


Definition 3.1: Let V = X1, …, Xn. Given a transition table T = v(t) → v(t + 1) where v(t) is the state vector of the GRN model at time t, find a set of discrete functions F = f1, f2, …, fn, so that Xi(t + 1) ( iX ′ hereafter) is calculated from fi as follows

1( 1) ( ( ), , ( )),i i i ikX t f X t X t+ = … (8)

where t goes from 1 to a limited constant N. If the F are Boolean functions, then the GRN model is a BLN, otherwise the

GRN model is a GLF or PLDE model.

3.2 Data quantity

We discuss how much data is necessary to successfully infer F in this section. Akutsu et al. (1999) proved that Ω(2k + k log2 n) transition pairs are the theoretic lower bound to infer the BLNs, where n is the number of genes and k is the maximum indegree of these genes.

Theorem 3.1 (Akutsu et al. 1999): Ω(2k + k log2 n) transition pairs are necessary in the worst case to identify the Boolean network of maximum indegree ≤ k.

To meet the requirement of multi-state discrete functions in GLF and PLDE models, we introduce the following theorem, which is a generalisation of Theorem 3.1. The proof of the theorem is also given in Appendix A.

Theorem 3.2: Ω(bk + k logb n) transition pairs are necessary in the worst case to identify the qualitative GRN models of maximum indegree ≤ k and the maximum number of discrete level for variables ≤ b.

Hereafter, we use ‘b (base)’ to denote the number of discrete level for variables. In the DFL algorithm, we introduce a coefficient c to determine the actual size of

synthetic data sets as follows,

( log ).kbN c b k n= × + (9)

That is to say, the parameter t in Definition 3.1 goes from 1 to N.

3.3 Search method

From Theorem 2.1, the problem in Definition 3.1 is converted to finding a set of input genes whose mutual information with iX ′ is equal to the entropy of iX ′ for each gene Xi in the GRN.

For n discrete variables V = Xi, …, Xn, there are totally 2n subsets. Clearly, it is NP-hard to examine all subsets of V exhaustively. However, for GRNs, each gene is estimated on the average to interact with four to eight other genes (Arnone and Davidson, 1997). Therefore, by restricting the indegree of a gene to a limited integer k, the problem can be solved in polynomial time. Even when we do this compromise, it is still very difficult to solve the problem. As mentioned before, it is still an open problem to develop an o(N ⋅ nk) time algorithm for identifying functional relations in any domain (Akutsu et al., 2000). In the following, we will introduce an algorithm called DFL with


the complexity of Ok ⋅ N ⋅ n2) to solve this open problem. The main steps of the DFL algorithm are listed in Figure 2. In the DFL algorithm, we use the following definition, called ∆ supersets.

Figure 2 The DFL algorithm

*The Sub() is a subroutine listed in Figure 5.

Definition 3.2: Let X be a subset of V = X1, …, Xn, then ∆i(X) of X are the supersets of X so that X ⊂ ∆i(X) and |∆i| = |X| + i, where |X| denotes the cardinality of X.

To clarify the searching method of the DFL algorithm, let us consider a BLN consisting of four genes, as shown in Figure 3. In this example, the function of each gene is listed in Table 1. The set of all genes is V = A, B, C, D, and we use X to denote subsets of V.

Figure 3 The wiring diagram of a BLN model, where n = 4, kmax = 4. X' denotes the state of X in next time step

Table 1 Boolean functions of the example, where ‘+’ is the logical OR operation, and ‘•’ is the logical AND operation

Gene Rule

A A′ = B B B′ = A + C C C′ = (B ⋅ C) + (C ⋅ D) + (B ⋅ D) D D′ = (A ⋅ B) + (C ⋅ D)

One of the commonly used algorithms to infer BLNs from data is the REVEAL algorithm (Liang et al., 1998). As shown in Figure 4, the REVEAL algorithm uses an exhaustive search method, it first searches the subsets with only one gene, then subsets with two genes, and so on. If the REVEAL algorithm finds a subset X that satisfies


I(X; Y) = H(Y), it will stop its searching and build models for this gene with X. When compared with the REVEAL algorithm, the DFL algorithm uses a better method when finding the target combination. Firstly, the DFL algorithm searches the first layer, then it sorts all subsets on the first layer according to their mutual information with D′. It finds that A shares the largest mutual information with D′ among subsets on the first layer. Then, the DFL algorithm searches through ∆1(A) = A, B, A, C, A, D. Next, it will search ∆2(A), …, ∆k–1(A), however it always decides the search order of ∆i+1(A) bases on the calculation results of ∆i(A).

Figure 4 Search procedures of the DFL algorithm and the REVEAL algorithm when finding the Boolean function of D′ in Figure 1. The solid line is for the DFL algorithm, the dashed line is for the REVEAL algorithm. The combinations with a black dot under them are the subsets which share the largest mutual information with D′ on their layers. The REVEAL algorithm firstly searches the first layer (the subsets with one gene), then the second layer, and so on. Finally, it finds the target subsets A, B, C, D at the fourth layer. The DFL algorithm uses a different heuristics. Firstly, it searches the first layer, then finds that A, with a black dot under it, shares the largest mutual information between D′ among subsets on the first layer. Then, it continues to search ∆1(A) on the second layer. Similarly, these calculations continue until the target combination A, B, C, D is found on the fourth layer

From the Chain Rule of mutual information (Cover and Thomas, 1991, p.41),

( ; , ( )) ( ; | ( )) ( ; ( )),i j i i j i i iI X X X I X X X I X X= +′ ′ ′ ′ ′ ′Pa Pa Pa

where ( )iX ′Pa is the selected parent nodes, \ ( ).j iX X ′∈V Pa Since ( ; ( ))i iI X X′ ′Pa does not change when trying different Xj, to maximise ( ; , ( ))i j iI X X X′ ′Pa is actually to maximise ( ; | ( )),i j iI X X X′ ′Pa i.e., the information of iX ′ that was not carried by

( )iX ′Pa but carried by the new variable Xj. This means in each layer the DFL algorithm chooses the best new variable to ( )iX ′Pa .

While checking these subsets, the DFL algorithm will check whether ( ( ); ) ( ).iI X X H X′ ′ ′=Pa After finding the target subset Pa(Xi) which satisfies ( ( ); ) ( ),i i iI X X H X′ ′=Pa the DFL algorithm will remove irrelevant variables and

duplicate rows of ( )i iX X′ ′→Pa to build the model fi for iX ′ . At the same time, the count values of different instances of ( )i iX X′ ′→Pa are stored. For this example, the


DFL algorithm finds that A, B, C, D satisfies I(A, B, C, D;D′) = H(D′), then will build model f for D′ with A, B, C, D.

If the DFL algorithm cannot the target subset Pa(D′) which satisfies I(Pa(D′); D′) = H(D′) in ∆2(A), …, ∆k–1(A), then it will continue to search ∆2(B), …, ∆k–1(B), ∆2(C), …, ∆k–1(C), ∆ 2(D), …, ∆k–1(D) in line 8 of Figure 5. Therefore, the DFL algorithm will check all subsets with ≤ k variables in the worst case.

Figure 5 The sub routine of the DFL algorithm

*By deleting unrelated variables and duplicate rows in T.

3.4 Analysis of complexity

There are two computationally expensive factors in the DFL algorithm. The first one is the number of subsets. The DFL algorithm guarantees the check of every subset whose cardinality is not larger than k. There are

1

k n kii

n =

≈∑ subsets whose cardinalities are

smaller than or equal to k. The second one the computation of I(X; Y). The DFL algorithm uses equation (6) to compute I(X; Y). It takes O(bk + k logb n) steps to compute I(X; Y), since the length N of transition table T is O(bk + k logb n) based on Theorem 3.2 and only two scans of T is sufficient to compute H(X) and H(X; Y). Since there are n genes in the network, the complexity of the DFL algorithm is O((bk + k logb n)nk+1) in the worst case. Contributing to sort step in the line 7 of the sub routine, the algorithm makes the best choice in current layer of subsets. Since there are (n – 1) ∆1 supersets for a given single element subset, (n – 2) ∆1 supersets for a given two element subsets, and so on. The DFL algorithm only considers

1

0( )

k

in i kn

−

=− ≈∑ subsets in the best case. Thus, the


expected time complexity of the DFL algorithm is approximately O(k ⋅ n2 ⋅ (N + log n)), where log n is for sort step in line 7 of Figure 5 and N is for the length of input table T. By ignoring the minor terms, the time complexity of the DFL algorithm becomes O((kbk + k2 logb n)n2). The expected complexity of the DFL algorithm comes from three parameters k, b and n. The complexity O((kbk + k2 logb n)n2) grows quasi-squarely with n when k, b á n, multinomially with b, and exponentially with k.

To store the information needed in the search processes, the DFL algorithm uses two data structures. The first data structure used by the DFL algorithm is a linked list, which stores the state table of every gene during its calculation process. Every gene has two sequences representing its state of current time step and next time step respectively. According to equation (9), the space complexity of the first data structure is O((bk + k logb n)n).

The second one is a two-dimension linked list called ∆Tree of length k where each node in the first dimension is itself a linked list. This data structure is used to store the ∆ supersets in the calculation procedures. More precisely, the first node of this data structure is used to store the single element subsets. If the DFL algorithm is processing Xi and its ∆ supersets, the second node to the kth node are used to store ∆1 to ∆k–1

1 supersets of Xi. If there are n genes, there would be

1

0( )

k

in i kn

−

=− ≈∑ subsets in the

∆Tree. To store the ∆Tree, the space complexity would be O(kn), since only the indexes of the genes are stored for each subsets. Therefore, the total space complexity of the DFL algorithm is O((bk + k logb n)n).

4 Results

In this section, we first introduce the synthetic data sets of BLN models that we use. Then, we discuss the evaluation criterion for the learning of qualitative models of GRNs. Next, we discuss two kinds of experiments of BLNs to validate the efficiency of the DFL algorithm in comparison with the REVEAL algorithm. In the following, we conduct experiments on synthetic data sets to show the sensitivity of the DFL algorithm. Next, we perform experiments on synthetic data set of a GLF model. Finally, we do experiments on yeast cell cycle gene expression profile of Cho et al. (1998).

We implement the DFL algorithm and the REVEAL algorithm with the Java language version 1.4.1. The implementations are included in our software called Discrete Function Learner. The implementation software and the data sets are available at the supplementary website2 of this paper. We perform our experiments on an HP AlphaServer SC computer, with one EV68 1 GHz CPU and 1GB memory, running Tru64 Unix operating system.

4.1 Synthetic data sets of BLNs

We present the synthetic datasets of BLNs in this section. For a BLN consisting of n genes, the total state space would be 2n. The v of a transition pair is randomly chosen from 2n possible instances of V with the Discrete Uniform Distribution, i.e., p(i) = 1/2n, where i is randomly chosen one value from 0 to 2n – 1 inclusively. Since the DFL algorithm examines different subsets in the kth layer of ∆Tree with lexicographic order, the run time of the DFL algorithm may be affected by the different position of the


target subsets in the kth layer of ∆Tree. Therefore, we select the first and the last k variables in V as the inputs for all iX ′ . The data sets generated from the first k and last k

variables are named as ‘head’ and ‘tail’ data sets. There are 22k

different Boolean functions when the indegree is k. Then, we use OR function (OR) or one of the Boolean functions randomly selected from 22

k

possible functions (RANDOM) to generate the v′, i.e., f1 = f2 = = fn. If a data set is generated by OR function defined with the first

k variables, then we name it as an OR-h data set, and so on.

4.2 Evaluation measures

The accuracy, sensitivity and precision are defined in equations (10)–(12). The TP, FP, TN and FN in these equations are defined in Table 2. In the learning of GRN models, TP and FP is the number of edges which is true and false edges of the original networks, and is predicted as edges, or positive regulatory relations, respectively. Similarly, TN and FN is the non-edges and edges of the original networks, and predicted as non-edges, or negative regulatory relations, respectively.

NO. of Correct PredictionsAccuracyNO. of Predictions

= TP TNTP FP TN FN

=

++ + +

(10)

NO. of Correct Positive PredictionsSensitivityNO. of Predictions

TPTP FN

=

=+

(11)

NO. of Correct Positive PredictionsPrecisionNO. of Positive Predictions

.TPTP FP

=

=+

(12)

Table 2 The prediction measures

Predicted as positive Predicted as negative

Positive TP FN Negative FP TN

4.3 Experiments for time complexity

In this section, we use OR data sets to examine the complexity of the DFL algorithm.


4.3.1 Experiments when k is fixed, n increases

In this section, the indegree of each gene k is fixed to 3, and the number of transition pairs is calculated with equation (9) where c is 3. The number of genes goes from 20 to 100.

The experiment results are shown in Figure 6(a), where the time is the average value of 5 experiments on ‘heard’ data sets and 5 experiments on ‘tail’ data sets. The run time values are shown in logarithmic value. In all experiments of this kind, both the DFL algorithm and the REVEAL algorithm can find the original BLNs correctly. However, the DFL algorithm is significantly faster than the REVEAL algorithm as shown in Figure 6(a).

Figure 6 The run time of the DFL algorithm and the REVEAL algorithm. (a) When n increases alone and (b) when k increases alone

(a) (b)

4.3.2 Experiments when n is fixed, k increases

In this section, the number of genes n is fixed to 20, and k is increased from 2 to 6. Similar to the results of the prior section, both the DFL algorithm and the REVEAL algorithm can find the original BLNs correctly. However, the search times are also significantly different as shown in Figure 6(b), where the time is the average value of five experiments on ‘heard’ data sets and five experiments on ‘tail’ data sets. The run time values are also shown in logarithmic value.

As shown in the time complexity of the DFL algorithm O((kbk + k2 logb n)n2), it will grow in an exponential way with k. In Figure 6(b), the run time of the DFL algorithm grows approximately linearly in logarithmic coordinate, which means an exponential growth in ordinary coordinate.

Also, as indicated by Figure 6(b), the run times of the DFL algorithm are significantly smaller than those of the REVEAL algorithm in all cases.

4.3.3 Experiments for large n

We do experiments that the number of genes n goes from 1000 to 6000, which is approximately the number of genes with in a yeast genome, in this section. As discussed in the Section 3, each gene is estimated on average to interact with four to eight other genes (Arnone and Davidson, 1997). Another kind of experiments for k are also done, where n is fixed to 1000 and k goes from 2 to 8. Here, we do not do experiments for the REVEAL algorithm, which has already been inoperative per se.


Again, the DFL algorithm can correctly identify the original BLNs in all experiments. From Figure 7(a), we see the run time of DFL grows quasi-squarely with n. In Figure 7(b), the run time of the DFL algorithm grows approximately linearly in logarithmic coordinate with k, which means an exponential growth in ordinary coordinate.

Figure 7 The run time of the DFL algorithm. (a) When n increases alone, where k = 3, c = 4 and (b) when k increases alone, where n = 1000

(a) (b)

4.4 Experiments of sensitivity

In this section, we do some experiments to show the sensitivity of the DFL algorithm with the OR data sets. We change the number of learning instances (N) in this kind of experiments, since when the data is not enough the algorithm may fail to identify the original BLNs. The results are shown in Figure 8, where we do experiments for BLNs of n = 50 and n = 100, and each point is the average value of five ‘head’ and five ‘tail’ data sets.

Figure 8 The sensitivity of the DFL algorithm. The horizontal axis, N (the number of learning instances), is log-scaled. The right most points of the two curves correspond to the numbers of learning instances obtained by c = 3 of equation (9)


Sensitivity measures the percentage of correct positive predictions by the DFL algorithm. From Figure 8, it is shown that the sensitivity of the DFL algorithm grows linearly with the logarithmic value of the number of learning instances. However, the sensitivity will become one and not increase after the number of learning instances grows to a certain number. That means, if the data is enough, the DFL algorithm can correctly identify the original BLNs.

4.5 Experiment on data of a GLF model

In this section, we use the DFL algorithm to find a GLF model discussed in Thieffry and Thomas (1998) and shown in Figure 9.

Figure 9 A simple GLF model of GRN. The genes are represented by circles. The directed edges represents the regulatory relations between genes. The ‘+’ and ‘–’ beside an edge represents this regulatory relation is activation and repression respectively

Thieffry and Thomas (1998) provided the transition table of the model, which consists of 18 lines. We use this transition table as the input table T of the DFL algorithm. The DFL algorithm can correctly find that A' = f1(A, B, C), B' = f2(A, C), C' = f3(A, B, C). The learned fis are truth tables, since we still lack the tools to simplify the multi-value discrete functions like the Kaunaugh maps for Boolean functions. In addition, the activation or repression relations in the graph could be obtained by analysing the correlation coefficient between genes (Bar-Joseph et al., 2003). The correlation coefficient matrix of the example in Figure 9 is listed in Table 3.

Table 3 The correlation coefficient matrix of the GLF example in Figure 9

A' B' C′ A 0.2 0.6 –0.7 B 0.3 0 0.3 C –0.7 0.6 0.2

In the correlation coefficient matrix, positive, negative and zero values indicate activation, repression and no direct interaction respectively. In our example, the 0.2 in the first column of the first line in Table 3 means A gives activation to A', and so on. We see that the activation and repression relations in Figure 9 are correctly identified with the correlation coefficient matrix in Table 3.


4.6 Experiments on yeast gene expression data

In this section, we use the gene expression data of yeast Saccharomyces cerevisiae cell cycle from Cho et al. (1998), which covers approximately two full cell cycles (Cho et al., 1998). Lee et al. (2002) reported a GRN related to cell cycle of yeast. The GRN consists of 11 well-known yeast cell cycle regulators, which are Mbp1, Swi4, Swi6, Mcm1, Fkh1, Fkh2, Ndd1, Swi5, Ace2, Skn7 and Stb1. The Mcm1, Swi5, Ace2 and Stb1 are relatively loosely related to other genes. Thus, we only consider the remaining 7 genes. We discretise the data set in Cho et al. (1998) to three and four levels with the equal-width binning unsupervised discretisation method (Dougherty et al., 1995), then rearrange these expression values to state-transition pairs such that the expression values at current time step are the product of expression values at the prior time step. Finally, we apply the DFL algorithm on the obtained transition table. The learned models are shown in Figure 10.

Figure 10 The learned GRN model. (a) The number of discrete levels for gene expression value is 3 and the indegree of the GRN is set to 5 and (b) idem, where the base for gene expression value is 4. The regulators are represented by ovals. The directed edge from Gene A to Gene B means that Gene A is a regulator of Gene B. The solid edges represent regulatory relations that have been verified by other approaches. The dashed edges represent regulatory relations that have not been verified

(a) (b)

The DFL algorithm is automatic and requires no prior knowledge of the regulatory relations between the genes under consideration. The DFL algorithm is also quite efficient, only needs less than 0.2 seconds for all experiments done.

The literature evidence for regulatory relations represented in Figure 10 are shown in Table 4. For instance, Swi4 transcription is regulated in late G1 by both SBF(Swi4/Swi6) and MBF(Mbp1/Swi6) (Simon et al., 2001). In Figure 10, these regulatory relations are identified in Figure 10(a) and (b) respectively.

From Table 4, we obtain the accuracy, sensitivity and precision of the DFL algorithm, and tabulate them in Table 5. In Table 5, we see that approximate 83% of the regulatory relations which have literature evidences are found with the DFL algorithm, when we combine the results from both Figures 10 and 13. It is also shown that the precision of the DFL algorithm is quite high no matter what the bases for expression values are. That means, it is quite probable that the regulatory relations found with the DFL algorithm are biologically meaningful. In Table 5, it is shown that over 90% of the regulatory relations found by the DFL algorithm are biologically significant.


Table 4 The literature evidences for the GRN model in Figures 10 and 13

Regulator (protein) Gene M1 S4 S6 F1 F2 N1 S7 MBP1 *3 * *34 – *34 *34 – SWI4 *34 *3 *34 – *3 *34 * SWI6 *4 * *34 3 *34 * *4 FKH1 *4 *3 *4 * *34 *34 3 FKH2 *4 *34 *3 *3 *34 *4 *3 NDD1 *34 * *34 * *34 *4 *3 SKN7 34 *3 *34 – *3 *34 *34

‘*’ means regulatory relations that have been verified in the literature (Lee et al., 2002; Simon et al., 2001). For example, ‘*’ in the first cell of first line means that Mbp1 gives MBP1 gene autoregulation (Lee et al., 2002). ‘3’ and ‘4’ represent the regulatory relations found with the DFL algorithm when the bases for expression values are 3 and 4 respectively. M1, S4, S6, F1, F2, N1 and S7 are Mbp1, Swi4, Swi6, Fkh1, Fkh2, Ndd1 and Skn7 respectively.

Table 5 The accuracy, sensitivity and precision of the DFL algorithm and the K2 algorithm

Accur. Sensi. Preci. DFL (b = 3) 65 67 90 DFL (b = 4) 63 60 96 DFL (combined) 80 83 92 K2 (b = 3) 27 17 88 K2 (b = 4) 22 12 83 K2 (combined) 33 24 91

To do a comparison with another commonly used model, Bayesian networks, we apply the K2 algorithm (Cooper and Herskovits, 1992) on the same data sets. Then, we calculate the accuracy, sensitivity and precision of the K2 algorithm with respect to literature evidences, and list them in Table 5 also. In Table 5, it is shown that the measures of the K2 algorithm are substantially lower than those of the DFL algorithm. Another important thing is that the au-toregulations cannot be represented by Bayesian networks due to fact that the structures of them are Directed Acyclic Graphs (DAGs) (Pearl, 1988). Therefore, the autoregulations are predeterminately missed whatever algorithms for learning Bayesian networks are used. However, the autoregulations are very common in GRNs as shown in Table 4, in which the diagonal line from upper-left corner to lower-right corner is fully occupied with autoregulation evidences.

Some regulatory relations which have literature evidence are not found by the DFL algorithm, as shown in Table 4. This is also shown by the sensitivity value in Table 5. There are mainly two reasons for this discrepancy. First, the size of the data set is too small. Second, there is noise in the gene expression data. It is reasonable to expect that the model obtained from the DFL algorithm will become more reasonable when the input data is larger and more precise. Further, there are also some regulatory relations (represented by dashed edges) to be verified yet. When we calculate the measures in Table 5, we count these relations as false positives.


5 The ε function method

In this section, we introduce the concept of ε function to meet issues incurred by noise in the data sets.

5.1 The definition of the ε function method

When there is noise in the data sets, the requirement of Theorem 2.1 cannot be satisfied strictly. In these cases, we can relax the requirement to obtain a best estimated result. As shown in Figure 11, by defining a significant factor ε, if the difference between I(X; Y) and H(Y) is less than ε × H(Y), then the DFL algorithm will stop the searching process, and build the classifier for Y with X at the significant level ε.

Figure 11 The Venn diagram of H(X), H(Y) and I(X, Y), when Y = f(X). (a) The noiseless case, where the mutual information between X and Y is the entropy of Y. (b) The noisy case, where the entropy of Y is not equal to the mutual information between X and Y strictly. The shaded region is resulted from the noises. The ε function method means that if the area of the shaded region is smaller than or equal to ε × H(Y), then the DFL algorithm will stop searching process, and build the function for Y with X

(a) (b)

Because the H(Y) may be quite different for various classification problems, it is not appropriate to use an absolute value, like ε, to stop the searching process or not. Therefore, we use the relative value, ε × H(Y), as the criterion to decide whether to stop the searching process or not.

Formally, we define the ε function as follows.

Definition 5.1: If H(Y) – I(X; Y) ≤ ε × H(Y), then Y = fε(X) where ε is a significant factor.

Correspondingly, the line 4 of Figure 5 should be modified. The ε function is useful to avoid the worst complexity of the DFL algorithm. When

data sets are noisy, it is impossible to reach I(X; Y) = H(Y) required in Theorem 2.1. However, the major diversity of Y is still under the influence of its input X. Hence, it is reasonable to expect that I(X; Y) is still much close to H(Y). Hence, if we have found a subset X with H(Y) – I(X; Y) ≤ ε × H(Y), then it is reasonable to stop the searching of the DFL algorithm and to build models with X. Consequently, the worst complexity of the DFL algorithm is reasonably avoided.

5.2 Selection of ε value

The choice of ε value should be decided based on the noise level of the data sets. For a given noisy data set, the missing part of H(Y), as demonstrated in Figure 11, is


determined, i.e., there exist a specific minimum ε value, εm, with which the DFL algorithm can find the original BLN models. If the ε value is smaller than the εm, the DFL algorithm will not find the original BLNs. Here, we will introduce two methods to efficiently find εm.

First, the εm can be found automatically by a restricted learning process. To efficiently find the minimum ε, we restrict the maximum number of the subsets to be checked for each iX ′ to k × n. The supposed scope of ε is specified in prior. If the DFL algorithm cannot find BLN for a noisey data set with the specified minimum ε value, then the ε will be increased with a step of 0.01. The restricted learning will be performed, until the DFL algorithm finds a BLN with a threshold value of ε, i.e., the εm. Since only k × n subsets are checked for each iX ′ in the restricted learning process, the time to find εm will be O(k ⋅ n2).

Second, the εm can also be found with a manual binary search method. Since ε ∈ [0, 1), ε is specified to 0.5 in the first try. If the DFL algorithm finds a BLN with ε value of 0.5, then ε is specified to 0.25 in the second try. Otherwise, if the DFL algorithm cannot find a BLN with a long time, like 10 minutes, then the DFL algorithm can be stopped and ε is specified to 0.75 in the second try. The selection process is carried out until the εm value is found so that the DFL algorithm can find a BLN model with it but can not when ε = εm – 0.01. This selection process is also efficient. Since ε ∈ [0,1), only 5 to 6 tries are needed to find the εm on the average.

The optimal estimation of the original BLN model is obtained by setting the ε to the εm value or a slightly larger value than the εm.

5.3 Obtaining correct truth table from noisy data sets

When data sets are noisy, the DFL algorithm, combined with the ε function method, can still find the correct structure of BLNs, but it is probable that there are two different instances of ( )i iX X′ ′→Pa for each instance of ( ).iX ′Pa Recall that the counts of different instances of ( )i iX X′ ′→Pa are obtained in the learning process, as discussed in Section 3.3. In the noisy data sets, these count values can be used to obtain the correct truth tables for .iX ′ In detail, the DFL algorithm finds all the instances of ( )i iX X′ ′→Pa for each instance of ( )iX ′Pa and chooses the one with largest count value.

5.4 Experiments of noisy data sets

The noisy data sets are generated by randomly selection λ% samples, then inverting their output values v′. To examine the performance of the DFL algorithm when dealing with noisy data sets, we generate 100 OR/RANDOM data sets (50 ‘head’ and 50 ‘tail’ data sets) with different noise levels λ from 1% to 20%. The expected cardinality k of the DFL algorithm is still set to k of the generation BLNs in experiments of this section.

Then, we run the DFL algorithm by choosing different ε values for these data sets with the method introduced in Section 5.2. The relation of εm and λ for OR/RANDOM data sets is shown in Figure 12(a) and (b), where the εm value is the average of the 10 sets for each λ value. The relation of εm and λ for AND data sets is similar to that for OR data sets. As demonstrated in Figure 12(a), the εm is increasing with the increase of λ value. This means that the missing part of the entropy of Y, as demonstrated by the


shaded region of Figure 11(b), tends to increase when there is more and more noise in the data sets. The variances of the εm in RANDOM data sets are bigger than those in OR data sets, as demonstrated in Figure 12(a) and (b). This is reasonable, since the

( ( ); ) and ( )i i iI X X H X′ ′ ′Pa are more diverse in RANDOM data sets than those in OR data sets.

The sensitivity of the DFL algorithm maintains 1 for all noisy OR data sets, as demonstrated in Figure 12(c). In other words, for all noisy OR data sets, the DFL algorithm can correctly find the original BLNs. As shown in Figure 12(c), the sensitivity of the DFL algorithm for RANDOM data sets does not decrease significantly even when λ increases to 20%. The DFL algorithm correctly finds the original BLNs for over 98% noisy RANDOM data sets, and find 2/3 correct edges of the BLNs for the rest RANDOM data sets.

Figure 12 The setting and performance of the DFL algorithm for noisy data sets, whose k = 3, n = 100 and N = 1000. The values shown are the average for 100 data sets. (a) The minimum ε value, εm, vs. noise level λ (%) in the OR data sets. (b) the minimum ε value, εm, vs. noise level λ (%) in the RANDOM data sets. (c) The sensitivity of the DFL algorithm vs. λ for noisy OR/RANDOM data sets. The curves marked with circle and diamond are for OR and RANDOM data sets respectively


The DFL algorithm also correctly finds the truth table with the method introduced in Section 5.3 for all noisy data sets when the sensitivity is one. For instance, the obtained rules for one RANDOM-h data sets are shown in Table 6.

As shown in Table 6, the left rules have significantly larger counts than their counterparts in the right side. By using the method introduced in Section 5.3, the rules in the right side will be eliminated and the truth table for this data sets is the 8 rules in the left side. By applying the Karnaugh-map to the obtained truth table, the obtained Boolean function is actually the 1 2 1 2 3iX X X X X X′ = ¬ ⋅ + ⋅¬ ⋅ in the original Boolean network. Up till now, the original Boolean network has been successfully identified by the DFL algorithm from this data sets with 10% noise. In addition, the total count value of the rules in the right side is 100 which is exactly 10% × 1000. This means that the rules in the right side are coming from the 10% noise in the data set.

Table 6 The obtained Boolean rules from one noisy RANDOM-h data set with 1000 samples and 10% noise. In the original BLN, 1 2 1 2 3iX X X X X X′ = ¬ ⋅ + ⋅¬ ⋅

Rules from noiseless spl. Rules from noisy spl. X1 X2 X3 iX ′ Ct. X1 X2 X3 iX ′ Ct. 0 0 0 0 137 0 0 0 1 10 0 0 1 0 123 0 0 1 1 8 0 1 0 1 97 0 1 0 0 10 0 1 1 1 105 0 1 1 0 12 1 0 0 0 98 1 0 0 1 19 1 0 1 1 118 1 0 1 0 15 1 1 0 0 107 1 1 0 1 12 1 1 1 0 115 1 1 1 1 14 Total count 900 100

The run times of the DFL algorithm do not change severely for data sets with different noisy levels. In other words, the DFL algorithm is still efficient when the data sets are noisy.

5.5 ε function for gene expression data

We do experiments on yeast cell cycle data from Cho et al. (1998) too. The results in shown in Figure 13.

As shown in Figure 13(a) and (b), some regulatory relations that are not found in Figure 10(a) are identified with the ε function method. For example, the autoregulation of Mph1 and Swi4 are successfully found in Figure 13(a). In Figure 13(b), the regulation of Fkh1 by Mbp1 is identified. In addition, the regulation of Fkh2 by Fkh1 is identified in experiments of b = 3 and ε = 0.25 (not shown in Figure 13).

However, some regulatory relations also disappear when we do compromise in the ε function method. For instance, the regulation of Fkh1 by Fkh2 disappears in Figure 13(a). Generally, the GRN model tends to become scarcer (contain fewer edges) when the value of ε increases. This is due to the fact that fewer genes can satisfy the requirement of ε function when the value of ε increases.


Figure 13 The learned GRN model for yeast cell cycle with the ε function method. (a) The base for gene expression value is 3, the indegree of the GRN is 5, and the ε is 0.2 and (b) the base for gene expression value is 4, the indegree of the GRN is 5, and the ε is 0.15. The legends are the same as those of Figure 10

(a) (b)

Finally, we give unified models in Figure 14, in which Figure 14(a) combines results in both Figures 10, 13 and 14(b) is a combined Bayesian network model learned by the K2 algorithm when the base for expression value is 3 and 4. It is shown in Figure 14 that the model found with the DFL algorithm is more significant than that learned with the K2 algorithm. As we mentioned before, there are no autoregulations found in Figure 14(b), but 6 out of 7 autoregulations are found in Figure 14(a).

Figure 14 The combined GRN models. (a) Combined model of Figures 10 and 13 and (b) combined Bayesian network structure learned with the K2 algorithm where the base for expression value is set to 3 and 4 respectively. The legends are the same as those of those of Figure 10

(a) (b)

6 Related work

In this section, we discuss the related models, the PBNs and DBNs. We also show that the DFL algorithm combined with the ε value method can be used to infer PBNs and DBNs. More discussion about the relationships of BLNs, PBNs and DBNs is given by Murphy and Mian (1999) and Shmulevich et al. (2002).


6.1 Probabilistic Boolean Networks (PBNs)

To cope with the uncertainty, Shmulevich et al. (2002) introduced the PBNs. The basic idea is to extend the BLN to accommodate more than one possible function for each node (Shmulevich et al., 2002). A PBN G(V, F) is defined by a set of variables V = X1, …, Xn and a function matrix, F = F1, …, Fn, where Fi is defined by equation (13).

( ) , 1, , ( )ii jF f j l i= = … (13)

where each ( )ijf is a possible function determining the value of gene Xi and l(i) is the

number of possible functions for gene Xi. The functions ( )ijf is referred to as predictors,

since the process of inferring these functions from measurements or equivalently, of producing a minimum-error estimate of the value of a gene at the next time point, is known as prediction in estimation theory (Shmulevich et al., 2002).

The major difference between the PBNs and standard BLNs lies in the F. As shown in Table 6, the noisy rules learned from noisy data sets contains two truth table of the three input variables, shown on the left and right sides respectively. In BLNs, the functional relations are deterministic, hence only the rules on left side are used as the estimated truth table of .iX ′

However, if we consider this situation from another aspect, the rules on the right side can also be seen as alternative gene expression patterns. That is to say, we can consider building a PBN model with the rules in Table 6. Formally, we let Fi = f1, f2, where f1 = ¬X1 ⋅ X2 + X1 ⋅ ¬X2 ⋅ X3 and f2 = X1 ⋅ X2 + ¬X1 ⋅ X2 + ¬X2 ⋅ ¬X3. The probability that fi is selected is estimated by the total counts of rules, i.e., p(f1) = 900/1000 = 0.9 and p(f2) = 0.1. Hence, the DFL algorithm combined with the ε function method can be used to build PBNs efficiently.

6.2 Dynamic Bayesian Networks (DBNs)

A Bayesian network for V is the tuple B(G, Θ). G is a DAG whose nodes are in one-to-one correspondence to variables in V, whose edges encode the conditional dependence between variables (Pearl, 1988). In particular, Xi is independent of its non-descendants given its parents Pai in G (Pearl, 1988). The second component Θ is a set of parameters which quantify the network. In particular, Θ = ∪iΘi, where Θi is the Conditional Probability Table (CPT) of node Xi. The Bayesian network B encodes the joint probability over V by

11

( , , ) ( | , )n

B n i i ii

p x x p x=

= Θ∏ Pa… (14)

where Pai is the set of parents of node Xi in G. Bayesian networks have been used to model GRNs (Friedman et al., 2000; Hartemink

et al., 2002; Segal et al., 2003; Friedman, 2004). However, since directed circles are not allowed in standard Bayesian networks, Murphy and Mian (1999); Ong et al. (2002) used the DBNs to model GRNs. The BLN can be regarded as a special case of DBN, where the relations between variables are deterministic (Murphy and Mian, 1999).


Similar to the discussion in Section 6.1, the DFL algorithm combined with the ε function method can also be used to learn DBNs, even the relations between variables are probabilistic. In DNBs, the expression data are assumed to be generated from a joint distribution rather than a deterministic function in BLNs. The CPT of variables can be estimated with the frequencies of rules, for the same instances of ( ).iX ′Pa For example, for (0, 0, 0) of X1, X2, X3, we can have p(0|000) = 137/(137 + 10) = 0.932 and p(1|000) = 0.068 as estimation of ( | ( ))i iP X X′ ′Pa based on Table 6.

There is one limitation when learning DBNs with the DFL algorithm. Recall that it is assumed that ∀1 ≤ i, j ≤ n, Xi(t), Xj(t) are independent in Section 3.1. That means in the DBNs learned with the DFL algorithm, the ( )iX ′Pa are all from the prior time step.

7 Discussions and conclusions

In Bayesian network fields, Friedman et al. (1999) proposed an algorithm for learning Bayesian networks, called as the Sparse Candidate algorithm. The Sparse Candidate algorithm chooses parent nodes for the node under consideration with the similar idea of selecting those having strong relations. One of the statistics, I(Xi; Xj, Pa(Xi)), used for evaluating the new parent node in the Sparse Candidate algorithm, is similar to the

( ; , ( ))i j iI X X X′ ′Pa used in the DFL algorithm. However, there are two fundamental differences between the Sparse Candidate algorithm and the DFL algorithm. Firstly, the Sparse Candidate algorithm does not compare the statistic I(Xi; Xj, Pa(Xi)) with the entropy of Xi, H(Xi), but the DFL algorithm compares ( ; , ( ))i j iI X X X′ ′Pa with ( ).iH X ′ As discussed in Section 3, this comparison is critical. Since by comparing

( ; , ( ))i j iI X X X′ ′Pa with ( ),iH X ′ the DFL algorithm knows which subset of variables is a sufficient to determine ,iX ′ based on Theorem 2.1. Consequently, the DFL algorithm avoids the exhaustive searching of all subsets of V, which is NP-hard. Secondly, the forward selection is used in the Sparse Candidate algorithm. However, the DFL algorithm uses a better searching scheme, which guarantees the exhaustive searching of all subsets of V with ≤k features.

The contributions of this paper are three fold. First, we systematically analyse a way to find functional relations from an

information theory approach. That is if the mutual information between X and Y is equal to the entropy of Y, then Y is a function of X.

Second, we introduce a new algorithm, called DFL, to learn qualitative models of GRNs from microarray gene expression data. The DFL algorithm is a general method to find discrete functional relations. The excellence of the DFL algorithm consists in that the base for gene expression data is adjustable. This virtue makes it possible to find GRN models of binary values (BLNs) and multi-state values (GLF etc.) with a universal tool. The experimental results show that it can correctly find the original model of the synthetic data set, and identify biologically significant models from a very limited gene expression data set. In addition, we analyse the theoretic lower bound of the size of data sets to accomplish the task of finding these discrete functions. The DFL algorithm is superior to currently existing algorithms with the expected time complexity of O((kbk + k2 logb n)n2), although its worst case complexity is O((bk + k logb n)nk+1). We also do experiments on synthetic data sets to validate our analysis about the complexity


of the DFL algorithm. In our experiments, we also find that the sensitivity of the DFL algorithm grows linearly with the logarithmic value of the number of learning instances.

At last, we introduce a new concept called ε function to deal with the noise in data sets. The ε function method is useful to find GRN models from noisy data sets. The experiments on yeast cell cycle expression data show that the ε function method is a good supplement to the DFL algorithm. The ε method is also useful to avoid the worst complexity of the DFL algorithm.

As indicated by the dashed edges in Figures 10, 13 and 14(a), the DFL algorithm finds some regulatory relations, which are not experimentally verified. This suggests that the DFL algorithm can be used to guide the biological research in deciphering the GRNs.

In the future, there are at least two ways to extend the DFL algorithm. First, it is advisable to incorporate other kinds of data, like genome-wide location data, in the learning procedures of qualitative models. Second, we can automatically explore whether a regulator is an activator or a repressor by calculating the correlation between the regulator and the regulated gene as shown in Section 4.5.

Acknowledgements

The authors appreciate Prasanna R Kolatkar and Ng See-Kiong for their reviews on an early version of this paper

References Akutsu, T. et al. (1999) ‘Identification of genetic networks from a small number of gene expression

patterns under the Boolean network model’, Proceedings of Pacific Symposium on Biocomputing ‘99, Hawaii, HI, Vol. 4, pp.17–28.

Akutsu, T. et al. (2000) ‘Algorithm for identifying boolean networks and related biological networks based on matrix multiplication and fingerprint function’, Journal of Computation Biology, Vol. 7, Nos. 3–4, pp.331–343.

Akutsu, T. et al. (2003) ‘A simple greedy algorithm for finding functional relations: efficient implementation and average case analysis’, Theor. Comput. Sci., Vol. 292, No. 2, pp.481–495.

Alur, R. et al. (2001) ‘Hybrid modeling and simulation of biomolecular networks’, Lecture Notes in Computer Science, Vol. 2034, pp.19–32.

Arnone, M. and Davidson, E. (1997) ‘The hardwiring of development: organization and function of genomic regulatory systems’, Development, Vol. 124, pp.1851–1864.

Bar-Joseph, Z. et al. (2003) ‘Computational discovery of gene modules and regulatory networks’, Nature Biotechnology, Vol. 21, pp.1337–1342.

Bolouri, H. and Davidson, E. (2002) ‘Modeling transcriptional regulatory networks’, BioEssays, Vol. 24, pp.1119–1129.

Cho, R.J. et al. (1998) ‘A genome-wide transcriptional analysis of the mitotic cell cycle’, Molecular Cell, Vol. 2, pp.65–73.

Cooper, G.F. and Herskovits, E. (1992) ‘A Bayesian method for the induction of probabilistic networks from data’, Machine Learning, Vol. 9, p.309.

Cover, T.M. and Thomas, J.A. (1991) Elements of Information Theory, John Wiley & Sons, Inc., New York, NY.

de Jong, H. (2002) ‘Modeling and simulation of genetic regulatory systems: a literature review’, Jounral of Computational Biology, Vol. 9, No. 1, pp.67–103.


de Jong, H. et al. (2002) ‘Qualitative simulation of the initiation of sporulation in B. subtilis’, Tech. Rep. 4527, INRIA.

de Jong, H. et al. (2001) ‘Qualitative simulation of genetic regulatory networks: method and application’, in Nebel, B. (Ed.): Proceedings of the 17th International Joint Conference on Artificial Intelligence, Morgan Kaufmann, San Mateo, CA, pp.67–73.

DeRisi, J. et al. (1997) ‘Exploring the metabolic and genetic control of gene expression on a genomic scale’, Science, Vol. 278, No. 5338, pp.680–686.

D’haeseleer, P. et al. (2000) ‘Genetic networks inference: from co-expression clustering to reverse engineering’, Bioinformatics, Vol. 16, No. 8, pp.707–726.

Dougherty, J. et al. (1995) ‘Supervised and unsupervised discretization of continuous features’, In Proceedings of the 12th International Conference on Machine Learning, Tahoe City, CA, pp.194–202.

Endy, D. and Brent, R. (2001) ‘Modelling cellular behavior’, Nature, Vol. 409, No. 6818, pp.391–395.

Friedman, N. (2004) ‘Inferring cellular networks using probabilistic graphical models’, Science, Vol. 303, No. 5659, pp.799–805.

Friedman, N. et al. (2000) ‘Using Bayesian networks to analyse expression data’, Journal of Computational Biology, Vol. 7, Nos. 3–4, pp.601–620.

Friedman, N. et al. (1998) ‘Learning the structure of dynamic probabilistic networks’, Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence(UAI-1998), Morgan Kaufmann Publishers, San Francisco, CA, pp.139–147.

Friedman, N. et al. (1999) ‘Learning Bayesian network structure from massive datasets: the ‘sparse candidate’ algorithm’, Proceedings of the 15th Annual Conference on Uncertainty in Artificial Intelligence (UAI-99), Morgan Kaufmann Publishers, San Francisco, CA, pp.206–215.

Ghosh, R. and Tomlin, C.J. (2001) ‘Lateral inhibition through Delta-Notch signaling: a piecewise affine hybrid model’, Lecture Notes in Computer Science, Vol. 2034, pp.232–246.

Glass, L. and Kauffman, S. (1973) ‘The logical analysis of continuous non-linear biochemical control networks’, Journal of Theoretical Biology, Vol. 39, pp.103–129.

Hartemink, A. et al. (2002) ‘Bayesian methods for elucidating genetic regulatory networks’, IEEE Intell. Syst. Biol., Vol. 17, No. 2, pp.37–43.

Hasty, J. et al. (2002) ‘Engineered gene circuits’, Nature, Vol. 420, pp.224–230. Hasty, J. et al. (2001) ‘Computational studies of gene regulatory networks: in numero molecular

biology’, Nature Review Genetics, Vol. 2, No. 4, pp.268–279. Heckerman, D. (1995) ‘A tutorial on learning Bayesian networks’, Tech. Rep. MSR-TR-95-06,

Microsoft Research. Heckerman, D. et al. (1995) ‘Learning Bayesian networks: the combination of knowledge and

statistical data’, Machine Learning, Vol. 20, No. 3, pp.197–243. Lahdesmaki, H. et al. (2003) ‘On learning gene regulatory networks under the Boolean network

model’, Mach. Learn., Vol. 52, Nos. 1–2, pp.147–167. Lee, T.I. et al. (2002) ‘Transcriptional regulatory networks in Saccharomyces cerevisiae’, Science,

Vol. 298, No. 5594, pp.799–804. Liang, S. et al. (1998) ‘REVEAL, a general reverse engineering algorithms for genetic network

architectures’, Proceedings of Pacific Symposium on Biocomputing ‘98, Maui, HI, Vol. 3, pp.18–29.

Mendoza, L. et al. (1999) ‘Genetic control of flower morphogenesis in Arabidopsis thaliana: a logical analysis’, Bioinformatics, Vol. 15, Nos. 7–8, pp.593–606.

Mestl, T. et al. (1995) ‘A mathematical framework for describing and analysing gene regulatory networks’, Journal of Theoretical Biology, Vol. 176, pp.291–300.

Murphy, K. and Mian, S. (1999) ‘Modelling gene expression data using dynamic bayesian networks’, Tech. rep., Computer Science Division, UC, Berkeley, CA.


Ong, M. et al. (2002) ‘Modeling regulatory pathways in E. coli from time series expression profiles’, Bioinformatics, Vol. 18, pp.S241–S248.

Pearl, J. (1988) Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, Morgan Kaufmann, San Mateo, CA.

Sanchez, L. and Thieffy, D. (2001) ‘A logical analysis of the Drosophila gap genes’, Journal of Theoretical Biology, Vol. 211, pp.115–141.

Sanchez, L. et al. (1997) ‘Establishment of the dorso-ventral pattern during embryonic development of Drosophila melanogaster: a logical analysis’, Journal of Theoretical Biology, Vol. 189, pp.377–389.

Segal, E. et al. (2003) ‘Discovering molecular pathways from protein interaction and gene expression data’, Bioinformatics, Vol. 19, No. 90001, pp.264i–272.

Shannon, C. and Weaver, W. (1963) The Mathematical Theory of Communication, University of Illinois Press, Ur-bana, IL.

Shmulevich, L. et al. (2002) ‘Probabilistic Boolean networks: a rule-based uncertainty model for gene regulatory networks’, Bioinformatics, Vol. 18, No. 2, pp.261–274.

Simon, P. et al. (2001) ‘Serial regulation of transcriptional regulators in the yeast cell cycle’, Cell, Vol. 166, pp.679–708.

Smolen, P. et al. (2000) ‘Modeling transcriptional control in gene network: methods, recent results, and future directions’, Bulletin of Mathematical Biology, Vol. 62, pp.274–292.

Spellman, P.T. et al. (1998) ‘Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization’, Molecular Biology of the Cell, Vol. 9, pp.3273–3297.

Thieffry, D. and Thomas, R. (1998) ‘Qualitative analysis of gene networks’, Proceedings of Pacific Symposium on Biocomputing ‘98, Maui, HI, Vol. 3, pp.77–88.

Thomas, R. and d’Ari, R. (1990) Biological Feedback, CRC Press, Boca Raton, FL. Thomas, R. et al. (1995) ‘Dynamical behaviour of biological regulatory networks: I. biological rool

of feedback loops and practical use of the concept of the loop-characteristic state’, Bulletin of Mathematical Biology, Vol. 57, No. 2, pp.247–276.

Yeung, R.W. (2002) A First Course in Information Theory, Kluwer Academic/Plenum Publishers, New York.

Zheng, Y. and Kwoh, C.K. (2004) ‘Dynamic algorithm for inferring qualitative models of gene regulatory networks’, Proceedings of the 3rd Computational Systems Bioinformatics Conference, CSB 2004, IEEE Computer Society Press, Stanford, CA, pp.353–362.

Notes 1Except ∆1 supersets, only a part of other ∆i(i = 2, …, k – 1) supersets is stored in ∆Tree. 2The supplements of this paper are available at http://www.ntu.edu.sg/home5/pg04325488/ijdmb05.htm.

Appendix

A The proofs

The proof for Theorem 2.1 is given below. Theorem 2.1 has been proposed or proved in the literature. Yeung (2002) gave a proof for the theorem that H(Y|X) = 0 if and only if Y is a function of X, from which it is straightforward to obtain Theorem 2.1. In Cover and Thomas (1991), Theorem 2.1 is also proposed as an exercise.


Theorem A.1: If the mutual information between X and Y is equal to the entropy of Y, i.e., I(X; Y) = H(Y), then Y is a function of X.

Proof: From the definition of mutual information, we obtain I(X; Y) = H(Y) – H(Y|X) = H(Y). Thus, H(Y|X) =0.

From the definition of conditional entropy,

( | ) ( , ) log ( | )

( , )( , ) log( )

0

y

y

H Y p y p y

p yp yp

= −

= −

=

∑∑

∑∑x

x

X x x

xxx

Hence, p(x, y) = p(x). That is to say, for every value x of X with p(x) > 0, there exists only one possible value y of Y that satisfy p(x, y) = p(x) > 0. In other words, Y is a function of X.

The proof for Theorem 3.2 is given as follows.

Theorem A.2: Ω(bk + k logb n) transition pairs are necessary in the worst case to identify the qualitative GRN models of maximum indegree ≤ k and the maximum base for variables ≤ b.

Proof: Firstly, we consider the number of mutually distinct qualitative GRN models of maximum indegree ≤ k and the maximum base for variables ≤ b.

There are n kk

n

≈ possible combinations of inputs for a given gene, and kbb possible

discrete functions of base b for each gene. Thus, there are (( ) )kb k nQ b n⋅ qualitative GRN

models of maximum indegree ≤ k and the maximum base for variables ≤ b. Therefore, Ω(bk ⋅ n log2b + nk log2 n) bits are required to represent a GRN model of

maximum indegree ≤ k and of maximum indegree ≤ k and the maximum base for variables ≤ b.

At last, we consider the number of transition pairs. For each transition pair, the information quantity is n log2 b bits if the maximum base for variables ≤ b. Hence, Ω(bk + k logb n) transition pairs are required in the worst case.

Yun Zheng* and Chee Keong Kwoh Papers/2006ijdmb-forma… · Dynamic algorithm for inferring...

Documents

Transcript of Yun Zheng* and Chee Keong Kwoh Papers/2006ijdmb-forma… · Dynamic algorithm for inferring...