eprints.uthm.edu.myeprints.uthm.edu.my/id/eprint/4108/1/shahreen_kasim.pdf · CluFA-I), improving...

14
Computers in,Biology and Medicine 43 (2013) 1120-1133 Contents lists available at SciVerse ScienceDirect Computers in Biology and Medicine journal homepage: www.elsevier.com/locate/cbm Multi-stage filtering for improving confidence level and determining cp CrossMark dominanf clusters in clustering algorithms of gene expression data ~hahre&l<asim "J". jafaai Deris '. Razib M. 0thman a Software Multimedia Center, F lty of Computer Science and Information Technology. Universiti Tun Hussein Onn Malaysia, 86400 Parit Raja. Batu Paha Yff aloysia Laboratory of Computational Intelligence and Biotechnology, Faculty of Computing, Universiti Teknologi Moloysia. 81310 IJTM Skudai, Malaysia 'Artificial Intelligence and Bioinformatics Research Croup. Faculty of Computing, Universiti Teknologi Malaysia, 81310 IJTM Skudai, Malaysia ARTICLE INFO ABSTRACT Article history: Received 12 December 2010 Accepted 15 May 2013 Keywords: Confidence level Dominant cluster Fuzzy clustering Gene expression A drastic improvement in the analysis of gene expression has lead to new discoveries in bioinformatics research. In order to analyse the gene expression data, fuzzy clustering algorithms are widely used. However, the resulting analyses from these specific types of algorithms may lead to confusion in hypotheses with regard to the suggestion of dominant function for genes of interest. Besides that, the current fuzzy clustering algorithms do not conduct a thorough analysis of genes with low membership values. Therefore, we present a novel computational framework called the "multi-stage filtering- Clustering Functional Annotation" (msf-CluFA) for clustering gene expression data. The framework consists of four components: fuzzy c-means clustering (msf-CluFA-0), achieving dominant cluster (msf- CluFA-I), improving confidence level (msf-CluFA-2) and combination of msf-CluFA-0, msf-CluFA-1 and msf-CluFA-2 (msf-CluFA-3). By employing double filtering in msf-CluFA-1 and aprion algorithms in msf-CluFA-2, our new framework is capable of determining the dominant clusters and improving the confidence level of genes with lower membership values by means of which the unknown genes can be predicted. o 2013 Elsevier Ltd. All rights resewed. 1. Introduction natural behaviour of gene expression datasets 1291. the fuzzy c-means clustering has become an appropriate choice 1241. This Microarray technology has become a significant tool in func- is due to the ability of fuzzy c-means algorithms to cluster genes tional genomics and biomedical research. This technology allows for simultaneous measurement of gene expression for thousands of genes. The resulting data can then be used in various ways, such as in diagnosing tumours [25], drug-effect profiling 1351 and identification of genes that contribute to common functions by grouping genes with similar expression of patterns using either clustering or classification techniques [22.28]. In discovering similar patterns in gene expression datasets, clustering has been used extensively and this may lead to an insight into significant connections within the gene regulatory networks. In order to understand the pattern of these genes, many contributions have been made by clinical 123,301, biological 147,381, toxicological 114,271 and pharmacological [37.17] studies. There are many clustering algorithms currently used for clustering gene expression datasets such as the k-means 191, hierarchical clustering 1481. Self-Organising Maps (SOM) 1161. graph theoretical algorithms 1431, Genetic Algorithms (GA) 141 and fuzzy c-means 126,411. Since imprecision and uncertainty are considered to be the *Corresponding author at: Software Multimedia Center, Faculty of Computer Science and Information Technology. Universiti Tun Hussein Onn Malaysia. 86400 Parit Raja. Batu Pahat, Malaysia. Tel.: +60 7 453 3776; fax: +60 7 453 6023. E-mail address: [email protected],~ny (S. Kasim). into more than one group thus providing a systematic, unbiased method to change precise values into several descriptors of cluster membership 1321. Furthermore, the fuzzy c-means clustering provides more information on the degree of similarity 1111 among genes. Therefore, the fuzzy c-means algorithm is applied in the clustering process to partition a given gene expression dataset. However, clustering alone, without paying attention to the coher- ence of biological functions within the clusters, brings no mean- ingful results. Coherence can be seen when one gene is involved in multiple biological functions. In order to capture the coherence of biological functions in the clusters, biological knowledge is applied during the clustering process. One of the many popular sources of biological knowledge is the Gene Ontology (GO) [I]. The popular- ity of GO is due to its capability to provide a standard species- independent controlled vocabulary for describing genes in terms of their biological processes, cellular components and molecular functions. Related research incorporating the GO in gene expres- sion clustering is found in numerous studies [15,34.7,21.39]. However, there are still some other issues associated with the acquisition and analysis of gene expression datasets which can impose a profound influence on the interpretation of the results. One of these problems is the degrading performance of clustering results due to certain situations in which a gene can have multiple 0010-4825/$-see front matter o 2013 Elsevier Ltd. All rights resewed. http://dx.doi.org/l0.1016/j.compbiomed.2013.05.011

Transcript of eprints.uthm.edu.myeprints.uthm.edu.my/id/eprint/4108/1/shahreen_kasim.pdf · CluFA-I), improving...

Page 1: eprints.uthm.edu.myeprints.uthm.edu.my/id/eprint/4108/1/shahreen_kasim.pdf · CluFA-I), improving confidence level (msf-CluFA-2) and combination of msf-CluFA-0, msf-CluFA-1 and msf-CluFA-2

Computers in,Biology and Medicine 43 (2013) 1120-1133

Contents lists available at SciVerse ScienceDirect

Computers in Biology and Medicine

journal homepage: www.elsevier.com/locate/cbm

Multi-stage filtering for improving confidence level and determining cp CrossMark

dominanf clusters in clustering algorithms of gene expression data

~hahre&l<asim "J". jafaai Deris '. Razib M. 0thman a Software Multimedia Center, F lty of Computer Science and Information Technology. Universiti Tun Hussein Onn Malaysia, 86400 Parit Raja. Batu Paha Yff aloysia

Laboratory of Computational Intelligence and Biotechnology, Faculty of Computing, Universiti Teknologi Moloysia. 81310 IJTM Skudai, Malaysia 'Artificial Intelligence and Bioinformatics Research Croup. Faculty of Computing, Universiti Teknologi Malaysia, 81310 IJTM Skudai, Malaysia

A R T I C L E I N F O A B S T R A C T

Article history: Received 12 December 2010 Accepted 15 May 2013

Keywords: Confidence level Dominant cluster Fuzzy clustering Gene expression

A drastic improvement in the analysis of gene expression has lead to new discoveries in bioinformatics research. In order to analyse the gene expression data, fuzzy clustering algorithms are widely used. However, the resulting analyses from these specific types of algorithms may lead to confusion in hypotheses with regard to the suggestion of dominant function for genes of interest. Besides that, the current fuzzy clustering algorithms do not conduct a thorough analysis of genes with low membership values. Therefore, we present a novel computational framework called the "multi-stage filtering- Clustering Functional Annotation" (msf-CluFA) for clustering gene expression data. The framework consists of four components: fuzzy c-means clustering (msf-CluFA-0), achieving dominant cluster (msf- CluFA-I), improving confidence level (msf-CluFA-2) and combination of msf-CluFA-0, msf-CluFA-1 and msf-CluFA-2 (msf-CluFA-3). By employing double filtering in msf-CluFA-1 and aprion algorithms in msf-CluFA-2, our new framework is capable of determining the dominant clusters and improving the confidence level of genes with lower membership values by means of which the unknown genes can be predicted.

o 2013 Elsevier Ltd. All rights resewed.

1. Introduction natural behaviour of gene expression datasets 1291. the fuzzy c-means clustering has become an appropriate choice 1241. This

Microarray technology has become a significant tool in func- is due to the ability of fuzzy c-means algorithms to cluster genes tional genomics and biomedical research. This technology allows for simultaneous measurement of gene expression for thousands of genes. The resulting data can then be used in various ways, such as in diagnosing tumours [25], drug-effect profiling 1351 and identification of genes that contribute to common functions by grouping genes with similar expression of patterns using either clustering or classification techniques [22.28].

In discovering similar patterns in gene expression datasets, clustering has been used extensively and this may lead to an insight into significant connections within the gene regulatory networks. In order to understand the pattern of these genes, many contributions have been made by clinical 123,301, biological 147,381, toxicological 114,271 and pharmacological [37.17] studies. There are many clustering algorithms currently used for clustering gene expression datasets such as the k-means 191, hierarchical clustering 1481. Self-Organising Maps (SOM) 1161. graph theoretical algorithms 1431, Genetic Algorithms (GA) 141 and fuzzy c-means 126,411. Since imprecision and uncertainty are considered to be the

*Corresponding author at: Software Multimedia Center, Faculty of Computer Science and Information Technology. Universiti Tun Hussein Onn Malaysia. 86400 Parit Raja. Batu Pahat, Malaysia. Tel.: +60 7 453 3776; fax: +60 7 453 6023.

E-mail address: [email protected],~ny (S. Kasim).

into more than one group thus providing a systematic, unbiased method to change precise values into several descriptors of cluster membership 1321. Furthermore, the fuzzy c-means clustering provides more information on the degree of similarity 1111 among genes. Therefore, the fuzzy c-means algorithm is applied in the clustering process to partition a given gene expression dataset. However, clustering alone, without paying attention to the coher- ence of biological functions within the clusters, brings no mean- ingful results. Coherence can be seen when one gene is involved in multiple biological functions. In order to capture the coherence of biological functions in the clusters, biological knowledge is applied during the clustering process. One of the many popular sources of biological knowledge is the Gene Ontology (GO) [I]. The popular- ity of GO is due to its capability to provide a standard species- independent controlled vocabulary for describing genes in terms of their biological processes, cellular components and molecular functions. Related research incorporating the GO in gene expres- sion clustering is found in numerous studies [15,34.7,21.39].

However, there are still some other issues associated with the acquisition and analysis of gene expression datasets which can impose a profound influence on the interpretation of the results. One of these problems is the degrading performance of clustering results due to certain situations in which a gene can have multiple

0010-4825/$-see front matter o 2013 Elsevier Ltd. All rights resewed. http://dx.doi.org/l0.1016/j.compbiomed.2013.05.011

Page 2: eprints.uthm.edu.myeprints.uthm.edu.my/id/eprint/4108/1/shahreen_kasim.pdf · CluFA-I), improving confidence level (msf-CluFA-2) and combination of msf-CluFA-0, msf-CluFA-1 and msf-CluFA-2

S. Kasim et al. / Computers in Biology and Medicine 43 (2013) 1120-1133 1121

functions. It has been observed that whenever a gene belongs to multiple functions, it will create confusions in choosing the dominant function of that particular gene. Another issue that frequently occurs is the ability to assign with high confidence genes that have low membership values in the cluster. This issue arises when some of the genes are considered as being clustered with high confidence due to their high membership values. Concurrently, some other genes that belong to the same cluster but have lower membership values are assigned with low con- fidence. This low confidence resulted from the presence of genes that are near the cluster border line or are slightly far from the centroid.

Based upon the observations stated above, we propose for an improved fuzzy c-means algorithm named the msf-CluFA (multi- stage filtering-Clustering Functional Annotation), as shown in Fig. 1, which not only takes into account the observation of genes with high membership values, but also handles genes with low membership values with more serious consideration. Our msf-CluFA has three main stages, referred to as the fuzzy c-means clustering stage (msf-CluFA-0). the achieving dominant cluster stage (msf-CluFA-1) and the improving confidence level stage (msf-CluFA-2). In msf-CluFA-0, the enhancement of traditional fuzzy c-means took place as the GO and Saccharomyces Genome Database (SGD: [lo]) functional annotation databases were incor- porated into the fuzzy c-means algorithm. Next, there were two filtering stages in which the genes with high membership values were initially filtered into the first stage filtering (msf-CluFA-1). At this stage, genes were assigned into their dominant cluster by filtering based on the membership values and degree of specificity. Concurrently, the genes with low membership values were filtered

into the second stage filtering (msf-CluFA-2) with the intention of improving their confidence level and ability to be included in the cluster with confidence. This was done by applying an apriori algorithm [3] to detect the co-occurrences of annotations. From the co-occurrences, the genes were then filtered based on the ranking system. The details of our msf-CluFA are explained in the next section. The difference of ms-CLuFA compared to the method proposed by Bandyopadhyay et al. 141 is that their work used the genetic algorithm in the first stage clustering. Furthermore, a multi-objective genetic clustering together with the nearest neigh- bour criterion has been used in their second stage clustering.

2. Materials and methods

2.1. Datasets

The yeast gene expression datasets from Eisen et al. 1131 and Gasch et al. 1191 were used in order to test the new framework of our msf-CluFA. There were 6221 expression profiles corresponding to four experiments on cell cycle, sporulation. temperature shock and diauxic shift processes in the Eisen dataset. On the other hand, in the Gasch dataset, there were 6152 expression profiles gathered over 173 of various experiments tested.

In order to cluster and identify the similarity between our msf-CluFA and the GO. we downloaded the GO slim yeast data and GO terms data from an updated version from September 2005. There were 76 terms in the GO slim data and 19,458 terms in the MySQL GO term data. The SGD compiled in September 2005,

Fuzzy C-Means Clustering Stage (msf-CluFA-0) Achieving Dominant Cluster Stage (msf-CluFA-1)

I For each gene, trace the I GO datasets (GOSlim, Number of cluster initializations clusters to which it belongs

I

Dominant cluster I Fuzzy membership initialization

Centroid calculation

I I I Fuzzy membership update I I I

I I

Improving Confidence Level Stage (msf-CluFA-2)

Group into one list clusters to which it belongs

v

Combine (msf-CluFA-3)

Apply genes with apriori algorithm

Evaluation by Dominant cluster HD, z-score

distribution (HD) achieved

Fig. 1. The flowchart of msf-CluFA.

Page 3: eprints.uthm.edu.myeprints.uthm.edu.my/id/eprint/4108/1/shahreen_kasim.pdf · CluFA-I), improving confidence level (msf-CluFA-2) and combination of msf-CluFA-0, msf-CluFA-1 and msf-CluFA-2

1122 5. Kasim e t al. / Compurers in Biology and Medicine 43 (2013) 1120-1133

which contained 33.651 genes. was also used to extract the GO terms.

2.2. Fuzzy c-means clustering

During the clustering stage (msf-CluFA-0). we employed the fuzzy c-means algorithm. There are four steps in the algorithm. They are: (i) initialisation of cluster number. (ii) initialisation of fuzzy member- ship. (iii) calculation of centroid and (iv) the fuzzy membership update as illustrated in Fig. 1. In the first step, we used the 76 terms in the GO slim dataset, GOsli, to form 76 clusters, clj where j = 1, ..., l and I is the number of clusters. For the gene expression dataset GE, each gene in the dataset is represented as gi where gieGE. In order to assign each gene gi to clj, the term for gi can be mapped to t which is a descendant term of clj in a GO hierarchy.

Then, in the second step, the initial membership value, mv:) for gene gi is defined by using the following formula:

mv;)=rsii(l-a)+ar, i = l , ..., g,, j = l , ..., c4 (1)

where rsij is the reliability score for g to be in a cluster cl. For gene gi that is no evidence code, thus the initial membership value of the following equation is used for calculation:

The use of the constant a is to give some variation in the iteration and r is a small constant to denote the level of reliability when gi has no GO annotation evidence code. Both a and r values need to be assigned within the range ofOra, r < 1 .

Next, the centroid calculation where ei is a vector expression value for gene gi, the fuzzy centroid. FC(~) = [fc,] for k = number of iteration is defined as

where ms(1, W) is the fuzzy parameter. In the final step, fuzzy membership is updated by

where

Then, the optimal cluster is achieved by the compactness and separation (CS) which are determined by the minimum CS value after a pre-defined number of iterations as defined below:

During the iteration, if the current CS value (CS) is less than the minimum CS value (CS*), then CS* = CS, optimal cluster cl* = clk and optimal membership mv* = mvk. The steps from Eq. (4) to Eq. (5) are repeated until they reach the pre-defined number of iterations. An illustrated example for the output of this clustering is presented in Fig. 2(a). We took an example of five clusters which included cluster 17, which consists of 320 genes. cluster 33. which consists of 486 genes, cluster 39, which consists of 813 genes, cluster 43, which consists of 2520 genes and cluster 49, which consists of 1789 genes. For instance, in cluster 17 we looked into a particular gene YDIUIOS which holds a 0.089 membership value and gene YALOOlC which holds a 0.098 membership value. The gene YDIUlOC also exists in cluster 33 with the same membership value. The Gene YAUIOIC exists in cluster 33 with a 0.094 membership value and it also exists in clusters 39.43 and 49 with the same membership values (0.092).

The output of this stage categorised the results into high membership and low membership values, as shown in Fig. 3. The genes in the optimal cluster are considered to have high membership values while genes which are located outside of the optimal cluster are considered to have low membership values. During the initial clustering stage, the genes were clustered according to the GO annotation file. Once the genes were assigned to their optimal cluster, there were genes that were left out due to their slightly lower membership values than those in the highly compact clusters. For genes with high membership values, we proceeded with filtering steps in order to assign them into their dominant cluster (msf-CluFA-1). In addition, we also took into consideration the filtering of the genes with low membership values in order to cluster them back into their particular cluster by upgrading their confidence level (msf-CluFA-2).

2.3. Achieving a dominant cluster

The process began by grouping the membership values, mvv, for each cluster. ck, to which the gene belonged in order to compare their membership values, which are calculated by the fuzzy c-means clustering. Next, the genes were analysed based on their membership values. If the gene had only one high member- ship value among all clusters then the gene would be assigned to a cluster that contains the highest membership values. The gene would be assigned to the dominant cluster when mv*<mvv, mv*=mv@ CIC=Ck. In this component, we left out genes that already appeared in only one cluster, while we processed those genes that appeared in multiple clusters. In cases where a gene has multiple clusters holding it with the highest membership values. the gene would be further filtered using a specificity definition [31] as shown below:

where @is calculated from f number of GO terms divided by the total number of genes in cluster i and is calculated from f number of GO terms within the cluster i divided by the total number of genes with f GO terms in all clusters.

A higher value of SC indicates a better degree of specification of gene function in a cluster compared to others. Filtering the genes with SC enabled the assignment of the genes that still existed in multiple functions to their respective dominant cluster. This process is continued until all the genes in all clusters have gone through the filtration. Then, the results for dominant clusters were combined with the results from the next component (msf-CluFA-3), as explained below.

Once the dominant clusters are achieved, the hypergeometric distribution (HD) calculation was performed in order to evaluate the clusters based on the probable number of genes involved in a cluster. An illustrated example for the output of this filtering is demonstrated in Fig. 2(b). For instance, the gene YDR3lOC still remained in clusters 17 and 33 after being filtered by the highest membership values. However, once the gene had been filtered by the SC, the gene YDR310C was assigned to cluster 17. Mean- while, the gene YALOOlC was assigned to cluster 17 after being filtered by the highest membership values. Defuzzification gives minor implications to the overall results. Instead of removing genes that do not belong to the dominant cluster, situations in which clusters with a balanced number of genes still exist.

2.4. Improving confidence level for low membership values

In this stage, genes below the threshold value (0.05) which are considered as low membership values, gi, in each of the clusters. clj, were traced and grouped into a list, LM. Then, the GENECODIS

Page 4: eprints.uthm.edu.myeprints.uthm.edu.my/id/eprint/4108/1/shahreen_kasim.pdf · CluFA-I), improving confidence level (msf-CluFA-2) and combination of msf-CluFA-0, msf-CluFA-1 and msf-CluFA-2

S. Kasim et 01. / Computers in Biology and Medicine 43 (2013) 1120-1133 1123

a Cluster 17 (320) Cluster 33 (486) Cluster 39 (813) Cluster 43 (2520) Cluster 49 (1789)

b Cluster 17 (320) I A I

Cluster 17 Cluster 17 (98) <-z-> (0.098) <=--> (0.098)

I I

Cluster 33 (486) I Cluster 33 I I Cluster 33 (53) I I I I I I I I I Cluster 39 (295) Cluster 39 (813)

I I Cluster 39

YALOOlC Filter by SC

I Cluster 43 (2520) I Cluster 43 I Cluster 43 (866)

I I

(0.091)

YALOOlC

YCR061W (0.128)

Cluster 49 (1789) Cluster 49 Cluster 49 (510)

YALOOlC 4-3 YCRO61W

(0.128)

Cluster 17 (98)

Cluster 33 (86)

.-------------. I Cluster 17 (0) L - - - - - - - - - - - - - l

-------------- j cluster 33 (33) j

_ _ _ _ _ _ _ _ _ _ _ _ _ _ I

Cluster 39 (295)

YAUJ62W

Cluster 43 (997)

I Cluster 39 (0) L - - - - - - - - - - - - - l

Legend:

Cluster # (number of high membership value or low membership value)

&- Genes with low membership values

Cluster 49 (601)

Fig. 2. An illustrated example for (a) the fuzzy c-means clustering stage, (b) achieving the dominant cluster, and (c) improving the confidence level.

Page 5: eprints.uthm.edu.myeprints.uthm.edu.my/id/eprint/4108/1/shahreen_kasim.pdf · CluFA-I), improving confidence level (msf-CluFA-2) and combination of msf-CluFA-0, msf-CluFA-1 and msf-CluFA-2

5. Icasim et al. / CompuFers in Biology and Medicine 43 (2013) 1120-1133

Optimal cluster \ Initial cluster

Genes with law

Fig. 3. The illustrated example of high me

[S] that applied the apriori algorithm is used which allows for integrative extraction of frequently co-occurring annotations in a given gene list across the GO terms. The apriori algorithm is the most well-ltnown association rule algorithm which was initially introduced by Agrawal et al. 121. Given that I = (il , i2, ..., in] is a set of literals called items. T is the transaction that contains a set of items such that T E I and D is a database with different transaction records T,. An association rule is an implication of the form of X+Y, where X and Ycl are sets of items called itemsets and XnY = 4. Here, X is called antecedent and Y consequent. Based on the association rule, the apriori algorithm generates sets of elements that frequently co-occur in a database of transactions. Briefly, the CENECODlS procedure begins by determining the set of all single annotations (itemset) that appear in at least x genes (also known as the support threshold) from the list of interest and establishes the frequent b itemsets, where b=l. In the second iteration (b=2), the set of frequent annotations found in the previous step is used to produce a new set of candidates of size 2 (2-itemset) and the database is scanned again to explore each gene and to count on the frequency of each pair of annotations. However, if the set of annotations does not satisfy the minimum support constraint, that is, they do not occur in at least x genes then they are not further considered to generate larger itemsets. The procedure continues until no additional combinations are possible. At the end of this search, all itemsets that contain the collection of annotations that co-occur in at least x genes are obtained. Then, each gene is filtered by comparing its hypergeo- metric distribution for each concurrent annotation. A gene is assigned to a dominant cluster when it has the lowest value of hypergeometric distribution. Meanwhile, when a gene has already appeared in one cluster and does not have co-occurrence with other annotations, it is directly assigned to the dominant cluster to which it belongs. This process continues until all the genes on the list have been filtered and assigned to their dominant clusters.

Once the process is completed, the results are combined at the achieved dominant cluster stage. Subsequently, the results are evaluated with HD and z-score values in order to check on the biological significance of the clusters. An illustrated example of the output of this filtering is shown in Fig. 2(c). In Fig. 2(c), those five clusters consisted of low membership values. The cluster 17 has three genes, cluster 33 has 2 genes, cluster 39 has four genes, cluster 43 has eleven genes and cluster 49 has seven genes with low membership values. After these genes were grouped and the

lmbership values and low membership values.

apriori algorithm was applied, the genes with low membership values in cluster 17 did not have an improved confidence level; therefore. cluster 17 remained with 98 genes.

2.5. Evaluation measurement

The results from our msf-CluFA were evaluated through several measurements, the compactness and separation 1461, HD, z-score and cluster profile. These evaluations were employed to evaluate the results from the proposed framework and thus reflecting the value of the resulting clusters. The measures were as follows:

(1) Compactness and separation: This measurement is to deter- mine the ratio of the compactness within the cluster to the separation of cluster among other clusters [44,12.46]. in which the smallest value of SC denotes the minimum intra-cluster and the maximum inter-cluster (as defined by Eq. (6)).

(2) Hypergeomem'c distribution: This measurement was to deter- mine the probability of the number of genes involved in a cluster. For a given cluster, the probability of HD can be defined as

where a list of gene n, for every gene, i, M is the number of genes annotated with GO term and N is the total number of genes in the genome with GO annotations. This evaluation determines that the closer the HD to 0, the chance of the genes in the cluster to be associated with the GO term is higher.

(3) z-score: This measurement was used to check on the mutual information relating to the clustering results and the SGD gene annotation data which is better compared with other classical measures such as the Rand index and Adjusted Rand index 140.423. A higher z-score indicates a clustering result that is further from random. The z-score is computed using the ClusterJudge 1201. In particular, the ClusterJudge first deter- mines a set of gene attributes among those provided by the GO that are independent and significant; it then computes the mutual information of the proposed clustering and that of a reference random clustering. Finally it returns the z-score.

Page 6: eprints.uthm.edu.myeprints.uthm.edu.my/id/eprint/4108/1/shahreen_kasim.pdf · CluFA-I), improving confidence level (msf-CluFA-2) and combination of msf-CluFA-0, msf-CluFA-1 and msf-CluFA-2

S. Kasim ec al. / Coquters in Biology and Medicine 43 (2013) 1120-1133 1125

Mlreal-Mlrandom/~random. where MIrandom is the mean of the mutual information score for the random clustering used and MIrandom is the standard deviation. Given the randomized nature of the test, different runs produce slightly different numerical values. However, the ranking of the method is stable and consistent across different applications of the evaluation tool. For this reason, for each dataset used, we repeated the evaluation of the output for all the different algorithms ten times, reported here as the average z-score.

(4) Cluster profile: The cluster profile shows the normalised gene expression values with respect to the time points in its experiment.

3. Results and discussion

3.7. Comparison of clustering results annotation

Table 1 shows a comparison of the msf-CluFA-1 and msf-CluFA- 3. Fang's and Eisen's cluster results. It could be observed that 9 genes in the msf-CluFA-1 and msf-CluFA-3 cluster number 72 were matched with the genes in Eisen's cluster number 6. compared with 8 genes in Fang's cluster number 75. Furthermore, there were 51 more genes included in the msf-CluFA-1 and 53 genes in the msf-CluFA-3. As seen in Table 2, all genes in the msf- CluFA-1 and msf-CluFA-3 cluster number 45 matched all 27 genes in Eisen's cluster number 2 compared with 24 genes in Fang's cluster number 335. Moreover. 31 more genes were included in the msf-CluFA-1 and msf-CluFA-3. The comparison of HD also showed that our msf-CluFA-1 and msf-CluFA-3 have the lowest HD among the other two clusters. This proved that the extra genes included in Tables 1 and 2 really belonged to the particular clusters. This is based on Boyle et al. [6] where they stated that the lower the HD value, the more significant the GO term is in its association to the group of genes. In Fig. 4. the profiles of these clusters are shown. The cluster profiles illustrate the expression pattern that spans a set of conditions (x-axis), while the y-axis holds the expression values varying from negative to positive. From these profiles. one can see that expression profiles for different clusters differ from one another, while expression pro- files within a cluster are reasonably similar. These cluster profiles have also been shown in Fang et al. [15] and Eisen et al. [13] to demonstrate the expression pattern for their cluster results.

3.2. Resultsfrom achieving dominant clusters and improving the confidence level of low membership values

One of the objectives of this study was to examine genes with multiple functions in order to put these genes into their dominant cluster. In Fig. 5, due to space limitations, we present the msf- CluFA-0, msf-CluFA-1 and msf-CluFA-3 for 76 clusters in two sub- figures. The x-axis represents the number of clusters whereas the y-axis represents the number of genes. There was a significant decrease in the fraction of genes in msf-CluFA-1 and msf-CluFA-3, whereas in the msf-CluFA-0 clusters, the number of genes was significantly higher; this effect was clearer in clusters 0-37 than in clusters 38-75. The results shown in this figure conformed to the gene assignment in which the gene is assigned only to their dominant cluster (in the stage of msf-CluFA-1 and msf-CluFA-3). thus reflecting the decreasing number of genes in the clusters. We also compared HD with the five clusters employed by Fang in his method, which could be associated to our four clusters; GO:0006950 (response to stress). G0:0045333 (cellular respiration). GO:0019725 (cellular homoeostasis) and GO:0006810 (electron transport), as shown in Table 3. Not surprisingly, lower HD was

Table 1 Comparison of Eisen's cluster number 6. Fang's cluster number 75, msf-CluFA-1 cluster number 72 and msf-CluFA-3 cluster number 72.

Gene names in Gene names in Gene names in Gene names in Eisen's cluster 6 Fang's cluster 75 msf-CluFA-1 cluster msf-CLuFA-3 (HD: 2.91e-23) (HD: 4.50e-22) 72 (GO:0006091) cluster 72

(HD: 1.51e-48) (G0:0006091) (HD: 1.51e-48)

YBL099W YDR298C YJRl21 W

YPLO78C YDR377W YLR295C YBR039W YDL004W YKL016C

YGROOBC YDLO66W YDLO78C YDL181W YDR074W YDR298C YEMll W YELOll W YER065C YER177W YFROl5C YGL187C YGL253W YGR240C YGR244C YHR051W YJL103C YJRO48W YKL148C YKL150W YKR058W YLR174W YLR258W YLR377C YLR395C YML054C YMLlOOW YMLl2OC YMR205C YMR261C YMR303C YNL009W YNL037C YNLO52W YNL117W YOLO86C YOL126C YOL136C YOR136W YOR142W YOR178C YOR344C YPLO31C YPL075W YPL262W YPRO2OW YPRO74C YPRl9lW YAL06OW YBR126C YCROO5C

YBLQ99W YDR298C YJRl2lW

YPL078C YDR377W YLR295C YBR039W YDLO04W YKL016C

YGROOBC YDM66W YDL078C YDLlBlW YDR074W YDR298C YELOll W YELOll W YER065C YER177W YFROl5C YGL187C YGL253W YGR240C YGR244C YHRO5lW YJL103C YJR048W YKL148C YKL150W YKR058W YLR174W YLR258W YLR377C YLR395C YML054C YMLlOOW YMLl2OC YMR205C YMR261C YMR303C YNL009W YNLO37C YNL052W YNLll7W YOL086C YOL126C YOL136C YOR136W YOR142W YOR178C YOR344C YPLO31C YPL075W YPL262W YPRO2OW YPR074C YPRl9lW YAMSOW YBR126C YCROO5C YER177W YEL024W

obtained for our msf-CluFA-3 clusters. A comparison of z-score and

Page 7: eprints.uthm.edu.myeprints.uthm.edu.my/id/eprint/4108/1/shahreen_kasim.pdf · CluFA-I), improving confidence level (msf-CluFA-2) and combination of msf-CluFA-0, msf-CluFA-1 and msf-CluFA-2

1126 S. Kasim et al. / Compufers in Biology ' and Medicine 43 (2013) 1120-1133

Table 2 Comparison of Eisen's cluster number 2. Fang's cluster number 335, msf-CluFA-1 cluster number 45 and msf-CluFA-3 cluster number 45.

Gene names in Gene names in Gene names in Gene names in Eisen's cluster 2 Fang's cluster 335 msf-CluFA-1 msf-CluFA-3 (HD: 1.22e-42) (HD: 2.29e-32) cluster 45 cluster 45

(G0:0030163) (G0:0030163) (HD: 1.728-89) (HD: 1.72E-89)

YGR048W YDR427W YKL145W

YFR050C YDL097C YOR259C YPRlO8W YERO2l W YGR253C YGMllC YMR314W YGR135W YEROl2W YPR103W YJMOlW YOR362C YOR157C YOLO38W YBL041 W YHRZOOW YDR394W YORll7W

YDL147W YOR261C YDLOO7W YDM2OC YER094C YGLO17W YILO75C YML092C YMR022W

processing time for several fuzzy algorithms is presented in Table 4. The msf-CluFA-0 z-score results performed better when compared to the GOFuzzy [39], followed by the msf-CluFA-1, msf- CluFA-3. FuzzyK [ls]. FuzzySOM 1331, biclustering, k-means and hierarchical. Meanwhile, the processing time of COFuzzy is the shortest compared to msf-CluFA-0. FuzzySOM, msf-CluFA-1. FuzzylC, msf-CluFA-3, k-means, hierarchical and biclustering. By

looking the overall time processing, our msf-CluFA still in an acceptable form. We configured GOFuzzy by setting the number of clusters to 76 with a threshold value of 0.05 while the member- ship cutoff is 0.08 in FuzzyIC. On the other hand, for FuzzySOM, we used WEIW [45] by setting the number of clusters to 76 with a fuzzy parameter of 1.2. and the maximum iterations is set at 500. In addition. we configured k-means and hierarchical algorithms by using the default setting of software named CLUSTER [13]. Mean- while, the implementations used in our experiments for the biclustering algorithm were obtained from Biclustering Analysis Toolbox (BicAT: 15)). The default setting for the maximum number of iterations for It-means and hierarchical are 100,100,000, and 20 respectively. We also used the default setting 0.06 for the member- ship cutoff value for biclustering, which is the same value reported in Barltow et al. [5] which also been used by Tari et al. [39] and gave the optimal biclustering results.

Although the z-score value from the msf-CluFA-0 result is significantly better than the msf-CluFA-1 and msf-CluFA-3, the purpose of the algorithm in msf-CluFA-0 and both msf-CluFA-1 and rnsf-CluFA-3 was different; thus it changes the value of the z-score due to the decrement in the number of genes in the cluster. As presented in Tables 5 and 6, the confidence level improvement for low membership values was slightly better for each cluster. It has been proven that with the employment of an apriori algo- rithm, the confidence level of genes with low membership values may be improved and, consequently, genes could be put back into their dominant clusters.

3.3. Gene function prediction

In this experiment, the purpose of using the SGD functional annotation database version 2005 was to determine the capability of our method to predict unknown genes. Surprisingly, once we obtain the dominant cluster for the genes, our msf-CluFA-3 successfully predicts the unknown genes. We then compare the predicted results from msf-CluFA-3 with the current SGD func- tional annotations for the unknown genes. The results for the gene function prediction for the unknown gene functions in their dominant cluster are shown in Table 7, for the Eisen dataset. From this table, there are 22 genes that were originally unknown; however, we were able to predict these genes and assigned them to their dominant clusters. As observed, there are 13 genes that were not annotated either in their molecular function (1 gene) or cellular component (12 genes), but were assigned as No Biological Data Available (ND). Interestingly, our msf-CluFA-3 managed to assign the function of these genes to our function with the dominant cluster (59.09%). thus enabling us to predict the function of the unltnown genes. The YNR034W gene in column number eleven had an 'ND' annotation in its molecular function; however, we were able to predict it as a G0:0016787 (hydrolase activity) annotation function. Hydrolase activity is defined as the catalysis of the hydrolysis of various bonds in which hydrolase is the systema- tic name for any enzyme of the Enzyme Commission (EC) class 3. The gene YNR034W has a standard name Sollp, which appears to function in transferring Ribonucleic acid (tRNA) nuclear export (361. The details and graphical view of proteins that share common domains/motifs with Sollp are shown in Fig. 6. The prediction of C0:0016787 (hydrolase activity) for YNR034W was also supported when we conducted further investigation with other genes in the same cluster, thus sharing the same annotation in the current SGD annotation database. For example, in molecular function ontology. gene YMR054W has G0:0046961 (proton-transporting ATPase activity. rotational mechanism), gene YOROllW has G0:0042626 (ATPase activity, coupled to hnnsmembmne movement ofsubstances) and gene YDL126C has G0:0016887 (ATPase activity). These annotations belong to the general GO Slim term which is GO:0016787 (hydrolase activity).

Page 8: eprints.uthm.edu.myeprints.uthm.edu.my/id/eprint/4108/1/shahreen_kasim.pdf · CluFA-I), improving confidence level (msf-CluFA-2) and combination of msf-CluFA-0, msf-CluFA-1 and msf-CluFA-2

S. Kasirn et al. / Computers in Biology and Medicine 43 (2013) 1120-1133 1127

Eisen's cluster number 6 Eisen'scluster number 2

Fang's cluster number 75 Fang's cluster number 335

msf-CluFA- 1 'scluster number 72 msf-CluFA-1 'scluster number 45

msf-CluFA-3'scluster number 72 msf-CluFA-3'scluster number 45

Fig. 4. Cluster profile plots: Eisen's cluster number 6, Fang's cluster number 75, Eisen's cluster number 2. Fang's cluster number 335, msf-CluFA-1's cluster number 72, msf- CluFA-1's cluster number 45, msf-CluFA-3's cluster number 72, and msf-CluFAJ's cluster number 45.

which thus proved our prediction. Furthermore, the 12 genes YBR213W. YFR055W. YGL184C YILIGNV, YKL218C YOR393W, WL281C. YBL067C. YGL259W. YI<R098C, YOR391C and YPL280W were assigned with an 'ND' in cellular component ontology. However. our msf-CluFA was able to predict their cellular component function in which the genes YBR213W, YFR055W, YGL184C, ML167W. YI<L218C YOR393W and YPL281C were predicted into GO:0016829 (lyase activity) while the genes YBL067C, YGL259W, YICR098C, YOR391C and WI280W were predicted into GO:0008233 (peptidase activity). The remaining 9 genes were predicted to be similar to the current (December 2009) SGD annotations available at: www.yeast genome.org. The same observation was achieved for the Gasch's dataset in which we were able to predict 156 genes that were originally unltnown and successfully assigned to their dominant cluster. From these genes. 110 genes were successfully predicted with our function (70.51%) in which 72 genes were in molecular functions,

23 genes in cellular components and 15 genes in biological processes. However, due to space limitations, we only show 20 genes as provided in Table 8. For this table, we randomly picked 12 genes from molecular functions and 8 genes from cellular components in which they originally have had 'ND' annotations and we successfully predicted their functions.

4. Conclusion

In this study, we have observed the yeast gene expression benchmark datasets in obtaining their dominant clusters and upgrading their confidence levels. Although there were genes that have the ability to perform multiple functions, it was better to identify which function was more dominant. We have also paid more attention to genes with low membership values in order to

Page 9: eprints.uthm.edu.myeprints.uthm.edu.my/id/eprint/4108/1/shahreen_kasim.pdf · CluFA-I), improving confidence level (msf-CluFA-2) and combination of msf-CluFA-0, msf-CluFA-1 and msf-CluFA-2

1128 5. Kasim et of. /Computers in Biology and Medicine 43 (2013) 1120-1133

Cluster 0 to 37

0 1 2 3 4 5 6 7 8 9 1 0 11 1 2 1 3 1 4 15 1 6 1 7 1 8 1 9 20 2 1 2 2 23 2 4 25 26 27 28 2 9 3 0 3 1 3 2 3 3 34 3 5 3 6 3 7

Cluster No.

Cluster 38 to 75

3 7 38 3 9 4 0 4 1 42 4 3 44 45 4 6 47 4 8 4 9 5 0 5 1 52 5 3 5 4 5 5 5 6 57 5 8 5 9 6 0 6 1 6 2 6 3 6 4 65 6 6 6 7 6 8 6 9 7 0 7 1 7 2 73 7 4 7 5

Cluster No.

Fig. 5. The comparative number of genes in each cluster for msf-CluFA-0, rnsf-CluFA-1 and msf-CluFA-3.

Table 3 Comparison with the results of Fang's clusters results. The value in bold denotes the best value of HD.

- -

Fang's Cluster GO Slim. msf-CluFA-1 msf-CluFA-3 Fang's Method (GO ID. Definition) Definition

Cluster No. of Cluster HD No. of Cluster HD Cluster No. of Cluster HD no. gene frequency gene frequency no. gene frequency

(%I (%I (%I

G0:0006298, mismatch repair G0:0006752, group transfer coenzyme metabolism G0:0042773. ATP synthesis coupled electron transport G0:0006461, protein complex assembly

G0:0006810, electron transport

-

G0:0006950. 26 40 95.00 1.298-38 83 97.50 5A4E-84 295 14 100.00 2.498-14 response to stress G0:0006950. 26 40 95.00 1.298-38 83 97.50 5.44E-84 133 12 75.00 1.798-21 response to stress G0:0045333. 5 17 100.00 4.068-31 17 100.00 4.068-31 74 11 100.00 9.148-27 cellular respiration G0:0019725. 52 20 100.00 9.808-34 20 100.00 9.808-34 79 10 100.00 2.00E-13 cellular homoeostasis GO:0006810. 2 60 91.50 1.678-38 120 94.10 2.01E-86 19 14 85.70 2.048-27 electron transport

Table 4 Comparison ofz-score values with different clustering algorithms in the Eisen and Gasch datasets.

Clustering Algorithm z-score: Eisen dataset z-score: Gasch Dataset

upgrade their confidence levels. The process of achieving domi- nant clusters begins by filtering the membership values of all genes and calculating the genes' specificity. Meanwhile, the apnori

algorithm was used in order to increase the confidence level for genes with low membership values. In our experiment evaluation, through the process in msf-CluFA, we were able to determine the dominant cluster with lower HD and z-score values. Furthermore, we were able to increase the confidence level for genes with low membership values while maintaining the best values of HD and z- score. Our msf-CluFA has also shown promising results in gene function prediction for unknown gene functions. Therefore, it is not an exaggeration to state that the msf-CluFA has the potential to help biologists identify and characterise potentially informative genes of interest for further investigations. In future, this frame- work may be well extended to include other biological knowledge, for example, biological pathways. This framework can also be applied to living gene disease databases in order to detect the dominant genes that have the capacity to cure disease and thus prevent the outbreak of an epidemic.

Page 10: eprints.uthm.edu.myeprints.uthm.edu.my/id/eprint/4108/1/shahreen_kasim.pdf · CluFA-I), improving confidence level (msf-CluFA-2) and combination of msf-CluFA-0, msf-CluFA-1 and msf-CluFA-2

S. Icasim et 01. / Computers in Biology and Medicine 43 (2013) 1120-1133

Table 5 Confidence level improvement for low membership values in Eisen's dataset.

Cluster Initial msf-CluFA-0 No. of low No. of upgraded Cluster Initial msf-CluFA-0 No. of low No. of upgraded no. no. of membership low membership no. no. of membership low membership

genes values values genes values values

Table 6 Confidence level improvement for low membership values in Gasch's dataset.

--

Cluster Initial msf-CluFA-0 No. of low No. of upgraded Cluster Initial msf-CluFA-0 No. of low No. of upgraded no. no. of membership low membership no. no. of membership low membership

genes values values genes values values

0 97 97 0 0 38 83 83 0 0 1 51 50 1 0 39 811 803 8 7 2 858 8 53 5 4 40 243 241 2 0 3 190 188 2 0 41 65 64 1 1 4 129 124 5 0 42 705 694 11 1 5 78 77 1 1 43 2520 2503 17 13 6 434 425 9 0 44 239 212 27 25 7 200 200 0 0 45 195 192 3 1 8 135 131 4 0 46 568 565 3 2 9 112 105 7 6 47 219 191 28 25 10 19 19 0 0 48 151 149 2 1 11 62 56 6 6 49 1784 1777 7 5 12 450 399 51 46 50 207 207 0 0 13 478 312 166 95 51 289 286 3 3 14 177 176 1 1 52 113 111 2 0 15 423 421 2 0 53 335 328 7 6 16 54 54 0 0 54 153 148 5 5 17 317 316 1 0 55 249 247 2 2 18 29 28 1 0 56 248 246 2 0 19 77 77 0 0 57 45 44 1 1 20 147 142 5 5 58 97 94 3 3 21 82 76 6 6 59 120 119 1 1 22 297 288 9 5 60 93 93 0 0 23 56 53 3 3 61 229 228 1 0 24 449 444 5 4 62 326 191 137 127

Page 11: eprints.uthm.edu.myeprints.uthm.edu.my/id/eprint/4108/1/shahreen_kasim.pdf · CluFA-I), improving confidence level (msf-CluFA-2) and combination of msf-CluFA-0, msf-CluFA-1 and msf-CluFA-2

1130 S. Kasim et al. / Computers in Biology and Medicine 43 (2013) 1120-1133

Table 6 (continued)

Cluster Initial msf-CluFA-0 No. of low No. of upgraded Cluster Initial msf-CluFA-0 No. of low No. of upgraded no. no. of membership low membership no. no. of membership low membership

genes values values genes values values

Table 7 Gene function prediction for the unknown gene functions from Eisen's dataset. The 'MF indicates a molecular function, 'BP' indicates a biological process, and 'CC' indicates a cellular component.

No. Gene name

Other name

Cluster no.

GO ID Definition msf-CluFA-3 function

Current SGD Annotation

MF Evidence BP Evidence CC Evidence code code code

G0:0015205 ISS GO:0055085 ISS GO:0016021 ISS GO:0006810 Electron transport

G0:0005515 Protein binding G0:0005515 IDA CO:0030001 IMP GO:0006623 IMP GO:0006511 IMP GO:0008150 ND

G0:0005783 IDA G0:0000329 IDA

PNSl

MET8

G0:0005215 Transporter activity

G0:0016829 Lyase activity

G0:0015220 IMP G0:0003674 ND GO:0004325 IDA G0:0043115 IDA

G0:0005887 ISS G0:0001950 IDA G0:0005575 ND G0:0019354 IMP

GO:0000103 IMP G0:0042493 IMP CO:0006878 IEP GO:0006312 IMP GO:0006790 ISS G0:0009086 IMP GO:0019346 IMP GO:0009069 ISS GO:0042219 IDA. IMP

G0:0016829 Lyase activity G0:0004121 ISS

G0:0016829 Lyase activity GO:0004121 IMP, ISS

SDLl SRY1

G0:0016829 Lyase activity GO:0016829 Lyase activity

GO:0003941 1SS GO:0030378 IDA, IMP G0:0030848 IDA. IMP G0:0004634 ISS G0:0004634 ISS G0:0017057 IDA GO:0003674 ND

G0:0016829 Lyase activity G0:0016829 Lyase activity G0:0016787 Hydrolase activity

GO:0008150 ND GO:0008150 ND GO:0005409 IGI. IMP

G0:0005575 ND G0:0005575 ND G0:0005634 IDA G0:0005737 IDA G0:0005634 IDA G0:0005743 IDA SOMl G0:0016787 Hydrolase activity G0:0004175 IPI G0:0033108 IMP

G0:0006508 IGI. IMP. G0:0042720 IPI 1PI. ISS

G0:0003674 ND GO:0042138 IGI G0:0000794 IDA G0:0005634 Nucleus ME14

ME14

NEMl

G0:0005694 Chromosome GO:0003674 ND G0:0042138 IGI G0:0000794 IDA

G0:0004721 Phosphoprotein phosphatase activity

G0:0004721 IDA GO:0030437 IMP GO:0016021 IDA. ISM

G0:0006998 IMP G0:0042175 IDA G0:0046890 IG1 G0:0005739 IDA

G0:0004721 IDA G0:0030437 IMP GO:0016021 IDA. ISM G0:0004721 Phosphoprotein phosphatase activity

GO:0007126 IMP GO:0006998 IMP GO:0046890 IGI GO:0008150 ND G0:0030476 IMP

G0:0042175 IDA

G0:0008233 Peptidase activity G0:0008233 Peptidase activity

G0:0004843 TAS G0:0008236 ISS

G0:0005575 ND G0:0005619 IDA G0:0005635 IDA G0:0005575 ND G0:0005575 ND G0:0005575 ND

YPS5 UBPll HSP33

G0:0008233 Peptidase activity G0:0008233 Peptidase activity GO:0008233 Peptidase activity

G0:0004190 ISS G0:0004843 IDA GO:0008234 ISS GO:0051082 ISS G0:0008234 ISS GO:0051082 ISS

G0:0008233 Peptidase activity

Page 12: eprints.uthm.edu.myeprints.uthm.edu.my/id/eprint/4108/1/shahreen_kasim.pdf · CluFA-I), improving confidence level (msf-CluFA-2) and combination of msf-CluFA-0, msf-CluFA-1 and msf-CluFA-2

S. Kasim et al. / Corr~puters in Biology and Medicine 43 (2013) 1120-1133

Protein information Expression ~TiaMIoJIU < ' ' : ' " ' ' . . . , . : . . . . . , : +

lmha &a

Interactions (BioPIXIE) 47 total interaction(s) for 42 unique geneslfeatures.

Physical Interactions: Genetic Interactions:

Af£inity Capture-MS: 10 Dosage Rescue. 3 . Biochemical Activity: 2 Negative Genetic: 16 PCA: I Positive Genetic: 11 Two-hybrid: 3 Synthetic Growth Defect: 1

Fig. 6. The details and graphical view of proteins that share domains/motifs in common with YNR034W/Sollp. Source: http://www.yeastgenome.org/cgi-bin/locus.fpl? locus=YNR034W.

Table 8 Gene function prediction for the unknown gene functions from Gasch's dataset. The 'MF' indicates a molecular function, 'BP' indicates a biological process, and 'CC indicates a cellular component.

No. Gene Other Cluster GO ID Definition msf-CluFA-3 Current SGD annotation name name no. Function

MF Evidence BP Evidence CC Evidence code code code

1 YIW37C PRM2

2 YlLll7C PRM5

3 YJL108C PRMlO

4 YJL170C ASG7

5 YJR084W CSNl2

6 YLRllOC CCWl2

G0:0016740 Transferase activity

GO:0016740 Transferase activity

G0:0016740 Transferase activity

GO:0016740 Transferase activity

G0:0016740 Transferase activity

G0:0016740 Transferase activity

G0:0016740 Transferase activity

G0:0016740 Transferase activity

GO:0016740 Transferase activity

G0:0016740 Transferase activity

GO:0016740 Transferase activity

GO:0016829 Lyase activity

G0:0016829 Lyase activity

G0:0004325 IDA G0:0043115 IDA

G0:0004451 IMP, ISS

G0:0000742 IMP

GO:0008150 ND

GO:0008150 ND

G0:0000747 IEP

G0:0000754 IMP

G0:0000398 IMP G0:0000752 IMP

G0:0000747 IMP GO:0031505 IMP GO:0000321 IG1

GO:0008150 ND

GO:0000754 IMP

G0:0000338 IMP GO:0000321 IGI, IMP

G0:0000755 IGI. IMP

G0:0000742 IMP G0:0031385 IMP

G0:0019354 IMP GO:0000103 IMP G0:0042493 IMP G0:0006097 TAS

GO:0016021 ISS

G0:0016021 1SS

GO:0016021 ISS

G0:0005886 IDA

GO:0008180 IDA

G0:0009277 IDA

G0:0005783 IDA

G0:0016021 ISS

GO:0008180 IDA. TAS

G0:0005783 IDA

G0:0005737 IDA

G0:0043332 IDA G0:0005634 IDA G0:0005739 IDA G0:0005575 ND

Page 13: eprints.uthm.edu.myeprints.uthm.edu.my/id/eprint/4108/1/shahreen_kasim.pdf · CluFA-I), improving confidence level (msf-CluFA-2) and combination of msf-CluFA-0, msf-CluFA-1 and msf-CluFA-2

1132 S. Kasim et al. / Computers in Biology and Medicine 43 (2013) 1120-1133

Table 8 (continued)

No. Gene Other Cluster GO ID Definition msf-CluFA-3 Current SCD annotation name name no. Function

MF Evidence BP Evidence CC Evidence code code code

14 YFR055W IRC7 38 GO:0016829 Lyase activity CC GO:0004121 ISS GO:0006878 IEP G0:0005575 ND GO:0006312 IMP G0:0006790 ISS

15 YGL184C STR3 38 G0:0016829 Lyase activity CC GO:0004121 IMP, ISS GO:0009086 IMP G0:0005575 ND CO:0019346 IMP

16 YILl67W SDLI 38 GO:0016829 Lyase activity CC G0:0003941 ISS GO:0009069 ISS G0:0005575 ND 17 YKL218C SRYl 38 GO:0016829 Lyase activity CC G0:0030378 IDA. IMP GO:0042219 IDA. IMP G0:0005575 ND

G0:0030848 IDA. IMP 18 YOR393W ERR1 38 GO:0016829 Lyase activity CC G0:0004634 1SS GO:0008150 ND G0:0005575 ND 19 YPL281C ERR2 38 G0:0016829 Lyase activity CC G0:0004634 ISS GO:0008150 ND G0:0005575 ND 20 YLR287C YLR287C 62 G0:0005198 Structural MF G0:0003674 ND GO:0008150 ND G0:0005737 IDA

molecule activity

Conflict of interest statement

None declared.

Acknowledgements

This project is funded by the Malaysian Ministry of Science. Technology and Innovation (MOSTI) under ScienceFund Grant no. 02-01-06-SF0068. Thanks are also extended to Professor Richard L. Spear, who, though not a geneticist was kind enough to read through this paper as a long-time native speaker of English. The comments and suggestions from the anonymous reviewers greatly improved the paper.

References

[I] M. Ashburner. C.A. Ba1l.J.A. Blake, D. Botstein. H. Butler. J.M. Cherry. A.P. Davis. K. Dolinski. S.S. Dwight, J.T. Eppig, M.A. Harris. D.P. Hill. L Issel-Tarver. A. Kasarskis. S. Lewis. J.C. Matese. J.E. Richardson. M. Ringwald. G.M. Rubin. G. Sherlock. Gene ontology: Tool for the unification of biology. The Gene Ontology Consortium, Nat. Genet. 25 (2000) 25-29.

[2] R. Agrawal, T. Imielinski. A. Swami. Mining association rules between sets of items in large databases, in: Proceedings of the ACM SIGMOD International Conference on Management of Data. May 2628.1993. ACM, Washington, DC. pp. 207-216.

[3] R. Agrawal. R. Sriltant. Fast Algorithms for mining association rules in large databases, in: Proceedings of the 20th International Conference on Very Large Databases. September 12-15.1994, ACM. Santiago. Chile, pp. 487499.

141 S. Bandyopadhyay, A. Multhopadhyay, U. Maulik, An improved algorithm for clustering gene expression data. Bioinformatics 23 (2007) 2859-2865.

[5] S. Barkow, S. Bleuler, A. Prelic, P. Zimmermann. E. Zitzler, Bicat: a biclustering analysis toolbox, Bioinformatics 22 (10) (2006) 1282-1283.

[6] E.I. Boyle. S. Weng. J. Gollub. H. Jin. D. Botstein. J.M. Cherry. G. Sherlock, GO: Termfinder--open source software for accessing gene ontology information and finding significantly enriched gene ontology terms associated with a list of genes. Bioinformatics 20 (18) (2004) 3710-3715.

[7] M. Brameier. C. Wiuf, Co-clustering and visualization of gene expression data and gene ontology terms for Saccharomyces cerevisiae using self-organizing maps. J. Biomed. Inform. 40 (2007) 160-173.

[8] P Carmona-Saez. M. Chagoyen, F. Tirado, J.M. Carazo. A. Pascual-Montano, GENECODIS: a web-based tool for finding significant concurrent annotations in gene lists. Genome Biol. 8 (2007) R3.

[9] Z.S.H. Chan, L. Collins. N. Kasabov. An efficient greedy k-means algorithm for global gene trajectory clustering, Expert Syst. Appl. 30 (2006) 137-141.

[lo] J.M. Cherry. C. Adler. C. Ball. S.A. Chervitz. S.S. Dwight. E.T. Hester. Y. Jia, G. Juvik, T. Roe, M. Schroeder, S. Weng. D. Botstein. SGD: Soccharomyces genome database. Nucleic Acids Res. 26 (1998) 73-79.

[ I l l Y. Ding. H. Han. F. Liu, Intelligent integrated data processing model for oceanic warning system. Knowl.-Based Syst 23 (2010) 61-69.

[12] R.O. Duda. P.E. Hart. D.G. Stork. Pattern Classification. 2nd ed.. John Wiley and Sons. New York. USA, 2001.

[I31 M.B. Eisen. P.T. Spellman. P.O. Brown. D. Botstein. Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA 95 (1998) 14863-14868.

1141 M.G. Elferink, P. Olinga, A.L Draaisma. M.T. Merema. S. Bauerschmidt. J. Polman. W.G. Schoonen, G.M. Groothuis. Microarray analysis in rat liver

slices correctly predicts in vivo hepatotoxicity. Toxicol. Appl. Pharmacol. 229 (2008) 300-309.

1151 Z. Fang. J. Yang. Y. Li, Q. Luo. L. Liu. Knowledge guided analysis of microarray data, J. Biomed. Inform. 39 (2006) 401411.

[I61 E.A. Fernandez. M. Balzarini, Improving cluster visualization in self-organizing maps: application in gene expression data analysis. Comput. Biol. Med. 37 (2007) 1677-1689.

[I71 D. Frank. C. Kuhn. B. Br0rs.C. Hanselmann, M. Liidde. H.A. Katus. N. Frey. Gene expression pattern in biomechanically stretched cardiomyocytes: evidence for a stretch-specific gene program, Hypertension 51 (2008) 309-318.

[18] A.P. Gasch. M.B. Eisen. Exploring the conditional coregulation of yeast gene expression through fuzzy k-means clustering. Genome Biol. 3 (2002) RESEARCH0059.

1191 A.P. Casch. P.T. Spellman. C.M. Kao. 0. Carmel-Harel. M.B. Eisen, G. Storz. D. Botstein. P.O. Brown. Genomic expression programs in the response of yeast cells to environmental changes. Mol. Biol. Cell 11 (2000) 42414257.

[20] F.D. Gibbons. F.P. Roth. Judging the quality of gene expression-based clustering methods using gene annotation, Genome Res. 12 (2002) 1574-1581.

[21] A. Gruca, M. Kozielski. M. Sibora. Fuzzy clustering and gene ontology based decision rules for identification and description of gene groups, in: K.A. Cyran, e t al.. (Eds.). Man-Machine Interactions. AlSC 59. Springer-Verlag. Berlin. Heidelberg. 2009, pp. 141-149.

[22] J. Guan. Y. Gan. H. Wang. Discovering pattern-based subspace clusters by pattern tree. Knowl.-Based Syst. 22 (2009) 569-579.

[23] B. Herszberg. X. Mata. E. Giulotto. P. Decaunes. F.M. Piras. B.P. Chowdhary. S. Chaffaux. G. Guerin, Characterization of the equine glycogen debranching enzyme gene (AGL): genomic and cDNA structure, localization, polymorphism and expression, Gene 404 (2007) 1-9.

[24] S. Kasim, S. Deris. R. Othman, A new computational framework for gene expression clustering. Adv. Data Min. Appl. (2010) 603-610.

[25] E. Lerat. M. Semon, Influence of the transposable element neighborhood on human gene expression in normal and tumor tissues. Gene 396 (2007) 303-311.

[26] J. Liu. M. Xu. Kernelized fuzzy attribute c-means clustering algorithm. Fuzzy Sets Syst. 159 (2008) 2428-2445.

[27] E.K. Lobenhofer. J.T. Auman. P.E. Blackshear. G.A. Boorman. P.R. Bushel M.L. Cunningham, J.M. Fostel. K. Gerrish. A.N. Heinloth. R.D. Irwin. D. E. Malarkey, B.A. Merrick. S.O. Sieber. C.J. Tucker. S.M. Ward, R.E. Wilson, P. Hurban. R.W. Tennant. R.S. Paules. Gene expression response in target organ and whole blood varies as a function of target organ injury phenotype. Genome Biol. 9 (2008) R100.

[28] P. Luultlta. Similarity classifier using similarities based on modified probabil- istic equivalence relations. Knowl.-Based Syst. 22 (2009) 57-62.

[29] U. Maulik. A. Mukhopadhyay. Simulated annealing based automatic fuzzy clustering combined with ANN classification for analyzing microarray data. Comput. Oper. Res. 37 (2010) 1369-1380.

[30] F. Mollinedo. R. Lhpez-Perez, C. Gajate, Differential gene expression patterns coupled to commitment and acquisition of phenotypic hallmarks during neutrophil differentiation of human leukaemia HL-60 cells. Gene 419 (2008) 16-26.

[31] Y. Okada, T. Sahara. H. Mitsubayashi. S. Ohgiya. T. Nagashima. Knowledge- assisted recognition of cluster boundaries in gene expression data. Artif. Intell. Med. 35 (1-2) (2005) 171-183.

[32] H.S. Park. S.B. Cho, Evolutionary fuzzy cluster analysis with Bayesian validation of gene expression profiles. J. Intell. Fuzzy Syst.: Appl. Eng. Technol. 18 (2007) 543-559.

[33] R.D. Pascual-Marqui. A.D. Pascual-Montano. K. Kochi, J.M. Carazo. Smoothly distributed fuzzy c-means: a new self-organizing map. Pattern Recognition 34 (2001 ) 2395-2402.

[34] M. Popescu.J.M. Ke1ler.J.A. Mitchell, Fuzzy measures on the gene ontology for gene product similarity, IEEEIACM Trans. Comput. Biol. Bioinform. 3 (2006) 263-274.

Page 14: eprints.uthm.edu.myeprints.uthm.edu.my/id/eprint/4108/1/shahreen_kasim.pdf · CluFA-I), improving confidence level (msf-CluFA-2) and combination of msf-CluFA-0, msf-CluFA-1 and msf-CluFA-2

S. Kasim et al. / Computers in Biology and Medicine 43 (2013) 1120-1133 1133

135) J. Shan. L Yuan. D.R. Budman, H.P. Xu, WTH3, a new member of the Rab6 gene family. and multidrug resistance. Biochim. Biophys. Acta 1589 (2002) 112-123.

[36] D.R. Stanford. M.L. Whitney. R.L. Hurto. D.M. Eisaman, W.C. Shen. A.K. Hopper. Division of labor among the yeast Sol proteins implicated in tRNA nuclear export and carbohydrate metabolism, Genetics 168 (2004) 117-127.

[371 Y. Takahara. T. Kobayashi, K. Takemoto. T. Adachi, K. Osaki. K. Kawahara. G. Tsujimoto. Pharmacogenomics of cardiovascular pharmacology: develop- ment of an informatics system for analysis of DNA microarray data with a focus on lipid metabolism, J. Pharmacol. Sci. 107 (2008) 1-7.

1381 M.P. Tan, E.N. Smith, J.R. Broach. C.A. Floudas, Microarray data mining: a novel optimization-based approach to uncover biologically coherent structures. BMC Bioinform. 9 (2008) 268.

1391 L Tari. C. Baral. S. Kim, Fuzzy c-means clustering with prior biological knowledge. J. Biomed. Inform. 42 (2009) 74-81.

[40] A.L. Traud. E.D. Kelsic. P.J. Mucha. M.A. Porter. Comparing community structure to characteristics in online collegiate social networks. SlAM Rev. 53 (3) (2011) 526-543.

[41] P. Valarmathie. M.V. Srinath. T. Ravichandran. K. Dinakaran. Hybrid fuzzy c-means clustering technique for gene expression data. Int. J. Res. Rev. Appl. Sci. 1 (2009) 33-37.

1421 N.X. Vinh. J. Epps. J. Bailey, Information theoretic measures for clusterings comparison: is a correction for chance necessary7 in: Proceedings of the 26th International Conference on Machine Learning. June 14-18. 2009. ACM Press. Quebec. Canada. pp. 1073-1080.

[43] B.H. Voy, J.A. Scharff, A.D. Perkins. A.M. Saxton, B. Borate. E.J. Chesler L.K. Branstetter. M.A. Langston. Extracting gene networks for low-dose radiation using graph theoretical algorithms. PLoS Comput. Biol. 2 (2006) e89.

1441 W. Wang. Y. Zhang. On fuzzy cluster validity indices. Fuzzy Sets Syst. 158 (19) (2007) 2095-2117.

[45] I.H. Witten, E. Frank, Data mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann Publishers, San Francisco. USA. 2005.

[46] X.L. Xie, G. Beni. A validity measure for fuzzy clustering, IEEE Trans. Pattern Anal. Mach. Intell. 13 (1991) 841-847.

[47] D. Yang. Y. Li. H. Xiao. Q. Liu. M. Zhang. J. Zhu. W. Ma. C. Yao, J. Wang. D. Wang. 2. Guo. B. Yang. Gaining confidence in biological interpretation of the microarray data: the functional consistence of the significant GO categories. Bioinformatics 24 (2008) 265-271.

[48] Q, Zhang, Y. Zhang. Hierarchical clustering of gene expression profiles with graphics hardware acceleration, Pattern Recognition Lett. 27 (2006) 676-681.