[IEEE 2014 International Conference on Information Science and Applications (ICISA) - Seoul, South...

Gene Function Prediction using Improved Fuzzy c-means Algorithm

Shahreen Kasim1,, Mohd Farhan Md. Fudzee1

1 Software and Multimedia Center, Faculty of Computer Science and Information Technology, Universiti Tun Hussein Onn Malaysia, 86400 Batu Pahat,

Malaysia.

Safaai Deris2, and Razib M. Othman3 2 Laboratory of Computational Intelligence and Biotechnology,

Universiti Teknologi Malaysia, 81310 UTM Skudai, Malaysia. 3 Artificial Intelligence and Bioinformatics Research Group, Faculty of Computer Science and Information Systems

Universiti Teknologi Malaysia, 81310 UTM Skudai, Malaysia.

Abstract—Currently, there are many new discoveries of gene expression analysis. In order to analyze the gene expression data, fuzzy clustering algorithms are widely used. However, common clustering algorithms do not provide a comprehensive approach that look into the three categories of annotations; biological process, molecular function, and cellular component, and were not tested with different functional annotation database formats. Furthermore, the common clustering algorithms do not provide the information of dominant gene among the clusters. In this paper, we present a new computational framework for clustering gene expression data. From this experiment, we can conclude that our framework capable of determining the dominant gene and also predict the unknown genes.

Keywords—gene expression; clustrering;

I. INTRODUCTION First and foremost, we need to understand the analysis of

gene. The gene structure can be viewed start from any kind of living cells. As in the human body contains around 75 trillion cells, most of which have a nucleus bounded by a nuclear membrane and surrounded by cytoplasm. Within each nucleus there is a set of 46 chromosomes made up of chromatin. This consists of an extremely long strand of DNA interwoven around structural proteins known as histones. Most human genes are divided into exons and introns, and only the exons carry information required for protein synthesis. This process is called expression.

Gene expression datasets have been used in many bioinformatics areas such as in classification, feature selection, and clustering in order to predict gene functions. Classification applies when the categories of interest are known, while clustering applies when the categories of interest are unknown. On the other hand, feature selection is the process of choosing the most appropriate features when creating a model for the process. The use of clustering methods for the discovery of cancer subtypes has attracted a great deal of attention in the scientific community. Therefore, in finding groups that are closely related to each other when we do not have any information or phenotypic data to help us find which process is best to classify the genes, clustering is the best answer.

In spite of enormous potentials of those works, there still remain challenging problems associated with the acquisition and analysis of gene expression data that can have profound influence on the interpretation of the results. One of these

problems is demeaning performance of clustering results due to certain situations in which gene can belong to multiple functions. It has been observed that whenever a gene has several points of belongingness to multiple functions it will create confusion in choosing the dominant function in which the gene should belong into.

Therefore, we proposed a new computational framework for clustering gene expression data by improving the fuzzy c-means algorithm.

II. RELATED WORKS This research is aiming and concentrating on fuzzy c-means

clustering, therefore in this section, we discussed in depth the fuzzy clustering and its applications in gene function prediction. The clustering process divides data into groups (clusters) such that similar data objects belong to the same cluster and dissimilar data objects to different clusters. The main difference between the traditional hard clustering and fuzzy clustering is in hard clustering, an entity belongs only to one cluster. Research has shown that most hard clustering algorithms were unable to identify genes whose expression is similar to multiple, distinct gene groups. However in fuzzy clustering method, the entities are allowed to belong to multiple clusters at the same time with different degrees of membership. In addition, fuzzy clustering is appropriate for analyzing genes since a single gene of gene expression profiles may have multiple functions in many cases [15] and able to handle noisy data [16]. Moreover, the clusters analysis is purely syntactical in the sense that it does not take advantage of the existing knowledge in the learning process.

Based on these studies, the fuzzy c-means algorithm in gene-based method is more suitable than other clustering gene expression algorithms for gene function prediction [32]. The fuzzy c-means works by constructing a fuzzy partition of a dataset in which the strength of the association of grouped elements to a particular cluster is described by a continuous membership value between 0 and 1. Most importantly, in clustering data, fuzzy c-means constructs a membership matrix that provides information on the membership values of each element in the data set to each cluster in the partition [17]. Furthermore, fuzzy c-means clustering could find the

978-1-4799-4441-5/14/$31.00 ©2014 IEEE

TABLE 2: Comparison of z-score value with different clustering algorithms in Eisen and Gasch datasets.

Clustering Algorithm z-score: Eisen Dataset

z-score: Gasch Dataset

FuzzyK 102.33 108.10

FuzzySOM 68.56 81.48

GOFuzzy 323.10 316.60

Our proposed method 204.00 201.11

most characteristic data point in each cluster, which can be considered the center of the cluster, and then the degree of membership for each gene in the cluster [16]. Fuzzy c-means clustering algorithm has been widely used in gene function prediction such as Valarmathie et al. [18], Berget et al. [19], Li et al. [20], and Dembele and Kastner [21].

However, in the initialization process, fuzzy c-means algorithm still produces random assignment of memberships of genes to the clusters thus producing inconsistent clustering results. Furthermore, some input data may have different membership values depending upon their location in the input dataset. In addition, the number of clusters needs to be specified in order to run the algorithm. Therefore, there are many attempts to incorporate fuzzy c-means algorithm with biological knowledge in order to increase the accuracy of the gene function prediction.

With a wide range of biological knowledge applications, GO is a popular prior knowledge to associate each cluster with appropriate functions. Among works that implemented GO in improving their gene expression clustering algorithms are in k-means [22,23], SOM [24,25], biclustering [26,27], and fuzzy c-means [28,29,30,31, 33].

III. RESULTS AND DISCUSSION In this experiment, we used SGD functional annotation database version 2005 in order to analyze the capability of our method in predicting the unknown genes. We compared our method’s result with the current annotation. In Table 1, it is shown 17 genes. For each genes, there are some annotation that were still in ND (No Biological Data Available). For example, YBR213W gene had GO:0004325 and GO:0043115 in its molecular function, GO:0019354 and GO:0000103 in its biological process and GO:0005575 in its cellular component was stated ND. Through our method, we predicted this gene belongs to its dominant cluster, GO:0016829 (lyase activity) in cellular component category. We also have done comparison of z-score with other algorithms. The z-score measurement is used to check the mutual information between a clustering result and SGD gene annotation data. A higher z-score indicates a clustering result that is further from random. The z-score is computed using ClusterJudge [33]. As presented in Table 2, GOFuzzy [27] results had better performance compared with our method followed by FuzzyK [34], and FuzzySOM [35]. Although GOFuzzy had shown

better results in this z-score comparison, genes in GOFuzzy still in multiple cluster. Thus it will change the value of the z-score due to the decrement of number of genes in the cluster. Meanwhile, our method already identified the dominant their cluster.

IV. METHOD The proposed algorithm is shown as in Figure 1. In this

computational framework, we improved fuzzy c-means algorithm.

if gi has GO annotation evidence code then Initialize Uij

(0) = rsij (1-α ) + α . r else Initialize Uij

(0) = α . r end-if for k=1 to max number of iteration do Calculate fuzzy centroids C(k) Update fuzzy membership U(k) Calculate cluster compactness and separation CS if CS < CS * then

CS * = CS C* = C(k) U* = U(k)

end-if end-for Calculate cluster consistency CT Calculate cluster precision, recall, and F-measure for j=1 to max number of cluster do for i=1 to max number of gene in a cluster do if U* < uij then U* = uij ; CL*= CL j // U* is a highest membership value of a gene

among all clusters. // CL* is the cluster which the gene of highest membership value is belongs to. else-if U *= uij then Filter by specificity value, end-if end-for end-for for i=1 to max number of genes with low membershipvalue do Apply gene with apriori algorithm Filter by Hypergeometric Distribution (HD) end-for end Fig. 1 The proposed algorithm

TABLE 1: Gene function prediction for the unknown gene functions.

V. SUMMARY The demand for gene expression analysis has led to

the emergence of various datasets, for example SGD, BGED, and MGD, which has given momentum to the conducting of a variety of experiments. Currently, co-expression under the computational analysis is the favored analysis technique for analyzing gene expression datasets. Three approaches have been identified in co-expression: gene-based, sample-based, and subspace. Based on the suitability of the dataset, the gene-based method, particularly the fuzzy c-means, has been a popular choice. However, similar results can be achieved by statistical-based or knowledge-based methods. The current research trend in gene function prediction of gene expression datasets is to incorporate of biological knowledge. However, this trend leads only to the interpretation and visualization of the cluster results. In addition, GO and SGD have not been applied as the guidance mechanism during the clustering itself.

These trends have also lead to the tendency of incorporating a knowledge-based approach for the prediction the gene functions. Although there are many techniques for

gene function prediction in computational analyses employing either direct or indirect method, the fuzzy c-means algorithm with its incorporation of GO and SGD appears to be the ideal combination for predicting gene function, while being able to handle gene expression dataset as well, and thus preventing computational problems. This is due to the evolving of GO data, now publicly available, well defined, with a consistent biological terminology, and associated with a large number of gene products that are supported by frequent citation and convincing evidence. Also, the SGD is the primary source of yeast gene functional annotation database. Further, by looking into current directions, most of which are publicly available through the internet, we find that most researchers are unable to determine the most dominant function while maintaining its accuracy. The direction of these trends hopefully will lead to the discovery of new drug discovery for cancer research. This is due to the ability of current studies, which are able to cluster gene expression according to biological and expressional similarity and thus reflect a coherent annotation of the clusters.

No. Gene Name Other Name

Cluster No. GO ID Definition Predicted

Function *

Current SGD Annotation MF Evidence

Code BP Evidence

Code CC Evidence

Code

1 YOR161C PNS1 15 GO:0005215 transporter activity MF GO:0015220 IMP GO:0008150 ND GO:0005887 ISS GO:0003674 ND GO:0001950 IDA 2 YBR213W MET8 38 GO:0016829 lyase activity CC GO:0004325 IDA GO:0019354 IMP GO:0005575 ND GO:0043115 IDA GO:0000103 IMP GO:0042493 IMP 3 YFR055W IRC7 38 GO:0016829 lyase activity CC GO:0004121 ISS GO:0006878 IEP GO:0005575 ND GO:0006312 IMP GO:0006790 ISS 4 YGL184C STR3 38 GO:0016829 lyase activity CC GO:0004121 IMP, ISS GO:0009086 IMP GO:0005575 ND GO:0019346 IMP 5 YIL167W SDL1 38 GO:0016829 lyase activity CC GO:0003941 ISS GO:0009069 ISS GO:0005575 ND 6 YKL218C SRY1 38 GO:0016829 lyase activity CC GO:0030378 IDA, IMP GO:0042219 IDA, IMP GO:0005575 ND GO:0030848 IDA, IMP 7 YOR393W ERR1 38 GO:0016829 lyase activity CC GO:0004634 ISS GO:0008150 ND GO:0005575 ND 8 YPL281C ERR2 38 GO:0016829 lyase activity CC GO:0004634 ISS GO:0008150 ND GO:0005575 ND 9 YNR034W SOL1 46 GO:0016787 hydrolase activity MF GO:0017057 IDA GO:0006409 IGI, IMP GO:0005634 IDA GO:0003674 ND GO:0005737 IDA GO:0005634 IDA

10 YER044C-A MEI4 49 GO:0005634 nucleus CC GO:0003674 ND GO:0042138 IGI GO:0000794 IDA 11 YER044C-A MEI4 50 GO:0005694 chromosome CC GO:0003674 ND GO:0042138 IGI GO:0000794 IDA 12 YBL067C UBP13 66 GO:0008233 peptidase activity CC GO:0004843 TAS GO:0008150 ND GO:0005575 ND 13 YCR045C RRT12 66 GO:0008233 peptidase activity CC GO:0008236 ISS GO:0030476 IMP GO:0005619 IDA GO:0005635 IDA

14 YGL259W YPS5 66 GO:0008233 peptidase activity CC GO:0004190 ISS GO:0008150 ND GO:0005575 ND 15 YKR098C UBP11 66 GO:0008233 peptidase activity CC GO:0004843 IDA GO:0008150 ND GO:0005575 ND 16 YOR391C HSP33 66 GO:0008233 peptidase activity CC GO:0008234 ISS GO:0008150 ND GO:0005575 ND GO:0051082 ISS

17 YPL280W HSP32 66 GO:0008233 peptidase activity CC GO:0008234 ISS GO:0008150 ND GO:0005575 ND GO:0051082 ISS

*The ‘MF’ indicates a molecular function, ‘BP’ indicates a biological process, and ‘CC’ indicates a cellular component.

ACKNOWLEDGMENT This project is funded by the Universiti Tun Hussein

Onn Malaysia under Research Acculturation Grant Scheme (RAGS) vote no. R0001. The comments and suggestions from the anonymous reviewers greatly improved the paper.

REFERENCES [1] Cai, J., Xie, D., Fan, Z., Chipperfield, H., Marden, J., Wong, W. H.,

Zhong, S. (2010). Modeling Co-Expression Across Species for Complex Traits: Insights to the Difference of Human and Mouse Embryonic Stem Cells. PloS Computational Biology. 6(3): E1000707.

[2] Adler, P., Kolde, R., Kull, M., Tkachenko, A., Peterson, H., Reimand, J., Vilo, J. (2009). Mining for Coexpression Across Hundreds of Datasets Using Novel Rank Aggregation and Visualization Methods. Genome Biology. 10(12): R139.

[3] Choi, Y., Kendziorski, C. (2009). Statistical Methods for Gene Set Co-Expression Analysis. Bioinformatics. 25(21): 2780-2786.

[4] Nayak, R. R., Kearns, M., Spielman, R. S., Cheung, V. G. (2009). Coexpression Network Based on Natural Variation in Human Gene Expression Reveals Gene Interactions and Functions. Genome Research. 19(11): 1953-1962.

[5] Oti, M., Van Reeuwijk, J., Huynen, M. A., Brunner, H. G. (2008) Conserved Co-Expression for Candidate Disease Gene Prioritization. BMC Bioinformatics. 9(1):208.

[6] Stuart, J. M., Segal, E., Koller, D., Kim, S. K. (2003). A Gene-Coexpression Network for Global Discovery of Conserved Genetic Modules. Science. 302(5643): 249-255.

[7] Tavazoie, S., Hughes, J. D., Campbell, M. J., Cho, R. J., Church, G. M. (1999). Systematic Determination of Genetic Network Architecture. Nature Genetics. 22(3): 281-285.

[8] Cho, R. J., Campbell, M. J., Winzeler, E. A., Steinmetz, L., Conway, A., Wodicka, L., Wolfsberg, T. G., Gabrielian, A. E., Landsman, D., Lockhart, D. J., Davis, R. W. (1998). A Genome-Wide Transcriptional Analysis of the Mitotic Cell Cycle. Molecular Cell. 2(1): 65-73.

[9] Goffeau, A., Barrell, B. G., Bussey, H., Davis, R. W., Dujon, B., Feldmann, H., Galibert, F., Hoheisel, J. D., Jacq, C., Johnston, M., Louis, E. J., Mewes, H. W., Murakami, Y., Philippsen, P., Tettelin, H., Oliver, S. G. (1996). Life with 6000 Genes. Science. 274(5287): 563-567.

[10] Cherry, J. M., Adler, C., Ball, C., Chervitz, S. A., Dwight, S. S., Hester, E. T., Jia, Y., Juvik, G., Roe, T., Schroeder, M., Weng, S., Botstein, D. (1998). SGD: Saccharomyces Genome Database. Nucleic Acids Research. 26(1): 73-80.

[11] Mewes, H. W., Heumann, K., Kaps, A., Mayer, K., Pfeiffer, F., Stocker, S., Frishman, D. (2002). MIPS: A Database for Genomes and Protein Sequences. Nucleic Acids Research. 30(1): 31-34.

[12] Schuler, G. D., Epstein, J. A., Ohkawa, H., Kans, J. A. (1996). Entrez: Molecular Biology Database and Retrieval System. Methods Enzymol. 266(1): 141-162.

[13] Costanzo, M. C., Hogan, J. D., Cusick, M. E., Davis, B. P., Fancher, A. M., Hodges, P. E., Kondu, P., Lengieza, C., Lew-Smith, J. E., Lingner, C., Roberg-Perez, K. J., Tillberg, M., Brooks, J. E., Garrels, J. I. (2000). The Yeast Proteome Database (YPD) and Caenorhabditis Elegans Proteome Database (WormPD): Comprehensive Resources for the Organization and Comparison of Model Organism Protein Information. Nucleic Acids Research. 28(1): 73-76.

[14] Kumar, A., Cheung, K. H., Tosches, N., Masiar, P., Liu, Y., Miller, P., Snyder, M. (2002). The TRIPLES Database: A Community Resource for Yeast Molecular Biology. Nucleic Acids Research. 30(1): 73-75.

[15] Bolshakova, N., Azuaje, F. (2003). Cluster Validation Techniques for Genome Expression Data. Signal Process. 83(1): 825-833.

[16] Halkidi, M., Batistakis, Y., Vazirgiannis, M. (2001). On Clustering Validation Techniques. Journal of Intelligenet Information System. 17(2-3): 107-145.

[17] Bezdek, J. C. (1981). Pattern Recognition with Fuzzy Objective Function Algorithms. Norwell, USA: Kluwer Academic Publishers.

[18] Valarmathie, P., Srinath, M. V., Ravichandran, T., Dinakaran, K. (2009). Hybrid Fuzzy C-Means Clustering Technique for Gene Expression Data. International Journal of Research and Reviews in Applied Sciences. 1(1): 33-37.

[19] Berget, I., Mevik, B. H., Næs, T. (2008). New Modifications and Applications of Fuzzy C-Means Methodology. Computational Statistics and Data Analysis. 52(5): 2403-2418.

[20] Li, X. L., Tan, Y. C., Ng, S. K. (2006). Systematic Gene Function Prediction from Gene Expression Data by Using a Fuzzy Nearest-Cluster Method. BMC Bioinformatics. 7(Suppl 4): S23.

[21] Dembélé, D., Kastner, P. (2003). Fuzzy-C-Means Method for Clustering Microarray Data. Bioinformatics. 19(8): 973-980.

[22] Bereta, M., Burczyński, T. (2009). Immune K-Means and Negative Selection Algorithms for Data Analysis. Information Sciences. 179(10): 1407-1425.

[23] Tseng, G. C. (2007). Penalized and Weighted K-Means for Clustering with Scattered Objects and Prior Information in High-Throughput Biological Data. Bioinformatics. 23(17): 2247-2255.

[24] Brameier, M., Wiuf, C. (2007). Co-Clustering and Visualization of Gene Expression Data and Gene Ontology Terms for Saccharomyces Cerevisiae Using Self-Organizing Maps. Journal of Biomedical Informatics. 40(2): 160-173.

[25] McGarry, K., Sarfraz, M., Macintyre, J. (2007). Integrating Gene Expression Data from Microarrays Using the Self-Organising Map and the Gene Ontology. In Rajapakse, J. C., Schmidt, B., Volkert, G. (Eds.) Pattern Recognition in Bioinformatics. Lecture Notes in Computer Science. 4774. (pp. 206-217). Berlin, Germany: Springer-Verlag.

[26] Angiulli, F., Cesario, E., Pizzuti, C. (2008). Random Walk Biclustering for Microarray Data. Information Sciences. 178(1): 1479-1497.

[27] DiMaggio Jr, P. A., Mcallister, S. R., Floudas, C. A., Feng, X. J., Rabinowitz, J. D., Rabitz, H. A. (2008). Biclustering via Optimal Re-Ordering of Data Matrices in Systems Biology: Rigorous Methods and Comparative Studies. BMC Bioinformatics. 9(1): 458.

[28] Tari, L., Baral, C., Kim, S. (2009). Fuzzy C-Means Clustering with Prior Biological Knowledge. Journal Biomedical Informatics. 42(1): 74-81.

[29] Zhang, M., Therneau, T., Mckenzie, M. A., Li, P., Yang, P. (2008). A Fuzzy C-Means Algorithm Using a Correlation Metrics and Gene Ontology. Proceedings of the International Conference on Pattern Recognition. 15-17 October. Melbourne, Australia: IEEE, 1-4.

[30] Bandyopadhyay, S., Mukhopadhyay, A., Maulik, U. (2007). An Improved Algorithm for Clustering Gene Expression Data. Bioinformatics. 23(21): 2859-2865.

[31] Kim, D.W., Kang, B.Y. (2006). Iterative Clustering Analysis for Grouping Missing Data in Gene Expression Profiles. In Ng, W. K., Kitsuregawa, M., Li, J. (Eds.) Advances in Knowledge Discovery and Data Mining. Lecture Notes in Computer Science. 3918. (pp. 129-138). Berlin, Germany: Springer-Verlag.

[32] Kasim, S., Deris, S., Othman, R.M. A new computational framework for gene expression clustering Advanced Data Mining and Applications, 603-610

[33] Kasim, S., Deris, S., Othman, R.M. Multi-stage filtering for improving confidence level and determining dominant clusters in clustering algorithms of gene expression data Computers in biology and medicine 43 (9), 1120-1133

[IEEE 2014 International Conference on Information Science and Applications (ICISA) - Seoul, South...

Documents

Transcript of [IEEE 2014 International Conference on Information Science and Applications (ICISA) - Seoul, South...