An effective graph-based clustering technique to identify coherent patterns from gene expression

20
Int. J. Bioinformatics Research and Applications, Vol. An effective graph-based clustering technique to identify coherent patterns from gene expression data G. Priyadarshini, R. Sarmah*, B. Chakraborty* and D.K. Bhattacharyya* Department of Computer Science and Engineering, Tezpur University, Tezpur, India E-mail: [email protected] E-mail: [email protected] E-mail: [email protected] E-mail: [email protected] J.K. Kalita* Department of Computer Science, University of Colorado, Colorado Springs, CO 809933, USA E-mail: [email protected] *Corresponding authors Abstract: This paper presents an effective parameter-less graph based clustering technique (GCEPD). GCEPD produces highly coherent clusters in terms of various cluster validity measures. The technique finds highly coherent patterns containing genes with high biological relevance. Experiments with real life datasets establish that the method produces clusters that are significantly better than other similar algorithms in terms of various quality measures. Keywords: cluster; affinity; homogeneity; separation; z-score; embedded patterns; coherent patterns. Reference to this paper should be made as follows: Priyadarshini, G., Sarmah, R., Chakraborty, B., Bhattacharyya, D.K. and Kalita, J.K. (2012) ‘An effective graph-based clustering technique to identify coherent patterns from gene expression data’, Int. J. Bioinformatics Research and Applications, Vol. Biographical notes: Gargi Priyadarshini has received her BTech in computer science and is currently pursuing her MTech (IT) from the Department of Computer Science and Engineering, Tezpur University, Tezpur, India. Her areas of research include bioinformatics. Copyright © 2012 Inderscience Enterprises Ltd.

Transcript of An effective graph-based clustering technique to identify coherent patterns from gene expression

Page 1: An effective graph-based clustering technique to identify coherent patterns from gene expression

Int. J. Bioinformatics Research and Applications, Vol.

An effective graph-based clustering technique

to identify coherent patterns from gene

expression data

G. Priyadarshini, R. Sarmah*, B. Chakraborty*

and D.K. Bhattacharyya*

Department of Computer Science and Engineering,Tezpur University,Tezpur, IndiaE-mail: [email protected]: [email protected]: [email protected]: [email protected]

J.K. Kalita*

Department of Computer Science,University of Colorado,Colorado Springs, CO 809933, USAE-mail: [email protected]*Corresponding authors

Abstract: This paper presents an effective parameter-less graph basedclustering technique (GCEPD). GCEPD produces highly coherent clustersin terms of various cluster validity measures. The technique findshighly coherent patterns containing genes with high biological relevance.Experiments with real life datasets establish that the method producesclusters that are significantly better than other similar algorithms in termsof various quality measures.

Keywords: cluster; affinity; homogeneity; separation; z-score; embeddedpatterns; coherent patterns.

Reference to this paper should be made as follows: Priyadarshini, G.,Sarmah, R., Chakraborty, B., Bhattacharyya, D.K. and Kalita, J.K.(2012) ‘An effective graph-based clustering technique to identify coherentpatterns from gene expression data’, Int. J. Bioinformatics Research andApplications, Vol.

Biographical notes: Gargi Priyadarshini has received her BTech incomputer science and is currently pursuing her MTech (IT) from theDepartment of Computer Science and Engineering, Tezpur University,Tezpur, India. Her areas of research include bioinformatics.

Copyright © 2012 Inderscience Enterprises Ltd.

Page 2: An effective graph-based clustering technique to identify coherent patterns from gene expression

An effective graph-based clustering technique 19

Rosy Sarmah (Das) is an Assistant Professor in the Department ofComputer Science and Engineering, Tezpur University, Tezpur, India.Her areas of research include data mining and bioinformatics.

Bhaskar Chakraborty has received his BTech (CS) from the Departmentof Computer Science and Engineering, Tezpur University, Tezpur, Indiaand is currently working in EMC Corporation.

Dhruba K. Bhattacharyya is a Professor in the Department of ComputerScience and Engineering, Tezpur University, Tezpur, India. His researchinterests include data mining and network security.

Jugal K. Kalita is a Professor in the Department of Computer Scienceat the University of Colorado, Colorado Springs, USA. His researchinterests include artificial intelligence, bioinformatics and natural languageprocessing.

This paper is a revised and expanded version of a paper entitled‘Highly Coherent Pattern Identification Using Graph-based Clustering’presented at Biotechnology and Bioinformatics Symposium (BIOT 2010),14–15 October, Lafayette, Louisiana, pp.29–38, 2010.

1 Introduction

In recent years, microarray technology has enabled the monitoring of expressionlevels of thousands of genes simultaneously, across different conditions and overdifferent time points (Stekel, 2003). Analysis of microarray data provides usefulinsight into characteristic functions of genes and regulatory life cycle mechanisms.The critical bottleneck is to derive knowledge from such large datasets and extractthe patterns hidden in them. Cluster analysis of gene expression patterns aims atfinding and grouping genes that manifest similar traits and serves as a fundamentalstep in addressing this challenge.

Clustering of gene expression data groups co-expressed and similar genesinto the same cluster. Different gene expression data clustering techniques havebeen developed. These include K-means (Stekel, 2003) and Self-Organising Maps(SOMs) (Tamayo et al., 1999). However, the specification of the number of clustersin advance is necessary in these approaches. Quality Threshold (QT) (Heyer et al.,1999) clustering begins by choosing a user-specified maximum diameter for clusters.The Self-Organising Tree Algorithm (SOTA) (Herrero et al., 2005) combines theconcepts of both hierarchical clustering and SOMs. Gene expression data oftencontain embedded and intersecting clusters the identification of which is verytough (Jiang et al., 2003). The Density-based Hierarchical Clustering method(DHC) (Jiang et al., 2003) can identify embedded clusters in the dataset evenin presence of outliers and can effectively visualise the internal structure of thedata set. In Das et al. (2010), a density based method (DGC) is presented forclustering gene expression data using a two-objective function. The method usesregulation information as well as a suitable dissimilarity measure to cluster genesinto regions of higher density separated by sparser regions. In Das et al. (2008), the

Page 3: An effective graph-based clustering technique to identify coherent patterns from gene expression

20 G. Priyadarshini et. al.

authors propose a Frequent Itemset Nearest Neighbour (FINN) based approachfor clustering genes. FINN is also capable in obtaining the embedded clusters in thedataset. The GenClus algorithm presented in Sarmah et al. (2010), is a density basedclustering approach which finds useful subgroups of highly coherent genes withina cluster and obtains a hierarchical structure of the dataset where the sub-clustersgive the finer clustering of the dataset. An effective tree-based clustering technique(GeneClusTree) for finding clusters in gene expression data is presented in Sarmahet al. (2011). GeneClusTree finds all the clusters over subspaces using a tree-baseddensity approach by scanning the whole database in minimum possible scans andis free from the restrictions of using a normal proximity measure.

Usually graph-based clustering algorithms do not need the number of clustersto be specified in advance, but requires some parameter to be provided by theuser (Foggia et al., 2007). CLuster Identification via Connectivity Kernels (CLICK)(Sharan and Shamir, 2000) is appropriate for subspace and high dimensional dataclustering. A novel algorithm for cluster analysis that is based on graph theoretictechniques is presented in Hartuv et al. (2000). Unlike other methods, it does notassume that the clusters are hierarchically structured and does not require priorknowledge on the number of clusters. The Cluster Affinity Search Techniques(CAST) by Ben-Dor et al. (1999) relies on the concept of a clique graph and is adivisive approach. In CAST, non-hierarchical clustering is obtained where clusterboundaries are determined without user intervention and the number of clusters isdetermined by the algorithm instead of being an input parameter. However, CASTuses a fixed initial threshold value to start clustering and the algorithm requires afinal cleaning step to relocate the data points among the existing clusters. E-CAST(Bellaachia et al., 2002) is an enhanced version of CAST which uses a dynamicthreshold. In Das et al. (2009), a new graph based clustering method is presentedto cluster the genes on the basis of a repulsion factor. In Foggia et al. (2007), aMinimum Spanning Tree (MST) is constructed from the graph representing thesamples, by removing inconsistent edges from the graph, which are identified on thebasis of the associated distances of the edges and a threshold determined by FuzzyC-Means algorithm. Breitling et al. (2004) introduces a statistical method known asGraph-based Iterative Group Analysis (GIGA), to detect active subgraphs in thebiological knowledge graph built by connecting genes sharing an annotation. Wanget al. (2009) uses complex relationships of genes to perform gene-based clusteringfrom the perspective of topological features. This approach involves demonstrationof a gene regulatory network via a graph and showing the complex relationships ofgenes via extraction of topological features of the graph. A graph-based clusteringmethod is presented in Kim and Choi (2009), which uses linear programming todecompose a graph into disjoint r-regular graphs followed by its further refinementthrough optimisation of the normalised cluster utility. In Baya et al. (2011),Penalised k-Nearest-Neighbour-Graph (PKNNG) based metric (a new tool forevaluating distances) has been used for clustering arbitrary manifolds as could bethe case of some gene expression datasets. The PKNNG metric is based on atwo-step method: first it constructs the k-Nearest-Neighbour-Graph of the datasetusing a low k-value and then it adds edges with a highly penalised weight forconnecting the sub-graphs produced by the first step.

Page 4: An effective graph-based clustering technique to identify coherent patterns from gene expression

An effective graph-based clustering technique 21

2 Discussion

Most clustering algorithms are invariably dependent on input parameters like thenumber of clusters or certain threshold value(s). However, algorithms based ongraph theory are able to detect clusters of various shapes and sizes, at least for thecase in which they are well separated and are suitable for data that do not followa Gaussian or spherical distribution (Foggia et al., 2007). Also, such algorithmsdo not need any specification of the number of clusters apriori and are verysuitable for clustering large datasets of high dimensions such as gene expressiondata. In this paper, we present a Graph based Clustering technique with EmbeddedPattern Detection (GCEPD) that extracts highly coherent patterns in derived coarseclusters to keep the biological relevance of the genes intact. Our automated methodcan identify embedded patterns in the dataset and our algorithm does as well asother less automated methods in quality.

3 The proposed technique

The Graph based Clustering technique with Embedded Pattern Detection (GCEPD) works in two steps. In the first step, the gene expression data is clusteredusing a graph theoretic approach as in Priyadarshini et al. (2010) and in the secondstep, the resulting clusters from the first step are checked for the presence of highlycoherent embedded subclusters. These two steps are discussed in detail in the nextsubsections. GCEPD is formulated on the basis of the E-CAST (Enhanced ClusterAffinity Search Technique) (Bellaachia et al., 2002) algorithm. Following are someof the key concepts that have been used in GCEPD.

Definition 3.1 (Affinity): The affinity of a gene gi to a cluster Ci is defined as:

a(gi) =∑

gk∈Ci

S(gi, gk) (1)

where S(gi, gk) is the similarity between a pair of genes gi and gk and S is anysimilarity measure. Here, S was chosen as Pearsons Correlation Coefficient (Stekel,2003).

Definition 3.2 (Affinity Threshold): The affinity threshold T is an input parameterused to determine membership of node (gene) in a cluster. This is calculateddynamically as,

T =

(∑gi,gj∈U ′,gi �=gjand S(gi,gj)≥α(S(gi, gj) − α)

| {gi, gj ∈ U ′and S(gi, gj) ≥ α} |

)+ α. (2)

where, β =

(∑gi,gj∈U′ S(gi,gj)

)

|U ′| and U ′ is the set of unclustered genes and U is theset of all genes. In our experiments, α = 80% of β, gave the best results.

Definition 3.3 (Connectivity Threshold): The connectivity threshold δ of a clusterCi is defined as: δ = T | Ci |, where |Ci| is the cardinality of Ci.

Page 5: An effective graph-based clustering technique to identify coherent patterns from gene expression

22 G. Priyadarshini et. al.

Definition 3.4 (High Connectivity Node): A high connectivity node gi is a nodethat will be added to a cluster if its affinity satisfies the following: a(gi) ≥ δ.

Definition 3.5 (Low Connectivity Node): A low connectivity node gi is a node thatwill be removed from a cluster if its affinity satisfies the following: a(gi) < δ.

Definition 3.6 (Cluster): A Cluster C consists of a subset of genesgk (k = 1, 2, · · · , c), where c is the number of genes in the subset such thata(gk) ≥ δ.

The above definitions are used in describing Step 1 of GCEPD where it detectsclusters. Step 2 uses the cluster results of Step 1 and detects the embedded patternsin the clusters.

3.1 Step 1: Cluster detection

Step 1 of the GCEPD algorithm starts by making the affinity of all genes equalto zero. Then, a gene gi with the maximum similarity among all genes gx ∈ U ′ isselected as the seed for the current cluster (Copen) under consideration and gi isadded to Copen. After the initial steps of choosing the seed and adding it to Copen,the algorithm iteratively performs the addition steps followed by the removal steps.In the addition step, the genes having high affinity are selected in decreasing orderof their affinity and added to Copen one at a time, until the maximum affinityamongst the unclassified genes falls below the connectivity threshold (T ). Similarly,in the removal steps, the genes having low affinity are selected individually inincreasing order of their affinity and removed from C until the minimum affinityamongst the genes in the cluster C is above the connectivity threshold (T ). Theaddition and removal steps are iterated until no more unclassified genes can beadded or removed to the current cluster Copen. The process then restarts withanother unclassified gene having the maximum similarity w.r.t. all genes in gx ∈ U ′.The pseudo-code for the first step of the method, i.e., the clustering of geneexpression data is provided in Figures 1 and 2.

3.2 Step 2: Highly coherent embedded pattern identification

Gene expression datasets usually have embedded patterns within them (Jiang et al.,2003). We therefore check for the existence of highly coherent patterns in a clusterthat have been obtained in the first step of GCEPD. The second step of GCEPDfocuses on detecting embedded patterns from the clusters obtained after the firststep. We use a regulation matching approach to determine the match between twogenes. Here, the regulation, i.e., up- or down-regulation in each of the conditionsfor a particular gene, plays a key role. Suppose, U is the set of all genes and S′ isthe set of all conditions. Let gi ∈ U be the ith gene and j ∈ S′ be the jth condition.Let the total number of genes be N and the total number of conditions be T . Theexpression value of gene gi at condition j is given by ξi,j . The regulation patternof each gene is obtained as follows:

Ri,j =

0 for j = 1 or ξi,j = ξi,j−1 for j = 2, · · · , T1 for ξi,j > ξi,j−1 for j = 2, · · · , T2 for ξi,j < ξi,j−1 for j = 2, · · · , T .

Page 6: An effective graph-based clustering technique to identify coherent patterns from gene expression

An effective graph-based clustering technique 23

Figure 1 The algorithm for Step 1 of GCEPD

Figure 2 The algorithm for calculating threshold, T and β

Page 7: An effective graph-based clustering technique to identify coherent patterns from gene expression

24 G. Priyadarshini et. al.

Each gene will now have a regulation pattern (Ri) of 0, 1, and 2 across theconditions or time points. This regulation pattern is used to find the match betweentwo genes gi and gj as follows.

Definition 3.7 (Match): Let Ri and Rk be the regulation patterns of two genes gi

and gk. The match (M ) between gi and gk is given by the number of agreements(No_Agreements) (i.e., the number of condition-wise common regulation values)between the two regulation patterns i.e.,

M(gi, gk) = No_Agreements(Rij , Rkj) for j = 1, 2, · · · , T.

Embedded pattern detection starts with a cluster Ci, i = 1, 2, . . . , Nc (where Nc isthe total number of clusters detected in Step 1), and finds a gene, gi ∈ Ci, with themaximum affinity. The process then checks for all other genes in Ci and finds thenext maximum affinity gene with which gi satisfies the conditions in Definition 3.8.

Definition 3.8 (Embedded patterns in a cluster): Two genes gi, gk ∈ ECi, whereECi is the set of embedded patterns of cluster Ci, iff

1 a(gi) ≥ δ and a(gk) ≥ δ

2 M(gi, gk) ≥ 80% of T (matching factor)

3 |ECi| ≥ 2.

If the genes satisfy the conditions, they are grouped into an embedded cluster,say ECi. The process continues until no more genes can be added to ECi. Thewhole process then restarts by selecting the next maximum affinity gene in Ci.When no more embedded clusters can be formed in Ci, the process checks for the

Figure 3 The algorithm for Step 2 of GCEPD

Page 8: An effective graph-based clustering technique to identify coherent patterns from gene expression

An effective graph-based clustering technique 25

presence of embedded clusters in another cluster Cj . The algorithm for Step 2 isgiven in Figure 3. We note that the two steps are not completely independent ofeach other; rather the output of one step is the input to the next step. The firststep proceeds by alternating between two steps, namely the addition step where agene is added to a cluster and the removal step where a gene is removed from acluster (Figure 1). During the formation of a cluster, the affinity value of everygene is updated whenever it is added as a member of a cluster or removed from acluster. So after a cluster has been formed, every member gene of that cluster has anaffinity value of its own. To detect coherent patterns (embedded patterns) in sucha cluster formed in the first step, these affinity values of genes are used to choosethe candidate genes for the identification of embedded patterns in the second step.The second step thus works on the output generated by the first step, avoiding theoverhead for computating the affinity values during the second step. The secondstep therefore chooses coherent genes on the basis of their affinity values and formsthe set of embedded patterns within a single cluster taking into account a matchingfactor criterion as given in (Definition 3.8).

4 Performance evaluation

GCEPD was implemented in C in Linux environment running on an HPworkstation. We have exhaustively compared the results of GCEPD with those ofseveral other clustering techniques on various real life gene expression datasets.To assess the quality of our method, we need an objective external criterion as aquality measure. We have used four different quality measures in this paper, namelyaverage homogeneity (Sharan and Shamir, 2000), separation (Sharan and Shamir,2000), z-score (Gibbons and Roth, 2002) and p-value (Berriz et al., 2003).

Average homogeneity reflects the compactness of the clusters while separationreflects the overall distance between clusters. It should be noted that higher averagehomogeneity and lower average separation suggests an improvement in clustering.Note also, that even though homogeneity and separation are precisely defined, theyhave opposing objectives: The better the homogeneity the poorer the separation,and vice versa.

We have used Gibbons Cluster Judge tool (Gibbons and Roth, 2002) tocalculate z-score, by investigating the relation between the clusters obtained in anexperiment and the functional annotation of the genes in the cluster. The larger thez-score, the better is the clusters.

We have obtained p-value using the software FuncAssociate (Berriz et al., 2003),which is a web-based tool that accepts as input a list of genes, and returns a list ofGO attributes that are over- (or under-) represented among the genes in the inputlist. p-value represents the probability of observing the number of genes from aspecific GO functional category within each cluster. A low p-value indicates thatthe genes belonging to the enriched functional categories are biologically significantin the corresponding clusters.

GCEPD was compared with several competitive algorithms on four real lifedatasets given in Table 1. All datasets are first normalised to have mean 0 andstandard deviation 1. No other normalisation was performed on the datasets. Genesthat have small variance over time were filtered out.

Page 9: An effective graph-based clustering technique to identify coherent patterns from gene expression

26 G. Priyadarshini et. al.

Table 1 List and source of datasets used

Serial No. of No. ofNo. Dataset used genes samples Source

1 Subset of yeast cell 384 17 http://yscdp.stanford.edu/yeastcycle (Cho et al., 1998) _ cell_cycle/full_data.html

2 Rat CNS 112 9 http://faculty.washington.edu(Wen et al., 1998) /kayee/cluster

3 Human fibroblasts 517 19 http://genomewww.stanford.edu(Iyer et al., 1999) /serum/data/

4 Colon Cancer 2000 62 http://www.sph.uth.tmc.edu/hgc(Alon et al., 1999)

We have used MeV (Howe et al., 2010) and EXPANDER (Sharan et al., 2003)to test the performance of the different clustering techniques on real life geneexpression datasets. The k-means algorithm was implemented in C in Linuxenvironment as well as using EXPANDER. SOM and CLICK were also testedon EXPANDER in Windows environment. Apart from these algorithms, CAST,SOTA and QT clustering methods were tested on MeV in Linux environment. Sincewe could not obtain the source code for E-CAST, we have implemented it forcomparison with our GCEPD. The performance of our method (Step I) has beencompared with that of k-means, SOM, CLICK, CAST, E-CAST, QT Clusteringand SOTA.

The clusters detected by GCEPD in Dataset 1 and in Dataset 2 are shown inFigures 4 and 5, respectively. The clusters obtained from Dataset 2 on executingECAST is given in Figure 6. Figures 7(a), 7(b), 8(a) and 8(b) show the result ofk-means, CAST, SOM and QT clustering, respectively, using Dataset 1. It can beseen visually that our method gave much more compact clusters in comparison withits counterparts. This fact is supported by the homogeneity and separation valuesin Table 2. We can see from the table that only ECAST had better homogeneityvalues than GCEPD but, ECAST gave many singleton clusters. Figures 9 and10 show the cluster results of GCEPD on Dataset 3 and Dataset 4, respectively.Figure 11(b)–(f) show some of the embedded clusters detected by GCEPD from acluster (Figure 11(a)) of Dataset 1. Figure 12 shows the embedded clusters detectedby GCEPD on Dataset 4. To restrict the size of the article we have shown onlytwo clusters from two different datasets and their corresponding embedded clusters.The figures show that the embedded patterns are the highly coherent patterns in thedata. The homogeneity and separation values for various algorithms on Dataset 2and Dataset 3 are given in Tables 3 and 4 respectively.

We note here that most of the algorithms require an input parameter to bespecified (number of clusters in k-means, threshold values in CLICK and CAST andso on) and this input parameter severely affects clustering as demonstrated by thevariation in cluster validity measures with change of values for these parameters.On the other hand, the GCEPD method is completely parameter independentand yields comparable results. From the performance of the various clusteringalgorithms shown in Tables 2–4, we infer that the method GCEPD (first step) yieldsquite satisfactory results. The E-CAST algorithm results in a slightly higher value

Page 10: An effective graph-based clustering technique to identify coherent patterns from gene expression

An effective graph-based clustering technique 27

of homogeneity than GCEPD. However, E-CAST gives a much higher number ofclusters which also includes a number of singleton clusters. Obtaining a numberof singleton clusters defeats the actual purpose of clustering. On the other hand,

Figure 4 Clusters obtained from Dataset 1 using GCEPD

Figure 5 Some clusters obtained from the first step of GCEPD on Dataset 2

Page 11: An effective graph-based clustering technique to identify coherent patterns from gene expression

28 G. Priyadarshini et. al.

Figure 6 The clusters obtained by ECAST on Dataset 2

Figure 7 (a) Result of k-means (10 clusters) on the Dataset 1 and (b) Result of CAST(threshold 0.8)on the Dataset 1 (see online version for colours)

Figure 8 (a) Result of SOM (3 × 3 grid) on the Dataset 1 and (b) result of QT clustering(0.5 diameter) on Dataset 1 (see online version for colours)

Page 12: An effective graph-based clustering technique to identify coherent patterns from gene expression

An effective graph-based clustering technique 29

Figure 9 Some clusters obtained from the first step of GCEPD on Dataset 3

Figure 10 Some clusters obtained from the first step of GCEPD on Dataset 4

GCEPD ensures a better distribution of genes in the clusters with comparable valueof homogeneity. In addition to that, GCEPD gives better values for separationbetween clusters than E-CAST as observed from Tables 2–4 indicating that the

Page 13: An effective graph-based clustering technique to identify coherent patterns from gene expression

30 G. Priyadarshini et. al.

Figure 11 (a) Original clusters obtained by first step of GCEPD from Dataset 1,(b)–(f) embedded patterns obtained from (a) by second step of GCEPD(see online version for colours)

Figure 12 (a) Cluster obtained by first step of GCEPD on Dataset 4, (b) and (c) showsembedded patterns obtained by second step of GCEPD from (a)

Table 2 Performance results of various clustering algorithms on Dataset 1

Percentage ofMethod Number singletonapplied of clusters Threshold value Homogeneity Separation clusters

k-means 7 NA 0.725 –0.127 0%

k-means 10 NA 0.820 –0.076 0%

SOM 9 3 × 3 0.810 –0.079 0%

CLICK 4 Default value 0.549 –0.212 25%

CAST 37 0.8 0.860 –0.164 24%

E-CAST 42 Dynamic 0.891 –0.035 21%

QT 9 0.5 0.864 –0.139 0%clustering

SOTA 11 Default 0.763 –0.057 0%

GCEPD 12 Dynamic 0.87 –0.157 0%

clusters are well-separated (i.e., inter-cluster similarity is low). Another merit ofGCEPD is that it is fully automated and also gives the embedded clusters withhigh coherency in the dataset. That the embedded patterns are highly coherent

Page 14: An effective graph-based clustering technique to identify coherent patterns from gene expression

An effective graph-based clustering technique 31

Table 3 Performance results of various clustering algorithms on Dataset 2

Percentage ofMethod Number singletonapplied of clusters Threshold value Homogeneity Separation clusters

k-means 5 NA 0.785 –0.171 0%

k-means 10 NA 0.855 –0.022 0%

SOM 9 3 × 3 0.798 –0.144 0%

CLICK 27 Default value 0.683 –0.538 89%

CAST 18 0.8 0.816 –0.099 38.8%

E-CAST 19 Dynamic 0.897 –0.017 5.3%

QT 6 0.5 0.619 0.408 0%clustering

SOTA 11 Default 0.664 0.227 0%

GCEPD 10 Dynamic 0.849 –0.137 20%

Table 4 Performance results of various clustering algorithms on Dataset 3

Percentage ofMethod Number singletonapplied of clusters Threshold value Homogeneity Separation clusters

k-means 8 NA 0.551 –0.040 0%

SOM 9 3 × 3 0.685 –0.046 0%

SOM 22 5 × 5 0.740 –0.009 0%

CLICK 47 Default 0.750 –0.107 85%

CAST 39 0.8 0.831 –0.0166 31%

E-CAST 51 Dynamic 0.843 –0.0094 29.4%

QT 11 0.5 0.814 –0.037 0%clustering

SOTA 11 Default 0.757 0.073 0%

GCEPD 14 Dynamic 0.818 –0.133 0%

can be inferred from Table 5, where we see that the embedded clusters have highhomogeneity values.

Table 5 Homogeneity values for embedded clusters (EC) detected using GCEPD

Datasetused EC1 EC2 EC3 EC4 EC5 EC6 EC7 EC8 EC9 EC10 EC11

Dataset 1 0.981 0.987 0.937 0.975 0.945 0.944 0.959 0.973 0.958 0.919 0.982

Dataset 2 0.969 0.964 0.972 0.970 0.989 0.908 0.951 0.869 0.911 0.923 0.852

Page 15: An effective graph-based clustering technique to identify coherent patterns from gene expression

32 G. Priyadarshini et. al.

The clusters obtained in Dataset 1 were compared in terms of z-score measure(Gibbons and Roth, 2002) and the result is reported in Table 6. We see that thez-score value of GCEPD is much better than other comparable algorithms. Inorder to give the biological relevance of our work we have used the FuncAssociate(Berriz et al., 2003) tool to find out the p-values. We report functional categoriesof only one cluster with p-value < E − 08 in order to restrict the size of the paper.The enriched functional categories for cluster 1 obtained by GCEPD on Dataset 1are listed in Table 7. The functional enrichment of each GO category in eachof the clusters is calculated by its p-value. The values shown in Table 7 indicatethat the genes categorised in the corresponding clusters through this algorithmare biologically significant in the respective clusters due to their low p-values.The cluster C1 contains several enriched categories on ‘DNA’.The highly enrichedcategory in C1 is the ‘DNA metabolic process’ with a p-value of 1.34e–26. TheGO categories ‘DNA replication’ and ‘DNA repair’ are also highly enriched inthis cluster with p-values of 1.63e–25 and 1.23e–24 respectively. A comparison ofp-values of CAST, QT clustering and ECAST is given in Table 8 for cluster C1 ofTable 7. We have reported the p-values < E − 08 and we see from Table 8 thatCAST and QT gave very less number of highly functionally related genes. ECASTgave almost similar result as that of GCEPD for that cluster. However, the GOcategory ‘chromosomal part’ has the lowest p-value (2.98E–27) in GCEPD whilefor ECAST it is 5.71E–26. Therefore GCEPD has the lowest p-value w.r.t. the GOcategories.

Table 6 z-scores for GCEPD, CAST, SOM and E-CAST for Dataset 1

Algorithm used No. of clusters z-score Total no. of genes

CAST 11 2.41 384

SOM 09 3.12 384

ECAST 42 4.37 384

GCEPD 12 6.98 384

Table 7 p-value of Dataset 1 on executing GCEPD

Cluster p-value GO number GO category

C1 1.16e-11 GO:0006273 Lagging strand elongation

5.17e-15 GO:0005657 Replication fork

5.46e-09 GO:0000079 Regulation of cyclin-dependent protein kinase activity

8.2e-16 GO:0007064 Mitotic sister chromatid cohesion

1.23e-13 GO:0007062 Sister chromatid cohesion

7.26e-10 GO:0006271 DNA strand elongation involved in DNA replication

7.26e-10 GO:0022616 DNA strand elongation

6.40e-15 GO:0006302 Double-strand break repair

6.25e-09 GO:0000724 Double-strand break repair via homologousrecombination

4.11e-09 GO:0000725 Recombinational repair

Page 16: An effective graph-based clustering technique to identify coherent patterns from gene expression

An effective graph-based clustering technique 33

Table 7 p-value of Dataset 1 on executing GCEPD (continued)

Cluster p-value GO number GO category

C1 1.63e-25 GO:0006260 DNA replication

4.84e-10 GO:0071103 DNA conformation change9.88e-10 GO:0006310 DNA recombination4.84e-10 GO:0051052 Regulation of DNA metabolic process1.23e-24 GO:0006281 DNA repair2.98e-27 GO:0044427 Chromosomal part8.31e-23 GO:0006974 Response to DNA damage stimulus1.57e-21 GO:0044454 Nuclear chromosome part1.34e-26 GO:0006259 DNA metabolic process2.12e-17 GO:0007049 Cell cycle7.43e-18 GO:0022402 Cell cycle process1.88e-17 GO:0033554 Cellular response to stress2.96e-13 GO:0051301 Cell division4.96e-16 GO:0051716 Cellular response to stimulus3.64e-09 GO:0022403 Cell cycle phase6.76e-11 GO:0051276 Chromosome organisation3.66e-11 GO:0048519 Negative regulation of biological process3.07e-09 GO:0045814 Negative regulation of gene expression, epigenetic3.58e-10 GO:0045934 Negative regulation of nucleobase, nucleoside,

nucleotide and nucleic acid metabolic process3.58e-10 GO:0051172 Negative regulation of nitrogen compound metabolic

process1.25e-10 GO:0010605 Negative regulation of macromolecule metabolic

process1.17e-11 GO:0009892 Negative regulation of metabolic process3.37e-10 GO:0048523 Negative regulation of cellular process1.36e-12 GO:0006950 Response to stress5.63e-09 GO:0006996 Organelle organisation9.86e-14 GO:0005634 Nucleus2.14e-11 GO:0050896 Response to stimulus

Table 8 p-values of CAST, QT and ECAST for cluster 1 of Dataset 1

Cluster p-value GO number GO category

C1 for CAST 4.74e-11 GO:0042555 MCM complex

6.09e-09 GO:0000084 S phase of mitotic cell cycle

6.09e-09 GO:0051320 S phase

7.13-11 GO:0022403 Cell cycle phase

C1 for QT 5.05e-10 GO:0042555 MCM complex

3.67e-09 GO:0030427 Site of polarised growth

2.96e-10 GO:0051301 Cell division

9.24e-11 GO:0007049 Cell cycle

Page 17: An effective graph-based clustering technique to identify coherent patterns from gene expression

34 G. Priyadarshini et. al.

Table 8 p-values of CAST, QT and ECAST for cluster 1 of Dataset 1 (continued)

Cluster p-value GO number GO category

C1 for ECAST 8.70e-10 GO:0045934 Negative regulation of nucleobase, nucleoside,nucleotide and nucleic acid metabolic process

5.04e-11 GO:0050896 Response to stimulus1.91e-16 GO:0005657 Replication fork2.28e-17 GO:0007064 Mitotic sister chromatid cohesion6.53e-11 GO:0006273 Lagging strand elongation3.58e-15 GO:0007062 Sister chromatid cohesion1.33e-09 GO:0006271 DNA strand elongation involved in DNA

replication1.33e-09 GO:0022616 DNA strand elongation3.43e-25 GO:0006260 DNA replication5.41e-15 GO:0006302 Double-strand break repair1.26e-11 GO:0051052 Regulation of DNA metabolic process3.96e-25 GO:0006281 DNA repair5.71e-26 GO:0044427 Chromosomal part6.50e-20 GO:0044454 Nuclear chromosome part1.24e-23 GO:0006974 Response to DNA damage stimulus3.87e-26 GO:0006259 DNA metabolic process1.75e-09 GO:0016458 Gene silencing8.37e-09 GO:0006342 Chromatin silencing8.37e-09 GO:0045814 Negative regulation of gene expression,

epigenetic3.13e-21 GO:0007049 Cell cycle5.32e-09 GO:0040029 Regulation of gene expression, epigenetic4.05e-11 GO:0051726 Regulation of cell cycle4.55e-18 GO:0033554 Cellular response to stress4.29e-16 GO:0022402 Cell cycle process6.45e-17 GO:0051716 Cellular response to stimulus4.42e-16 GO:0005634 Nucleus8.70e-10 GO:0051172 Negative regulation of nitrogen compound

metabolic process1.33e-10 GO:0010605 Negative regulation of macromolecule

metabolic process9.38e-11 GO:0009892 Negative regulation of metabolic process7.71e-09 GO:0010558 Negative regulation of macromolecule

biosynthetic process3.38e-09 GO:0009890 Negative regulation of biosynthetic process3.69e-12 GO:0048519 Negative regulation of biological process2.24e-09 GO:0031324 Negative regulation of cellular metabolic

process6.60e-12 GO:0051276 Chromosome organisation7.34e-11 GO:0048523 Negative regulation of cellular process3.62e-09 GO:0051301 Cell division7.57e-14 GO:0006950 Response to stress

Page 18: An effective graph-based clustering technique to identify coherent patterns from gene expression

An effective graph-based clustering technique 35

5 Conclusions and future work

This paper presents a graph based clustering algorithm that can detect embeddedpatterns in gene expression data without the use of any input parameters. Theclustering is based on a connectivity threshold which is calculated dynamicallyduring the clustering process. Moreover, the embedded pattern discovery processis also parameter independent and is based on a simple matching technique.The embedded patterns are the highly coherent patterns in the dataset whichare biologically also very similar. Experimental results in terms of homogeneity,separation, z-score and p-values are reported to establish the superiority of thetechnique in comparison with other similar algorithms using several real-lifedatasets. As a future direction of our work, we plan to incorporate biologicalknowledge in the clustering step itself.

References

Alon, U., Barkai, N., Notterman, D.A., Gish, K., Ybarra, S., Mack, D. and Levine, A.J.(1999) ‘Broad patterns of gene expression revealed by clustering analysis of tumorand normal colon tissues probed by oligonucleotide array’, Proceedings of NationalAcademy of Sciences, Vol. 96, No. 12, pp.6745–6750.

Baya, A.E. and Granitto, P.M. (2011) ‘Clustering gene expression data with a penalizedgraph-based metric’, BMC Bioinformatics, Vol. 12, No. 2, pp.1–18.

Bellaachia, A., Portnoy, D., Chen, Y. and Elkahloun, A.G. (2002) ‘E-CAST: a datamining algorithm for gene expression data’, BIOKDD02: Workshop on Data Mining inBioinformatics (with SIGKDD02 Conference), Canada, pp.49–54.

Ben-Dor, A., Shamir, R. and Yakhini, Z. (1999) ‘Clustering gene expression patterns’,Journal of Computational Biology, pp.281–297.

Berriz, G.F., King, O.D., Bryant, B., Sander, C. and Roth, F.P. (2003) ‘Characterizing genesets with funcassociate’, Bioinformatics, Vol. 19, pp.2502–2504.

Breitling, R., Amtmann, A. and Herzyk, P. (2004) ‘Graph-based iterative group analysisenhances microarray interpretation’, BMC Bioinformatics, Vol. 5, No. 100, pp.1–10.

Cho, R.J., Campbell, M., Winzeler, E., Steinmetz, L., Conway, A., Wodicka, L.,Wolfsberg, T., Gabrielian, A., Landsman, D. and Lockart, D. (1998) ‘A genome-widetranscriptional analysis of the mitotic cell cycle’, Molecular Cell, Vol. 2, No. 1,pp.65–73.

Das, R., Bhattacharyya, D.K. and Kalita, J.K. (2010) ‘Clustering gene expression data usingan effective dissimilarity measure’, International Journal of Computational BioScience(Special Issue), Vol. 1, No. 1, pp.55–68.

Das, R., Bhattacharyya, D.K. and Kalita, J.K. (2008) ‘ Frequent itemset-nearestneighbor based approach for clustering gene expression data’, Proceedings of FifthBiotechnology and Bioinformatics Symposium (BIOT’08), Texas, pp.73–78.

Das, R., Bhattacharyya, D.K. and Kalita, J.K. (2009) ‘A new approach for clustering geneexpression time series data’, International Journal of Bioinformatics Research andApplications, Vol. 5, No. 3, pp.310–328.

Foggia, P., Percannella, G., Sansone, C. and Vento, M. (2007) ‘Assessing the performanceof a graph-based clustering algorithm’, Proceedings of International Conference onGraph-based Representations in Pattern Recognition (GbRPR’07), Spain, pp.215–227.

Page 19: An effective graph-based clustering technique to identify coherent patterns from gene expression

36 G. Priyadarshini et. al.

Foggia, P., Percannella, G., Sansone, C. and Vento M. (2007) ‘A graph-based clusteringmethod and its applications’, BVAI 2007, LNCS 4729, pp.277–287.

Gibbons, F. and Roth, F. (2002) ‘Judging the quality of gene expression based clusteringmethods using gene annotation’, Genome Research, Vol. 12, pp.1574–1581.

Hartuv, E. Schmitt, A.O., Lange, J., Meier-Ewert, S., Lehrach, H. and Shamir, R. (2000)‘An algorithm for clustering cDNA fingerprints’, Genomics, Vol. 66, pp.249–256.

Herrero, J. , Valencia, A. and Dopazo, J. (2005) ‘A hierarchical unsupervised growingneural network for clustering gene expression patterns’, Bioinformatics, Vol. 17, No. 2,pp.126–136.

Heyer, L.J., Kruglyak, S. and Yooseph, S. (1999) ‘Exploring expression data: identiï¬'cationand analysis of co-expressed genes’, Genome Research, Vol. 9, No. 11, pp.1106–1115.

Howe, E., Holton, K., Nair, S., SChlauch, D., Sinha, R. and Quackenbush, J. (2010) ‘MeV:multiExperiment viewer’, Biomedical Informatics for Cancer Research, pp.267–277.

Iyer, V.R., Eisen, M.B., Ross, D.T., Schuler, G., Moore, T., Lee, J., Trent, J.M.,Staudt, L.M., Hudson, J.J., Boguski, M.S., Lashkari, D., Shalon, D., Botstein, D.and Brown, P.O. (1999) ‘The transcriptional program in the response of the humanfibroblasts to serum’, Science, Vol. 283, pp.83–87.

Jiang, D., Pei, J. and Zhang, A. (2003) ‘DHC: a density-based hierarchical clusteringmethod for time series gene expression data’, Proceedings of BIBE2003: 3rd IEEEInternational Symposium on Bioinformatics and Bioengineering, Bethesda, Maryland,USA, pp.393–400.

Jiang, D., Tang, C. and Zhang, A. (2003) Cluster Analysis for Gene Expression Data: ASurvey, Available: www.cse.buffalo.edu/DBGROUP/bioinformatics/papers/survey.pdf

Kim, J.K. and Choi, S. (2009) ‘Clustering with r-regular graphs’, Pattern Recognition,Vol. 42, pp.2020–2028.

Priyadarshini, G., Chakraborty, B., Das, R., Bhattacharyya, D.K. and Kalita, J.K. (2010)‘Highly coherent pattern identification using graph-based clustering’, Proc. of the 7thAnnual Biotechnology and Bioinformatics Symposium (BIOT-2010), Lafayette, LA,pp.29–38.

Sarmah, S. and Bhattacharyya, D.K. (2010) ‘An effective technique for clusteringincremental gene expression data’, IJCSI International Journal of Computer ScienceIssues, Vol. 7, No. 3, pp.31–41.

Sarmah, S., Das Sarmah, R. and Bhattacharyya, D.K. (2011) ‘An effective density-basedhierarchical clustering technique to identify coherent patterns from gene expressiondata’, Proceedings of the 15th Pacific-Asia Conference on Knowledge Discovery andData Mining (PAKDD), Shenzhen, China, pp.225–236.

Sharan, R., Maron-Katz, A. and Shamir, R. (2003) ‘Click and expander: a system forclustering and visualizing gene expression data’, Bioinformatics, Vol. 19, No. 14,pp.1787–1799.

Sharan, R. and Shamir, R. (2000) ‘CLICK: a clustering algorithm with applications togene expression analysis’, Proceedings of 8th International Conference on IntelligentSystems for Molecular Biology, pp.307–316.

Stekel, D. (2003) Microarray Bioinformatics, Cambridge University Press, Cambridge, UK.

Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander, E.S.and Golub, T.R. (1999) ‘Interpreting patterns of gene expression with self-organizingmaps: Methods and application to hematopoietic differentiation’, Proc. of Natl. Acad.Sci, USA, pp.2907–2912.

Page 20: An effective graph-based clustering technique to identify coherent patterns from gene expression

An effective graph-based clustering technique 37

Wang, W., Zhang, J., Xu, J. and Wang, Y. (2009) ‘A graph-based approach for clusteringanalysis of gene expression data by using topological features’, Proceedings ofWorld Congress on Computer Science and Information Engineering (CSIE (1)),Los Alamitos, CA, USA, pp.559–563.

Wen, X., Fuhrman, S., Michaels, G.S., Carr, D.B., Smith, S., Barker, J.L. and Somogyi, R.(1998) ‘Large-scale temporal gene expression mapping of central nervous systemdevelopment’, PNAS, Vol. 95, No. 1, pp.334–339.