[IEEE 2011 World Congress on Information and Communication Technologies (WICT) - Mumbai, India...

6
FLBC: A Fuzzy Link based Clustering Approach for Gene Expression Data Giri Mallika Bora Dept. of CS & Engg. Tezpur University Tezpur, Assam, India [email protected] Rosy Sarmah Dept. of CS & Engg. Tezpur University Tezpur, Assam, India [email protected] Abstractโ€”In this paper, we propose an algorithm for ef๏ฌcient clustering of gene expression data. The algorithm uses the concept of common neighbors and uses a fuzzy approach for detecting intersecting and overlapping clusters. We have also compared the algorithm to the existing popular approaches and found our algorithm to give good results in terms of z-score measure of cluster validity and p-value measure. Keywords-Gene expression data; nearest neighbor; common nearest neighbor; Pearson correlation coef๏ฌcient; I. I NTRODUCTION Microarray technology has helped the monitoring of the expression levels of thousands of genes during avrious biological processes and functions. Finding the hidden in- formation in this huge volume of gene expression data is quite dif๏ฌcult, and, therefore the need for computationally ef๏ฌcient methods to mine this data is a thrust area for the research community. However, according to [1], โ€œthe large number of genes and the complexity of biological networks greatly increase the challenges of comprehending and interpreting the resulting mass of data, which often consists of millions of measurementsโ€. Clustering techniques have been widely used and found capable in addressing this challenge. Clustering methods identi๏ฌes the inherent natural structures and the interesting patterns in the dataset. Clustering identi๏ฌes subsets of genes that behave similarly. Genes in the same cluster have similar expression patterns (co-expressed genes). Clustering Techniques cluster genes with similar expres- sion patterns into the same cluster and helps in under- standing gene function, gene regulation, cellular processes, and sub types of cells. Over the last few decades, a very rich literature on Cluster Analysis of Gene Expression Data has evolved [1]. Three approaches can be found in gene data clustering: (i) Gene based clustering, (ii) Sample based clustering and (iii) Subspace clustering. Gene-based clustering considers the genes as data objects and the samples as features. Sample-based clustering con- siders the samples as data objects to be clustered, while the genes are considered as features. In subspace clustering ei- ther genes or samples can be regarded as objects or features. All the three categories namely: gene-based, sample-based, and subspace clustering have different challenges, and dif- ferent computational strategies are adopted for each of them. In this paper we investigate the problems of identi๏ฌcation of coherent patterns by clustering technique using gene based approach. The proposed method, is based on the concept of common nearest neighbor clustering approach as in [7] and uses a fuzzy membership function to identify the overlapping clusters as well. The remainder of the paper is organized as follows. In Section II, we analyze some of the important existing clus- tering methods in the context of clustering gene expression data. In Section III, we provide the background of our work and in section IV, we discuss the proposed algorithm. In Section V, we report the experimental results and ๏ฌnally the conclusions and future works are presented in Section VI. II. RELATED WORK Clustering methods for gene expression data should be capable of revealing the inherent structure of the data, extracting useful features from even noisy data, identifying the highly connected and embedded patterns in the data [2] and ๏ฌnding the relationships between the clusters and their sub-clusters. k-means [3] is a pioneering partition-based clustering algorithm that partitions the dataset into some pre-de๏ฌned number of clusters by optimizing a prede๏ฌned criterion. However, speci๏ฌcation of the number of clusters, noise sensitivity, unsuitability in detecting arbitrary shaped clusters and inconsistency in yielding the same result on different runs of the algorithm are considered as its major demerits. In [2], the authors propose the Density-based Hierarchical Clustering method (DHC) that uses a density-based approach to identify co-expressed gene groups from gene expression data. DHC is suitable for detecting highly connected clusters but is computationally expensive and is dependent on two global parameters. Jarvis and Patrick [4] ๏ฌrst introduced the idea of de๏ฌning the similarity of points in terms of their shared nearest neighbors. In [5], a -nearest neighbor based density estimation technique has been exploited. Another density based algorithm proposed by [5] works in three phases: density estimation for each gene, rough clustering 892 978-1-4673-0126-8/11/$26.00 c 2011 IEEE

Transcript of [IEEE 2011 World Congress on Information and Communication Technologies (WICT) - Mumbai, India...

FLBC: A Fuzzy Link based Clustering Approach for Gene Expression Data

Giri Mallika BoraDept. of CS & Engg.

Tezpur UniversityTezpur, Assam, India

[email protected]

Rosy SarmahDept. of CS & Engg.

Tezpur UniversityTezpur, Assam, [email protected]

Abstractโ€”In this paper, we propose an algorithm for efficientclustering of gene expression data. The algorithm uses theconcept of common neighbors and uses a fuzzy approach fordetecting intersecting and overlapping clusters. We have alsocompared the algorithm to the existing popular approaches andfound our algorithm to give good results in terms of z-scoremeasure of cluster validity and p-value measure.

Keywords-Gene expression data; nearest neighbor; commonnearest neighbor; Pearson correlation coefficient;

I. INTRODUCTION

Microarray technology has helped the monitoring of theexpression levels of thousands of genes during avriousbiological processes and functions. Finding the hidden in-formation in this huge volume of gene expression data isquite difficult, and, therefore the need for computationallyefficient methods to mine this data is a thrust area forthe research community. However, according to [1], โ€œthelarge number of genes and the complexity of biologicalnetworks greatly increase the challenges of comprehendingand interpreting the resulting mass of data, which oftenconsists of millions of measurementsโ€. Clustering techniqueshave been widely used and found capable in addressingthis challenge. Clustering methods identifies the inherentnatural structures and the interesting patterns in the dataset.Clustering identifies subsets of genes that behave similarly.Genes in the same cluster have similar expression patterns(co-expressed genes).

Clustering Techniques cluster genes with similar expres-sion patterns into the same cluster and helps in under-standing gene function, gene regulation, cellular processes,and sub types of cells. Over the last few decades, a veryrich literature on Cluster Analysis of Gene Expression Datahas evolved [1]. Three approaches can be found in genedata clustering: (i) Gene based clustering, (ii) Sample basedclustering and (iii) Subspace clustering.

Gene-based clustering considers the genes as data objectsand the samples as features. Sample-based clustering con-siders the samples as data objects to be clustered, while thegenes are considered as features. In subspace clustering ei-ther genes or samples can be regarded as objects or features.All the three categories namely: gene-based, sample-based,

and subspace clustering have different challenges, and dif-ferent computational strategies are adopted for each of them.In this paper we investigate the problems of identification ofcoherent patterns by clustering technique using gene basedapproach. The proposed method, is based on the concept ofcommon nearest neighbor clustering approach as in [7] anduses a fuzzy membership function to identify the overlappingclusters as well.

The remainder of the paper is organized as follows. InSection II, we analyze some of the important existing clus-tering methods in the context of clustering gene expressiondata. In Section III, we provide the background of our workand in section IV, we discuss the proposed algorithm. InSection V, we report the experimental results and finallythe conclusions and future works are presented in SectionVI.

II. RELATED WORK

Clustering methods for gene expression data should becapable of revealing the inherent structure of the data,extracting useful features from even noisy data, identifyingthe highly connected and embedded patterns in the data [2]and finding the relationships between the clusters and theirsub-clusters.

k-means [3] is a pioneering partition-based clusteringalgorithm that partitions the dataset into some pre-definednumber of clusters by optimizing a predefined criterion.However, specification of the number of clusters, noisesensitivity, unsuitability in detecting arbitrary shaped clustersand inconsistency in yielding the same result on differentruns of the algorithm are considered as its major demerits.In [2], the authors propose the Density-based HierarchicalClustering method (DHC) that uses a density-based approachto identify co-expressed gene groups from gene expressiondata. DHC is suitable for detecting highly connected clustersbut is computationally expensive and is dependent on twoglobal parameters. Jarvis and Patrick [4] first introduced theidea of defining the similarity of points in terms of theirshared nearest neighbors. In [5], a ๐‘˜-nearest neighbor baseddensity estimation technique has been exploited. Anotherdensity based algorithm proposed by [5] works in threephases: density estimation for each gene, rough clustering

892978-1-4673-0126-8/11/$26.00 cยฉ2011 IEEE

using core genes and cluster refinement using border genes.In [6], the authors present a density and shared nearestneighbor based clustering method. The similarity measureused is that of Pearsonโ€™s correlation and the density ofa gene is given by the sum of its similarities with itsneighbors. The use of shared nearest neighbor measure isjustified by the fact that the presence of shared neighborsbetween two dense genes means that the density around thedense genes is similar and hence should be included in thesame cluster along with their neighbors. In [7], a commonnearest neighbour-based clustering technique (CNNC) forfinding clusters over gene expression data is reported. CNNCattempts to find all the clusters over gene expression dataqualitatively using a nearest neighbour-based approach anduses a regulation-based module for finding the sub clusters.A subspace clustering algorithm, CLIC, is proposed in [8].CLIC first clusters the genes in individual dimensions andthe ordinal labels of clusters in each dimension are then usedfor further full dimension-wide clustering. CLIC also findsthe sub-clusters of the clusters detected in the first round ofclustering which helps in finding more homogeneous groupsin the data. Fuzzy clustering approaches have received con-siderable focus recently because of their capability to assignone gene to more than one cluster (fuzzy assignment), whichmay allow capturing genes involved in multiple transcrip-tional programs and biological processes. Fuzzy C-means(FCM), is an extension of K-means clustering and bases thefuzzy assignment of an object to a cluster on the relativedistance between the object and all cluster centroids [9].Many variants of FCM have been proposed in the past years,including a fuzzy clustering approach, FLAME [9], whichdetects dataset-specific structures by defining neighborhoodrelations and then neighborhood approximation of fuzzymemberships are used so that non-globular and non-linearclusters are also captured.

A. Discussion

From our selected survey, we observe that most of thealgorithms are dependent on several input parameters whichare difficult to estimate. Gene expression data are usuallyhigh dimensional whereas, most of the existing algorithmsare found to be costly with the increase in dimension.Moreover there is interconnection between genes therefore ashared nearest neighbor approach would help in identifyingsuch interconnected genes. Gene expression data also con-tains intersecting and overlapping clusters, the identificationof which is very important.

III. BACKGROUND OF THE WORK

Our Link based clustering deals with clustering amonggenes which are intensely associated to each other throughcommon neighbor genes. The clustering is done on the geneson the basis of links (defined later in this section) existing

Figure 1. (a) The Neighbor Graph (b) Link Graph (๐‘– and ๐‘— have a linkbetween them)

between them. Following are some of the key concepts basedon [7] which have been used in our approach.

Definition 3.1: Neighbor: A gene ๐‘”๐‘– is a neighbor ofanother gene ๐‘”๐‘— if the similarity between them i.e.,๐‘†๐‘–๐‘š(๐‘”๐‘–, ๐‘”๐‘—) โ‰ฅ ๐›ฟ where ๐‘†๐‘–๐‘š(๐‘”๐‘–, ๐‘”๐‘—) refers to the Pearsonโ€™scorrelation coefficient [1] and ๐›ฟ a user defined threshold.

Definition 3.2: Common Neighbor: A gene ๐‘”๐‘– is said tobe a common neighbor of another gene ๐‘”๐‘— if both ๐‘”๐‘– and ๐‘”๐‘—have atleast one shared or common neighbor between themi.e., ๐‘”๐‘– has atleast one neighbor which is also a neighbor of๐‘”๐‘— .

๐ถ๐‘(๐‘”๐‘–, ๐‘”๐‘—) = {๐‘”1, ๐‘”2, โ‹… โ‹… โ‹… , ๐‘”๐‘˜} ๐‘คโ„Ž๐‘’๐‘Ÿ๐‘’

๐‘†๐‘–๐‘š(๐‘”๐‘–, ๐‘”1) โ‰ฅ ๐›ฟ, ๐‘†๐‘–๐‘š(๐‘”๐‘–, ๐‘”2) โ‰ฅ ๐›ฟ, โ‹… โ‹… โ‹… , ๐‘†๐‘–๐‘š(๐‘”๐‘–, ๐‘”๐‘˜) โ‰ฅ ๐›ฟ ๐‘Ž๐‘›๐‘‘

๐‘†๐‘–๐‘š(๐‘”๐‘— , ๐‘”1) โ‰ฅ ๐›ฟ, ๐‘†๐‘–๐‘š(๐‘”๐‘— , ๐‘”2) โ‰ฅ ๐›ฟ, โ‹… โ‹… โ‹… , ๐‘†๐‘–๐‘š(๐‘”๐‘— , ๐‘”๐‘˜) โ‰ฅ ๐›ฟ

Here, โˆฃ ๐ถ๐‘(๐‘”๐‘–, ๐‘”๐‘—) โˆฃโ‰ฅ 1. In other words CN value is thenumber of shared or common nearest neighbors shared bytwo genes ๐‘”๐‘– , ๐‘”๐‘— .

Definition 3.3: Link: Two genes ๐‘”๐‘– , ๐‘”๐‘— have a linkbetween them if ๐‘™๐‘–๐‘›๐‘˜(๐‘”๐‘–, ๐‘”๐‘—) = {โˆฃ ๐ถ๐‘(๐‘”๐‘–, ๐‘”๐‘—) โˆฃโ‰ฅ ๐›ผ}, where๐›ผ is a user defined parameter.An example scenario is shown in Fig. 1. Here in the neighborgraph each of the nodes represent genes and the edges in Fig.1 (a) represent the neighbor connections i.e., two nodes willhave an edge between them if they are neighbors. The edgein Fig. 1 (b) represents the link connection i.e, two geneswill have an edge between them in the link graph if thenumber of common neighbors is atleast ๐›ผ. In this example,๐›ผ = 3.

Definition 3.4: Interconnected genes: Interconnectedgenes is a set consisting of ordered pairs of genes such thatthere is a link between every gene in the ordered pairs andthere is a chain of links between the ordered pairs.

For example, let us consider the genes ๐‘– and ๐‘—. Let therebe a link between ๐‘– and ๐‘—. Let the set of interconnected genesbe {(๐‘–, ๐‘—), (๐‘–1, ๐‘–2), (๐‘–3, ๐‘”4), โ‹… โ‹… โ‹… (๐‘—1, ๐‘—2), (๐‘—3, ๐‘—4), โ‹… โ‹… โ‹…}, wheregenes ๐‘–, ๐‘— has a link between them; ๐‘–1 has a link with ๐‘–,๐‘–2 has a link with ๐‘–1 and so on. Similarly, ๐‘—1 has a linkwith ๐‘—, ๐‘—2 has a link with ๐‘—1 and so on. The scenario is

2011 World Congress on Information and Communication Technologies 893

Figure 2. The interconnection network

depicted in Fig. 2. Te set of all interconnected genes forman interconnection network.

Definition 3.5: Initial Cluster: An initial cluster ๐ผ๐ถ๐‘– isdefined as a set of interconnected genes .

Definition 3.6: Core genes: The centroid of a set ofinterconnected genes is known as a core gene.

Definition 3.7: Reachability: A gene ๐‘”๐‘– is said to bereachable from a core gene ๐‘”๐‘— , if ๐‘”๐‘– belongs to the set ofinterconnected genes of ๐‘”๐‘— .

Definition 3.8: Fuzzy Reachability: A gene ๐‘”๐‘– is said tobe fuzzy-reachable from a core gene ๐‘”๐‘— , if ๐‘š๐‘Ž๐‘ฅ๐‘ข๐‘–,๐‘— โ‰ฅ ๐›ฝ,where, ๐‘— = 1, 2, โ‹… โ‹… โ‹… , ๐‘˜ and ๐‘˜ is the total number of initialclusters.

Definition 3.9: Cluster: A cluster ๐ถ๐‘– is the set of all thereachable and fuzzy reachable genes.

Definition 3.10: Noise: Those genes which do not belongto any cluster are called noise genes.Link based clustering adopts a recursive approach for effi-cient clustering. The aim of the clustering is to expand theneighbors of a gene to check if a link exists. As soon asa gene finds a link, its cluster id is assigned and termedas classified. Classified genes are not further taken intoconsideration for expansion.

IV. PROPOSED TECHNIQUE: FLBC

Our proposed Fuzzy Link Based Clustering (FLBC) iden-tifies the initial clusters by using a common nearest neighborbased approach (Step 1) and then finds the overlapped andintersecting clusters by a fuzzy based approach to obtainthe final clustering of the data (Step 2). Step 1 of FLBCproceeds by first finding the neighbors for each gene basedon definition 3.1. For finding the neighbors we use thePearsonโ€™s Correlation measure as the similarity metric. Next,the common neighbors are found by using Definition 3.2.Cluster expansion starts with a pair of genes ๐‘”๐‘– , ๐‘”๐‘— that havea link between them. Then all the common neighbors of thatpair of genes are also classified with the same ๐‘๐‘™๐‘ข๐‘ ๐‘ก๐‘’๐‘Ÿ ๐‘–๐‘‘

as ๐‘”๐‘– , ๐‘”๐‘— and cluster expansion proceeds by finding the linkof each of the common neighbors in a depth first manner.When no more genes can be inserted into this cluster, clusterexpansion restarts with the next pair of unclassified geneswith a link between them. This process is iterated till nomore genes can be classified or no more links are found.The step-wise represention of step 1 of FLBC is given next:

1) Initially all genes are unclassified.2) Find the neighbors for each pair of genes.

3) Compute the common neighbors for each pair ofgenes.

4) Find link between pairs of genes.5) Repeat through step 11 till all genes are classified.6) Arbitrarily choose a pair of unclassified genes with a

link between them and classify them with ๐‘๐‘™๐‘ข๐‘ ๐‘ก๐‘’๐‘Ÿ ๐‘–๐‘‘.7) Find all interconnected genes w.r.t. the genes obtained

in step 6.8) Classify all pairs of interconnected genes as well as

their common neighbors with the same ๐‘๐‘™๐‘ข๐‘ ๐‘ก๐‘’๐‘Ÿ ๐‘–๐‘‘.9) Repeat steps 7 and 8 till no more genes are inserted

into this cluster.10) Increment ๐‘๐‘™๐‘ข๐‘ ๐‘ก๐‘’๐‘Ÿ ๐‘–๐‘‘.

The Step 1 of FLBC algorithm uses the concept of linkto support crisp and clear decision on the membership ofa gene in a cluster. For two genes to be linked there hasto be ๐›ผ no of common neighbors, which depicts that thetwo genes must have an intense similarity to many commonneighbors. It is efficient even in presence of noise. ButStep 1 of FLBC suffers from the disadvantage that it is noteffective in detecting overlapping and intersecting clusters.With higher value of CN threshold, ๐›ผ, the algorithm givesclusters which are highly interconnected with a high degreeof similarity i.e. highly coherent gene clusters. The Step 2of FLBC is based on the cluster result of Step 1 and alsouses the concept of fuzzy membership to find the clusters.

Fuzzy link based clustering applies the concepts, fuzzyclustering and link based clustering for deciding the co-herency of genes. The algorithm for this approach makesuse of Link based clustering to detect genes with higherdegree of similarity.

The Step 1 of FLBC is executed with a high value of๐›ผ to obtain the highly coherent clusters. Here, a lot ofgenes which are in the bordeing area of the clusters are notclassified. These genes (also knwon as candidate genes) maybe assigned to the clusters using a fuzzy approach to avoidthis loss of useful information. From the clusters obtained byStep 1, the centroids of the clusters are computed. We useanother parameter ๐›ฝ, to incorporate these candidate genesinto a cluster. We use the fuzzy membership function givenin [10], to find the degree of membership of every candidategene to each of the ๐‘˜ clusters detected in Step 1. Themembership function is given next.

๐‘ข๐ถ๐‘–,๐‘”๐‘— =1

โˆ‘๐‘˜

๐‘™=1

[๐‘‘(๐ถ๐‘–,๐‘”๐‘—)๐‘‘(๐ถ๐‘™,๐‘”๐‘—)

] 2

๐‘š๐‘“โˆ’1

(1)

where ๐‘”๐‘— is a candidate gene, ๐‘˜ is the number of clustersdetected in Step 1, ๐‘ข is the fuzzy membership matrix suchthat ๐‘ข๐‘–๐‘— โˆˆ [0, 1] is the membership degree of ๐‘”๐‘— to cluster๐ถ๐‘–. ๐ถ = {๐ถ1, ๐ถ2, โ‹… โ‹… โ‹… , ๐ถ๐‘˜} is the set of initial clusters and๐ถ๐‘– is the current cluster for which the membership of ๐‘”๐‘— isto be determined and ๐ถ๐‘™ are the initial clusters present, ๐‘‘ isa distance measure between the centroid of an initial cluster

894 2011 World Congress on Information and Communication Technologies

Figure 3. Results of FLBC on Dataset 1

and a candidate gene. The factor ๐‘š๐‘“ is called fuzziness andis usually equal to 2 [10]. The membership for each of thecandidate genes is computed and assigned to that clusterfor which it has the highest value according to Definition3.8. The cluster expansion in Step 2 checks if a gene is acandidate gene and finds if its highest membership value to acore gene is also greater than ๐›ฝ,then it is assigned the samecluster id of the core gene to which it is fuzzy reachable.For those candidate genes which have similar membershipvalue to more than one core genes are assigned to boththe clusters and thus have more than one cluster id. Theseare the genes which give the overlapping and intersectingclusters. This step is repeated until all candidate genes areassigned cluster id or noise id.

V. PERFORMANCE EVALUATION

The performance of the proposed technique was evaluatedin light of three real-life datasets. We used a PentiumIV machine of 3.60GHz speed with 2.00 GB RAM. Theimplementation was done in C++ in windows platform. Thedatasets used and their characteristics are given in TableI. The results of the first step of FLBC on Dataset 1 andDataset 2 are given in Figures 3 and 4. It can be seen thatFLBC gives clusters of coherent genes. The clusters obtainedby the proposed FLBC on Dataset 3 is illustrated in Fig. 5.

A. Cluster Quality

For validating our clustering result, we employ the ho-mogeneity, separation measures as given in [1] and z-scoremeasure of [14]. The Homogeneity and separation values

Figure 4. Results of FLBC on Dataset 2

Figure 5. Results of FLBC on Dataset 3

for FLBC for different datasets as given in Table II showtheat the clusters obtained are quite homogeneous.

Z-score is calculated by taking into account the rela-tionship between the clustering result and the functionalannotation of the genes in that cluster. The Gibbons Clus-terJudge [14] tool has been used here to calculate the z-score. A higher z-score indicates the clustering result ismore biologically relevant. To test the performance of theclustering algorithm, we compare the clusters identified byFLBC with the results from k-means [3], FCM [10] and

Table IIHOMOGENEITY AND SEPARATION VALUES FOR FLBC

Datasets No. of Clusters Homogeneity SeparationDataset 1 18 0.94 0.09Dataset 2 06 0.90 0.15Dataset 3 14 0.95 -0.08

2011 World Congress on Information and Communication Technologies 895

Table IDATASETS USED FOR EVALUATING ๐ถ๐‘๐‘๐ถ

SerialNo.

Dataset No. of genes No of condi-tions

Source

1 Subset of Yeast Cell Cycle[11]

384 17 http://faculty.washington. edu/kayee/cluster

2 Rat CNS [12] 112 9 http://faculty. washington.edu/kayee/cluster

3 Yeast Diauxic Shift [13] 6089 7 http://www.ncbi.nlm.nih.gov/geo/query

Table IIIZ-SCORES FOR FLBC AND ITS COUNTERPARTS FOR DATASET 1

Method Applied No. of Clusters z-scorek-means 18 2.48

FCM 18 2.49CNNC 15 3.04FLBC 18 2.59

Table IVZ-SCORES FOR FLBC AND ITS COUNTERPARTS FOR DATASET 3

Method Applied No. of Clusters z-scorek-means 14 16.08

FCM 14 13.98CNNC 14 14.24FLBC 14 13.77

CNNC [7]. In this paper, the reported z-score is averagedover 50 repeated experiments. The result of applying thez-score on Dataset 1 is shown in Table III, which clearlyshows that FLBC outperforms k-means, and FCM w.r.t. thecluster quality. However, the performance of FLBC is lessthan CNNC for both Dataset 1 and Dataset 3 (Table IV).This is due to the fact that CNNC gives disjoint clusters butthere are also overlapping clusters in FLBC.

To evaluate the statistical significance of the genes in acluster, we compute the p-value for each GO category. P-value represents the probability of observing the number ofgenes from a specific GO functional category within eachcluster. A low p-value indicates the genes belonging to theenriched functional categories are biologically significant inthe corresponding clusters. We have obtained p-value usingthe software FuncAssociate [15], which is a web-based toolthat accepts as input a list of genes, and returns a list of GOattributes that are over- (or under-) represented among thegenes in the input list. The enriched functional categoriesfor some of the clusters obtained by FLBC on Dataset 3 arelisted in Table V. To restrict the size of the paper, we havereported only a part of the results. The functional enrichmentof each GO category in each of the clusters is calculated byits p-value.

Our Fuzzy link based clustering approach combines theclassical as well as the fuzzy set theory approach toclustering genes. It provides a better precise solution forhandling the imprecise data i.e. any gene that might belongto different clusters with different degree of membership. Agene is assigned to a cluster with which it has the highest

Table VP-VALUES OF DATASET 3

Cluster P-value GO number GO categoryC11 1.8e-23 GO:0007039 vacuolar protein catabolic process

2.9e-23 GO:0044257 cellular protein catabolic process3.48e-21 GO:0030163 protein catabolic process4.39e-13 GO:0009056 catabolic process1.68e-11 GO:0009266 response to temperature stimulus5.14e-11 GO:0044248 cellular catabolic process1.6e-10 GO:0044265 cellular macromolecule catabolic pro-

cess4.39e-08 GO:0005996 monosaccharide metabolic process

C12 2.37e-31 GO:0022613 ribonucleoprotein complex biogenesis2.4e-30 GO:0042254 ribosome biogenesis3.5e-30 GO:0044085 cellular component biogenesisg1.28e-30 GO:0043228 non-membrane-bounded organelle1.28e-30 GO:0043232 intracellular non-membrane-bounded

organelle8.8e-29 GO:0030684 Preribosome1.5e-28 GO:0030529 ribonucleoprotein complex2.9e-25 GO:0005730 Nucleolus1.7e-19 GO:0006364 rRNA processing4.69e-19 GO:0016072 rRNA metabolic process2.5e-17 GO:0006417 regulation of translation4.03e-17 GO:0032268 regulation of cellular protein

metabolic process2.76e-17 GO:0005840 Ribosome1.9e-16 GO:0010608 posttranscriptional regulation of gene

expression1.6e-15 GO:0030686 90S preribosome6.39e-13 GO:0030687 preribosome, large subunit precursor4.24e-11 GO:0042255 ribosome assembly6.78e-10 GO:0042257 ribosomal subunit assembly2.58e-10 GO:0022625 cytosolic large ribosomal subunit9.68e-09 GO:0000027 ribosomal large5.05e-09 GO:0032040 small-subunit processome1.3e-09 GO:0070925 organelle assembly3.7e-08 GO:0042273 ribosomal large subunit biogenesis

C13 9.4e-24 GO:0055114 oxidation reduction7.1e-23 GO:0044455 mitochondrial membrane part3.0e-21 GO:0044455 mitochondrial part6.57e-18 GO:0022904 respiratory electron transport chain1.36e-18 GO:0008219 cell death1.36e-18 GO:0016265 Death5.59e-12 GO:0006099 tricarboxylic acid cycle5.59e-12 GO:0046356 acetyl-CoA catabolic process8.25e-10 GO:0005751 mitochondrial respiratory chain com-

plex IV8.25e-10 GO:0045277 respiratory chain complex IV

membership value. Similar membership values for genesw.r.t. different clusters helps in the detection of overlappedclusters. Even after fuzzy clustering for genes, it is still goodenough for detecting noise genes which could not obtain amembership in any cluster because of the threshold ๐›ฝ usedfor deciding membership in a cluster.

The FLBC clustering provides good results for clusteringin two steps. In the first step it finds highly coherent genes.In the second step it applies a fuzzy clustering on the

896 2011 World Congress on Information and Communication Technologies

genes which have not been classified in the first step. Forassigning the membership of such genes we have usedanother parameter ๐›ฝ, which serves as the threshold for a geneto be qualified as a member of a cluster. A comparativelyhigher value of ๐›ฝ < ๐œƒ, ensures that the algorithm is stillsuccessful in detecting the noise genes.

Performance of fuzzy link based clustering is almostcomparable to CNNC [7]. Nevertheless, this approach ofclustering decreases the quality of the clusters because theincorporation of some of the candidate genes might decreasethe coherency of the cluster. The algorithm gives clusterswhich are dense with higher value of CN, with a gooddegree of similarity. These genes represent a very fine levelof clustering or it gives the set of highly coherent patterns.We note here that some of the algorithms require an inputparameter to be specified i.e. number of clusters as in k-means and FCM, and this input parameter severely affectsthe clustering as revealed by the variation in cluster validitymeasures with change of values for these parameters. Onthe other hand, the proposed method FLBC does not requirethe no. of clusters apriori and has been found to yield quitecomparable results.

VI. CONCLUSION

This paper discusses an effective technique for clusteringgene expression data using a common neighbor and fuzzybased approach. The value of the tuning parameters ๐›ฟ and๐›ผ play a key role in the improvement of the performance ofthe proposed technique. An appropriate value for the tuningparameter ๐›ฟ may be selected by cross validation. Based onour experiments, it has been observed that the value of ๐›ฟ

within 0.8โ€“0.95 gives excellent results for the tested datasets.It should be noted that higher the value of ๐›ฟ, better is thecoherence. However, work is going on to develop a heuristicmethod for adaptive selection of the tuning parameter ๐›ผ

and ๐›ฝ. Performance evaluation over real data show thatthe proposed method significantly improves the performanceover some of the traditional methods such as FCM and k-means even in presence of outliers. Also significantly goodz- and p- values establish the effectiveness of the proposedFLBC while comparing with its other counterparts.

FLBC may be modified to make it parameter independentand to detect intrinsic and embedded cluster.

REFERENCES

[1] Jiang, D., Tang, C., and Zhang, A. โ€˜Cluster anal-ysis for gene expression data: A surveyโ€™, Available:www.cse.buffalo.edu/DBGROUP/bioinformatics/papers/ sur-vey.pdf, 2003.

[2] Jiang, D., Pei, J. and Zhang A. โ€˜DHC: a density-basedhierarchical clustering method for time series gene expressiondataโ€™, Proceedings of BIBE2003: 3rd IEEE InternationalSymposium on Bioinformatics and Bioengineering, pp. 393,Bethesda, Maryland, USA, 2003.

[3] McQueen, J.B. โ€˜Some methods for classification and analysisof multivariate observationsโ€™. Proceedings of the Fifth Berke-ley Symposium on Mathematical Statistics and Probability,Vol. 1, pp. 281-297, University of California Press, 1967.

[4] Jarvis, R.A. and Patrick, E.A. โ€˜Clustering using a similaritymeasure based on Shared Nearest Neighborsโ€™, IEEE Trans-actions on Computers,Vol. 22(11), pp. 1025-1034, 1973.

[5] Chung, S., Jun, J. and McLeod, D. Mining gene expressiondatasets using density based clustering, Technical Report,USC/IMSC, University of Southern California, No. IMSC-04-002, 2004.

[6] Syamala, R., Abidin, T. and Perrizo, W. โ€˜Clustering Microar-ray Data based on Density and Shared Nearest Neighbor Mea-sureโ€™, Proceedings of the 21st ISCA International Conferenceon Computers and Their Applications (CATA-2006), pp. 23-25, 2006.

[7] Goswami, M., Sarmah, R. and Bhattacharyya, D. K. โ€˜CNNC:a common nearest neighbour clustering approach for gene ex-pression dataโ€™, International Journal of Computational Visionand Robotics 2011, Vol. 2, No.2 pp. 115 - 126, 2011.

[8] Yun, T., Hwang, T., Cha, K. and Yi, GS. โ€˜CLIC: clus-tering analysis of large microarray datasets with individualdimension-based clusteringโ€™, Nucleic Acids Research, Vol. 38,pp. W246-W253, 2010.

[9] Fu, L and Medico, E. โ€˜FLAME: a novel fuzzy clusteringmethod for the analysis of DNA microarray dataโ€™, BMCBioinformatics, Vol. 8(3), 2007.

[10] Bezdek, J. C. โ€˜Pattern Recognition with Fuzzy ObjectiveFunction Algorithmsโ€™, Plenum, New York, USA, 1981.

[11] Cho, R. J., Campbell, M., Winzeler, E., Steinmetz, L., Con-way, A., Wodicka, L., Wolfsberg, T., Gabrielian, A., Lands-man, D. and Lockart, D. โ€˜A genome-wide transcriptionalanalysis of the mitotic cell cycleโ€™, Molecular Cell, Vol. 2(1),pp. 65-73, 1998.

[12] Wen, X., Fuhrman, S., Michaels, G.S., Carr, D.B., Smith,S., Barker, J.L. and Somogyi, R. โ€˜Large-scale temporal geneexpression mapping of central nervous system developmentโ€™,PNAS, Vol. 95(1), pp. 334-339, 1998.

[13] DeRisi, J.L., Iyer, V.R.and Brown, P.O. โ€˜Exploring theMetabolic and Genetic Control of Gene Expression on aGenomic Scaleโ€™, Science, Vol. 278, pp. 680-686, 1997.

[14] Gibbons, F. and Roth, F. โ€˜Judging the quality of gene ex-pression based clustering methods using gene annotationโ€™,Genome Research, Vol. 12, pp. 1574-1581, 2002.

[15] Berriz, F. G. et. al. โ€˜Characterizing gene sets with funcasso-ciateโ€™, Bioinformatics, Vol. 19, pp. 2502-2504, 2003.

2011 World Congress on Information and Communication Technologies 897