Welcome to my presentationon
A Comparative Study of Clustering for Gene Expression Data in Bioinformatics
1
Roll: 08054746 Reg: 1484
Department of StatisticsRajshahi University
Rajshahi-6205
Md. Bipul Hossen, Dept. of Statistics, University of Rajshahi
2
Outline1. Why choosing clustering technique ?2. Some Objectives 3. Methods and materials 4. Results and Discussions5. Conclusion
Md. Bipul Hossen, Dept. of Statistics, University of Rajshahi
3
1. Why choosing Clustering TechniqueCluster analysis programs are routinely run as a first step of data summary and grouping genes in a microarray data analysis. Mainly the gene expression data is so much noisy, mixture with expression pattern, down regulated and up regulated. That’s why we show here a comparative study of four clustering algorithms and two proximity measures applied on most commonly used iris data, simulated data and six real cancer gene expression data sets.
Md. Bipul Hossen, Dept. of Statistics, University of Rajshahi
Bioinformatics Lab, Dept. of Statistics, University of Rajshahi 44
2. Some Objectives Find significant cluster according to similarities,
intensities and regulations among it’s objects. Compare several method of HC with K-means
based on two proximity measures. To asses the quality and reliability of clustering by
Calinaski Harabasz (CH) and Daviece Bouldin (DB) index.
Md. Bipul Hossen, Dept. of Statistics, University of Rajshahi
5
1. Single Linkage or Nearest Neighbor Method
2. Complete Linkage or Furthest Neighbor Method 3. Average Linkage Method4. K-means clustering
Methods
Md. Bipul Hossen, Dept. of Statistics, University of Rajshahi
Md. Bipul Hossen, Dept. of Statistics, University of Rajshahi 6
Davies–Bouldin (DB) IndexThe Davies–Bouldin index is a metric for evaluating clustering algorithms (Davies and Bouldin, 1969). This is an internal evaluation scheme and it is a cluster separation measure.
DB=
Ri,j= Di=
7
Calinski-Harabasz (CH) Index• Calinski-Harabasz (Calinski and Harabasz, 1974; Olatz et al., 2012) index obtained
the best results in the work of (Milligan and Cooper, 1985). It is a ratio type index where the cohesion is estimated based on the distances from the points in a cluster to its centroid. The separation is based on the distance from the centroids to the global centroid. This index for estimating the number of clusters, based on an observations/variables-matrix here. This method described as follows.
• The Calinski-Harabasz criterion is sometimes called the variance ratio criterion (VRC). The Calinski-Harabasz index is defined as
Where, SSB is the overall between-cluster variance, SSW is the overall within-cluster
variance, k is the number of clusters, and N is the number of observations.
Md. Bipul Hossen, Dept. of Statistics, University of Rajshahi
8
Dataset Chip Tissue n #C Dist. Classes m d
Armstrong-V2 [2] Affy Blood 72 3 24,20,28 12582 2194
Bhattacharjee [3]
Affy Lung 203 5 139,17,6,21,20 12600 1543
Nutt-V1 [6] Affy Brain 50 4 14,7,14,15 12625 1377
Alizadeh-V2 [1] cDNA Blood 62 3 42,9,11 4022 2093
Garber [4] cDNA Lung 66 4 17,40,4,5 24192 4553
Liang [5] cDNA Brain 37 3 28,6,3 24192 1411
Data sets
Md. Bipul Hossen, Dept. of Statistics, University of Rajshahi
Md. Bipul Hossen, Dept. of Statistics, University of Rajshahi 9
In this example, the objects g1, g2, g3, g4, g5, g6, g7, g8, g9 and g10 have been clustered. The place at the bottom of the tree, where the object names are written, are called leaves. The junctions are called nodes. It is possible to use a hierarchical clustering algorithm to find groups in the data, by cutting the tree at a certain height. For instance, it might be considered than on the example there are two groups, (g2, g3, g1, g8) and (g6, g10, g5, g7, g4, g9) or three groups (g2, g3, g1, g8), (g6, g10) and (g5, g7, g4, g9) or ten groups, each containing only one leaf.
3 Clusters
2 Clusters
10
Hierarchical Clustering of Simulated Data
Green color dendrogram shows the best result and we make a Heat map by this method. i.e Complete HC with respect to Euclidean distance give the best result then other methods.
Fig: Heat map
Md. Bipul Hossen, Dept. of Statistics, University of Rajshahi
11
K-means of Simulated Data
No. of Cluster K=2 K=3 K=4 K=5Cluster Size 20,40 20,20,20 12,20,8,20 4,4,12,20,20DB index 0.897 0.797 0.825
From the above table we see, when the number of cluster k=3 the DB index give the lower value. Therefore we may conclude that three clusters are present in this data set.
Table: Davies-Bouldin index
0.321
Md. Bipul Hossen, Dept. of Statistics, University of Rajshahi
12
HC
of A
rmst
rong
-V2
Dat
a(d)
Md. Bipul Hossen, Dept. of Statistics, University of Rajshahi 24,20,28
13
Seve
ral H
C N
utt-V
1 D
ata
(c)
Md. Bipul Hossen, Dept. of Statistics, University of Rajshahi 14,7,14,15
Table represent, when the number of cluster k=3 the DB index give the lower value. The sizes of the cluster is 41, 10 and 11 and the actual cluster size is 42, 9 and 11. When the number of cluster is 3 than the DB index gives the lower value. Therefore we may conclude that three clusters are present in Alizadeh-V2 data.
14
Table: Davies-Bouldin index
No. of Cluster K=2 K=3 K=4 K=5Cluster Size 44,18 41,10,11 11,20,17,14 22, 9, 3,10,18
DB index 2.708 2.477 2.3281.774
K-means of Alizadeh-V2 Data
Md. Bipul Hossen, Dept. of Statistics, University of Rajshahi
Table 4.3 represent, when the number of cluster k=3 the DB index give the lower value. The sizes of the cluster is 8, 26 and 3 and the actual cluster size is 6, 28 and 3. When the number of cluster is 3 than the DB index gives the lower value. Therefore we may conclude that three clusters are present in Armstrong-V2 data.
15
Table: Davies-Bouldin index
No. of Cluster K=2 K=3 K=4 K=5Cluster Size 29, 8 8,26,3 6, 9, 3, 19 1, 2, 19, 14, 1
DB index 1.231 2.091 1.2151.124
K-means of Liang Data
Md. Bipul Hossen, Dept. of Statistics, University of Rajshahi
16
Seve
ral H
C Li
ang
Dat
a(c,
d,e,
f)
Md. Bipul Hossen, Dept. of Statistics, University of Rajshahi 28,6,3
17
Hea
t map
of L
iang
Dat
a
Md. Bipul Hossen, Dept. of Statistics, University of Rajshahi
Bioinformatics Lab, Dept. of Statistics, University of Rajshahi 1818
Dataset Distance Method Cluster Method Calinski-Harabasz (CH)
Armstrong-V2 Euclidean Single 1.889
Euclidean Complete 11.803
Euclidean Average 6.674
Pearson Single 0.914
Pearson Average 10.393
K-means 11.943
Bhattacharjee Euclidean Single 1.786
Euclidean Average 26.850
Pearson Single 1.700
Pearson Complete 26.512
Pearson Average 12.902
K-means 22.924
Nutt-V1 Euclidean Single 3.167
Euclidean Average 5.269
Pearson Single 0.941
Pearson Complete 4.273
Pearson Average 2.987
K-means 6.051
Pearson Complete 12.559
Euclidean Complete 34.702
Euclidean Complete 7.938
Compare HC with K-means for Affymetrix data sets
Md. Bipul Hossen, Dept. of Statistics, University of Rajshahi
Bioinformatics Lab, Dept. of Statistics, University of Rajshahi 19
Single Average Complete K-Means0
2
4
6
8
10
12
14
16
18
20
Mean of the CH index for Affy Chip
PearsonEuclidean
CH in
dex
Compare HC with K-means for Affymetrix data sets by visualization technique
From the above graph we see that Complete linkage with Euclidean achieves CH index of 18.14 which is larger CH than Single, Average and K-means with respect to their proximity measure. Therefore we may conclude that the complete linkage method gives the better result for the Affymetrix data sets.
20
Dataset Distance Method Cluster Method Calinski-Harabasz (CH)
Alizadeh-V2 Euclidean Single 2.047Euclidean Complete 11.161Euclidean Average 11.068Pearson Single 0.980Pearson Complete 11.229Pearson Average 10.319
Garber Euclidean Single 2.772
Euclidean Average 5.166Pearson Single 0.855Pearson Complete 7.693Pearson Average 18.912
K-means 9.269
Liang Euclidean Single 9.057Euclidean Complete 19.665Euclidean Average 10.279Pearson Single 19.665Pearson Complete 19.665Pearson Average 19.665
Compare HC with K-means for cDNA data sets
K-means 13.003
Euclidean Complete 19.097
K-means 23.781
Md. Bipul Hossen, Dept. of Statistics, University of Rajshahi
21
Single Complete Average K-Means0
2
4
6
8
10
12
14
16
18
Mean of the CH index for cDNA Chip
EuclideanPearsonSeries3CH
inde
x
Compare HC with K-means for cDNA data sets by visualization technique
From the above graph we see that K-means achieves a CH index of 17.01 which is larger CH than Single, Complete and Average with respect to their proximity measure. Therefore we may conclude that the K-means method gives the better result for the cDNA data sets.
Md. Bipul Hossen, Dept. of Statistics, University of Rajshahi
Md. Bipul Hossen, Dept. of Statistics, University of Rajshahi 22
ConclusionsOur results reveal that the complete linkage with euclidean distance exhibited the best performance for Affymetrix data sets. For cDNA data sets the K-means clustering exhibited the best performance in terms of recovering the true structure of the data sets. To the best of our knowledge, the comparative study of several HC and K-means with the validity index as CH and DB are poorly documented in literature.
Md. Bipul Hossen, Dept. of Statistics, University of Rajshahi 23
Future Research Interest1. Comparison on Hierarchical clustering method with the
Self-Organizing Maps method and other existing update clustering methods.
2. Investigate the performance of the different hierarchical clustering method in a comparison of the other existing methods by false discovery rate (FDR), misclassification error rate (MER), receiver operating characteristic (ROC) and area under ROC curve using resampling technique.
3. Comparing both supervised and unsupervised methods for gene expression data.
Thank you
Reference[1] Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JC, Sabet H, Tran T, Yu X, Powell JI, Yang L, Marti GE, Moore T, Hudson J, Lu L, Lewis DB, Tibshirani R, Sherlock G, Chan WC, Greiner TC, Weisenburger DD, Armitage JO, Warnke R, Levy R, Wilson W, Grever MR, Byrd JC, Botstein D, Brown PO, Staudt LM (2000); Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature. 403:503-511.
[2] Armstrong SA, Staunton JE, Silverman LB, Pieters R, den Boer ML, Minden MD, Sallan SE, Lander ES, Golub TR, Korsmeyer SJ (2002); MLL translocations specify a distinctgene expression profile that distinguishes a unique leukemia; Nat Genet. 30:41-47.
[3] Bhattacharjee A, Richards WG, Staunton J, Li C, Monti S, Vasa P, Ladd C, Beheshti J, Bueno R, Gillette M,Loda M, Weber G, Mark EJ, Lander ES, Wong W, Johnson BE, Golub TR, Sugarbaker DJ, Meyerson M (2001); Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses; Proc Natl Acad Sci USA. 98(24):13790-13795.
[4] Garber ME, Troyanskaya OG, Schluens K, Petersen S, Thaesler Z, Pacyna-Gengelbach M, Rijn M van de, Rosen GD, Perou CM, Whyte RI, Altman RB, Brown PO, Botstein D, Petersen I (2001); Diversity of gene expression in adenocarcinoma of the lung; Proc Natl Acad Sci USA. 98(24):13784-13789.
[5] Liang Y, Diehn M, Watson N, Bollen AW, Aldape KD, Nicholas MK, Lamborn KR, Berger MS, Botstein D, Brown PO, Israel MA (2005); Gene expression profiling reveals molecularly and clinically distinct subtypes of glioblastoma multiforme; Proc Natl Acad Sci USA. 102(16):5814-5819.
[6] Nutt CL, Mani DR, Betensky RA, Tamayo P, Cairncross JG, Ladd C, Pohl U, Hartmann C, McLaughlin ME, Batchelor TT, Black PM, von Deimling A, Pomeroy SL, Golub TR, Louis DN (2003); Gene expressionbased classification of malignant gliomas correlates better with survival than histological classification; Cancer Res. 63(7):1602-1607.
Top Related