Agenda
description
Transcript of Agenda
Agenda1. Introduction to clustering
1. Dissimilarity measure2. Preprocessing
2. Clustering method1. Hierarchical clustering2. K-means and K-memoids3. Self-organizing maps (SOM)4. Model-based clustering
3. Estimate # of clusters
4. Two new methods allowing scattered genes 4.1 Tight clustering4.2 Penalized and weighted K-means
5. Cluster validation and evaluation
6. Comparison and discussion
4.1 Tight clustering
K-means Clustering looks
informative.
k=10 k=15 k=30
A closer look, however, finds lots of noises in each cluster.
A common situation for gene clustering in microarray:
Challenge 1: Lots of scattered genes. i.e. genes not belonging to any tight cluster of biological function.
Main challenges for clustering in microarray
x
y
0 10 20 30
-20
-15
-10
-50
x
y
-10 0 10 20 30 40
-30
-20
-10
010
4.1 Tight clustering
Challenge 2: Microarray is an exploratory tool to guide further biological experiments
Hypothesis driven: hypothesis => experimental data.
Data driven: high-throughput experiment => data mining => hypothesis => further validation experiment
Important to provide the most informative clusters instead of lots of loose clusters (reduce false positives).
Main challenges for clustering in microarray
4.1 Tight clustering
x
y
0 10 20 30
-20
-15
-10
-50
x
y
-10 0 10 20 30 40
-30
-20
-10
010
Tight Clustering: Directly identify informative,
tight and stable clusters with reasonable size, say, 20~60 genes.
Need not estimate k !! Need not assign all genes into
clusters.
Traditional: Estimate the number of
clusters, k. (except for hierarchical clustering)
Perform clustering through assigning all genes into clusters.
4.1 Tight clustering
x
y
0 1 2 3 4 5
01
23
4
x
x
subsample 1
x
y
0 1 2 3 4 5
01
23
4
x
x
subsample 2
x
y
0 1 2 3 4 5
01
23
4
x
x
judgement by subsample 1
x
y
0 1 2 3 4 5
01
23
4
x
x
judgement by subsample 2
11
1 2 3 4 5
6 7 8 9 10
x
y
0 1 2 3 4 5
01
23
4
whole data
Basic idea:
Original DataX
co-membership matrixD[C(X', k), X]
sub-sampleX'
cluster centersC(X', k)=(C1,…, Ck)
K-means
4.1 Tight clustering
• X={xij}nd : data to be clustered.• X'={x'ij}n/2d : random sub-sample
• C(X', k)=(C1, C2,…, Ck): the cluster centers obtained from clustering X' into k clusters.
• D[C(X', k), X] : an nn matrix denoting co-membership relations of X classified by C(X', k). (Tibshirani 2001)
D[C(X', k), X]ij =1 if i and j in the same cluster. =0 o.w.
genes of sets twoof similarity of measure a :VV
VV)V,s(V •
ji
jiji
4.1 Tight clustering
Algorithm 1 (when fixing k):
1. Fix k. Random sub-sampling X(1), …, X(B). Define the average co-membership matrix to be
Note: a. i and j always clustered together
in each sub-sampling judgment.b. i and j never clustered together in
each sub-sampling judgment.c.
. X] ), ,D[C(X , X], ), ,D[C(XmeanD (B)(1) kk
1D ij
0D ij
iii 1D
4.1 Tight clustering
Algorithm 1 (when fixing k): (cont’d)
2. Search for a large set of points such that . Sets with this property are candidates of tight clusters. Order sets with this property by their size to obtain Vk1,Vk2, …
},,1{},,{ 1 nvvV m jiDjivv ,1
0 toclose
4.1 Tight clustering
1 2 3 4 56 7 8 9 10 11
1 2 3 4 56 7 8 9 10 11
1 2 3 4 56 7 8 9 10 11
1 2 3 4 5 116 7 8 9 10
1 2 3 4 56 7 8 9 10 11
6 7 8 9 10 111 2 3 4 5
6 7 8 9 101 2 3 4 5 11
1 2 3 4 5 116 7 8 9 10
1 2 3 4 5 116 7 8 9 10
6 7 8 9 101 2 3 4 5 11
1 2 3 4 5 6 7 8 9 10 1112 1.034567891011
1 2 3 4 56 7 8 9 10 11
1 2 3 4 56 7 8 9 10 11
1 2 3 4 56 7 8 9 10 11
1 2 3 4 5 116 7 8 9 10
1 2 3 4 56 7 8 9 10 11
6 7 8 9 10 111 2 3 4 5
6 7 8 9 101 2 3 4 5 11
1 2 3 4 5 116 7 8 9 10
1 2 3 4 5 116 7 8 9 10
6 7 8 9 101 2 3 4 5 11
1 2 3 4 5 6 7 8 9 10 111 0.52 1.034567891011
1 2 3 4 5 6 7 8 9 10 111 1.0 1.0 1.0 1.0 1.0 0 0 0 0 0 0.52 1.0 1.0 1.0 1.0 1.0 0 0 0 0 0 0.53 1.0 1.0 1.0 1.0 1.0 0 0 0 0 0 0.54 1.0 1.0 1.0 1.0 1.0 0 0 0 0 0 0.55 1.0 1.0 1.0 1.0 1.0 0 0 0 0 0 0.56 0 0 0 0 0 1.0 1.0 1.0 1.0 1.0 0.57 0 0 0 0 0 1.0 1.0 1.0 1.0 1.0 0.58 0 0 0 0 0 1.0 1.0 1.0 1.0 1.0 0.59 0 0 0 0 0 1.0 1.0 1.0 1.0 1.0 0.510 0 0 0 0 0 1.0 1.0 1.0 1.0 1.0 0.511 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 1.01110987654321
1110987654321
6 7 8 9 101 2 3 4 5 11
1 2 3 4 5 116 7 8 9 10
1 2 3 4 5 116 7 8 9 10
6 7 8 9 101 2 3 4 5 11
6 7 8 9 10 111 2 3 4 5
1 2 3 4 56 7 8 9 10 11
1 2 3 4 5 116 7 8 9 10
1 2 3 4 56 7 8 9 10 11
1 2 3 4 56 7 8 9 10 11
1 2 3 4 56 7 8 9 10 11
1,0kV
2,0kV
3,0kV
1),1( 0 kV
2),1( 0 kV
3),1( 0 kV
1),2( 0 kV
2),2( 0 kV
3),2( 0 kV
1),3( 0 kV
2),3( 0 kV
3),3( 0 kV
0k 10 k 20 k 30 k
0.7
0 0.01
0.01
0.52
0
0.030.11
0.23
0.01
0.3 0.1
0.95
0
0
0.140.01
0
0.17
0.05 0.21
0
0
1
0.030.11
0
Tight Clustering Algorithm: (relax estimation of k)
4.1 Tight clustering
Tight Clustering Algorithm:
1. Start with a suitable k0. Search for consecutive k’s and choose the top 3 clusters for each k.
2. Stop when
Select to be the tightest cluster.
},,,{},,,{ 3)1(2)1(1)1(321 000000 kkkkkk VVVVVV
1 toclose },3,2,1{,,,'
),(,),(
0
)2'()1'()1'('
nmlkk
VVsVVs nkmkmklk
mkV )1'(
4.1 Tight clustering
Tight Clustering Algorithm: (cont’d)
3. Identify the tightest cluster and remove it from the whole data.
4. Decrease k0 by 1. Repeat 1.~3. to identify the next tight cluster.
Remark: and k0 determines the tightness and size of resulting clusters.
,
4.1 Tight clustering
A simple simulation on 2-D:14 clusters normally distributed (50 points each)
plus 175 scattered points. Stdev=0.1, 0.2, …, 1.4.
x
y
-10 0 10 20 30 40
-30
-20
-10
010
4.1 Tight clusteringExample:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 remaintruth 50 50 50 50 50 50 50 50 50 50 50 50 50 50 175alpha 0 beta 0.7k0=10 58 59 59 78 72 60 489k0=20 59 56 55 53 57 53 53 52 52 52 56 51 51 51 12 112k0=25 55 56 53 56 53 53 52 55 51 51 51 50 50 50 9 130k0=40 52 51 51 52 51 51 51 50 26 25 22 50 18 17 30 278
Tight clustering on simulated data:
40 and 25,20,10,10,7.0,0 0 kB
4.1 Tight clusteringExample:
x
y
-10 0 10 20 30 40
-30
-20
-10
010
x
y
-10 0 10 20 30 40
-30
-20
-10
010
x
y
-10 0 10 20 30 40
-30
-20
-10
010
x
y
-10 0 10 20 30 40
-30
-20
-10
010
x
y
-10 0 10 20 30 40
-30
-20
-10
010
x
y
-10 0 10 20 30 40
-30
-20
-10
010
x
y
-10 0 10 20 30 40
-30
-20
-10
010
x
y
-10 0 10 20 30 40
-30
-20
-10
010
x
y
-10 0 10 20 30 40
-30
-20
-10
010
x
y
-10 0 10 20 30 40
-30
-20
-10
010
x
y
-10 0 10 20 30 40
-30
-20
-10
010
x
y
-10 0 10 20 30 40
-30
-20
-10
010
x
y
-10 0 10 20 30 40
-30
-20
-10
010
x
y
-10 0 10 20 30 40
-30
-20
-10
010
x
y
-10 0 10 20 30 40
-30
-20
-10
010
x
y
-10 0 10 20 30 40
-30
-20
-10
010
10,7.0,0,250 Bk
4.1 Tight clusteringExample:
Gene expression during the life cycle of Drosophila melanogaster. (2002) Science 297:2270-2275
• 4028 genes monitored. Reference sample is pooled from all samples.
• 66 sequential time points spanning embryonic (E), larval (L), pupal (P) and adult (A) periods.
• Filter genes without significant pattern (1100 genes) and standardize each gene to have mean 0 and stdev 1.
Example:4.1 Tight clustering
Comparison of various K-means and tight clustering:Seven mini-chromosome maintenance (MCM) deficient genes
K-meansk=30
K-meansk=50
4.1 Tight clusteringExample:
K-meansk=70
K-meansk=100
Tightclustering
4.1 Tight clusteringExample:
4.1 Tight clustering
TightClust software download:http://www.pitt.edu/~ctseng/research/tightClust_download.html
Scattered (noisy) genes
K-means criterion:Minimize the within-group sum-squared dispersion to obtain C:
Formulation: K-means
.cluster ofcenter theis
);(
)(
1
2)(
jx
xxkCW
j
k
j Cx
jimeansK
ji
K-memoids criterion:
.cluster in point median theis
,);(
)(
1
)(
jXx
xxdkCW
j
k
j Cx
jimemoidsK
ji
4.2 Penalized and weighted K-means
Formulation: K-means
.cluster ofcenter theis
);(
)(
1
2)(
jx
xxkCW
j
k
j Cx
jimeansK
ji
Proposition: K-means is a special case of CML under Gaussian model of identical spherical clusters.
K-means:
CML:
4.2 Penalized and weighted K-means
Goal 1:• Allow a set of scattered genes
without being clustered.
Goal 2:• Incorporation of prior information
in cluster formation.
Formulation: PW-Kmeans4.2 Penalized and weighted K-means
Formulation: PW-KmeansFormulation:
d(xi, Cj): dispersion of point xi in Cj.
|S|: # of objects in noise set S.w(xi; P): weight function to incorporate prior info P.: a tuning parameter
• Penalty term : assign outlying objects of a cluster to the noise set S.
• Weighting term w: utilize prior knowledge of preferred or prohibited patterns P.
4.2 Penalized and weighted K-means
.,);,(,);,( , If .3
.),(),( , If .2
.,);,(,);,( , If 1.
. and given minimizer the),(),,(),...,,(),( Denote
:nPropositio
22*
11*
21
2*
1*
21
22*
11*
21
***1
*
kkCWkkCW
kSkS
kkCWkkCWkk
kkSkCkCkC k
Properties of PW-Kmeans4.2 Penalized and weighted K-means
P-Kmeans loss function:
Classification likelihood: (Gaussian model)
V is the space where noise set is uniformly distributed.
Properties of PW-KmeansRelation to classification likelihood
4.2 Penalized and weighted K-means
Prior informationSix groups of validated cell cycle genes:
Formulation: PW-Kmeans4.2 Penalized and weighted K-means
ICSA 06/15/2006
Prior informationSix groups of validated cell cycle genes:
Formulation: PW-Kmeans4.2 Penalized and weighted K-means
Prior knowledge of p pathways
The weight is designed as a transformation of logistic function:
Formulation: PW-KmeansA special example of PW-Kmeans for microarray:
4.2 Penalized and weighted K-means
Design of weight function Formulation: PW-Kmeans
4.2 Penalized and weighted K-means
ICSA 06/15/2006
8 histone genes tightly coregulated
in S phase
Prior informationSix groups of validated cell cycle genes:
Application : Yeast cell cycle expression
4.2 Penalized and weighted K-means
C1
C2
C3
C4
C5
S
The 8 histone genes are left in noise set S
without being clustered.
58
31
39
101
71
1276::
Penalized K-means with no prior information
4.2 Penalized and weighted K-means
Application : Yeast cell cycle expression
C1
C2
C3
C4
C5
S
The 8 histone genes are now in cluster 3.
PW-Kmeans: take three randomly selected histone genes as prior information, P.
112
158
88
139
57
1109
4.2 Penalized and weighted K-meansApplication : Yeast cell cycle expression
5. Cluster evaluation
• Evaluation and comparison of clustering methods is always difficult.
• In supervised learning (classification), the class labels (underlying truth) are known and performance can be evaluated through cross validation.
• In unsupervised learning (clustering), external validation is usually not available.
• Ideal data for cluster evaluation:• Data with class/tumor labels (for clustering samples)• Cell cycle data (for clustering genes)• Simulated data
5. Cluster evaluation
Rand index: (Rand 1971)
Y={(a,b,c), (d,e,f)} Y'={(a,b), (c,d,e), (f)}
ab ac ad ae af bc bd be bf cd ce cf de df ef Total
together in both * * 2
separate in both * * * * * * * 7
discordant * * * * * * 6
Rand index: c(Y, Y') =(2+7)/15=0.6 (percentage of concordance)
1. 1 c(Y, Y')02. Clustering methods can be evaluated by c(Y, Ytruth) if Ytruth
available.
5. Cluster evaluation
Adjusted Rand index: (Hubert and Arabie 1985)
index expected -index maximumindex expected -index
The adjusted Rand index will take maximum value at 1 and constant expected value 0 (when two clusterings are totally independent)
Adjusted Rand index =
Advantage Disadvantage
Hierarchical clustering
• Intuitive algorithm• Good interpretability• Do not need to estimate # of
clusters
• Very vulnerable to outliers• Tree not unique; gene closer not
necessarily more similar• Hard to read when tree is big
K-means • Simplified Gaussian mixture model• Normally get nice clusters
• Local minimum• Estimating # of clusters
SOM • Clusters has interpretation on 2D geometry (more interpretable)
• The algorithm very heuristic• Solution sub-optimal due to 2D
geometry restriction
Model-based clustering
• Flexibility on cluster structure• Rigorous statistical inference
• Model selection usually difficult• Local minimum problem
Tight clustering
• Allow genes not being clustered; only produce tight clusters
• Ease the problem of accurate estimation of # of clusters
• Biologically more meaningful
• Slower computation when data large
6. Comparison
20 samples
15clusters
Simulation:• 20 time-course samples for each gene.• In each cluster, four groups of samples
with similar intensity.
• Individual sample and gene variation are added.
• # of genes in each cluster ~ Poisson(10)• Scattered (noise) genes are added.• The simulated data well assembles real
data by visualization.
6. Comparison
Thalamuthu et al. 2006
a number (0, 5, 10, 20, 60, 100 and 200% of the original total number of clustered genes) of randomly
Different types of perturbations• Type I: a number (0, 5, 10, 20, 60, 100 and 200% of the
original total number of clustered genes) of randomly simulated scattered genes are added. E.g. For sample j in a scattered gene, the expression level is randomly sampled from the empirical distribution of expressions of all clustered genes in sample j.
• Type II: For each element of the log-transformed expression matrix, a small random error from normal distribution (SD = 0.05, 0.1, 0.2, 0.4, 0.8, 1.2) is added, to evaluate robustness of the clustering against potential random errors.
• Type III: combination of type I and II.
6. ComparisonDifferent degree of perturbation in the simulated microarray data
6. Comparison
Simulation schemes performed in the paper.In total, 25 (simulation settings) X 100 (data sets) = 2500 are evaluated.
• Adjusted Rand index: a measure of similarity of two clustering; • Compare each clustering result to the underlying true clusters.
Obtain the adjusted Rand index (the higher the better).
6. Comparison
T: tight clusteringM: model-basedP: K-medoidsK: K-meansH: hierarchicalS: SOM
Consensus Clustering
Simpson et al. BMC Bioinformatics 2010 11:590 doi:10.1186/1471-2105-11-590
•Consensus clustering with PAM (blue)•Consensus clustering with hierarchical clustering (red)•HOPACH (black)•Fuzzy c-means (green)
6. Comparison
Comparison in real data sets:(see paper for detailed comparison criteria)
6. Comparison
Tight clustering:• George C. Tseng and Wing H. Wong. (2005) Tight Clustering: A
Resampling-based Approach for Identifying Stable and Tight Patterns in Data. Biometrics.61:10-16.
Penalized and weighted K-means:• George C. Tseng. (2007). Penalized and weighted K-means for
clustering with scattered objects and prior information in high-throughput biological data. Bioinformatics. 23:2247-2255.
Comparative study:• George C. Anbupalam Thalamuthu*, Indranil Mukhopadhyay*,
Xiaojing Zheng* and George C. Tseng. (2006) Evaluation and comparison of gene clustering methods in microarray analysis. Bioinformatics. 22:2405-2412.
References
• Despite many sophisticated methods for detecting regulatory interactions (e.g. Shortest-path and Liquid Association), cluster analysis remains a useful routine in microarray analysis.
• We should use these methods for visualization, investigation and hypothesis generation.
• We should not use these methods inferentially.
• In general, methods with resampling evaluation, allowing scattered genes and related to model-based approach are better.
• Hierarchical clustering specifically: we are provided with a picture from which we can make many/any conclusions.
6. Conclusion
Common mistakes or warnings:1. Run K-means with large k and get excited to
see patterns without further investigation.K-means can let you see patterns even in randomly generated data and besides human eyes tend to see “patterns”.
2. Identify genes that are predictive to survival (e.g. apply t-statistics to long and short survivors). Cluster samples based on the selected genes and find the samples are clustered according to survival status.
The gene selection procedure is already biased towards the result you desire.
6. Conclusion
Common mistakes (con’d):3. Cluster samples into k groups. Perform F-test to
identify genes differentially expressed among subgroups.
Data has been re-used for both clustering and identifying differentially expressed genes. You always obtain a set of differentially expressed genes but not sure it’s real or by random.
6. Conclusion
Post-cluster analysis
• Identify novel genes participating in known cellular process
• Enrichment of particular Gene Ontology (GO) terms in clusters
• Motif finding in clusters
6. Conclusion
• Rand WM: Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 1971, 66:846-850.
• Lawrence Hubert and Phipps Arabie. Comparing partitions. Journal of Classification, 2, 193–218, 1985.
• Dudoit S, Fridlyand J. (2002). A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biology. 3(7):0036.1-0036.21.
• Dudoit S, Fridlyand J. (2003). Bagging to improve the accuracy of a clustering procedure. Bioinformatics. 19(9):1090-9.
• Robert Tibshirani, Guenther Walther and Trevor Hastie. "Estimating the number of clusters in a dataset via the Gap statistic“. Published in JRSSB 2000.
• Fraley,C. and Raftery,A.E. (2002) Model-based clustering, discriminant analysis, and density estimation. J. Am. Stat. Assoc., 97, 611–631.
• McLachlan,G.J. et al. (2002) A mixture model-based approach to the clustering of microarray expression data. Bioinformatics, 18, 413–422.
• Medvedovic,M. and Sivaganesan,S. (2002) Bayesian infinite mixture model-based clustering of gene expression profiles. Bioinformatics, 18, 1194–1206.
• Milligan,G.W. and Cooper,M.C. (1985) An examination of procedures for determining the number of clusters in a data set. Psychometrika, 50, 159–179.
More references
Methods in R:Hierarchical clusteirng in R: hclust
K-means in R: kmeans
Self-organizing maps in R: som(som)
Model-based clustering: EMclust(mclust)
Individual packages:
GeneCluster2.0http://www.broad.mit.edu/cancer/software/software.htmlCluster, Mike Eisenhttp://rana.lbl.gov/EisenSoftware.htmTight Clustering and PW-Kmeanshttp://www.biostat.pitt.edu/bioinfo/software.htm PAMhttp://www-stat.stanford.edu/~tibs/PAM/index.htmlExpander and Clickhttp://www.cs.tau.ac.il/~rshamir/expander/expander.html
Software and Packages