Agenda

Agenda1. Introduction to clustering

1. Dissimilarity measure2. Preprocessing

2. Clustering method1. Hierarchical clustering2. K-means and K-memoids3. Self-organizing maps (SOM)4. Model-based clustering

3. Estimate # of clusters

4. Two new methods allowing scattered genes 4.1 Tight clustering4.2 Penalized and weighted K-means

5. Cluster validation and evaluation

6. Comparison and discussion

4.1 Tight clustering

K-means Clustering looks

informative.

k=10 k=15 k=30

A closer look, however, finds lots of noises in each cluster.

A common situation for gene clustering in microarray:

Challenge 1: Lots of scattered genes. i.e. genes not belonging to any tight cluster of biological function.

Main challenges for clustering in microarray

x

y

0 10 20 30

-20

-15

-10

-50

x

y

-10 0 10 20 30 40

-30

-20

-10

010


Challenge 2: Microarray is an exploratory tool to guide further biological experiments

Hypothesis driven: hypothesis => experimental data.

Data driven: high-throughput experiment => data mining => hypothesis => further validation experiment

Important to provide the most informative clusters instead of lots of loose clusters (reduce false positives).

Main challenges for clustering in microarray


x

y

0 10 20 30

-20

-15

-10

-50

x

y

-10 0 10 20 30 40

-30

-20

-10

010

Tight Clustering: Directly identify informative,

tight and stable clusters with reasonable size, say, 20~60 genes.

Need not estimate k !! Need not assign all genes into

clusters.

Traditional: Estimate the number of

clusters, k. (except for hierarchical clustering)

Perform clustering through assigning all genes into clusters.


x

y

0 1 2 3 4 5

01

23

4

x

x

subsample 1

x

y

0 1 2 3 4 5

01

23

4

x

x

subsample 2

x

y

0 1 2 3 4 5

01

23

4

x

x

judgement by subsample 1

x

y

0 1 2 3 4 5

01

23

4

x

x

judgement by subsample 2

11

1 2 3 4 5

6 7 8 9 10

x

y

0 1 2 3 4 5

01

23

4

whole data

Basic idea:

Original DataX

co-membership matrixD[C(X', k), X]

sub-sampleX'

cluster centersC(X', k)=(C1,…, Ck)

K-means


• X={xij}nd : data to be clustered.• X'={x'ij}n/2d : random sub-sample

• C(X', k)=(C1, C2,…, Ck): the cluster centers obtained from clustering X' into k clusters.

• D[C(X', k), X] : an nn matrix denoting co-membership relations of X classified by C(X', k). (Tibshirani 2001)

D[C(X', k), X]ij =1 if i and j in the same cluster. =0 o.w.

genes of sets twoof similarity of measure a :VV

VV)V,s(V •

ji

jiji


Algorithm 1 (when fixing k):

1. Fix k. Random sub-sampling X(1), …, X(B). Define the average co-membership matrix to be

Note: a. i and j always clustered together

in each sub-sampling judgment.b. i and j never clustered together in

each sub-sampling judgment.c.

. X] ), ,D[C(X , X], ), ,D[C(XmeanD (B)(1) kk

1D ij

0D ij

iii 1D


Algorithm 1 (when fixing k): (cont’d)

2. Search for a large set of points such that . Sets with this property are candidates of tight clusters. Order sets with this property by their size to obtain Vk1,Vk2, …

},,1{},,{ 1 nvvV m jiDjivv ,1

0 toclose


1 2 3 4 56 7 8 9 10 11

1 2 3 4 56 7 8 9 10 11

1 2 3 4 56 7 8 9 10 11

1 2 3 4 5 116 7 8 9 10

1 2 3 4 56 7 8 9 10 11

6 7 8 9 10 111 2 3 4 5

6 7 8 9 101 2 3 4 5 11

1 2 3 4 5 116 7 8 9 10

1 2 3 4 5 116 7 8 9 10

6 7 8 9 101 2 3 4 5 11

1 2 3 4 5 6 7 8 9 10 1112 1.034567891011

1 2 3 4 56 7 8 9 10 11

1 2 3 4 56 7 8 9 10 11

1 2 3 4 56 7 8 9 10 11

1 2 3 4 5 116 7 8 9 10

1 2 3 4 56 7 8 9 10 11

6 7 8 9 10 111 2 3 4 5

6 7 8 9 101 2 3 4 5 11

1 2 3 4 5 116 7 8 9 10

1 2 3 4 5 116 7 8 9 10

6 7 8 9 101 2 3 4 5 11

1 2 3 4 5 6 7 8 9 10 111 0.52 1.034567891011

1 2 3 4 5 6 7 8 9 10 111 1.0 1.0 1.0 1.0 1.0 0 0 0 0 0 0.52 1.0 1.0 1.0 1.0 1.0 0 0 0 0 0 0.53 1.0 1.0 1.0 1.0 1.0 0 0 0 0 0 0.54 1.0 1.0 1.0 1.0 1.0 0 0 0 0 0 0.55 1.0 1.0 1.0 1.0 1.0 0 0 0 0 0 0.56 0 0 0 0 0 1.0 1.0 1.0 1.0 1.0 0.57 0 0 0 0 0 1.0 1.0 1.0 1.0 1.0 0.58 0 0 0 0 0 1.0 1.0 1.0 1.0 1.0 0.59 0 0 0 0 0 1.0 1.0 1.0 1.0 1.0 0.510 0 0 0 0 0 1.0 1.0 1.0 1.0 1.0 0.511 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 1.01110987654321

1110987654321

6 7 8 9 101 2 3 4 5 11

1 2 3 4 5 116 7 8 9 10

1 2 3 4 5 116 7 8 9 10

6 7 8 9 101 2 3 4 5 11

6 7 8 9 10 111 2 3 4 5

1 2 3 4 56 7 8 9 10 11

1 2 3 4 5 116 7 8 9 10

1 2 3 4 56 7 8 9 10 11

1 2 3 4 56 7 8 9 10 11

1 2 3 4 56 7 8 9 10 11

1,0kV

2,0kV

3,0kV

1),1( 0 kV

2),1( 0 kV

3),1( 0 kV

1),2( 0 kV

2),2( 0 kV

3),2( 0 kV

1),3( 0 kV

2),3( 0 kV

3),3( 0 kV

0k 10 k 20 k 30 k

0.7

0 0.01

0.01

0.52

0

0.030.11

0.23

0.01

0.3 0.1

0.95

0

0

0.140.01

0

0.17

0.05 0.21

0

0

1

0.030.11

0

Tight Clustering Algorithm: (relax estimation of k)


Tight Clustering Algorithm:

1. Start with a suitable k0. Search for consecutive k’s and choose the top 3 clusters for each k.

2. Stop when

Select to be the tightest cluster.

},,,{},,,{ 3)1(2)1(1)1(321 000000 kkkkkk VVVVVV

1 toclose },3,2,1{,,,'

),(,),(

0

)2'()1'()1'('

nmlkk

VVsVVs nkmkmklk

mkV )1'(


Tight Clustering Algorithm: (cont’d)

3. Identify the tightest cluster and remove it from the whole data.

4. Decrease k0 by 1. Repeat 1.~3. to identify the next tight cluster.

Remark: and k0 determines the tightness and size of resulting clusters.

,


A simple simulation on 2-D:14 clusters normally distributed (50 points each)

plus 175 scattered points. Stdev=0.1, 0.2, …, 1.4.

x

y

-10 0 10 20 30 40

-30

-20

-10

010

4.1 Tight clusteringExample:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 remaintruth 50 50 50 50 50 50 50 50 50 50 50 50 50 50 175alpha 0 beta 0.7k0=10 58 59 59 78 72 60 489k0=20 59 56 55 53 57 53 53 52 52 52 56 51 51 51 12 112k0=25 55 56 53 56 53 53 52 55 51 51 51 50 50 50 9 130k0=40 52 51 51 52 51 51 51 50 26 25 22 50 18 17 30 278

Tight clustering on simulated data:

40 and 25,20,10,10,7.0,0 0 kB


x

y

-10 0 10 20 30 40

-30

-20

-10

010

x

y

-10 0 10 20 30 40

-30

-20

-10

010

x

y

-10 0 10 20 30 40

-30

-20

-10

010

x

y

-10 0 10 20 30 40

-30

-20

-10

010

x

y

-10 0 10 20 30 40

-30

-20

-10

010

x

y

-10 0 10 20 30 40

-30

-20

-10

010

x

y

-10 0 10 20 30 40

-30

-20

-10

010

x

y

-10 0 10 20 30 40

-30

-20

-10

010

x

y

-10 0 10 20 30 40

-30

-20

-10

010

x

y

-10 0 10 20 30 40

-30

-20

-10

010

x

y

-10 0 10 20 30 40

-30

-20

-10

010

x

y

-10 0 10 20 30 40

-30

-20

-10

010

x

y

-10 0 10 20 30 40

-30

-20

-10

010

x

y

-10 0 10 20 30 40

-30

-20

-10

010

x

y

-10 0 10 20 30 40

-30

-20

-10

010

x

y

-10 0 10 20 30 40

-30

-20

-10

010

10,7.0,0,250 Bk


Gene expression during the life cycle of Drosophila melanogaster. (2002) Science 297:2270-2275

• 4028 genes monitored. Reference sample is pooled from all samples.

• 66 sequential time points spanning embryonic (E), larval (L), pupal (P) and adult (A) periods.

• Filter genes without significant pattern (1100 genes) and standardize each gene to have mean 0 and stdev 1.

Example:4.1 Tight clustering

Comparison of various K-means and tight clustering:Seven mini-chromosome maintenance (MCM) deficient genes

K-meansk=30

K-meansk=50


K-meansk=70

K-meansk=100

Tightclustering



TightClust software download:http://www.pitt.edu/~ctseng/research/tightClust_download.html

Scattered (noisy) genes

K-means criterion:Minimize the within-group sum-squared dispersion to obtain C:

Formulation: K-means

.cluster ofcenter theis

);(

)(

1

2)(

jx

xxkCW

j

k

j Cx

jimeansK

ji

K-memoids criterion:

.cluster in point median theis

,);(

)(

1

)(

jXx

xxdkCW

j

k

j Cx

jimemoidsK

ji

4.2 Penalized and weighted K-means

Formulation: K-means

.cluster ofcenter theis

);(

)(

1

2)(

jx

xxkCW

j

k

j Cx

jimeansK

ji

Proposition: K-means is a special case of CML under Gaussian model of identical spherical clusters.

K-means:

CML:


Goal 1:• Allow a set of scattered genes

without being clustered.

Goal 2:• Incorporation of prior information

in cluster formation.

Formulation: PW-Kmeans4.2 Penalized and weighted K-means

Formulation: PW-KmeansFormulation:

d(xi, Cj): dispersion of point xi in Cj.

|S|: # of objects in noise set S.w(xi; P): weight function to incorporate prior info P.: a tuning parameter

• Penalty term : assign outlying objects of a cluster to the noise set S.

• Weighting term w: utilize prior knowledge of preferred or prohibited patterns P.


.,);,(,);,( , If .3

.),(),( , If .2

.,);,(,);,( , If 1.

. and given minimizer the),(),,(),...,,(),( Denote

:nPropositio

22*

11*

21

2*

1*

21

22*

11*

21

***1

*

kkCWkkCW

kSkS

kkCWkkCWkk

kkSkCkCkC k

Properties of PW-Kmeans4.2 Penalized and weighted K-means

P-Kmeans loss function:

Classification likelihood: (Gaussian model)

V is the space where noise set is uniformly distributed.

Properties of PW-KmeansRelation to classification likelihood


Prior informationSix groups of validated cell cycle genes:


ICSA 06/15/2006



Prior knowledge of p pathways

The weight is designed as a transformation of logistic function:

Formulation: PW-KmeansA special example of PW-Kmeans for microarray:


Design of weight function Formulation: PW-Kmeans


ICSA 06/15/2006

8 histone genes tightly coregulated

in S phase


Application : Yeast cell cycle expression


C1

C2

C3

C4

C5

S

The 8 histone genes are left in noise set S

without being clustered.

58

31

39

101

71

1276::

Penalized K-means with no prior information


Application : Yeast cell cycle expression

C1

C2

C3

C4

C5

S

The 8 histone genes are now in cluster 3.

PW-Kmeans: take three randomly selected histone genes as prior information, P.

112

158

88

139

57

1109

4.2 Penalized and weighted K-meansApplication : Yeast cell cycle expression

5. Cluster evaluation

• Evaluation and comparison of clustering methods is always difficult.

• In supervised learning (classification), the class labels (underlying truth) are known and performance can be evaluated through cross validation.

• In unsupervised learning (clustering), external validation is usually not available.

• Ideal data for cluster evaluation:• Data with class/tumor labels (for clustering samples)• Cell cycle data (for clustering genes)• Simulated data


Rand index: (Rand 1971)

Y={(a,b,c), (d,e,f)} Y'={(a,b), (c,d,e), (f)}

ab ac ad ae af bc bd be bf cd ce cf de df ef Total

together in both * * 2

separate in both * * * * * * * 7

discordant * * * * * * 6

Rand index: c(Y, Y') =(2+7)/15=0.6 (percentage of concordance)

1. 1 c(Y, Y')02. Clustering methods can be evaluated by c(Y, Ytruth) if Ytruth

available.


Adjusted Rand index: (Hubert and Arabie 1985)

index expected -index maximumindex expected -index

The adjusted Rand index will take maximum value at 1 and constant expected value 0 (when two clusterings are totally independent)

Adjusted Rand index =

Advantage Disadvantage

Hierarchical clustering

• Intuitive algorithm• Good interpretability• Do not need to estimate # of

clusters

• Very vulnerable to outliers• Tree not unique; gene closer not

necessarily more similar• Hard to read when tree is big

K-means • Simplified Gaussian mixture model• Normally get nice clusters

• Local minimum• Estimating # of clusters

SOM • Clusters has interpretation on 2D geometry (more interpretable)

• The algorithm very heuristic• Solution sub-optimal due to 2D

geometry restriction

Model-based clustering

• Flexibility on cluster structure• Rigorous statistical inference

• Model selection usually difficult• Local minimum problem

Tight clustering

• Allow genes not being clustered; only produce tight clusters

• Ease the problem of accurate estimation of # of clusters

• Biologically more meaningful

• Slower computation when data large

6. Comparison

20 samples

15clusters

Simulation:• 20 time-course samples for each gene.• In each cluster, four groups of samples

with similar intensity.

• Individual sample and gene variation are added.

• # of genes in each cluster ~ Poisson(10)• Scattered (noise) genes are added.• The simulated data well assembles real

data by visualization.

6. Comparison

Thalamuthu et al. 2006

a number (0, 5, 10, 20, 60, 100 and 200% of the original total number of clustered genes) of randomly

Different types of perturbations• Type I: a number (0, 5, 10, 20, 60, 100 and 200% of the

original total number of clustered genes) of randomly simulated scattered genes are added. E.g. For sample j in a scattered gene, the expression level is randomly sampled from the empirical distribution of expressions of all clustered genes in sample j.

• Type II: For each element of the log-transformed expression matrix, a small random error from normal distribution (SD = 0.05, 0.1, 0.2, 0.4, 0.8, 1.2) is added, to evaluate robustness of the clustering against potential random errors.

• Type III: combination of type I and II.

6. ComparisonDifferent degree of perturbation in the simulated microarray data

6. Comparison

Simulation schemes performed in the paper.In total, 25 (simulation settings) X 100 (data sets) = 2500 are evaluated.

• Adjusted Rand index: a measure of similarity of two clustering; • Compare each clustering result to the underlying true clusters.

Obtain the adjusted Rand index (the higher the better).

6. Comparison

T: tight clusteringM: model-basedP: K-medoidsK: K-meansH: hierarchicalS: SOM

Consensus Clustering

Simpson et al. BMC Bioinformatics 2010 11:590 doi:10.1186/1471-2105-11-590

•Consensus clustering with PAM (blue)•Consensus clustering with hierarchical clustering (red)•HOPACH (black)•Fuzzy c-means (green)

6. Comparison

Comparison in real data sets:(see paper for detailed comparison criteria)

6. Comparison

Tight clustering:• George C. Tseng and Wing H. Wong. (2005) Tight Clustering: A

Resampling-based Approach for Identifying Stable and Tight Patterns in Data. Biometrics.61:10-16.

Penalized and weighted K-means:• George C. Tseng. (2007). Penalized and weighted K-means for

clustering with scattered objects and prior information in high-throughput biological data. Bioinformatics. 23:2247-2255.

Comparative study:• George C. Anbupalam Thalamuthu*, Indranil Mukhopadhyay*,

Xiaojing Zheng* and George C. Tseng. (2006) Evaluation and comparison of gene clustering methods in microarray analysis. Bioinformatics. 22:2405-2412.

References

• Despite many sophisticated methods for detecting regulatory interactions (e.g. Shortest-path and Liquid Association), cluster analysis remains a useful routine in microarray analysis.

• We should use these methods for visualization, investigation and hypothesis generation.

• We should not use these methods inferentially.

• In general, methods with resampling evaluation, allowing scattered genes and related to model-based approach are better.

• Hierarchical clustering specifically: we are provided with a picture from which we can make many/any conclusions.

6. Conclusion

Common mistakes or warnings:1. Run K-means with large k and get excited to

see patterns without further investigation.K-means can let you see patterns even in randomly generated data and besides human eyes tend to see “patterns”.

2. Identify genes that are predictive to survival (e.g. apply t-statistics to long and short survivors). Cluster samples based on the selected genes and find the samples are clustered according to survival status.

The gene selection procedure is already biased towards the result you desire.

6. Conclusion

Common mistakes (con’d):3. Cluster samples into k groups. Perform F-test to

identify genes differentially expressed among subgroups.

Data has been re-used for both clustering and identifying differentially expressed genes. You always obtain a set of differentially expressed genes but not sure it’s real or by random.

6. Conclusion

Post-cluster analysis

• Identify novel genes participating in known cellular process

• Enrichment of particular Gene Ontology (GO) terms in clusters

• Motif finding in clusters

6. Conclusion

• Rand WM: Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 1971, 66:846-850.

• Lawrence Hubert and Phipps Arabie. Comparing partitions. Journal of Classification, 2, 193–218, 1985.

• Dudoit S, Fridlyand J. (2002). A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biology. 3(7):0036.1-0036.21.

• Dudoit S, Fridlyand J. (2003). Bagging to improve the accuracy of a clustering procedure. Bioinformatics. 19(9):1090-9.

• Robert Tibshirani, Guenther Walther and Trevor Hastie. "Estimating the number of clusters in a dataset via the Gap statistic“. Published in JRSSB 2000.

• Fraley,C. and Raftery,A.E. (2002) Model-based clustering, discriminant analysis, and density estimation. J. Am. Stat. Assoc., 97, 611–631.

• McLachlan,G.J. et al. (2002) A mixture model-based approach to the clustering of microarray expression data. Bioinformatics, 18, 413–422.

• Medvedovic,M. and Sivaganesan,S. (2002) Bayesian infinite mixture model-based clustering of gene expression profiles. Bioinformatics, 18, 1194–1206.

• Milligan,G.W. and Cooper,M.C. (1985) An examination of procedures for determining the number of clusters in a data set. Psychometrika, 50, 159–179.

More references

Methods in R:Hierarchical clusteirng in R: hclust

K-means in R: kmeans

Self-organizing maps in R: som(som)

Model-based clustering: EMclust(mclust)

Individual packages:

GeneCluster2.0http://www.broad.mit.edu/cancer/software/software.htmlCluster, Mike Eisenhttp://rana.lbl.gov/EisenSoftware.htmTight Clustering and PW-Kmeanshttp://www.biostat.pitt.edu/bioinfo/software.htm PAMhttp://www-stat.stanford.edu/~tibs/PAM/index.htmlExpander and Clickhttp://www.cs.tau.ac.il/~rshamir/expander/expander.html

Software and Packages

http://www.broad.mit.edu/cancer/software/software.html

http://rana.lbl.gov/EisenSoftware.htm

http://www.biostat.pitt.edu/bioinfo/software.htm

http://www-stat.stanford.edu/~tibs/PAM/index.html

http://www.cs.tau.ac.il/~rshamir/expander/expander.html

Agenda

Documents

Transcript of Agenda