Agenda

54
Agenda 1. Introduction to clustering 1. Dissimilarity measure 2. Preprocessing 2. Clustering method 1. Hierarchical clustering 2. K-means and K-memoids 3. Self-organizing maps (SOM) 4. Model-based clustering 3. Estimate # of clusters 4. Two new methods allowing scattered genes 4.1 Tight clustering 4.2 Penalized and weighted K- means 5. Cluster validation and evaluation

description

Agenda. Introduction to clustering Dissimilarity measure Preprocessing Clustering method Hierarchical clustering K-means and K-memoids Self-organizing maps (SOM) Model-based clustering Estimate # of clusters Two new methods allowing scattered genes 4.1 Tight clustering - PowerPoint PPT Presentation

Transcript of Agenda

Page 1: Agenda

Agenda1. Introduction to clustering

1. Dissimilarity measure2. Preprocessing

2. Clustering method1. Hierarchical clustering2. K-means and K-memoids3. Self-organizing maps (SOM)4. Model-based clustering

3. Estimate # of clusters

4. Two new methods allowing scattered genes 4.1 Tight clustering4.2 Penalized and weighted K-means

5. Cluster validation and evaluation

6. Comparison and discussion

Page 2: Agenda

4.1 Tight clustering

K-means Clustering looks

informative.

k=10 k=15 k=30

A closer look, however, finds lots of noises in each cluster.

A common situation for gene clustering in microarray:

Page 3: Agenda

Challenge 1: Lots of scattered genes. i.e. genes not belonging to any tight cluster of biological function.

Main challenges for clustering in microarray

x

y

0 10 20 30

-20

-15

-10

-50

x

y

-10 0 10 20 30 40

-30

-20

-10

010

4.1 Tight clustering

Page 4: Agenda

Challenge 2: Microarray is an exploratory tool to guide further biological experiments

Hypothesis driven: hypothesis => experimental data.

Data driven: high-throughput experiment => data mining => hypothesis => further validation experiment

Important to provide the most informative clusters instead of lots of loose clusters (reduce false positives).

Main challenges for clustering in microarray

4.1 Tight clustering

Page 5: Agenda

x

y

0 10 20 30

-20

-15

-10

-50

x

y

-10 0 10 20 30 40

-30

-20

-10

010

Tight Clustering: Directly identify informative,

tight and stable clusters with reasonable size, say, 20~60 genes.

Need not estimate k !! Need not assign all genes into

clusters.

Traditional: Estimate the number of

clusters, k. (except for hierarchical clustering)

Perform clustering through assigning all genes into clusters.

4.1 Tight clustering

Page 6: Agenda

x

y

0 1 2 3 4 5

01

23

4

x

x

subsample 1

x

y

0 1 2 3 4 5

01

23

4

x

x

subsample 2

x

y

0 1 2 3 4 5

01

23

4

x

x

judgement by subsample 1

x

y

0 1 2 3 4 5

01

23

4

x

x

judgement by subsample 2

11

1 2 3 4 5

6 7 8 9 10

x

y

0 1 2 3 4 5

01

23

4

whole data

Basic idea:

Page 7: Agenda

Original DataX

co-membership matrixD[C(X', k), X]

sub-sampleX'

cluster centersC(X', k)=(C1,…, Ck)

K-means

4.1 Tight clustering

Page 8: Agenda

• X={xij}nd : data to be clustered.• X'={x'ij}n/2d : random sub-sample

• C(X', k)=(C1, C2,…, Ck): the cluster centers obtained from clustering X' into k clusters.

• D[C(X', k), X] : an nn matrix denoting co-membership relations of X classified by C(X', k). (Tibshirani 2001)

D[C(X', k), X]ij =1 if i and j in the same cluster. =0 o.w.

genes of sets twoof similarity of measure a :VV

VV)V,s(V •

ji

jiji

4.1 Tight clustering

Page 9: Agenda

Algorithm 1 (when fixing k):

1. Fix k. Random sub-sampling X(1), …, X(B). Define the average co-membership matrix to be

Note: a. i and j always clustered together

in each sub-sampling judgment.b. i and j never clustered together in

each sub-sampling judgment.c.

. X] ), ,D[C(X , X], ), ,D[C(XmeanD (B)(1) kk

1D ij

0D ij

iii 1D

4.1 Tight clustering

Page 10: Agenda

Algorithm 1 (when fixing k): (cont’d)

2. Search for a large set of points such that . Sets with this property are candidates of tight clusters. Order sets with this property by their size to obtain Vk1,Vk2, …

},,1{},,{ 1 nvvV m jiDjivv ,1

0 toclose

4.1 Tight clustering

Page 11: Agenda

1 2 3 4 56 7 8 9 10 11

1 2 3 4 56 7 8 9 10 11

1 2 3 4 56 7 8 9 10 11

1 2 3 4 5 116 7 8 9 10

1 2 3 4 56 7 8 9 10 11

6 7 8 9 10 111 2 3 4 5

6 7 8 9 101 2 3 4 5 11

1 2 3 4 5 116 7 8 9 10

1 2 3 4 5 116 7 8 9 10

6 7 8 9 101 2 3 4 5 11

1 2 3 4 5 6 7 8 9 10 1112 1.034567891011

1 2 3 4 56 7 8 9 10 11

1 2 3 4 56 7 8 9 10 11

1 2 3 4 56 7 8 9 10 11

1 2 3 4 5 116 7 8 9 10

1 2 3 4 56 7 8 9 10 11

6 7 8 9 10 111 2 3 4 5

6 7 8 9 101 2 3 4 5 11

1 2 3 4 5 116 7 8 9 10

1 2 3 4 5 116 7 8 9 10

6 7 8 9 101 2 3 4 5 11

1 2 3 4 5 6 7 8 9 10 111 0.52 1.034567891011

1 2 3 4 5 6 7 8 9 10 111 1.0 1.0 1.0 1.0 1.0 0 0 0 0 0 0.52 1.0 1.0 1.0 1.0 1.0 0 0 0 0 0 0.53 1.0 1.0 1.0 1.0 1.0 0 0 0 0 0 0.54 1.0 1.0 1.0 1.0 1.0 0 0 0 0 0 0.55 1.0 1.0 1.0 1.0 1.0 0 0 0 0 0 0.56 0 0 0 0 0 1.0 1.0 1.0 1.0 1.0 0.57 0 0 0 0 0 1.0 1.0 1.0 1.0 1.0 0.58 0 0 0 0 0 1.0 1.0 1.0 1.0 1.0 0.59 0 0 0 0 0 1.0 1.0 1.0 1.0 1.0 0.510 0 0 0 0 0 1.0 1.0 1.0 1.0 1.0 0.511 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 1.01110987654321

1110987654321

6 7 8 9 101 2 3 4 5 11

1 2 3 4 5 116 7 8 9 10

1 2 3 4 5 116 7 8 9 10

6 7 8 9 101 2 3 4 5 11

6 7 8 9 10 111 2 3 4 5

1 2 3 4 56 7 8 9 10 11

1 2 3 4 5 116 7 8 9 10

1 2 3 4 56 7 8 9 10 11

1 2 3 4 56 7 8 9 10 11

1 2 3 4 56 7 8 9 10 11

Page 12: Agenda

1,0kV

2,0kV

3,0kV

1),1( 0 kV

2),1( 0 kV

3),1( 0 kV

1),2( 0 kV

2),2( 0 kV

3),2( 0 kV

1),3( 0 kV

2),3( 0 kV

3),3( 0 kV

0k 10 k 20 k 30 k

0.7

0 0.01

0.01

0.52

0

0.030.11

0.23

0.01

0.3 0.1

0.95

0

0

0.140.01

0

0.17

0.05 0.21

0

0

1

0.030.11

0

Tight Clustering Algorithm: (relax estimation of k)

4.1 Tight clustering

Page 13: Agenda

Tight Clustering Algorithm:

1. Start with a suitable k0. Search for consecutive k’s and choose the top 3 clusters for each k.

2. Stop when

Select to be the tightest cluster.

},,,{},,,{ 3)1(2)1(1)1(321 000000 kkkkkk VVVVVV

1 toclose },3,2,1{,,,'

),(,),(

0

)2'()1'()1'('

nmlkk

VVsVVs nkmkmklk

mkV )1'(

4.1 Tight clustering

Page 14: Agenda

Tight Clustering Algorithm: (cont’d)

3. Identify the tightest cluster and remove it from the whole data.

4. Decrease k0 by 1. Repeat 1.~3. to identify the next tight cluster.

Remark: and k0 determines the tightness and size of resulting clusters.

,

4.1 Tight clustering

Page 15: Agenda

A simple simulation on 2-D:14 clusters normally distributed (50 points each)

plus 175 scattered points. Stdev=0.1, 0.2, …, 1.4.

x

y

-10 0 10 20 30 40

-30

-20

-10

010

4.1 Tight clusteringExample:

Page 16: Agenda

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 remaintruth 50 50 50 50 50 50 50 50 50 50 50 50 50 50 175alpha 0 beta 0.7k0=10 58 59 59 78 72 60 489k0=20 59 56 55 53 57 53 53 52 52 52 56 51 51 51 12 112k0=25 55 56 53 56 53 53 52 55 51 51 51 50 50 50 9 130k0=40 52 51 51 52 51 51 51 50 26 25 22 50 18 17 30 278

Tight clustering on simulated data:

40 and 25,20,10,10,7.0,0 0 kB

4.1 Tight clusteringExample:

Page 17: Agenda

x

y

-10 0 10 20 30 40

-30

-20

-10

010

x

y

-10 0 10 20 30 40

-30

-20

-10

010

x

y

-10 0 10 20 30 40

-30

-20

-10

010

x

y

-10 0 10 20 30 40

-30

-20

-10

010

x

y

-10 0 10 20 30 40

-30

-20

-10

010

x

y

-10 0 10 20 30 40

-30

-20

-10

010

x

y

-10 0 10 20 30 40

-30

-20

-10

010

x

y

-10 0 10 20 30 40

-30

-20

-10

010

x

y

-10 0 10 20 30 40

-30

-20

-10

010

x

y

-10 0 10 20 30 40

-30

-20

-10

010

x

y

-10 0 10 20 30 40

-30

-20

-10

010

x

y

-10 0 10 20 30 40

-30

-20

-10

010

x

y

-10 0 10 20 30 40

-30

-20

-10

010

x

y

-10 0 10 20 30 40

-30

-20

-10

010

x

y

-10 0 10 20 30 40

-30

-20

-10

010

x

y

-10 0 10 20 30 40

-30

-20

-10

010

10,7.0,0,250 Bk

4.1 Tight clusteringExample:

Page 18: Agenda

Gene expression during the life cycle of Drosophila melanogaster. (2002) Science 297:2270-2275

• 4028 genes monitored. Reference sample is pooled from all samples.

• 66 sequential time points spanning embryonic (E), larval (L), pupal (P) and adult (A) periods.

• Filter genes without significant pattern (1100 genes) and standardize each gene to have mean 0 and stdev 1.

Example:4.1 Tight clustering

Page 19: Agenda

Comparison of various K-means and tight clustering:Seven mini-chromosome maintenance (MCM) deficient genes

K-meansk=30

K-meansk=50

4.1 Tight clusteringExample:

Page 20: Agenda

K-meansk=70

K-meansk=100

Tightclustering

4.1 Tight clusteringExample:

Page 21: Agenda

4.1 Tight clustering

TightClust software download:http://www.pitt.edu/~ctseng/research/tightClust_download.html

Scattered (noisy) genes

Page 22: Agenda

K-means criterion:Minimize the within-group sum-squared dispersion to obtain C:

Formulation: K-means

.cluster ofcenter theis

);(

)(

1

2)(

jx

xxkCW

j

k

j Cx

jimeansK

ji

K-memoids criterion:

.cluster in point median theis

,);(

)(

1

)(

jXx

xxdkCW

j

k

j Cx

jimemoidsK

ji

4.2 Penalized and weighted K-means

Page 23: Agenda

Formulation: K-means

.cluster ofcenter theis

);(

)(

1

2)(

jx

xxkCW

j

k

j Cx

jimeansK

ji

Proposition: K-means is a special case of CML under Gaussian model of identical spherical clusters.

K-means:

CML:

4.2 Penalized and weighted K-means

Page 24: Agenda

Goal 1:• Allow a set of scattered genes

without being clustered.

Goal 2:• Incorporation of prior information

in cluster formation.

Formulation: PW-Kmeans4.2 Penalized and weighted K-means

Page 25: Agenda

Formulation: PW-KmeansFormulation:

d(xi, Cj): dispersion of point xi in Cj.

|S|: # of objects in noise set S.w(xi; P): weight function to incorporate prior info P.: a tuning parameter

• Penalty term : assign outlying objects of a cluster to the noise set S.

• Weighting term w: utilize prior knowledge of preferred or prohibited patterns P.

4.2 Penalized and weighted K-means

Page 26: Agenda

.,);,(,);,( , If .3

.),(),( , If .2

.,);,(,);,( , If 1.

. and given minimizer the),(),,(),...,,(),( Denote

:nPropositio

22*

11*

21

2*

1*

21

22*

11*

21

***1

*

kkCWkkCW

kSkS

kkCWkkCWkk

kkSkCkCkC k

Properties of PW-Kmeans4.2 Penalized and weighted K-means

Page 27: Agenda

P-Kmeans loss function:

Classification likelihood: (Gaussian model)

V is the space where noise set is uniformly distributed.

Properties of PW-KmeansRelation to classification likelihood

4.2 Penalized and weighted K-means

Page 28: Agenda

Prior informationSix groups of validated cell cycle genes:

Formulation: PW-Kmeans4.2 Penalized and weighted K-means

Page 29: Agenda

ICSA 06/15/2006

Prior informationSix groups of validated cell cycle genes:

Formulation: PW-Kmeans4.2 Penalized and weighted K-means

Page 30: Agenda

Prior knowledge of p pathways

The weight is designed as a transformation of logistic function:

Formulation: PW-KmeansA special example of PW-Kmeans for microarray:

4.2 Penalized and weighted K-means

Page 31: Agenda

Design of weight function Formulation: PW-Kmeans

4.2 Penalized and weighted K-means

Page 32: Agenda

ICSA 06/15/2006

8 histone genes tightly coregulated

in S phase

Prior informationSix groups of validated cell cycle genes:

Application : Yeast cell cycle expression

4.2 Penalized and weighted K-means

Page 33: Agenda

C1

C2

C3

C4

C5

S

The 8 histone genes are left in noise set S

without being clustered.

58

31

39

101

71

1276::

Penalized K-means with no prior information

4.2 Penalized and weighted K-means

Application : Yeast cell cycle expression

Page 34: Agenda

C1

C2

C3

C4

C5

S

The 8 histone genes are now in cluster 3.

PW-Kmeans: take three randomly selected histone genes as prior information, P.

112

158

88

139

57

1109

4.2 Penalized and weighted K-meansApplication : Yeast cell cycle expression

Page 35: Agenda

5. Cluster evaluation

• Evaluation and comparison of clustering methods is always difficult.

• In supervised learning (classification), the class labels (underlying truth) are known and performance can be evaluated through cross validation.

• In unsupervised learning (clustering), external validation is usually not available.

• Ideal data for cluster evaluation:• Data with class/tumor labels (for clustering samples)• Cell cycle data (for clustering genes)• Simulated data

Page 36: Agenda

5. Cluster evaluation

Rand index: (Rand 1971)

Y={(a,b,c), (d,e,f)} Y'={(a,b), (c,d,e), (f)}

ab ac ad ae af bc bd be bf cd ce cf de df ef Total

together in both * * 2

separate in both * * * * * * * 7

discordant * * * * * * 6

Rand index: c(Y, Y') =(2+7)/15=0.6 (percentage of concordance)

1. 1 c(Y, Y')02. Clustering methods can be evaluated by c(Y, Ytruth) if Ytruth

available.

Page 37: Agenda
Page 38: Agenda

5. Cluster evaluation

Adjusted Rand index: (Hubert and Arabie 1985)

index expected -index maximumindex expected -index

The adjusted Rand index will take maximum value at 1 and constant expected value 0 (when two clusterings are totally independent)

Adjusted Rand index =

Page 39: Agenda

Advantage Disadvantage

Hierarchical clustering

• Intuitive algorithm• Good interpretability• Do not need to estimate # of

clusters

• Very vulnerable to outliers• Tree not unique; gene closer not

necessarily more similar• Hard to read when tree is big

K-means • Simplified Gaussian mixture model• Normally get nice clusters

• Local minimum• Estimating # of clusters

SOM • Clusters has interpretation on 2D geometry (more interpretable)

• The algorithm very heuristic• Solution sub-optimal due to 2D

geometry restriction

Model-based clustering

• Flexibility on cluster structure• Rigorous statistical inference

• Model selection usually difficult• Local minimum problem

Tight clustering

• Allow genes not being clustered; only produce tight clusters

• Ease the problem of accurate estimation of # of clusters

• Biologically more meaningful

• Slower computation when data large

6. Comparison

Page 40: Agenda

20 samples

15clusters

Simulation:• 20 time-course samples for each gene.• In each cluster, four groups of samples

with similar intensity.

• Individual sample and gene variation are added.

• # of genes in each cluster ~ Poisson(10)• Scattered (noise) genes are added.• The simulated data well assembles real

data by visualization.

6. Comparison

Thalamuthu et al. 2006

a number (0, 5, 10, 20, 60, 100 and 200% of the original total number of clustered genes) of randomly

Page 41: Agenda

Different types of perturbations• Type I: a number (0, 5, 10, 20, 60, 100 and 200% of the

original total number of clustered genes) of randomly simulated scattered genes are added. E.g. For sample j in a scattered gene, the expression level is randomly sampled from the empirical distribution of expressions of all clustered genes in sample j.

• Type II: For each element of the log-transformed expression matrix, a small random error from normal distribution (SD = 0.05, 0.1, 0.2, 0.4, 0.8, 1.2) is added, to evaluate robustness of the clustering against potential random errors.

• Type III: combination of type I and II.

Page 42: Agenda

6. ComparisonDifferent degree of perturbation in the simulated microarray data

Page 43: Agenda

6. Comparison

Simulation schemes performed in the paper.In total, 25 (simulation settings) X 100 (data sets) = 2500 are evaluated.

Page 44: Agenda

• Adjusted Rand index: a measure of similarity of two clustering; • Compare each clustering result to the underlying true clusters.

Obtain the adjusted Rand index (the higher the better).

6. Comparison

T: tight clusteringM: model-basedP: K-medoidsK: K-meansH: hierarchicalS: SOM

Page 45: Agenda

Consensus Clustering

Simpson et al. BMC Bioinformatics 2010 11:590   doi:10.1186/1471-2105-11-590

Page 46: Agenda

•Consensus clustering with PAM (blue)•Consensus clustering with hierarchical clustering (red)•HOPACH (black)•Fuzzy c-means (green)

6. Comparison

Page 47: Agenda

Comparison in real data sets:(see paper for detailed comparison criteria)

6. Comparison

Page 48: Agenda

Tight clustering:• George C. Tseng and Wing H. Wong. (2005) Tight Clustering: A

Resampling-based Approach for Identifying Stable and Tight Patterns in Data. Biometrics.61:10-16.

Penalized and weighted K-means:• George C. Tseng. (2007). Penalized and weighted K-means for

clustering with scattered objects and prior information in high-throughput biological data. Bioinformatics. 23:2247-2255.

Comparative study:• George C. Anbupalam Thalamuthu*, Indranil Mukhopadhyay*,

Xiaojing Zheng* and George C. Tseng. (2006) Evaluation and comparison of gene clustering methods in microarray analysis. Bioinformatics. 22:2405-2412.

References

Page 49: Agenda

• Despite many sophisticated methods for detecting regulatory interactions (e.g. Shortest-path and Liquid Association), cluster analysis remains a useful routine in microarray analysis.

• We should use these methods for visualization, investigation and hypothesis generation.

• We should not use these methods inferentially.

• In general, methods with resampling evaluation, allowing scattered genes and related to model-based approach are better.

• Hierarchical clustering specifically: we are provided with a picture from which we can make many/any conclusions.

6. Conclusion

Page 50: Agenda

Common mistakes or warnings:1. Run K-means with large k and get excited to

see patterns without further investigation.K-means can let you see patterns even in randomly generated data and besides human eyes tend to see “patterns”.

2. Identify genes that are predictive to survival (e.g. apply t-statistics to long and short survivors). Cluster samples based on the selected genes and find the samples are clustered according to survival status.

The gene selection procedure is already biased towards the result you desire.

6. Conclusion

Page 51: Agenda

Common mistakes (con’d):3. Cluster samples into k groups. Perform F-test to

identify genes differentially expressed among subgroups.

Data has been re-used for both clustering and identifying differentially expressed genes. You always obtain a set of differentially expressed genes but not sure it’s real or by random.

6. Conclusion

Page 52: Agenda

Post-cluster analysis

• Identify novel genes participating in known cellular process

• Enrichment of particular Gene Ontology (GO) terms in clusters

• Motif finding in clusters

6. Conclusion

Page 53: Agenda

• Rand WM: Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 1971, 66:846-850.

• Lawrence Hubert and Phipps Arabie. Comparing partitions. Journal of Classification, 2, 193–218, 1985.

• Dudoit S, Fridlyand J. (2002). A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biology. 3(7):0036.1-0036.21.

• Dudoit S, Fridlyand J. (2003). Bagging to improve the accuracy of a clustering procedure. Bioinformatics. 19(9):1090-9.

• Robert Tibshirani, Guenther Walther and Trevor Hastie. "Estimating the number of clusters in a dataset via the Gap statistic“. Published in JRSSB 2000.

• Fraley,C. and Raftery,A.E. (2002) Model-based clustering, discriminant analysis, and density estimation. J. Am. Stat. Assoc., 97, 611–631.

• McLachlan,G.J. et al. (2002) A mixture model-based approach to the clustering of microarray expression data. Bioinformatics, 18, 413–422.

• Medvedovic,M. and Sivaganesan,S. (2002) Bayesian infinite mixture model-based clustering of gene expression profiles. Bioinformatics, 18, 1194–1206.

• Milligan,G.W. and Cooper,M.C. (1985) An examination of procedures for determining the number of clusters in a data set. Psychometrika, 50, 159–179.

More references

Page 54: Agenda

Methods in R:Hierarchical clusteirng in R: hclust

K-means in R: kmeans

Self-organizing maps in R: som(som)

Model-based clustering: EMclust(mclust)

Individual packages:

GeneCluster2.0http://www.broad.mit.edu/cancer/software/software.htmlCluster, Mike Eisenhttp://rana.lbl.gov/EisenSoftware.htmTight Clustering and PW-Kmeanshttp://www.biostat.pitt.edu/bioinfo/software.htm PAMhttp://www-stat.stanford.edu/~tibs/PAM/index.htmlExpander and Clickhttp://www.cs.tau.ac.il/~rshamir/expander/expander.html

Software and Packages