Essential Statistics in Biology: Getting the Numbers Right Raphael Gottardo Clinical Research...

43
Essential Statistics in Biology: Getting the Numbers Right Raphael Gottardo Clinical Research Institute of Montreal (IRCM) [email protected] http://www.rglab.org

Transcript of Essential Statistics in Biology: Getting the Numbers Right Raphael Gottardo Clinical Research...

Essential Statistics in

Biology: Getting the Numbers

Right

Raphael GottardoClinical Research Institute of Montreal (IRCM)

[email protected]://www.rglab.org

Day 1 2

Outline

•Exploratory Data Analysis

•1-2 sample t-tests, multiple testing

•Clustering

•SVD/PCA

•Frequentists vs. Bayesians

Clustering(Multivariate

analysis)

Day 1 - Section 3 4

Outline

•Basics of clustering

•Hierarchical clustering

•K-means

•Model based clustering

Day 1 - Section 3 5

What is it?

Clustering is the classification of similar objects into different groups. Partition a data set into subsets (clusters), so that the data in each subset are “close” to one another - often proximity according to some defined distance measure.

Examples: www, gene clustering

Day 1 - Section 3 6

Hierarchical clustering

Given N items and a distance metric

1.Assign each item to a clusterInitialize the distance matrix between clusters as the distance between items

2.Find the closest pair of clusters and merge them into a single cluster

3.Compute new distances between clusters

4.Repeat 2-3 until call items are classified into a single cluster

Day 1 - Section 3 7

Single linkageThe distance between clusters is defined as the shortest distance from any member of one cluster to any member of the other cluster.

Cluster 1 Cluster 2

d

Day 1 - Section 3 8

Complete linkageThe distance between clusters is defined as the greatest distance from any member of one cluster to any member of the other cluster.

Cluster 1 Cluster 2

d

Day 1 - Section 3 9

Average linkageThe distance between clusters is defined as the average distance from any member of one cluster to any member of the other cluster.

Cluster 1 Cluster 2

d=Average of all distances

Day 1 - Section 3 10

Example

Cell cycle dataset (Cho et al. 1998)

Expression levels of ~6000 genes during the cell cycle

17 time points (2 cell cycles)

Day 1 - Section 3 11

Example

cho.data<-as.matrix(read.table("logcho_237_4class.txt",skip=1)[1:50,3:19])D.cho<-dist(cho.data, method = "euclidean")hc.single<-hclust(D.cho, method = "single", members=NULL)plot(hc.single)rect.hclust(hc.single,k=4)class.single<-cutree(hc.single, k = 4)par(mfrow=c(2,2))matplot(t(cho.data[class.single==1,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[class.single==2,]),type="l",xlab="time",ylab="log expression value")matplot(as.matrix(cho.data[class.single==3,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[class.single==4,]),type="l",xlab="time",ylab="log expression value")hc.complete<-hclust(D.cho, method = "complete", members=NULL)plot(hc.complete)rect.hclust(hc.complete,k=4)class.complete<-cutree(hc.complete, k = 4)par(mfrow=c(2,2))matplot(t(cho.data[class.complete==1,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[class.complete==2,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[class.complete==3,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[class.complete==4,]),type="l",xlab="time",ylab="log expression value")hc.average<-hclust(D.cho, method = "average", members=NULL)plot(hc.average)rect.hclust(hc.average,k=4)class.average<-cutree(hc.average, k = 4)par(mfrow=c(2,2))matplot(t(cho.data[class.average==1,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[class.average==2,]),type="l",xlab="time",ylab="log expression value")matplot(as.matrix(cho.data[class.average==3,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[class.average==4,]),type="l",xlab="time",ylab="log expression value")set.seed(100)km.cho<-kmeans(cho.data, 4)par(mfrow=c(2,2))matplot(t(cho.data[km.cho$cluster==1,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[km.cho$cluster==2,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[km.cho$cluster==3,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[km.cho$cluster==4,]),type="l",xlab="time",ylab="log expression value")

Day 1 - Section 3 12

ExampleSingle linkage

Day 1 - Section 3 13

ExampleSingle linkageSingle linkage

k=2

Day 1 - Section 3 14

ExampleSingle linkage

k=3

Day 1 - Section 3 15

ExampleSingle linkage

k=4

Day 1 - Section 3 16

ExampleSingle linkage

k=5

Day 1 - Section 3 17

ExampleSingle linkagek=25

Day 1 - Section 3 18

ExampleSingle linkage

k=4

1 2

3 4

Day 1 - Section 3 19

ExampleComplete linkagek=4

Day 1 - Section 3 20

Example Complete linkagek=4

1 2

3 4

Day 1 - Section 3 21

K-means

N items, assume K clusters

Goal is to minimized

over the possible assignments and centroids .

represents the location of the cluster.

Day 1 - Section 3 22

K-means - algorithm

1.Divide the data into K clustersInitialize the centroids with the mean of the clusters

2.Assign each item to the cluster with closest centroid

3.When all objects have been assigned, recalculate the centroids (mean)

4.Repeat 2-3 until the centroids no longer move

Day 1 - Section 3 23

K-means - algorithmset.seed(100)x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))colnames(x) <- c("x", "y")set.seed(100)for(i in 1:4){ set.seed(100) cl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=i) plot(x,col=cl$cluster) points(cl$centers, col = 1:5, pch = 8, cex=2) Sys.sleep(2)}

Day 1 - Section 3 24

Example1 2

3 4Why?

Day 1 - Section 3 26

Example1 2

3 4

Day 1 - Section 3 27

Summary

•K-means and hierarchical clustering methods are useful techniques

•Fast and easy to implementBeware of memory requirements for HC

•A bit “ad hoc”:•Number of clusters?•Distance metric?•Good clustering?

Day 1 - Section 3 28

Model based clustering

•Based on probability models (e.g. Normal mixture models)

•We could talk about good clustering

•Compare several models

•Estimate the number of clusters!

Day 1 - Section 3 29

Model based clustering

Multivariate observations

K clusters

Assume observation i belongs to cluster k, then

that is each cluster can be represented by a multivariatenormal distribution with mean and covariance

Yeung et al. (2001)

Day 1 - Section 3 30

Model based clustering

Banfield and Raftery (1993)

VolumeOrientation Shape

Eigenvalue decomposition

Day 1 - Section 3 31

Model based clustering

00

00

00

00

Equal volume spherical EII

Unequal volume spherical VII

Equal volume, shape, orientation (EEE)

Unconstrained (VVV)

Day 1 - Section 3 32

Estimation

Given the number of clusters and the covariancestructure the EM algorithm can be used

Mclust R package available from CRAN

Likelihood (Mixture model)

Day 1 - Section 3 33

Model selection

Which model is appropriate?- Which covariance structure?- How many clusters?

Compare the different models using BIC

Day 1 - Section 3 34

Model selection

We wish to compare two models and with parameters and respectively.

Given the observed data D, define the integrated likelihood

Probability to observe the data given model

andmight have different dimensionsNB:

Day 1 - Section 3 35

Model selection

To compare two models and use the integratedlikelihoods.

The integral is difficult to compute!

Bayesian information criteria:

is the maximum likelihoodis the number of parameter in model

Day 1 - Section 3 36

Model selectionBayesian information criteria:

Measure of fit Penalty termA large BIC score indicates strong evidence for the corresponding modelBIC can be used to choose the number of clustersand the covariance parametrization (Mclust)

Day 1 - Section 3 37

Example revisited1 EII2 EEI

library(mclust)cho.mclust.bic<-EMclust(cho.data.std,modelNames=c("EII","EEI"))plot(cho.mclust.bic)cho.mclust<-EMclust(cho.data.std,4,"EII")sum.cho<-summary(cho.mclust,cho.data.std)

Day 1 - Section 3 38

Example revisited

par(mfrow=c(2,2))matplot(t(cho.data[sum.cho$classification==1,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[sum.cho$classification==2,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[sum.cho$classification==3,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[sum.cho$classification==4,]),type="l",xlab="time",ylab="log expression value")

Day 1 - Section 3 39

Example revisited1 2

3 4

EII4 clusters

Day 1 - Section 3 40

Example revisitedcho.mclust<-EMclust(cho.data.std,3,"EEI")sum.cho<-summary(cho.mclust,cho.data.std)

par(mfrow=c(2,2))matplot(t(cho.data[sum.cho$classification==1,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[sum.cho$classification==2,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[sum.cho$classification==3,]),type="l",xlab="time",ylab="log expression value")

Day 1 - Section 3 41

Example revisited1 2

3

EEI3 clusters

Day 1 - Section 3 42

Summary

•Model based clustering is a nice alternative to heuristic clustering algorithms

•BIC can be used for choosing the covariance structure and the number of clusters

Day 1 - Section 3 43

Conclusion

•We have seen a few clustering algorithms

•There are many others

•Two way clustering

•Plaid model ...

•Clustering is a useful tool and ... a dangerous weapon

•To be consumed with moderation!