Essential Statistics in Biology: Getting the Numbers Right Raphael Gottardo Clinical Research...
-
Upload
derick-roberts -
Category
Documents
-
view
214 -
download
0
Transcript of Essential Statistics in Biology: Getting the Numbers Right Raphael Gottardo Clinical Research...
Essential Statistics in
Biology: Getting the Numbers
Right
Raphael GottardoClinical Research Institute of Montreal (IRCM)
[email protected]://www.rglab.org
Day 1 2
Outline
•Exploratory Data Analysis
•1-2 sample t-tests, multiple testing
•Clustering
•SVD/PCA
•Frequentists vs. Bayesians
Day 1 - Section 3 4
Outline
•Basics of clustering
•Hierarchical clustering
•K-means
•Model based clustering
Day 1 - Section 3 5
What is it?
Clustering is the classification of similar objects into different groups. Partition a data set into subsets (clusters), so that the data in each subset are “close” to one another - often proximity according to some defined distance measure.
Examples: www, gene clustering
Day 1 - Section 3 6
Hierarchical clustering
Given N items and a distance metric
1.Assign each item to a clusterInitialize the distance matrix between clusters as the distance between items
2.Find the closest pair of clusters and merge them into a single cluster
3.Compute new distances between clusters
4.Repeat 2-3 until call items are classified into a single cluster
Day 1 - Section 3 7
Single linkageThe distance between clusters is defined as the shortest distance from any member of one cluster to any member of the other cluster.
Cluster 1 Cluster 2
d
Day 1 - Section 3 8
Complete linkageThe distance between clusters is defined as the greatest distance from any member of one cluster to any member of the other cluster.
Cluster 1 Cluster 2
d
Day 1 - Section 3 9
Average linkageThe distance between clusters is defined as the average distance from any member of one cluster to any member of the other cluster.
Cluster 1 Cluster 2
d=Average of all distances
Day 1 - Section 3 10
Example
Cell cycle dataset (Cho et al. 1998)
Expression levels of ~6000 genes during the cell cycle
17 time points (2 cell cycles)
Day 1 - Section 3 11
Example
cho.data<-as.matrix(read.table("logcho_237_4class.txt",skip=1)[1:50,3:19])D.cho<-dist(cho.data, method = "euclidean")hc.single<-hclust(D.cho, method = "single", members=NULL)plot(hc.single)rect.hclust(hc.single,k=4)class.single<-cutree(hc.single, k = 4)par(mfrow=c(2,2))matplot(t(cho.data[class.single==1,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[class.single==2,]),type="l",xlab="time",ylab="log expression value")matplot(as.matrix(cho.data[class.single==3,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[class.single==4,]),type="l",xlab="time",ylab="log expression value")hc.complete<-hclust(D.cho, method = "complete", members=NULL)plot(hc.complete)rect.hclust(hc.complete,k=4)class.complete<-cutree(hc.complete, k = 4)par(mfrow=c(2,2))matplot(t(cho.data[class.complete==1,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[class.complete==2,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[class.complete==3,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[class.complete==4,]),type="l",xlab="time",ylab="log expression value")hc.average<-hclust(D.cho, method = "average", members=NULL)plot(hc.average)rect.hclust(hc.average,k=4)class.average<-cutree(hc.average, k = 4)par(mfrow=c(2,2))matplot(t(cho.data[class.average==1,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[class.average==2,]),type="l",xlab="time",ylab="log expression value")matplot(as.matrix(cho.data[class.average==3,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[class.average==4,]),type="l",xlab="time",ylab="log expression value")set.seed(100)km.cho<-kmeans(cho.data, 4)par(mfrow=c(2,2))matplot(t(cho.data[km.cho$cluster==1,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[km.cho$cluster==2,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[km.cho$cluster==3,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[km.cho$cluster==4,]),type="l",xlab="time",ylab="log expression value")
Day 1 - Section 3 21
K-means
N items, assume K clusters
Goal is to minimized
over the possible assignments and centroids .
represents the location of the cluster.
Day 1 - Section 3 22
K-means - algorithm
1.Divide the data into K clustersInitialize the centroids with the mean of the clusters
2.Assign each item to the cluster with closest centroid
3.When all objects have been assigned, recalculate the centroids (mean)
4.Repeat 2-3 until the centroids no longer move
Day 1 - Section 3 23
K-means - algorithmset.seed(100)x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))colnames(x) <- c("x", "y")set.seed(100)for(i in 1:4){ set.seed(100) cl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=i) plot(x,col=cl$cluster) points(cl$centers, col = 1:5, pch = 8, cex=2) Sys.sleep(2)}
Day 1 - Section 3 27
Summary
•K-means and hierarchical clustering methods are useful techniques
•Fast and easy to implementBeware of memory requirements for HC
•A bit “ad hoc”:•Number of clusters?•Distance metric?•Good clustering?
Day 1 - Section 3 28
Model based clustering
•Based on probability models (e.g. Normal mixture models)
•We could talk about good clustering
•Compare several models
•Estimate the number of clusters!
Day 1 - Section 3 29
Model based clustering
Multivariate observations
K clusters
Assume observation i belongs to cluster k, then
that is each cluster can be represented by a multivariatenormal distribution with mean and covariance
Yeung et al. (2001)
Day 1 - Section 3 30
Model based clustering
Banfield and Raftery (1993)
VolumeOrientation Shape
Eigenvalue decomposition
Day 1 - Section 3 31
Model based clustering
00
00
00
00
Equal volume spherical EII
Unequal volume spherical VII
Equal volume, shape, orientation (EEE)
Unconstrained (VVV)
Day 1 - Section 3 32
Estimation
Given the number of clusters and the covariancestructure the EM algorithm can be used
Mclust R package available from CRAN
Likelihood (Mixture model)
Day 1 - Section 3 33
Model selection
Which model is appropriate?- Which covariance structure?- How many clusters?
Compare the different models using BIC
Day 1 - Section 3 34
Model selection
We wish to compare two models and with parameters and respectively.
Given the observed data D, define the integrated likelihood
Probability to observe the data given model
andmight have different dimensionsNB:
Day 1 - Section 3 35
Model selection
To compare two models and use the integratedlikelihoods.
The integral is difficult to compute!
Bayesian information criteria:
is the maximum likelihoodis the number of parameter in model
Day 1 - Section 3 36
Model selectionBayesian information criteria:
Measure of fit Penalty termA large BIC score indicates strong evidence for the corresponding modelBIC can be used to choose the number of clustersand the covariance parametrization (Mclust)
Day 1 - Section 3 37
Example revisited1 EII2 EEI
library(mclust)cho.mclust.bic<-EMclust(cho.data.std,modelNames=c("EII","EEI"))plot(cho.mclust.bic)cho.mclust<-EMclust(cho.data.std,4,"EII")sum.cho<-summary(cho.mclust,cho.data.std)
Day 1 - Section 3 38
Example revisited
par(mfrow=c(2,2))matplot(t(cho.data[sum.cho$classification==1,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[sum.cho$classification==2,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[sum.cho$classification==3,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[sum.cho$classification==4,]),type="l",xlab="time",ylab="log expression value")
Day 1 - Section 3 40
Example revisitedcho.mclust<-EMclust(cho.data.std,3,"EEI")sum.cho<-summary(cho.mclust,cho.data.std)
par(mfrow=c(2,2))matplot(t(cho.data[sum.cho$classification==1,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[sum.cho$classification==2,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[sum.cho$classification==3,]),type="l",xlab="time",ylab="log expression value")
Day 1 - Section 3 42
Summary
•Model based clustering is a nice alternative to heuristic clustering algorithms
•BIC can be used for choosing the covariance structure and the number of clusters