Canadian Bioinformatics Workshops .
-
Upload
justina-laura-gibson -
Category
Documents
-
view
215 -
download
0
Transcript of Canadian Bioinformatics Workshops .
![Page 1: Canadian Bioinformatics Workshops .](https://reader033.fdocuments.us/reader033/viewer/2022042702/56649d095503460f949dbdf7/html5/thumbnails/1.jpg)
Canadian Bioinformatics Workshops
www.bioinformatics.ca
![Page 2: Canadian Bioinformatics Workshops .](https://reader033.fdocuments.us/reader033/viewer/2022042702/56649d095503460f949dbdf7/html5/thumbnails/2.jpg)
2Module #: Title of Module
![Page 3: Canadian Bioinformatics Workshops .](https://reader033.fdocuments.us/reader033/viewer/2022042702/56649d095503460f949dbdf7/html5/thumbnails/3.jpg)
Lecture 7ML & Data Visualization & Microarrays
MBP1010
Dr. Paul C. BoutrosWinter 2015
DEPARTMENT OFMEDICAL BIOPHYSICSDEPARTMENT OFMEDICAL BIOPHYSICS
This workshop includes material originally developed by Drs. Raphael Gottardo, Sohrab Shah, Boris Steipe and others
††
††
Aegeus, King of Athens, consulting the Delphic Oracle. High Classical (~430 BCE)
![Page 4: Canadian Bioinformatics Workshops .](https://reader033.fdocuments.us/reader033/viewer/2022042702/56649d095503460f949dbdf7/html5/thumbnails/4.jpg)
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca
Course Overview• Lecture 1: What is Statistics? Introduction to R• Lecture 2: Univariate Analyses I: continuous• Lecture 3: Univariate Analyses II: discrete• Lecture 4: Multivariate Analyses I: specialized models• Lecture 5: Multivariate Analyses II: general models• Lecture 6: Machine-Learning• Lecture 7: Microarray Analysis I: Pre-Processing• Lecture 8: Microarray Analysis II: Multiple-Testing• Lecture 9: Sequence Analysis• Final Exam (written)
![Page 5: Canadian Bioinformatics Workshops .](https://reader033.fdocuments.us/reader033/viewer/2022042702/56649d095503460f949dbdf7/html5/thumbnails/5.jpg)
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca
House Rules• Cell phones to silent
• No side conversations
• Hands up for questions
![Page 6: Canadian Bioinformatics Workshops .](https://reader033.fdocuments.us/reader033/viewer/2022042702/56649d095503460f949dbdf7/html5/thumbnails/6.jpg)
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca
Topics For This Week• Machine-learning 101 (Briefly)
• Data visualization 101
• Attendance
• Microarrays 101
![Page 7: Canadian Bioinformatics Workshops .](https://reader033.fdocuments.us/reader033/viewer/2022042702/56649d095503460f949dbdf7/html5/thumbnails/7.jpg)
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca
cho.data<-as.matrix(read.table("logcho_237_4class.txt",skip=1)[1:50,3:19])
D.cho<-dist(cho.data, method = "euclidean")
hc.single<-hclust(D.cho, method = "single", members=NULL)
Example: cell cycle data
![Page 8: Canadian Bioinformatics Workshops .](https://reader033.fdocuments.us/reader033/viewer/2022042702/56649d095503460f949dbdf7/html5/thumbnails/8.jpg)
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca
plot(hc.single)
Single linkage
Example: cell cycle data
![Page 9: Canadian Bioinformatics Workshops .](https://reader033.fdocuments.us/reader033/viewer/2022042702/56649d095503460f949dbdf7/html5/thumbnails/9.jpg)
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca
Careful with the interpretation of dendrograms:they introduce a proximity between elements that does not correlate with distance between elements!cf.: # 1 and #47
Example: cell cycle data
![Page 10: Canadian Bioinformatics Workshops .](https://reader033.fdocuments.us/reader033/viewer/2022042702/56649d095503460f949dbdf7/html5/thumbnails/10.jpg)
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca
Single linkage, k=2
rect.hclust(hc.single,k=2)
Example: cell cycle data
![Page 11: Canadian Bioinformatics Workshops .](https://reader033.fdocuments.us/reader033/viewer/2022042702/56649d095503460f949dbdf7/html5/thumbnails/11.jpg)
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca
Single linkage, k=3
rect.hclust(hc.single,k=3)
Example: cell cycle data
![Page 12: Canadian Bioinformatics Workshops .](https://reader033.fdocuments.us/reader033/viewer/2022042702/56649d095503460f949dbdf7/html5/thumbnails/12.jpg)
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca
Single linkage, k=4
rect.hclust(hc.single,k=4)
Example: cell cycle data
![Page 13: Canadian Bioinformatics Workshops .](https://reader033.fdocuments.us/reader033/viewer/2022042702/56649d095503460f949dbdf7/html5/thumbnails/13.jpg)
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca
Single linkage, k=5
rect.hclust(hc.single,k=5)
Example: cell cycle data
![Page 14: Canadian Bioinformatics Workshops .](https://reader033.fdocuments.us/reader033/viewer/2022042702/56649d095503460f949dbdf7/html5/thumbnails/14.jpg)
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca
Single linkage, k=25
rect.hclust(hc.single,k=25)
Example: cell cycle data
![Page 15: Canadian Bioinformatics Workshops .](https://reader033.fdocuments.us/reader033/viewer/2022042702/56649d095503460f949dbdf7/html5/thumbnails/15.jpg)
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca
1 2
3 4
class.single<-cutree(hc.single, k = 4)
par(mfrow=c(2,2))
matplot(t(cho.data[class.single==1,]),type="l",
xlab="time",ylab="log expression value")
matplot(t(cho.data[class.single==2,]),type="l",
xlab="time",ylab="log expression value")
matplot(as.matrix(cho.data[class.single==3,]),
type="l",xlab="time",ylab="log expression value")
matplot(t(cho.data[class.single==4,]),type="l",
xlab="time",ylab="log expression value")
Properties of cluster members, single linkage, k=4
Example: cell cycle data
![Page 16: Canadian Bioinformatics Workshops .](https://reader033.fdocuments.us/reader033/viewer/2022042702/56649d095503460f949dbdf7/html5/thumbnails/16.jpg)
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca
1 2
3 4
Single linkage, k=4
1 2
3 4
Example: cell cycle data
![Page 17: Canadian Bioinformatics Workshops .](https://reader033.fdocuments.us/reader033/viewer/2022042702/56649d095503460f949dbdf7/html5/thumbnails/17.jpg)
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca
1 2
3 4
Complete linkage, k=4
1 2
3 4
Single linkage, k=4
Example: cell cycle data
![Page 18: Canadian Bioinformatics Workshops .](https://reader033.fdocuments.us/reader033/viewer/2022042702/56649d095503460f949dbdf7/html5/thumbnails/18.jpg)
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca
Hierarchical clustering analyzed
Advantages Disadvantages
There may be small clusters nested inside large ones
Clusters might not be naturally represented by a hierarchical structure
No need to specify number groups ahead of time
Its necessary to ‘cut’ the dendrogram in order to produce clusters
Flexible linkage methods Bottom up clustering can result in poor structure at the top of the tree. Early joins cannot be ‘undone’
![Page 19: Canadian Bioinformatics Workshops .](https://reader033.fdocuments.us/reader033/viewer/2022042702/56649d095503460f949dbdf7/html5/thumbnails/19.jpg)
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca
Partitioning methods
• Anatomy of a partitioning based method• data matrix• distance function• number of groups
• Output• group assignment of every object
![Page 20: Canadian Bioinformatics Workshops .](https://reader033.fdocuments.us/reader033/viewer/2022042702/56649d095503460f949dbdf7/html5/thumbnails/20.jpg)
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca
Partitioning based methods
• Choose K groups• initialise group centers
• aka centroid, medoid
• assign each object to the nearest centroid according to the distance metric
• reassign (or recompute) centroids
• repeat last 2 steps until assignment stabilizes
![Page 21: Canadian Bioinformatics Workshops .](https://reader033.fdocuments.us/reader033/viewer/2022042702/56649d095503460f949dbdf7/html5/thumbnails/21.jpg)
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca
K-means vs. K-medoids
K-means K-medoids
Centroids are the ‘mean’ of the clusters
Centroids are an actual object that minimizes the total within cluster distance
Centroids need to be recomputed every iteration
Centroid can be determined from quick look up into the distance matrix
Initialisation difficult as notion of centroid may be unclear before beginning
Initialisation is simply K randomly selected objects
kmeans pam
![Page 22: Canadian Bioinformatics Workshops .](https://reader033.fdocuments.us/reader033/viewer/2022042702/56649d095503460f949dbdf7/html5/thumbnails/22.jpg)
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca
Partitioning based methods
Advantages Disadvantages
Number of groups is well defined
Have to choose the number of groups
A clear, deterministic assignment of an object to a group
Sometimes objects do not fit well to any cluster
Simple algorithms for inference
Can converge on locally optimal solutions and often require multiple restarts with random initializations
![Page 23: Canadian Bioinformatics Workshops .](https://reader033.fdocuments.us/reader033/viewer/2022042702/56649d095503460f949dbdf7/html5/thumbnails/23.jpg)
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca
N items, assume K clusters
Goal is to minimize
over the possible assignments and centroids .
represents the location of the cluster.
K-means
![Page 24: Canadian Bioinformatics Workshops .](https://reader033.fdocuments.us/reader033/viewer/2022042702/56649d095503460f949dbdf7/html5/thumbnails/24.jpg)
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca
1. Divide the data into K clustersInitialize the centroids with the mean of the clusters
2. Assign each item to the cluster with closest centroid
3. When all objects have been assigned, recalculate the centroids (mean)
4. Repeat 2-3 until the centroids no longer move
K-means
![Page 25: Canadian Bioinformatics Workshops .](https://reader033.fdocuments.us/reader033/viewer/2022042702/56649d095503460f949dbdf7/html5/thumbnails/25.jpg)
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca
set.seed(100)x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))colnames(x) <- c("x", "y")
set.seed(100); cl <- NULLcl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=1)plot(x,col=cl$cluster)points(cl$centers, col = 1:5, pch = 8, cex=2)
set.seed(100); cl <- NULLcl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=2)plot(x,col=cl$cluster)points(cl$centers, col = 1:5, pch = 8, cex=2)
set.seed(100); cl <- NULLcl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=3)plot(x,col=cl$cluster)points(cl$centers, col = 1:5, pch = 8, cex=2)
K-means
![Page 26: Canadian Bioinformatics Workshops .](https://reader033.fdocuments.us/reader033/viewer/2022042702/56649d095503460f949dbdf7/html5/thumbnails/26.jpg)
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca
set.seed(100)x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))colnames(x) <- c("x", "y")
set.seed(100); cl <- NULLcl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=1)plot(x,col=cl$cluster)points(cl$centers, col = 1:5, pch = 8, cex=2)
set.seed(100); cl <- NULLcl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=2)plot(x,col=cl$cluster)points(cl$centers, col = 1:5, pch = 8, cex=2)
set.seed(100); cl <- NULLcl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=3)plot(x,col=cl$cluster)points(cl$centers, col = 1:5, pch = 8, cex=2)
K-means
![Page 27: Canadian Bioinformatics Workshops .](https://reader033.fdocuments.us/reader033/viewer/2022042702/56649d095503460f949dbdf7/html5/thumbnails/27.jpg)
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca
K-means, k=4
1 2
3 4
set.seed(100)km.cho<-kmeans(cho.data, 4)
par(mfrow=c(2,2))matplot(t(cho.data[km.cho$cluster==1,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[km.cho$cluster==2,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[km.cho$cluster==3,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[km.cho$cluster==4,]),type="l",xlab="time",ylab="log expression value")
K-means
![Page 28: Canadian Bioinformatics Workshops .](https://reader033.fdocuments.us/reader033/viewer/2022042702/56649d095503460f949dbdf7/html5/thumbnails/28.jpg)
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca
1 2
3 4
K-means, k=4
1 2
3 4
Single linkage, k=4
K-means
![Page 29: Canadian Bioinformatics Workshops .](https://reader033.fdocuments.us/reader033/viewer/2022042702/56649d095503460f949dbdf7/html5/thumbnails/29.jpg)
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca
K-means and hierarchical clustering methods are simple, fast and useful techniques
Beware of memory requirements for HC
Both are bit “ad hoc”:Number of clusters?
Distance metric?
Good clustering?
Summary
![Page 30: Canadian Bioinformatics Workshops .](https://reader033.fdocuments.us/reader033/viewer/2022042702/56649d095503460f949dbdf7/html5/thumbnails/30.jpg)
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca
Meta-Analysis• Combining results of multiple-studies that study related
hypotheses
• Often used to merge data from different microarray platforms
• Very challenging – unclear what the best approaches are, or how they should be adapted to the pecularities of microarray data
![Page 31: Canadian Bioinformatics Workshops .](https://reader033.fdocuments.us/reader033/viewer/2022042702/56649d095503460f949dbdf7/html5/thumbnails/31.jpg)
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca
Why Do Meta-Analysis?• Can identify publication biases• Appropriately weights diverse studies
• Sample-size• Experimental-reliability• Similarity of study-specific hypotheses to the overall one
• Increases statistical power• Reduces information
• A single meta-analysis vs. five large studies• Provides clearer guidance
![Page 32: Canadian Bioinformatics Workshops .](https://reader033.fdocuments.us/reader033/viewer/2022042702/56649d095503460f949dbdf7/html5/thumbnails/32.jpg)
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca
Challenges of Meta-Analysis• No control for bias
• What happens if most studies are poorly designed?
• File-drawer problem• Publication bias can be detected, but not explicitly controlled
for
• How homogeneous is the data?• Can it be fairly grouped?• Simpson’s Paradox
![Page 33: Canadian Bioinformatics Workshops .](https://reader033.fdocuments.us/reader033/viewer/2022042702/56649d095503460f949dbdf7/html5/thumbnails/33.jpg)
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca
Simpson’s Paradox
Group-wise correlations are inverted when the groups are merged. Cautionary note for all meta-analyses!
Group-wise correlations are inverted when the groups are merged. Cautionary note for all meta-analyses!
![Page 34: Canadian Bioinformatics Workshops .](https://reader033.fdocuments.us/reader033/viewer/2022042702/56649d095503460f949dbdf7/html5/thumbnails/34.jpg)
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca
Topics For This Week• Machine-learning 101 (Focus: Unsupervised)
• Data visualization 101
![Page 35: Canadian Bioinformatics Workshops .](https://reader033.fdocuments.us/reader033/viewer/2022042702/56649d095503460f949dbdf7/html5/thumbnails/35.jpg)
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca
Topics For This Week• Machine-learning 101 (Briefly)
• Data visualization 101
• Attendance
• Microarrays 101