Unsupervised analysis of gene expression...

19
Unsupervised analysis of gene expression data Bing Zhang Department of Biomedical Informatics Vanderbilt University [email protected]

Transcript of Unsupervised analysis of gene expression...

Page 1: Unsupervised analysis of gene expression databioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture13.pdf · Three major goals of gene expression studies Class comparison (supervised

Unsupervised analysis of gene expression data

Bing Zhang Department of Biomedical Informatics

Vanderbilt University

[email protected]

Page 2: Unsupervised analysis of gene expression databioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture13.pdf · Three major goals of gene expression studies Class comparison (supervised

Overall workflow of a microarray study

Microarray experiment

Biological question

Experiment design

Image analysis

Pre-processing

Data Analysis

Hypothesis Experimental verification

Applied Bioinformatics, Spring 2011 2

Page 3: Unsupervised analysis of gene expression databioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture13.pdf · Three major goals of gene expression studies Class comparison (supervised

Three major goals of gene expression studies

  Class comparison (supervised analysis)   e.g. disease biomarker discovery

  Differential expression analysis

  Input: gene expression data, class label of the samples

  Output: differentially expressed genes

  Class detection (unsupervised analysis)   e.g. patient subgroup detection

  Clustering analysis

  Input: gene expression data

  Output: groups of similar samples or genes

  Class prediction (supervised learning)   e.g. disease diagnosis and prognosis

  Machine learning techniques

  Input: gene expression data, class label of the samples (training data)

  Output: prediction model

Applied Bioinformatics, Spring 2011

!"#$%&'%(&)* +,-.&/ +,-.&0 +,-.&1 +,-2.&/ +,-2.&0 +,-2.&1/..3&'&4( !"#!!! !"$%&$ !"$'() !"$')& !"$#&' !"*%(*/.51&4( +")$$! +")!*$ +"'&+' +"&))) +")&%' +"&'+'//3&4( ("%(%% ("%%*' #"+%'( +"%')' !"#*!& +"&##*/0/&4( +"()(' +"(''% +"#)&% +"($!) +"('&& +"(*'$/055&6&4( '"&!%) '"'##+ '"&*#% '"*(%% '"'$(* '"&+(+/078&4( #"*$$# #"&*!) #"&%$* #"'&+% #"$%(' #"&(()/1/2&4( #"$($+ #"$**% #"'(%+ #"##*# #"#'*! #"'#!!/10.&4( #"$'+( #"$*!! #"$')% #"##%$ #"$+!( #"(&*#/8.5&)&4( '"*&#% '"'#'% '")'*! '"*'#& '"*!(# '"#!'+/81/&4( $"&)+) $"&%(% $"&#$( $"&!&* $"&$&& $")!%!/819&4( ("%)$$ #"+*$+ #"+&') ("%&'! ("%)'& ("%+()/893&4( !"#*#) !"'!(+ !"''+! !"''(% !"$*)) !"'&&$/878&:&4( ("*&+# ("*+%) ("%!!# ("&#'! ("#%$! ("&+'+/550052&4&4( )%"#&'$ )%"$&*$ )%"#$&& )%"'&%$ )%"&*'' )%"*)''/550053&4&4( )%"*&&' )%")('+ )%")++& )%"&'#' )%"&)+) )%"&'%$

3

Page 4: Unsupervised analysis of gene expression databioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture13.pdf · Three major goals of gene expression studies Class comparison (supervised

What is clustering

  Clustering algorithms are methods to divide a set of n objects (genes or samples) into g groups so that within group similarities are larger than between group similarities

  Unsupervised techniques that do not require sample annotation in the process

Applied Bioinformatics, Spring 2011

Sample_1 Sample_2 Sample_3 Sample_4 Sample_5 …… TNNC1 14.82 14.46 14.76 11.22 11.55 …… DKK4 10.71 10.37 11.23 19.74 19.73 …… ZNF185 15.20 14.96 15.07 12.57 12.37 …… CHST3 13.40 13.18 13.15 11.18 10.99 …… FABP3 15.87 15.80 15.85 13.16 12.99 …… MGST1 12.76 12.80 12.67 14.92 15.02 …… DEFA5 10.63 10.47 10.54 15.52 15.52 …… VIL1 11.47 11.69 11.87 13.94 14.01 …… AKAP12 18.26 18.10 18.50 15.60 15.69 …… HS3ST1 10.61 10.67 10.50 12.44 12.23 …… …… …… …… …… …… …… ……

Gen

es

Samples

4

Page 5: Unsupervised analysis of gene expression databioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture13.pdf · Three major goals of gene expression studies Class comparison (supervised

Why clustering?

  Exploratory data analysis, providing rough maps and suggesting directions for further study

  Representing distances among high-dimensional expression profiles in a concise, visually effective way, such as a tree or dendrogram

  Identify candidate subgroups in complex data. e.g. identification of novel sub-types in cancer, identification of co-expressed genes

  Functional annotation based on guilt by association

Applied Bioinformatics, Spring 2011 5

Page 6: Unsupervised analysis of gene expression databioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture13.pdf · Three major goals of gene expression studies Class comparison (supervised

Clustering methods

  Hierarchical clustering: generate a hierarchy of clusters going from 1 cluster to n clusters

  Partitioning: divide the data into g groups using some reallocation algorithms, e.g. K-means

Applied Bioinformatics, Spring 2011 6

Page 7: Unsupervised analysis of gene expression databioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture13.pdf · Three major goals of gene expression studies Class comparison (supervised

Hierarchical clustering

  Agglomerative clustering (bottom-up)   Start out with all sample units in n clusters of size 1.

  At each step of the algorithm, the pair of clusters with the shortest distance are combined into a single cluster.

  The algorithm stops when all sample units are combined into a single cluster of size n.

  Divisive clustering (top-down)   Start out with all sample units in a single cluster of size n.

  At each step of the algorithm, clusters are partitioned into a pair of daughter clusters, selected to maximize the distance between each daughter.

  The algorithm stops when sample units are partitioned into n clusters of size 1.

Applied Bioinformatics, Spring 2011 7

Page 8: Unsupervised analysis of gene expression databioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture13.pdf · Three major goals of gene expression studies Class comparison (supervised

Agglomerative clustering

  Require distance measurement   Between two objects

  Between clusters

Applied Bioinformatics, Spring 2011 8

Page 9: Unsupervised analysis of gene expression databioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture13.pdf · Three major goals of gene expression studies Class comparison (supervised

Between objects distance measurement

  Euclidean distance   Focus on the absolute expression value

  Pearson correlation coefficient   Focus on the expression profile shape   Parametric, normally distributed and

follow the linear regression model

  Spearman correlation coefficient   Focus on the expression profile shape   Non-parametric, no assumption   Less sensitive but more robust than

Pearson

Applied Bioinformatics, Spring 2011

!

d = xi " yi( )2i=1

n

#

!

r =xi " x ( )(yi " y )

i=1

n#(xi " x )2

i=1

n# (yi " y )2

i=1

n#

!

d =1" r

9

Page 10: Unsupervised analysis of gene expression databioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture13.pdf · Three major goals of gene expression studies Class comparison (supervised

Different measurement, different distance

0

1

2

3

4

5

6

1 2 3 4 5 6 7Time (hr)

Gen

e ex

pres

sion

leve

l (lo

g2)

GeneAGeneBGeneCGeneD

Most similar profile to GeneA (blue) based on different distance measurement: Euclidean: GeneB (pink)

Pearson: GeneC (green)

Spearman: GeneD (red)

Applied Bioinformatics, Spring 2011 10

Page 11: Unsupervised analysis of gene expression databioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture13.pdf · Three major goals of gene expression studies Class comparison (supervised

Between cluster distance measurement

  Single linkage: the smallest distance of all pairwise distances   Complete linkage: the maximum distance of all pairwise distances

  Average linkage: the average distance of all pairwise distances

Applied Bioinformatics, Spring 2011 11

Page 12: Unsupervised analysis of gene expression databioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture13.pdf · Three major goals of gene expression studies Class comparison (supervised

Visualization and interpretation of hierarchical clustering results

  Dendrogram   Output of a hierarchical

clustering

  Tree structure with the genes or samples as the leaves

  The height of the join indicates the distance between the left branch and the right branch

  Heat map   Graphical representation of

data where the values are represented as colors.

Applied Bioinformatics, Spring 2011 12

Page 13: Unsupervised analysis of gene expression databioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture13.pdf · Three major goals of gene expression studies Class comparison (supervised

Partitioning

  General idea   Select the number of groups, g

  Randomly divide the objects into g Group

  Iteratively rearrange the objects until a stop condition

  Representative methods   K-means

  Self Organizing Map (SOM)

Applied Bioinformatics, Spring 2011 13

Page 14: Unsupervised analysis of gene expression databioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture13.pdf · Three major goals of gene expression studies Class comparison (supervised

K-means

  Define k = number of clusters   Randomly initialize a seed vector for each cluster

  Go through all objects, and assign each object to the cluster witch it is most similar to

  Recalculate all seed vectors as means of patterns of each cluster

  Repeat 3 & 4 until a stop condition (e.g. Until all objects get assigned to the same partition twice in a row)

Applied Bioinformatics, Spring 2011 14

Page 15: Unsupervised analysis of gene expression databioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture13.pdf · Three major goals of gene expression studies Class comparison (supervised

K-means seed vector 1

seed vector 2

Objects join with closest seed Randomly initialize seeds

Recaculate seeds Reassign objects

Recaculate seeds Reassign objects

Seeds become stable: final clusters

Applied Bioinformatics, Spring 2011 15

Page 16: Unsupervised analysis of gene expression databioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture13.pdf · Three major goals of gene expression studies Class comparison (supervised

Cool animations

  Hierarchical clustering   http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletH.html

  K-means   http://animation.yihui.name/mvstat:k-means_cluster_algorithm

Applied Bioinformatics, Spring 2011 16

Page 17: Unsupervised analysis of gene expression databioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture13.pdf · Three major goals of gene expression studies Class comparison (supervised

Resources

  Data source   Gene Expression Omnibus (GEO): http://www.ncbi.nlm.nih.gov/geo/

  ArrayExpress: http://www.ebi.ac.uk/arrayexpress/

  Microarray data analysis tools   Bioconductor: http://www.bioconductor.org/

  Expression profiler: http://www.ebi.ac.uk/expressionprofiler/

Applied Bioinformatics, Spring 2011 17

Page 18: Unsupervised analysis of gene expression databioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture13.pdf · Three major goals of gene expression studies Class comparison (supervised

Summary

  Agglomerative clustering   Bottom-up

  Between objects distance measurement   Euclidean distance   Pearson’s correlation coefficient   Spearman’s correlation coefficient

  Between cluster distance measurement   Single linkage

  Complete linkage

  Average linkage

  Visualization   Dendrogram

  Heat map

  k-means clustering   Partitioning

Applied Bioinformatics, Spring 2011 18

Page 19: Unsupervised analysis of gene expression databioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture13.pdf · Three major goals of gene expression studies Class comparison (supervised

Exercise

  Data set: evan_deneris_2010_5ht_top500diff.txt

  500 selected probe sets

  Four groups (Rostral_5ht, Rostral_non5ht, Caudal_5ht, Caudal_non5ht)

  No missing value; Already normalized; Already log transformed

  Use hierarchical clustering in Expression profiler (http://www.ebi.ac.uk/expressionprofiler) to generate a heat map

Applied Bioinformatics, Spring 2011 19