Gene Expression Changes in the Course of Neural Progenitor Cell
Introduction to Time-Course Gene Expression Data
description
Transcript of Introduction to Time-Course Gene Expression Data
Introduction to Time-Course Gene Expression Data
STAT 675
R Guerra
April 21, 2008
Outline
• The Data
• Clustering – nonparametric, model based
• A case study
• A new model
The Data
• DNA Microarrays: collections of microscopic DNA spots, often representing single genes, attached to a solid surface
The Data
• Gene expression changes over time due to environmental stimuli or changing needs of the cell
• Measuring gene expression against time leads to time-course data sets
Time-Course Gene Expression
• Each row represents a single gene
• Each column represents a single time point
• These data sets can be massive, analyzing many genes simultaneously
Time-Course Gene Expression
• k-means to clustering• “in the budding yeast
Saccharomyces cerevisiae clustering gene expression data
• groups together efficiently genes of known similar function,
• and we find a similar tendency in human data…” Eisen et al. (1998)
Clustering Expression Data
• When these data sets first became available, it was common to cluster using non-parametric clustering techniques like K-Means and hierarchical clustering
Yeast Data Set
• Spellman et al (1998) measured mRNA levels on yeast (saccharomyces cerevisiae)– 18 equally spaced time-points– Of 6300 genes nearly 800 were categorized as cell-
cycle regulated– A subset of 433 genes with no missing values is a
commonly used data set in papers detailing new time-course methods
– Original and follow-up papers clustered genes using K-means and hierarchical clustering
Spellman et al. (1998)Yeast cell cycle
Row labels = cell cycleRows=genesCol labels = exptsCols = time points
Yeast Data Set (Spellman et al.)K-means Hierarchical
Which method gives the “right” result???
Non-Parametric Clustering
1. Data curves2. Apply distance metric to get distance matrix3. Cluster
Issues with Non-Parametric Clustering
• Technical– Require the number of clusters to be chosen a priori – Do not take into account the time-ordering of the data– Hard to incoporate covariate data, eg, gene ontology
• Yeast analysis had number of clusters chosen based on number of cell cycle groups .…no statistical validation showing that these were the best clustering assignments
Model-Based Clustering
• In response to limitations of nonparametric methods, model based methods proposed– Time series
– Spline Methods
– Hidden Markov Model
– Bayesian Clustering Models
• Little consensus over which method is “best” to cluster time course data
K-Means Clustering
Relocation method: Number of clusters pre-determined and curves can change clusters at each iteration– Initially, data assigned at random to k clusters
– Centroid is computed for each cluster
– Data reassigned to cluster whose centroid is closest to it
– Algorithm repeats until no further change in assignment of data to clusters
– Hartigan rule used to select “optimal” #clusters
K-means: Hartigan Rule
• n curves, let k1 =k groups and k2 = k+1 groups.• If E1 and E2 are the sums of the within cluster
sums of squares for k1 and k2 respectively, then add the extra group if:
10)1(
2
1
E
knE
K-means: Distance Metric
• Euclidean Distance
• Pearson Correlation
K-means: Starting Chains
• Initially, data are randomly assigned to k clusters but this choice of k cluster centers can have an effect on the final clustering
• R implementation of K-means software allows the choice of “number of initial starting chains” to be chosen and the run with the smallest sum of within cluster sums of squares is the run which is given as output
K-Means: Starting Chains
For j = 1 to B
Random assignment j k clusters wj = within cluster sum-of-squares
End j
Pick clustering with min(wj)
Insert Initial starting chains
Hierarchical Clustering
• Hierarchical clustering is an addition or subtraction method.
• Initially each curve is assigned its own cluster– The two closest clusters are joined into one
branch to create a clustering tree– The clustering tree stops when the algorithm
terminates via a stopping rule
Hierarchical Clustering
• Nearest neighbor: Distance between two clusters is the minimum of all distances between all pairs of curves, one from each cluster
• Furthest neighbor: Distance between two cluster is the maximum of all distances between all pairs of curves, one from each cluster
• Average linkage: Distance between two clusters is the average of all distances between all pairs of elements, one from each cluster
Hierarchical Clustering
• Normally the algorithm stops at a pre-determined number of clusters or when the distance between two clusters reaches some pre-determined threshold– No universal stopping rule of thumb to find an
optimal number of clusters using this algorithm.
Model-Based Clustering
Many uses mixture models, splines or piecewise polynomial functions used to approximate curves
Can better incorporate covariate information
Models using Splines
• Time course profiles assumed observations from some underlying smooth expression curve
• Each data curves represented as the sum of:– Smooth population mean spline (dependent on time and
cluster assignment)
– Spline function representing individual (gene) effects
– Gaussian measurement noise
SSCLUST software
Pan
Model based clustering and data transformationsfor gene expression data (2001)
Yeung et al., Bioinformatics, 17:977-987.
MCLUST software
Validation Methods
• L(C) is maximized log-likelihood for model with C clusters, m is the number of independent parameters to be estimated and n is the number of genes
• Strikes a balance between goodness-of-fit and model complexity
• The non-model-based methods have no such validation method
)()(2)( nLogmCLCBIC c
Clustering Yeast Data using SSClust
Clustering Yeast Data in MCLUST
Comparison of Methods
• Ma et al (2006)
• Smoothing Spline Clustering (SSClust)
• Simulation study
• SSClust better than MClust & nonparameteric
• Comparison: misclassification rates
Functional Form of Ma et al (2006) Simulation Cluster Centers
MR and OSR
• Misclassification Rate
• Overall Success Rate
– To calculate OSR the MR is only for the cases when the correct number of clusters is found
Curves of # Total
Curves iedMisclassif of #MR
)-(1 found) clusters #correct (% MROSR
Comparison of Methods
• From Ma et al. (2006) paper.
Clustering Method Distance Metric MR (%) Correct # of Clusters (%) OSR (%)K-means Euclidean 9.73 N/A NAK-means Pearson 2.64 N/A NAMCLUST N/A 0.38 77 69.5SSClust N/A 0.13 100 98.7
SSClust Methods Paper
• Concluded that SSClust was the superior clustering method
• Looking at the data, the differences in scale between the four true curves is large– Typical time course clusters differ in location and
spread but not in scale to this extreme– Their conclusions are based on a data set which is not
representative of the type of data this clustering method would be used for
Alternative Simulation
Functional Form for fiveclusters centers
Example of SSClust Breaking Down
Linear curves joined while sine curves arbitrarily split into 2 clusters
Simulation Configuration
• Distance Metric– Euclidean or Pearson
• # of Curves– Small (100), Large (3000)
• # Resolution of Time Points – 13 or 25 time points– evenly spaced or unevenly spaced
• Types of underlying Curves – Small (4) – Large (8)
Simulation Configuration
• Distribution of curves across clusters– Equally distributed verses unequally distributed
• Noise Level– Small (< 0.5*SD of the data set)– Large (> 0.5*SD of the data set)
• For these cases, found the misclassification rates and the percent of times that the correct number of clusters was found
Function Forms of 7 Cluster Centers
Simulation Analysis
Conclusions from Simulations
• MCLUST performed better than SSClust and K-means in terms of misclassification rate and finding the correct number of clusters
• Clustering methods were affected by the level of noise but, in general, not by the number of curves, the number of time points or the distribution of curves across cluster
Effect on Number of Profiles on OSR
Comparison based on Real Data
• Applied these same clustering techniques to real data
• Different numbers of clusters found for different methods for each real data set
Yeast Data
Human Fibroblast Data
Simulations Based on Real Data
– Start with real data, like the yeast data set– Cluster the results using a given clustering
method– Perturb the original data (add noise at each
point)– Evaluate how different the new clustering is in
comparison to the original clustering• Use MR and OSR
Simulations Based on Yeast Data
Simulations Based on HF Data
Conclusions from these Simulations
• SSClust better than MCLUST and K-means – This was in contrast to the prior simulations
where MCLUST was best
Gene Ontology
• So far I’ve described my work analyzing and comparing clustering results on gene expression data
• Some, like Pan (2006) have argued that clustering methods, even newer model-based clustering methods, are incomplete because they ignore gene function and other biological aspects in the clustering
Gene Ontology
• Expectation is that incorporating biological data in with the expression data with yield to better clustering
Gene Ontology
• Gene Ontology project (Ashburner et al. 2000) provides a structured vocabulary to describe genes and gene products in organisms
• Three ontologies developed– Biological Processes (e.g……)– Molecular Function (e.g……)– Cellular Component (e.g……)
Annotations
• Gene Ontology annotations are associations made between gene products and the GO terms describing them
• A directed acyclic graph for a gene from the HF data set using GO molecular function annotation is to the right
Clustering using GO Data
• First, need a distance metric
• Two metrics used are based Union-Intersection distance and the longest path distance both developed in Gentleman (2005) and extended by Christian (2007)
• I used the Union-Intersection distance in my clustering
GO Distances
• The union-intersection distance is defined as
• Show example using two dags – Min = 0 when two DAGs are identical, – Max = 1 when two DAGs have nothing in
common
Showing UI Distance
Clustering Using All Data
• Open question in how to cluster genes using both time-course expression data and gene ontology data together
• Two of the methods I used are from Boratyn et al (2007) and from Fang et al (2006)
Boratyn et al (2007) Method
• Clusters are based on adding individually scaled distances matrices– Take distance matrix from expression clustering and
the distance matrix from gene ontology cluster
– Put them on the same scale [0,1]
– Add the scaled distance matrices together
– Cluster using this new distance matrix which captures differences in expression profiles and gene function
Yeast: 12 Clusters on Combined Distance Metric
Fang et al (2006) Method
• In this method,– Gene ontology is a guide for clustering the
expression profiles– Biological process is the GO annotation used– Uses the mean squared residual score to assess
the expression correlation of genes within a cluster from the clustering by GO data.
Effect of the Choice of Ontologies
• Examined effect of the choice of which ontology to use in my clustering between BP, CC, and MF.
• Fang et al (2006) uses BP in their method as it has tended to be most closely correlated with gene function among the 3 ontologies
Effect of Choice of Ontology
Conclusions from GO Chapter
• Clustering using expression and ontology data together proved to provide expression clusters as good or better as when expression data is clustered alone but we have the added bonus of a biological base filtering out potentially nonsensical clustering
Conclusions from Paper as a Whole
• Expression clustering by model-based and non-model-based clustering methods do not have a uniform “best clustering method” in all cases– But, methods are robust in terms of data apportionment
per cluster and the number of curves per dataset (important for massive gene data banks.)
• Clustering using expression and GO data together improves upon expression clustering and again methods vary in complexity, performance, and ease of use
Further Extensions
• GO analysis was all using K-means and hierarchical clustering– Extend GO clustering to model-based
clustering techniques like MCLUST and SSClust (currently, GO data can be used as initial conditions in these models but not as some notion of prior model parameters.)
P. falciparum:Examination of Correlation Between Spatial Location and Temporal Expression of Genes
Motivations:
• Evidence for correlation in literature– Printing artifact – Biological
• Develop a visualization and statistical testing methodology
ORF1 ORF2promoter
mRNA
Operon control (bacteria)
ORF1 ORF2
mRNAs
Upstream Activating Sequences (yeast)
UAS1 UAS2
ORF1 ORF2
Locus Control Region (mammalian globin cluster)
LCR1
mRNAs
Biological Motivations
Hypothesis and Statistic
• Statistical: Correlation between chromosomal location and gene expression?
• Biological: Gene order random?
• H0: no correlation between location on chromosome and expression
• Consider correlations in partitions
ApproachCovariogram: General Tool
Partition Chromosome, Develop Statistic
Permutation Testing Framework
Check for Confounding Factors
Biological Significance
Issues
• Confounding (printing) or other artifacts
• Account for inter-gene distances (as opposed to adjacent pairwise correlation)
• Significance of correlation
operon
Methods: Data
• Need gene information (plasmodb.org has annotated fastA files):
TCAAGCAATTGTTAGATGAGAACAATAGGAAGAATTTAAATTTTAATGATCTGGTTATACACCCTTGGTGGTCTTATAAGAATTAA>Pfa3D7|pfal_chr1|PFA0135w|Annotation|Sanger(protein
coding) hypothetical protein Location=join(124752..124823,124961..125719)
ATGATATTTCATAAATGCTTTAAAATTTGTTCGCTCTCTTGTACTGTTTTATGGGTTACCGCCATATCATCGATCATTCAACCAGACAAACAACAAGAAA
• Normalized gpr files (2-D loess, centered and scaled)
Methods: Data
FastA sequence:5400 predicted
genes
QC Microarray:3800 genes
5100 probes
Intersection:3500 genes
with common gene name
PFA0135w124752:125719 bp
PFA0135wprobe a16122_1
t1,t2,…, t48
PFA0135w124752:125719
bpprobe a16122_1
t1,t2,…, t48
Methods: Covariograms
)]),(|,([),;,( baba dyxdistdyxAveddyx
• Covariogram 1: distance is chromosomal location:
• Covariogram 2: distance is printed microarray location:
)(,)(,),( locchrmidptjlocchrmidptiji ggggd
2,,
2,,),( yjyixjxiji ggggggd
Chr 10: Covariogram 2
Chr 10: Covariogram 1
Chr 6: Covariogram 1
Chr 6: Covariogram 2
Methods: Partitioning
• Partition• Avg of all pairwise
Pearson correlations
3
12 3
1
iirr
3 genes,
2
3pairwise correlations
60 kb
120 kb
0 kb
21
11 21
1
iirr
7 genes,
2
7pairwise correlations
Methods: Partitioning
• Chr 6, 40 kb partition• Significant?
Methods: Permutation Test
• in a 40kb interval on chr 6
• Permutation test• Null distribution• Estimated
p-values
2g
3g
4g
obsgene
1g 1e
2e
3e
4e
Perm(1)
1e
2e
3e
4e
Perm(2)
1e
2e
3e
4e
Perm(n)…
1e
2e
3e
4e
…
.50r
22.0
2
57.0
valp
n
r
genes
obs
Methods: Permutation Test
• Distribution of
in 40 kb interval
r
001.0
6
72.0
valp
n
r
genes
obs
Methods: Permutation Test
• Distribution of
in 40 kb interval
r
002.0
9
49.0
valp
n
r
genes
obs
Methods: Permutation Test
• Distribution of
in 40 kb interval
r
475.0
12
018.0
valp
n
r
genes
obs
Methods: Permutation Test
• Distribution of
in 40 kb interval
r
100kb
10kb
80kb
20kb
60kb40kb
Significant Intervals (Chr 7)
Significant Intervals (Chr 7)100kb
10kb
80kb
20kb
60kb40kb
Significant Intervals (Chr 7)100kb
10kb
80kb
20kb
60kb40kb
100kb
80kb
10kb20kb40kb60kb
MAL6P1.257: hypothetical protein
MAL6P1.258: malate:quinone oxidoreductase
MAL6P1.259: hypothetical protein
MAL6P1.260: hypothetical protein
MAL6P1.263: hypothetical protein
MAL6P1.265: pyridoxine kinase
MAL6P1.266: hypothetical protein
MAL6P1.267: hypothetical protein
MAL6P1.268: hypothetical protein
MAL6P1.271: cdc2-like protein kinase
MAL6P1.272: ribonuclease
MAL6P1.273: hypothetical protein
Results: Summary Table
10kb 60kb 100kb 10kb in 60kb
Chr 3 3/400 0/68 0/40 0
Chr 4 10/476 5/80 2/48 4
Chr 5 6/528 1/88 3/56 0
Chr 14 4/1304 2/220 1/132 0
Conclusions
• Statistical: Significance for both small regions of strong correlation and large regions of weak correlation
• Biological: Evidence for regulation at multiple levels