Uncorrelated and Autocorrelated relaxed phylogenetics
description
Transcript of Uncorrelated and Autocorrelated relaxed phylogenetics
Juin 2008 bioinf.cs.auckland.ac.nz
Uncorrelated and Autocorrelatedrelaxed phylogenetics
Michaël Defoin-Platel and Alexei Drummond
Relaxed Phylogenetics 2
(Bayesian) RELAXED PHYLOGENETICS
Relaxed Phylogenetics allows •the co-estimation of divergence times together with a
phylogenetic reconstruction•should be compared with
b1
b2
b3
b4
b5
time
t0
t1
t2
Unrooted (2n-3 parameters)
Rooted with a strict clock(n-1 divergence times)
Relaxed Phylogenetics 3
TIME, SUBSTITUTIONS, and RATES
Time, substitutions and rates•Expected number of substitutions
per site on a particular branch i
•Substitution rate R(t) cannot be directly observed !
→Only the product of rate and time is identifiable→Without information external to the data, rate and time cannot be
separated…
T
ii dttRTb0
)()(
time
i
T
0
Relaxed Phylogenetics 4
MOLECULAR CLOCK HYPOTHESIS
Molecular Clock Hypothesis (MCH)(Zuckerlandl and
Pauling 1965)•DNA and protein sequences change at a rate that is constant
over time•First the substitution rate is estimated then time corresponds
to sequence divergence divided by the rate→Estimation of relative rate and relative divergence times
Calibration•Time reference, scaling
•Bayesian Phylogenetics : Priors on node height or on tips
→Transform relative to absolute rate
divergenceevaluatedtimeevaluated
divergencencalibratiotimencalibratio
Relaxed Phylogenetics 5
MOLECULAR CLOCK HYPOTHESIS
Substitution rate depends on •Natural selection, population size, body mass,
generation time, mutation rate, mutation pattern, …→MCH is often violated !
How to deal with non-clock like data•Keep them !•Remove them !•Relax the MCH
→Allow the rate of evolution to vary→Make assumptions about the variations
Relaxed Phylogenetics 6
RELAXING THE MCH
Modeling the “Rate of evolution of the rate of evolution”•Sanderson “nonparametric” model•(Random) Local Clock model•Uncorrelated relaxed clock model•Autocorrelated relaxed clock model•Compound Poisson process
Implementation of relaxed clock models in Beast allows to co-estimate
•the substitution parameters•the clock parameters •the ancestral phylogenies•the demography•…
→Relaxed phylogenetics
Relaxed Phylogenetics 7
UNCORRELATED RELAXED CLOCK (UC) Drummond et al 2006
Hypothesis•The rate of evolution is probably never exactly the same
for all evolutionary lineages •Rates follow a given distribution
Prior on rates
→Distribution of the rates given by the hyperparameters and 2 or
)(),(~ 2 ExporLogNormalr
Relaxed Phylogenetics 8
UNCORRELATED RELAXED CLOCK (UC) Drummond et al 2006
Implementation•Different rates in a tree•But a constant rate per branch•On a given rooted tree of n
species 2n-2 ratesn-1 divergence times
•The distribution is discretized•Each branch of the tree is
assigned a given rate category•Category mixing :
swappeddrawn (uniform)random walk
time
t0
t1
t2
4321
r1
r0
r2r3r4 r5
0 2 4 6 8 10
relative rate r
),(~ 2LNr
Relaxed Phylogenetics 9
AUTOCORRELATED RELAXED CLOCK (AC) Thorne and Kishino 1998,2001,2002
Hypothesis•The rate is probably never exactly the same for all evolutionary
lineages •For closely related lineages the rates should be similar
Prior on rates
•log of the rates follow a Normal distribution•Expectation of a rate r is its ancestor rate rA
→Rate at the root node is given by the hyperparameter →Amount of variation is given by the hyperparameter 2
2
2
,2
)log(~)log(|)log( ttrNrr AA
rA
rt
Relaxed Phylogenetics 10
AUTOCORRELATED RELAXED CLOCK (AC) Thorne and Kishino 1998,2001,2002
Implementation•Different rates in a tree•But a constant rate per branch•On a given rooted tree of n
species 2n-2 rates n-1 divergence times
Episodic vs Time dependent•Episodic variance = 2
•Time dependent variance = t 2
time
t0
t1
t2
4321
r1
r0
r2r3r4 r5
Relaxed Phylogenetics 11
GOALS of this TALK
Validation of models implementation
Comparison of models•Fit the data•Deal with calibrations•Estimate of divergence times•Estimate of rates•Reconstruct the tree topology
Relaxed Phylogenetics 12
PHYLOGENETIC ANALYSIS
Dataset 1: Lemurs (Yoder et al 2000)•36 species (lemurs + mammals outgroup)•alignment of 1812 nucleotides (2 genes)•7 calibration points
Settings•HKY substitution model + gamma rate heterogeneity•Yule tree prior•4 independent runs of 20 M steps of MCMC for each
setting
Relaxed Phylogenetics 13
PHYLOGENETIC ANALYSIS
Dataset 2: Primates (Peter Waddell)•7 species of primates: human, chimp, gorilla, orangutan,
gibbon, macaque and marmoset•alignment of 1,362,261 nucleotides •Non coding regions•calibration : 16 MYA divergence time
of human – orangutan
Settings•GTR substitution model + gamma rate heterogeneity +
Invariant•Coalescent or Yule tree prior•4 independent runs of 50 M steps of MCMC for each
setting
Relaxed Phylogenetics 14
PHYLOGENETIC ANALYSIS
Dataset 3: Yeast (Rokas et al 2003)•8 species of yeast•alignment of 127,026 nucleotides (106 genes)•calibration : Normal prior on the root height N (1, 0.025)
Settings•GTR substitution model + gamma rate heterogeneity +
Invariant•Yule tree prior•4 independent runs of 50 M steps of MCMC for each
setting
Relaxed Phylogenetics 15
PHYLOGENETIC ANALYSIS
Dataset 4: Dengue (Rambaut 2000)•17 serotype 4 sequences•alignment of 1,485 nucleotides•serial sampling (1956-1994)
Settings•HKY substitution model•Coalescent tree prior•4 independent runs of 10 M steps of MCMC for each
setting
Relaxed Phylogenetics 16
PHYLOGENETIC ANALYSIS
Dataset 5 : Influenza A virus (Drummond et al 2006)•69 sequences •each sequence represents a consensus of the viral
population•alignment of 98 nucleotides•serial sampling (1981-1998)
Settings•HKY substitution model + gamma rate heterogeneity•Coalescent tree prior•Constant population size•4 independent runs of 20 M steps of MCMC for each
setting
Relaxed Phylogenetics 17
MODEL COMPARISON
Bayes Factor (Kass and Raftery 1995, Marc Suchard 2005) •Quantifies the real support of two competing hypothesis
given the observed data
→Ratio of the marginal likelihood of two models M1 and M2
→Bayesian analogue of the likelihood rate test (LRT)
)Pr()Pr(
2
1
MDMD
K
Relaxed Phylogenetics 18
MARGINAL LOG LIKELIHOOD
SC UC AC eAC
Lemurs -31 524.7 -31 349.3 -31 355.4 -31 352.3
Primates -3 090 089.90 -3 089 592.76 -3 089 591.72 -3 089 591.37Yeast -684 380.8 -683 754.6 -683 754.4 -683 754.6Dengue -3 861.7 -3 861.5 -3 861.9 -3 861.7Influenza -4 288.8 -4 263.9 -4272.1 -4 275.7
A priori
Clock-like Correlated Calibrations
Lemurs No ? 7 internal (hard)
Primates Nearly Yes 1 internal (soft)
Yeast No ? root node (soft)
Dengue Yes Yes Serial Sampling
Influenza No No Serial Sampling
Relaxed Phylogenetics 19
Influenza datasetConsensus trees
Uncorrelated AutoCorrelated
Relaxed Phylogenetics 20
DIVERGENCE TIMES
Lemurs Primates Yeast
Dengue Influenza
Lemurs Primates Yeast
Dengue Influenza
Lemurs Primates Yeast
Dengue Influenza
Relaxed Phylogenetics 21
DIVERGENCE TIMES
Beast: mean of the posterior distributions, error bars are 95% lower and upper HPDsGlazko et al: error bars are +/- standard error
Posterior distribution of the root height
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
20 28 36 44 52 60 68 77 85 93
root height
mar
gina
l den
sity
UC+CoalescentUC+YuleAC+CoalescentAC+Yule
Posterior distribution of the root height
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
20 28 36 44 52 60 68 77 85 93
root height
mar
gina
l den
sity
UC+CoalescentUC+YuleAC+CoalescentAC+Yule
Divergence times of Human from other Primates
0
10
20
30
40
50
60
70
80
90
100
Chimp Gorilla Orangutan Gibbon OWM NWMM
YA
Table 5 Glazko et al (2003)UC + CoalescentUC + YuleAC + CoalescentAC + Yule
Divergence times of Human from other Primates
0
10
20
30
40
50
60
70
80
90
100
Chimp Gorilla Orangutan Gibbon OWM NWMM
YA
Table 5 Glazko et al (2003)UC + CoalescentUC + YuleAC + CoalescentAC + Yule
Relaxed Phylogenetics 22
DIVERGENCE TIMES
Uncorrelated Relaxed Clock
Human
Chimp
Gorilla
Orang
Gibbon
Macaque
Marmoset
Autocorrelated Relaxed Clock
8.00E-04
8.50E-04
9.00E-04
9.50E-04
1.00E-03
1.05E-03
1.10E-03
1.15E-03
1.20E-03
0 10 20 30 40 50
Mya
Bran
che
rate
Human
Chimp
Gorilla
Orang
Gibbon
Macaque
Marmoset
8.00E-04
8.50E-04
9.00E-04
9.50E-04
1.00E-03
1.05E-03
1.10E-03
1.15E-03
1.20E-03
0 10 20 30 40 50
Mya
Bran
che
rate
Human
Chimp
Gorilla
Orang
Gibbon
Macaque
Marmoset
Human
Chimp
Gorilla
Orang
Gibbon
Macaque
Marmoset
8.00E-04
8.50E-04
9.00E-04
9.50E-04
1.00E-03
1.05E-03
1.10E-03
1.15E-03
1.20E-03
0 10 20 30 40 50 60
MyaBr
anch
era
te
8.00E-04
8.50E-04
9.00E-04
9.50E-04
1.00E-03
1.05E-03
1.10E-03
1.15E-03
1.20E-03
0 10 20 30 40 50 60
MyaBr
anch
era
te
Relaxed Phylogenetics 23
RATE OF EVOLUTION
Mean External Coefficient of Coefficient of Rate Rate Variation CorrelationLemurs SC 0.00297 - -
UC 0.00309 0.00357 0.39 0.01 AC 0.00325 0.00419 0.37 0.88 eAC 0.00325 0.00472 0.49 0.88Primates SC 0.00095 - -
UC 0.00098 0.00099 0.12 -0.14
AC 0.00105 0.00100 0.11 0.56
eAC 0.00104 0.00099 0.11 0.74Yeast SC 1.03 - -
UC 0.87 0.83 0.46 -0.13
AC 0.83 0.79 0.37 0.19
eAC 0.90 0.98 0.44 0.33
Relaxed Phylogenetics 24
RATE OF EVOLUTION
Mean External Coefficient of Coefficient of Rate Rate Variation Correlation
Dengue SC 0.00080 - -
UC 0.00081 0.00082 0.06 -0.03
AC 0.00079 0.00080 0.06 0.69
eAC 0.00079 0.00081 0.05 0.69Influenza SC 0.0048 - -
UC 0.0050 0.0061 0.58 -0.01 AC 0.0050 0.0052 0.37 0.87 eAC 0.0045 0.0052 0.38 0.89
Relaxed Phylogenetics 25
RATE OF EVOLUTION
Relaxed Phylogenetics 26
RATE OF EVOLUTION
Relaxed Phylogenetics 27
GENES RATE VS
SPECIES RATE
Mean rate per “locus”
Primates Yeast
Relaxed Phylogenetics 28
NAÏVE MULTIPLE LOCUS APPROACH
Super Matrix→Genes share the same divergence time
Multiple Locus→Perform a relaxed phylogenetic analysis for each “genes”
SC UC AC eACYeast (SM) -684 380.8 -683 754.6 -683 754.4 -683 754.6Yeast (mL) -672 854.3 -672 135.5 -672 115.8 -672 128.86Primates (SM) -3 090 089.90 -3 089 592.76 -3 089 591.72 -3 089 591.37Primates (mL) -3 078 315.48 -3 077 756.50 -3 077 784.95 -3 078 136.58
Relaxed Phylogenetics 29
GENES DIVERGENCE TIMES VS
SPECIES DIVERGENCE TIMES
Relaxed Phylogenetics 30
GENES DIVERGENCE TIMES VS
SPECIES DIVERGENCE TIMES
Root Height in the primates dataset
Genome Multiple LocusMean Error Mean Error
SC 56.91 0.04 57.91 0.51UC 55.7 0.61 55.47 0.60AC 49.7 0.08 51.52 0.39eAC 51.06 0.58 54.9 0.47
Relaxed Phylogenetics 31
GENES RATE VS
SPECIES RATE
Coefficient of
VariationCoefficient ofCorrelation
Super MatrixMultiple Locus Super Matrix Multiple Locus
Yeast UC 0.46 0.75 -0.13 -0.07
AC 0.37 0.71 0.19 0.39 eAC 0.44 0.77 0.33 0.34
Primates UC 0.12 0.16 -0.14 -0.08 AC 0.11 0.10 0.56 0.44
eAC 0.11 0.03 0.74 0.49
Relaxed Phylogenetics 32
GENES TREE VS
SPECIES TREE
% True Tree in Size of True Tree95% Cred Set 95% Cred Set Posterior
Yeast SC 64.7 2.9 25.4UC 92.4 24.7 20.6
AC 88.6 17.8 15.7
eAC 88.6 15.1 19.1Primates SC 86.7 1.1 79.4
UC 87.5 1.3 75.7
AC 87.5 1.2 77.7
eAC 87.5 1.1 79.1
Relaxed Phylogenetics 33
GENES TREE VS
SPECIES TREE
Relaxed Phylogenetics 34
Conclusions
Validation of the implementation in Beast
Model comparison•Fit the data•Uncorrelated vs Autocorrelated : prior knowledge•Calibrations•Estimate of rates•Disagree in the multiple locus approach•Reconstruct the tree topology
Relaxed Phylogenetics 35
THANKS