Estimating phylogenetic trees from discrete morphological data
Thesis DefenseApril M. WrightApril 14, 2015
Supervising Professor: David M. Hillis1
2
Bayesian Analysis Using a Simple Likelihood Model Outperforms Parsimony for Estimation of Phylogeny from Discrete Morphological Data
Modeling character change heterogeneity through the use of priors
Use of an Automated Method for Partitioning Morphological Data
"Bayes' Theorem MMB 01" by mattbuck, PartitionFinder logo by Ainsley Seago
Why Morphology?● Estimates put > 99% of biota that has ever existed as extinct● Probably, we won’t wring DNA from a stone
5
Why Morphology?● Estimates put > 99% of biota that has ever existed as extinct● Probably, we won’t wring DNA from a stone● Inclusion of fossils acknowledged to improve phylogenetic trees
(Huelsenbeck 1991, Wiens 2001 & 2004), divergence dating (Heath, Stadler and Huelsenbeck
2012) and comparative method estimates (Slater and Harmon 2012)
6
Image: W
ikimedia C
omm
ons
● Smaller● Selection bias
● Sequence data sets large● Take the whole sequence
Image: W
ikimedia C
omm
ons
● Smaller● Selection bias● Morphology has to be
interpreted
● Sequence data sets large● Take the whole sequence● Characters have more
clearly-defined properties
● Smaller● Selection bias● Morphology has to be
interpreted
Char. 1 Char. 2 Char. 3 Char. 4 Char. 5 Char. 6 Char. 7 Char. 8 Char. 9
Species 1 0 1 1 0 1 1 0 1 0
Species 2 1 0 1 0 0 0 1 1 0
Species 3 1 0 1 0 0 1 0 0 0
Species 4 0 0 0 1 1 0 0 0 1
Species 5 0 0 0 1 0 1 1 0 0
Image: W
ikimedia C
omm
ons
● Smaller● Morphology has to be
interpreted ● ...not so much
● DNA data sets large● Clearly-defined properties● Explicit, well-developed
models for sequence evolution
Chapter One: Bayesian Analysis Using a Simple Likelihood Model Outperforms Parsimony for Estimation of Phylogeny from Discrete Morphological Data
14
Likelihood models
● There is one practical published model for phylogenetic estimation from discrete morphological data, Mk
15
Likelihood models
● There is one practical published model for phylogenetic estimation from discrete morphological data, Mk○ Like all methods, Mk has assumptions
16
Likelihood models
● There is one practical published model for phylogenetic estimation from discrete morphological data, Mk○ Like all methods, Mk has assumptions
■ Change can occur at any instant along a branch■ Change is symmetrical between states
17
Likelihood models
● There is one practical published model for phylogenetic estimation from discrete morphological data, Mk○ Like all methods, Mk has assumptions
■ Change can occur at any instant along a branch■ Change is symmetrical between states
● This model is statistically consistent
18
Likelihood models
● There is one practical published model for phylogenetic estimation from discrete morphological data, Mk○ Like all methods, Mk has assumptions
■ Change can occur at any instant along a branch■ Change is symmetrical between states
● This model is statistically consistent○ Caveat: As long as the assumptions hold○ Also, we don't have infinite data 19
Does a parametric approach (Mk) outperform non-parametric approach
(parsimony) when model assumptions are violated?
20
● Full data set
Char. 1 Char. 2 Char. 3 Char. 4 Char. 5 Char. 6 Char. 7 Char. 8 Char. 9
Species 1(Fossil)
0 1 1 0 1 1 0 1 0
Species 2(Fossil)
1 1 1 0 0 0 1 1 0
Species 3 1 0 1 0 0 1 0 0 0
Species 4 0 0 0 1 1 0 0 0 1
Species 5 0 0 0 1 0 1 1 0 0
● Random missing data
○ Most studies have looked at random missing data
Char. 1 Char. 2 Char. 3 Char. 4 Char. 5 Char. 6 Char. 7 Char. 8 Char. 9
Species 1(Fossil)
0 1 1 ? 1 1 0 1 0
Species 2(Fossil)
1 ? 1 0 0 ? 1 ? 0
Species 3 1 0 1 ? 0 1 ? 0 0
Species 4 0 0 ? 1 1 ? 0 0 ?
Species 5 0 0 ? 1 0 1 1 0 0
● Some types of characters are more likely to be missing from fossil taxa
● Missing characters may be more likely to fall into certain rate classes
Char. 1 Char. 2 Char. 3 Char. 4 Char. 5 Char. 6 Char. 7 Char. 8 Char. 9
Species 1(Fossil)
0 1 1 0 ? ? ? ? ?
Species 2(Fossil)
1 0 1 0 ? ? ? ? ?
Species 3 1 0 1 0 0 0 1 1 0
Species 4 0 0 0 1 0 1 0 0 0
Species 5 0 0 0 1 0 1 1 0 0
Missing data
● In these conditions, some model assumptions have been violated
● Given these model violations, is it preferable to use parsimony?
28
A simulation framework
● Simulate characters
○ 350 & 1000 character data sets
● Estimate topology using the Mk model and parsimony
32
Rate heterogeneity
● Different rates of character evolution
○ Low rates of change mean each character is likely to have changed rarely, if at all
○ High rates mean there are likely reversals and parallel evolution in the data
33
Fast Medium Slow
Char. 1 Char. 2 Char. 3 Char. 4 Char. 5 Char. 6 Char. 7 Char. 8 Char. 9
Species 1(Fossil)
? ? ? 0 1 1 0 1 0
Species 2(Fossil)
? ? ? 0 0 0 1 1 0
Species 3 1 0 1 0 0 1 0 0 0
Species 4 0 0 0 1 1 0 0 0 1
Species 5 0 0 0 1 0 1 1 0 0
34
Rate heterogeneity
● Diversity of rate classes can be helpful for resolving different regions of phylogenetic trees
○ Likelihood models account for superimposed changes
36
Rate heterogeneity
● Diversity of rate classes can be helpful for resolving different regions of phylogenetic trees○ Likelihood models account for superimposed
changes○ In analysis, rate heterogeneity is often modeled as
gamma-distributed
37
Summary - Chapter One
Does a parametric approach (Mk) outperform non-parametric approach
(parsimony) when model assumptions are violated?
43
Summary - Chapter One
Does a parametric approach (Mk) outperform non-parametric approach (parsimony) when
model assumptions are violated?Yes
44
Summary - Chapter One
Caveats: We’ve really only looked at one type of model violation here
There are other reasons you might use parsimony, or might think the contrast of
likelihood and parsimony methods is telling you something interesting
45
Model Assumptions
● Change is symmetrical between states○ We know this is not always true
What if we could relax this assumption?
50
Relaxing this assumption
● In Bayesian estimation, we can put priors on the parameters in our analyses
51
Relaxing this assumption
● In Bayesian estimation, we can put priors on the parameters in our analyses
● Transition probabilities are the product of exchangeabilities and frequencies
52
Relaxing this assumption
Probability of G to T change (exchangeability)Equilibrium frequency of T
56
Relaxing this assumption
Probability of G to T change (exchangeability)Equilibrium frequency of T
.75 * 0 = 0
57
Relaxing this assumption
Probability of G to T change (exchangeability)Equilibrium frequency of T
.75 * .25 = 0.1875
58
Relaxing this assumption
Probability of G to T change (exchangeability)Equilibrium frequency of T
.75 * .25 = 0.1875
59
Empirical Datasets
● 206 datasets○ 5 to 279 taxa○ 11 to 364 characters○ Biased towards vertebrates
68
Empirical Datasets
● 206 datasets○ 5 to 279 taxa○ 11 to 364 characters○ Biased towards vertebrates
● Modeled character change asymmetry according to the 6 distributions
69
Empirical Datasets
Which priors best match empirical data?Does using the best-fit prior matter to
phylogenetic inference?
71
Simulations
● Simulated data according to 4 distributions○ Modeled the data according to each of the four
distributions
72
Simulations
● Simulated data according to 4 distributions○ Modeled the data according to each of the four
distributions○ One generating model, 3 misspecified models
74
Simulations
● Simulated data according to 4 distributions○ Modeled the data according to each of the four
distributions○ One generating model, 3 misspecified models
● Also simulated missing data
77
Simulations
● Estimated trees according to each of the 4 values of alpha
● Used Bayes Factor model selection to choose the best-fit value
● Used Robinson-Foulds distance to assess topological correctness
78
Simulations
Can we detect the generating value of alpha among misspecified values of alpha?
Does using the correct alpha result in a more correct tree?
79
Empirical Datasets
Which priors best match empirical data?About half: best fit is the α = ∞ prior
Strength of support for different values of α varies
84
Simulations
Does using the best-fit α result in a more correct tree?
Yes, and the importance of doing so is greater when the problem is harder
98
Chapter Two: Conclusions
● Appropriate fit of α parameter improves phylogenetic estimation
● Bayes Factor model selection performs well at choosing the best-fit value of α among a set of α values
99
Partitioning
● Refers to breaking a dataset into smaller subsets that can be analyzed under different phylogenetic models
101
Partitioning
● Refers to breaking a dataset into smaller subsets that can be analyzed under different phylogenetic models○ Well-explored in a molecular context
102
Partitioning
● Refers to breaking a dataset into smaller subsets that can be analyzed under different phylogenetic models○ Well-explored in a molecular context (Brown and Lemmon
2007)
○ Often, partition schemes are tested as a stage in model-fitting
103
Partitioning
● Less well-explored in morphology○ Clarke and Middleton (2008) is one of the few
explorations of partitioning in a likelihood context for morphology
105
Partitioning
● Less well-explored in morphology○ Clarke and Middleton (2008) is one of the few
explorations of partitioning in a likelihood context for morphology
○ Used anatomical subregion partitioning
106
Partitioning
● Less well-explored in morphology○ Clarke and Middleton (2008) is one of the few
explorations of partitioning in a likelihood context for morphology
○ Used anatomical subregion partitioning○ Found improved model fit and different topology with
partitioned data
107
Partitioning
● Less well-explored in morphology○ Clarke and Middleton (2008) is one of the few
explorations of partitioning in a likelihood context for morphology
○ Used anatomical subregion partitioning○ Found improved model fit and different topology with
partitioned data
108
PartitionFinder Morphology
● Adaptation of technology for partitioning of genome-scale information
109
PartitionFinder Morphology
● Adaptation of technology for partitioning of genome-scale information○ But we don’t have genome-scale information
110
PartitionFinder Morphology
● Estimate a phylogenetic tree from the unpartitioned data matrix
● Fit parameters of the evolutionary model to the whole dataset as a single set of sites
113
PartitionFinder Morphology
● Estimate a phylogenetic tree from the unpartitioned data matrix
● Fit parameters of the evolutionary model to the whole dataset as a single set of sites
● Calculate the score of the data given this model according to an information theoretic criterion (AIC, BIC or AICc)
114
PartitionFinder Morphology
● Generate rates of evolution for each site in the dataset
● Use k-means clustering to split the subset in two based on these rates
116
PartitionFinder Morphology
● Generate rates of evolution for each site in the dataset
● Use k-means clustering to split the subset in two based on these rates
● Fit parameters of the model for these new subsets
117
PartitionFinder Morphology
● Calculate the score of this new partitioned data matrix according to the same information theoretic criterion used in step 3.
118
PartitionFinder Morphology
● Calculate the score of this new partitioned data matrix according to the same information theoretic criterion used in step 3.
● If smaller subsets are supported by this criterion, continue to divide them, repeating steps 5-7. If not, terminate the search.
119
PartitionFinder Morphology
Are partitioned models often the best-fit model for empirical datasets?
120
PartitionFinder Morphology
Are partitioned models often the best-fit model for empirical datasets?
When they are, does this make a difference to the tree estimated?
121
Modeling
● We used PartitionFinder Morphology to partition 209 datasets○ 3 criteria: AIC, BIC and AICc
2k- 2lnL -2 (lnL + 2k * n-k-1)
-2lnL + k * lnn
n
122
Modeling
● We used PartitionFinder Morphology to partition 209 datasets○ 3 criteria: AIC, BIC and AICc
Least conservative
Most conservative
123
Estimation
● Estimate trees under likelihood and Bayesian implementations of the Mk model using the partitioned data and unpartitioned data
124
PartitionFinder Morphology
Are partitioned models often the best-fit model for empirical datasets?
125
PartitionFinder Morphology
Are partitioned models often the best-fit model for empirical datasets?
Yes.
129
PartitionFinder Morphology
When a partitioned model is the best fit, does this make a difference to the tree estimated?
130
When a partitioned model is the best fit, does this make a difference to the tree estimated?
Yes, in empirical datasets we often estimate different trees.
132
When a partitioned model is the best fit, does this make a difference to the tree estimated?
Yes, and the likelihood surface is more peaked.
136
Conclusions
● Chapter One: Likelihood-based methods are effective for estimating phylogeny for morphological data, even in the presence of biased missing data
138
Conclusions
● Chapter Two: Use of a prior on equilibrium state frequencies can improve the performance of Bayesian estimation using morphological data
139
Conclusions
● Chapter Three: PartitionFinder Morphology is a promising lead for evaluation of partitioning schemes
140
Thank you!CommitteeDavid HillisMartha SmithDavid CannatellaRandy LinderBob Jansen
LabmatesBen, Emily Jane, Thomas, Patricia, Becca, Mariana, Shannon, Patrick, Chris, J9, Matt, Sandi, Carlos, Taylor, Katie, Devon, Anne, Jim, Jeremy, Tracy
OtherUT vert paleo, especially Julia, Robert, Zach
And the people who make this worth doing:You know who you are.And Jason and unnamed baby girl Wright Stinnett.
Collaborators:Graeme Lloyd, Paul Fransden, David Bapst, Nick Matzke, Matt Brandley, Rob Lanfear
141