A probabilistic parsimonious model for species tree reconstruction
-
Upload
leonardo-de-oliveira-martins -
Category
Technology
-
view
245 -
download
2
description
Transcript of A probabilistic parsimonious model for species tree reconstruction
![Page 1: A probabilistic parsimonious model for species tree reconstruction](https://reader034.fdocuments.us/reader034/viewer/2022052619/555088dab4c9051e5b8b4bf4/html5/thumbnails/1.jpg)
A probabilistic parsimonious model for species tree reconstruction
Leonardo de Oliveira MartinsDavid Posada
with invaluable help from Klaus Schliep and Diego Mallo
![Page 2: A probabilistic parsimonious model for species tree reconstruction](https://reader034.fdocuments.us/reader034/viewer/2022052619/555088dab4c9051e5b8b4bf4/html5/thumbnails/2.jpg)
What do we want
● To account for uncertainty in gene tree and species tree estimation ← some gene families may be more informative, or maybe we don't have signal at all
● To estimate species trees given arbitrary gene families ←can contain paralogous, missing data, etc.
● To allow for several sources of disagreement ← real data seldomly can be explained by just one biological phenomenon
● Fast computation ← improvement provided by slower, fully probabilistic methods may be elusive, and they can benefit from our output nonetheless
![Page 3: A probabilistic parsimonious model for species tree reconstruction](https://reader034.fdocuments.us/reader034/viewer/2022052619/555088dab4c9051e5b8b4bf4/html5/thumbnails/3.jpg)
Outline
Model of gene family evolution
Parsimonious estimation of disagreement
* reconciliation
* distance between trees
Hierarchical Bayesian model
Examples
* comparing many trees
* simulation
* TreeFam data set
![Page 4: A probabilistic parsimonious model for species tree reconstruction](https://reader034.fdocuments.us/reader034/viewer/2022052619/555088dab4c9051e5b8b4bf4/html5/thumbnails/4.jpg)
D1
D2
G1S
G2
Model for the evolution of gene families
.
.
.
Dn
Gn
![Page 5: A probabilistic parsimonious model for species tree reconstruction](https://reader034.fdocuments.us/reader034/viewer/2022052619/555088dab4c9051e5b8b4bf4/html5/thumbnails/5.jpg)
D1
G1S
Model for the evolution of gene families
distance between G and S
P(G
/S)
Our assumption:
We just need to consider the
simplest explanation for the
difference between the gene
and species trees
● we may use several such simple explanations
![Page 6: A probabilistic parsimonious model for species tree reconstruction](https://reader034.fdocuments.us/reader034/viewer/2022052619/555088dab4c9051e5b8b4bf4/html5/thumbnails/6.jpg)
D1
G1S
Model for the evolution of gene families
distance between G and S
P(G
/S)
Our assumption:
We just need to consider the
simplest explanation for the
difference between the gene
and species trees
● we may use several such simple explanations
● work with unrooted gene trees
● penalize gene trees very different from species tree
![Page 7: A probabilistic parsimonious model for species tree reconstruction](https://reader034.fdocuments.us/reader034/viewer/2022052619/555088dab4c9051e5b8b4bf4/html5/thumbnails/7.jpg)
Outline
Model of gene family evolution
Parsimonious estimation of disagreement
* reconciliation
* distance between trees
Hierarchical Bayesian model
Examples
* comparing many trees
* simulation
* TreeFam data set
![Page 8: A probabilistic parsimonious model for species tree reconstruction](https://reader034.fdocuments.us/reader034/viewer/2022052619/555088dab4c9051e5b8b4bf4/html5/thumbnails/8.jpg)
Quantifying the disagreement
gene tree species tree
reconciliation
assuming deepcoal:
assuming duplosses:
1 deepcoal
1 dup3 losses
assuming HGT:
1 event
![Page 9: A probabilistic parsimonious model for species tree reconstruction](https://reader034.fdocuments.us/reader034/viewer/2022052619/555088dab4c9051e5b8b4bf4/html5/thumbnails/9.jpg)
Quantifying the disagreement
gene tree species tree
reconciliation
assuming deepcoal:
assuming duplosses:
1 deepcoal
1 dup3 losses
Stochastic error/nonparametric
assuming HGT:
1 event
![Page 10: A probabilistic parsimonious model for species tree reconstruction](https://reader034.fdocuments.us/reader034/viewer/2022052619/555088dab4c9051e5b8b4bf4/html5/thumbnails/10.jpg)
Outline
Model of gene family evolution
Parsimonious estimation of disagreement
* reconciliation
* distance between trees
Hierarchical Bayesian model
Examples
* comparing many trees
* simulation
* TreeFam data set
![Page 11: A probabilistic parsimonious model for species tree reconstruction](https://reader034.fdocuments.us/reader034/viewer/2022052619/555088dab4c9051e5b8b4bf4/html5/thumbnails/11.jpg)
Quantifying the disagreement – other measures
mul-tree version: Chaudhary R, Burleigh JG, Fernández-Baca D (2013) Inferring Species Trees from Incongruent Multi-Copy Gene Trees Using the Robinson-Foulds Distance. arXiv:1210.2665
![Page 12: A probabilistic parsimonious model for species tree reconstruction](https://reader034.fdocuments.us/reader034/viewer/2022052619/555088dab4c9051e5b8b4bf4/html5/thumbnails/12.jpg)
de Oliveira Martins et al. (2008) Phylogenetic Detection of Recombination with a Bayesian Prior on the Distance between Trees. PLoS ONE 3(7): e2651.
Quantifying the disagreement – other measures
![Page 13: A probabilistic parsimonious model for species tree reconstruction](https://reader034.fdocuments.us/reader034/viewer/2022052619/555088dab4c9051e5b8b4bf4/html5/thumbnails/13.jpg)
see also: Whidden et al. (2013) Supertrees based on the subtree prune-and-regraft distance. PeerJ PrePrints 1:e18v1
Quantifying the disagreement – other measures
![Page 14: A probabilistic parsimonious model for species tree reconstruction](https://reader034.fdocuments.us/reader034/viewer/2022052619/555088dab4c9051e5b8b4bf4/html5/thumbnails/14.jpg)
Hdist similar to: Nye TMW, Liò P, Gilks WR (2006) A novel algorithm and web-based tool for comparing two alternative phylogenetic trees. Bioinformatics 22: 117-119
Quantifying the disagreement – other measures
![Page 15: A probabilistic parsimonious model for species tree reconstruction](https://reader034.fdocuments.us/reader034/viewer/2022052619/555088dab4c9051e5b8b4bf4/html5/thumbnails/15.jpg)
Now we have estimates for these
assuming deepcoal:
assuming duplosses:
1 deepcoal
1 dup3 losses
Stochastic error/nonparametric
assuming HGT:
1 event
![Page 16: A probabilistic parsimonious model for species tree reconstruction](https://reader034.fdocuments.us/reader034/viewer/2022052619/555088dab4c9051e5b8b4bf4/html5/thumbnails/16.jpg)
Now we have estimates for these
assuming deepcoal:
assuming duplosses:
1 deepcoal
1 dup3 losses
Stochastic error/nonparametric
assuming HGT:
1 event
Gene tree parsimony
![Page 17: A probabilistic parsimonious model for species tree reconstruction](https://reader034.fdocuments.us/reader034/viewer/2022052619/555088dab4c9051e5b8b4bf4/html5/thumbnails/17.jpg)
Now we have estimates for these
assuming deepcoal:
assuming duplosses:
1 deepcoal
1 dup3 losses
Stochastic error/nonparametric
assuming HGT:
1 event
Gene tree parsimony
Gene tree parsimony
![Page 18: A probabilistic parsimonious model for species tree reconstruction](https://reader034.fdocuments.us/reader034/viewer/2022052619/555088dab4c9051e5b8b4bf4/html5/thumbnails/18.jpg)
Now we have estimates for these
assuming deepcoal:
assuming duplosses:
1 deepcoal
1 dup3 losses
Stochastic error/nonparametric
assuming HGT:
1 event
Gene tree parsimony
Gene tree parsimony
(approximate) dSPR
![Page 19: A probabilistic parsimonious model for species tree reconstruction](https://reader034.fdocuments.us/reader034/viewer/2022052619/555088dab4c9051e5b8b4bf4/html5/thumbnails/19.jpg)
Now we have estimates for these
assuming deepcoal:
assuming duplosses:
1 deepcoal
1 dup3 losses
Stochastic error/nonparametric
assuming HGT:
1 event
Gene tree parsimony
Gene tree parsimony
(approximate) dSPR
RF, Hdist
![Page 20: A probabilistic parsimonious model for species tree reconstruction](https://reader034.fdocuments.us/reader034/viewer/2022052619/555088dab4c9051e5b8b4bf4/html5/thumbnails/20.jpg)
Considering several measures of disagreement:
Thus we can incorporate e.g. duplications and losses while accounting for HGT and
random errors
Easy to include other distances in the future
![Page 21: A probabilistic parsimonious model for species tree reconstruction](https://reader034.fdocuments.us/reader034/viewer/2022052619/555088dab4c9051e5b8b4bf4/html5/thumbnails/21.jpg)
Considering several measures of disagreement:
Problem: the normalization constant
E.g.: Rodrigue N, Kleinman CL, Philippe H, Lartillot N (2009) Computational Methods for Evaluating Phylogenetic Models of Coding Sequence Evolution with Dependence between Codons. Mol Biol Evol 26: 1663-1676.
Solution: importance sampling estimate of Z(.)
Ref.: Bryant D, Steel M (2009) Computing the Distribution of a Tree Metric. TCBB: 420 – 426
Thus we can incorporate e.g. duplications and losses while accounting for HGT and
random errors
Easy to include other distances in the future
![Page 22: A probabilistic parsimonious model for species tree reconstruction](https://reader034.fdocuments.us/reader034/viewer/2022052619/555088dab4c9051e5b8b4bf4/html5/thumbnails/22.jpg)
Outline
Model of gene family evolution
Parsimonious estimation of disagreement
* reconciliation
* distance between trees
Hierarchical Bayesian model
Examples
* comparing many trees
* simulation
* TreeFam data set
![Page 23: A probabilistic parsimonious model for species tree reconstruction](https://reader034.fdocuments.us/reader034/viewer/2022052619/555088dab4c9051e5b8b4bf4/html5/thumbnails/23.jpg)
G1 S
Gn
.
.
.
Distribution of gene trees: probabilistic model
D1
Dn
Q1
Qn
![Page 24: A probabilistic parsimonious model for species tree reconstruction](https://reader034.fdocuments.us/reader034/viewer/2022052619/555088dab4c9051e5b8b4bf4/html5/thumbnails/24.jpg)
G1 S
Gn
λdup1
λdupn
λdupprior
.
.
.
Distribution of gene trees: probabilistic model
D1
Dn
Q1
Qn
![Page 25: A probabilistic parsimonious model for species tree reconstruction](https://reader034.fdocuments.us/reader034/viewer/2022052619/555088dab4c9051e5b8b4bf4/html5/thumbnails/25.jpg)
G1 S
Gn
λdup1
λdupn
λdupprior
.
.
.
.
.
.
λloss1
λlossn
λlossprior
λspr1
λsprn
λsprprior...
Distribution of gene trees: probabilistic model
D1
Dn
Q1
Qn
![Page 26: A probabilistic parsimonious model for species tree reconstruction](https://reader034.fdocuments.us/reader034/viewer/2022052619/555088dab4c9051e5b8b4bf4/html5/thumbnails/26.jpg)
G1 S
Gn
λdup1
λdupn
λdupprior
.
.
.
.
.
.
λloss1
λlossn
λlossprior
λspr1
λsprn
λsprprior...
ImportanceSampling
So we can use complex, state-of-the-art software
for phylogenetic inference
Distribution of gene trees: probabilistic model
![Page 27: A probabilistic parsimonious model for species tree reconstruction](https://reader034.fdocuments.us/reader034/viewer/2022052619/555088dab4c9051e5b8b4bf4/html5/thumbnails/27.jpg)
G1 S
Gn
λdup1
λdupn
λdupprior
.
.
.
.
.
.
λloss1
λlossn
λlossprior
λspr1
λsprn
λsprprior...
Input
ImportanceSampling
So we can use complex, state-of-the-art software
for phylogenetic inference
Distribution of gene trees: probabilistic model
![Page 28: A probabilistic parsimonious model for species tree reconstruction](https://reader034.fdocuments.us/reader034/viewer/2022052619/555088dab4c9051e5b8b4bf4/html5/thumbnails/28.jpg)
G1 S
Gn
λdup1
λdupn
λdupprior
.
.
.
.
.
.
λloss1
λlossn
λlossprior
λspr1
λsprn
λsprprior...
Output
ImportanceSampling
So we can use complex, state-of-the-art software
for phylogenetic inference
Distribution of gene trees: probabilistic model
![Page 29: A probabilistic parsimonious model for species tree reconstruction](https://reader034.fdocuments.us/reader034/viewer/2022052619/555088dab4c9051e5b8b4bf4/html5/thumbnails/29.jpg)
G1 S
Gn
λdup1
λdupn
λdupprior
.
.
.
.
.
.
λloss1
λlossn
λlossprior
λspr1
λsprn
λsprprior...
Output
ImportanceSampling
So we can use complex, state-of-the-art software
for phylogenetic inference
Distribution of gene trees: probabilistic model
We should not rely on single estimates of gene
phylogenies
E.g.: Boussau B, Szollosi GJ, Duret L, Gouy M, Tannier E, Daubin V. (2012) Genome-scale coestimation of species and gene trees. Genome research 23: 323-330.
![Page 30: A probabilistic parsimonious model for species tree reconstruction](https://reader034.fdocuments.us/reader034/viewer/2022052619/555088dab4c9051e5b8b4bf4/html5/thumbnails/30.jpg)
Outline
Model of gene family evolution
Parsimonious estimation of disagreement
* reconciliation
* distance between trees
Hierarchical Bayesian model
Examples
* comparing many trees
* simulation
* TreeFam data set
![Page 31: A probabilistic parsimonious model for species tree reconstruction](https://reader034.fdocuments.us/reader034/viewer/2022052619/555088dab4c9051e5b8b4bf4/html5/thumbnails/31.jpg)
Example: distances between gene families
● 567 single-copy gene trees for 23 species
Data from.: Salichos L, Rokas A (2013) Inferring ancient divergences requires genes with strong phylogenetic signals. Nature 497: 327–331
● Analysis under a model where only RF, Hdist and dSPR are considered
● Not interested in data set per se (unreliable)
● Use it just as a didactical tool about how the model works
![Page 32: A probabilistic parsimonious model for species tree reconstruction](https://reader034.fdocuments.us/reader034/viewer/2022052619/555088dab4c9051e5b8b4bf4/html5/thumbnails/32.jpg)
● 567 single-copy gene trees for 23 species
Data from.: Salichos L, Rokas A (2013) Inferring ancient divergences requires genes with strong phylogenetic signals. Nature 497: 327–331
● Analysis under a model where only RF, Hdist and dSPR are considered
Example: distances between gene families
RF Hdist SPR
● Not interested in data set per se (unreliable)
● Use it just as a didactical tool about how the model works
![Page 33: A probabilistic parsimonious model for species tree reconstruction](https://reader034.fdocuments.us/reader034/viewer/2022052619/555088dab4c9051e5b8b4bf4/html5/thumbnails/33.jpg)
Example: distances between gene families
RF Hdist SPR
![Page 34: A probabilistic parsimonious model for species tree reconstruction](https://reader034.fdocuments.us/reader034/viewer/2022052619/555088dab4c9051e5b8b4bf4/html5/thumbnails/34.jpg)
Posterior samples
Example: distances between gene families
RF Hdist SPR
![Page 35: A probabilistic parsimonious model for species tree reconstruction](https://reader034.fdocuments.us/reader034/viewer/2022052619/555088dab4c9051e5b8b4bf4/html5/thumbnails/35.jpg)
Posterior samplesbest estimate
Example: distances between gene families
RF Hdist SPR
![Page 36: A probabilistic parsimonious model for species tree reconstruction](https://reader034.fdocuments.us/reader034/viewer/2022052619/555088dab4c9051e5b8b4bf4/html5/thumbnails/36.jpg)
Outline
Model of gene family evolution
Parsimonious estimation of disagreement
* reconciliation
* distance between trees
Hierarchical Bayesian model
Examples
* comparing many trees
* simulation
* TreeFam data set
![Page 37: A probabilistic parsimonious model for species tree reconstruction](https://reader034.fdocuments.us/reader034/viewer/2022052619/555088dab4c9051e5b8b4bf4/html5/thumbnails/37.jpg)
Analysis of simulated data sets
Idea from: Rasmussen MD, Kellis M (2012) Unified modeling of gene duplication, loss, and coalescence using a locus tree. Genome Res. 22: 755-765
We use gene trees only, and simulate tree inference error
● Fully probabilistic simulation of gene trees by Diego Mallo and
David Posada
● Birth and death of new loci, conditioned on a multispecies
coalescent, followed by sequence evolution
![Page 38: A probabilistic parsimonious model for species tree reconstruction](https://reader034.fdocuments.us/reader034/viewer/2022052619/555088dab4c9051e5b8b4bf4/html5/thumbnails/38.jpg)
Analysis of simulated data sets – results
![Page 39: A probabilistic parsimonious model for species tree reconstruction](https://reader034.fdocuments.us/reader034/viewer/2022052619/555088dab4c9051e5b8b4bf4/html5/thumbnails/39.jpg)
Analysis of simulated data sets – results
![Page 40: A probabilistic parsimonious model for species tree reconstruction](https://reader034.fdocuments.us/reader034/viewer/2022052619/555088dab4c9051e5b8b4bf4/html5/thumbnails/40.jpg)
Analysis of simulated data sets – results
![Page 41: A probabilistic parsimonious model for species tree reconstruction](https://reader034.fdocuments.us/reader034/viewer/2022052619/555088dab4c9051e5b8b4bf4/html5/thumbnails/41.jpg)
Outline
Model of gene family evolution
Parsimonious estimation of disagreement
* reconciliation
* distance between trees
Hierarchical Bayesian model
Examples
* comparing many trees
* simulation
* TreeFam data set
![Page 42: A probabilistic parsimonious model for species tree reconstruction](https://reader034.fdocuments.us/reader034/viewer/2022052619/555088dab4c9051e5b8b4bf4/html5/thumbnails/42.jpg)
Single copy genes from Drosophila (TreeFam)● 4591 informative, single-copy gene families
● (TreeFam database has 14250 informative gene families)
![Page 43: A probabilistic parsimonious model for species tree reconstruction](https://reader034.fdocuments.us/reader034/viewer/2022052619/555088dab4c9051e5b8b4bf4/html5/thumbnails/43.jpg)
Single copy genes from Drosophila (TreeFam)● 4591 informative, single-copy gene families
● (TreeFam database has 14250 informative gene families)
![Page 44: A probabilistic parsimonious model for species tree reconstruction](https://reader034.fdocuments.us/reader034/viewer/2022052619/555088dab4c9051e5b8b4bf4/html5/thumbnails/44.jpg)
Estimated species tree:
Single copy genes from Drosophila (TreeFam)● 4591 informative, single-copy gene families
● Root location uncertain
![Page 45: A probabilistic parsimonious model for species tree reconstruction](https://reader034.fdocuments.us/reader034/viewer/2022052619/555088dab4c9051e5b8b4bf4/html5/thumbnails/45.jpg)
Estimated species tree:
Single copy genes from Drosophila (TreeFam)● 4591 informative, single-copy gene families
● Root location uncertain
● Only one unrooted topology
![Page 46: A probabilistic parsimonious model for species tree reconstruction](https://reader034.fdocuments.us/reader034/viewer/2022052619/555088dab4c9051e5b8b4bf4/html5/thumbnails/46.jpg)
Large gene families from Drosophila (TreeFam)● 43 gene families with 102~295 tips
![Page 47: A probabilistic parsimonious model for species tree reconstruction](https://reader034.fdocuments.us/reader034/viewer/2022052619/555088dab4c9051e5b8b4bf4/html5/thumbnails/47.jpg)
Large gene families from Drosophila (TreeFam)
best species tree:
● 43 gene families with 102~295 tips
~100%
![Page 48: A probabilistic parsimonious model for species tree reconstruction](https://reader034.fdocuments.us/reader034/viewer/2022052619/555088dab4c9051e5b8b4bf4/html5/thumbnails/48.jpg)
To recap, our model can
● Account for uncertainty in gene tree and species tree estimation ← some gene families may be more informative, or maybe we don't have signal at all
● Estimate species trees given arbitrary gene families ← can contain paralogous, missing data, etc.
● Allow for several sources of disagreement ← real data seldomly can be explained by just one biological phenomenon
● Be fast ← improvement provided by slower, fully probabilistic methods may be elusive, and they can benefit from our output nonetheless
The larger, the better – specially for rooting the species tree
Do not assume gene trees are known – embrace ignorance!
Different gene families may be product of distinct processes
It's parallelized, and all distances can be calculated very fast.
![Page 49: A probabilistic parsimonious model for species tree reconstruction](https://reader034.fdocuments.us/reader034/viewer/2022052619/555088dab4c9051e5b8b4bf4/html5/thumbnails/49.jpg)
Thank you!
Check out http://darwin.uvigo.es for announcements, code, slides...