Post on 03-Mar-2020
Analy&cal and computa&onal challenges in coalescent-‐based species tree es&ma&on
Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-‐Champaign hBp://tandy.cs.illinois.edu (joint work with Siavash Mirarab and M.S. Bayzid)
Orangutan Gorilla Chimpanzee Human
From the Tree of the Life Website, University of Arizona
Species Tree
Phylogenomics = genome-‐scale phylogeny es&ma&on Note: Jonathan Eisen coined this term, but used it to mean something else.
Main contribu&ons • Mul&ple Sequence Alignment: Methods for
large-‐scale MSA (up to 1,000,000 sequences, including fragments): SATe, PASTA, and UPP
• Phylogenomics: Methods for mul&-‐locus species tree es&ma&on that are robust to gene tree incongruence due to incomplete lineage sor&ng (ILS) and horizontal gene transfer (HGT)
• Metagenomics: Methods for taxon iden&fica&on and abundance profiling of metagenomic datasets
Main contribu&ons • Mul&ple Sequence Alignment: Methods for
large-‐scale MSA (up to 1,000,000 sequences, including fragments): SATe, PASTA, and UPP
• Phylogenomics: Methods for mul&-‐locus species tree es&ma&on that are robust to gene tree incongruence due to incomplete lineage sor&ng (ILS) and horizontal gene transfer (HGT)
• Metagenomics: Methods for taxon iden&fica&on and abundance profiling of metagenomic datasets
Concatena&on gene 1"
S1
S2
S3
S4
S5
S6
S7
S8
gene 2! gene 3! TCTAATGGAA" GCTAAGGGAA" TCTAAGGGAA" TCTAACGGAA"
TCTAATGGAC"
TATAACGGAA"
GGTAACCCTC!GCTAAACCTC!
GGTGACCATC!
GCTAAACCTC!
TATTGATACA"
TCTTGATACC"
TAGTGATGCA"
CATTCATACC"
TAGTGATGCA" ? ? ? ? ? ? ? ? ? ? "
? ? ? ? ? ? ? ? ? ?"
? ? ? ? ? ? ? ? ? ? "
? ? ? ? ? ? ? ? ? ? "
? ? ? ? ? ? ? ? ? ? "
? ? ? ? ? ? ? ? ? ? "
? ? ? ? ? ? ? ? ? ?"
? ? ? ? ? ? ? ? ? ?"
? ? ? ? ? ? ? ? ? ?"
Red gene tree ≠ species tree (green gene tree okay)
Gene Tree Incongruence
Gene trees can differ from the species tree due to: • Duplica&on and loss • Horizontal gene transfer • Incomplete lineage sor&ng (ILS)
Incomplete Lineage Sor&ng (ILS)
1000+ papers in 2013 alone Confounds phylogene&c analysis for many groups:
Hominids Birds Yeast Animals Toads Fish Fungi
There is substan&al debate about how to analyze phylogenomic datasets in the presence of ILS.
The Mul&-‐species Coalescent Model
Present
Past
Courtesy James Degnan
. . ."
Analyze"separately"
Summary Method"
Two compe&ng approaches
gene 1 gene 2 . . . gene k"
. . ." Concatenation"
Species
Sta&s&cally consistent under MSC?
NO: • MDC
• Greedy consensus
• Unpar&&oned concatena&on under maximum likelihood or maximum parsimony
• MRP (supertree method)
Unknown: • Fully par&&oned concatena&on under maximum likelihood YES • MP-‐EST (Liu et al. 2010): maximum likelihood es&ma&on of rooted species tree – YES, but
• BUCKy-‐pop (Larget et al. 2010): quartet-‐based Bayesian species tree es&ma&on –YES, but…
• *BEAST and BEST (co-‐es&ma&on of gene trees and species trees) – YES, but…
• SNAPP, SVDquartets (site-‐based analyses) – Yes, but…
• STEM, STELLS, GLASS, METAL, etc. – Yes, but…
Avian Phylogenomics Project
G Zhang, BGI
• Approx. 50 species, whole genomes • 14,000 loci
MTP Gilbert, Copenhagen
S. Mirarab Md. S. Bayzid, UT-‐Aus&n UT-‐Aus&n
T. Warnow UT-‐Aus&n
Plus many many other people…
Erich Jarvis, HHMI
Challenge: • Species tree es&ma&on under the mul&-‐species coalescent model from 14,000 poorly es&mated gene trees, all with different topologies (we used “sta&s&cal binning”)
Science, December 2014 (Jarvis, Mirarab, et al., and Mirarab et al.)
1kp: Thousand Transcriptome Project
Plant Tree of Life based on transcriptomes of !800 loci and ~100 species
G. Ka-Shu Wong U Alberta
N. Wickett Northwestern
J. Leebens-Mack U Georgia
N. Matasci iPlant
T. Warnow, S. Mirarab, N. Nguyen, UT-Austin UT-Austin UT-Austin
Plus many many other people…
Challenge: • Gene tree incongruence sugges&ve of ILS, but we were unable to use MP-‐EST due to dataset size and many incomplete gene trees (we used ASTRAL, Mirarab et al. 2014)
WickeB, Mirarab, et al., PNAS 2014
Avian whole genomes phylogenies [Jarvis*, Mirarab*, et al., Science, 2014]
• International team of more than 100 researchers
• Whole genomes for 48 bird species (~100 million years of evolution)
• Goal: a phylogeny of major bird lineages
• Extremely challenging due to rampant gene tree incongruence
• Implications for traits such as vocal learning
• 14,000 “genes” (typically short and relatively conserved)
14
90. J. F. Storz, J. C. Opazo, F. G. Hoffmann, Mol. Phylogenet. Evol.66, 469–478 (2013).
91. F. G. Hoffmann, J. F. Storz, T. A. Gorr, J. C. Opazo, Mol. Biol.Evol. 27, 1126–1138 (2010).
ACKNOWLEDGMENTS
Genome assemblies and annotations of avian genomes in thisstudy are available on the avian phylogenomics website(http://phybirds.genomics.org.cn), GigaDB (http://dx.doi.org/10.5524/101000), National Center for Biotechnology Information(NCBI), and ENSEMBL (NCBI and Ensembl accession numbersare provided in table S2). The majority of this study wassupported by an internal funding from BGI. In addition, G.Z. wassupported by a Marie Curie International Incoming Fellowshipgrant (300837); M.T.P.G. was supported by a Danish NationalResearch Foundation grant (DNRF94) and a Lundbeck Foundationgrant (R52-A5062); C.L. and Q.L. were partially supported by aDanish Council for Independent Research Grant (10-081390);and E.D.J. was supported by the Howard Hughes Medical Instituteand NIH Directors Pioneer Award DP1OD000448.
The Avian Genome ConsortiumChen Ye,1 Shaoguang Liang,1 Zengli Yan,1 M. Lisandra Zepeda,2
Paula F. Campos,2 Amhed Missael Vargas Velazquez,2
José Alfredo Samaniego,2 María Avila-Arcos,2 Michael D. Martin,2
Ross Barnett,2 Angela M. Ribeiro,3 Claudio V. Mello,4 Peter V. Lovell,4
Daniela Almeida,3,5 Emanuel Maldonado,3 Joana Pereira,3
Kartik Sunagar,3,5 Siby Philip,3,5 Maria Gloria Dominguez-Bello,6
Michael Bunce,7 David Lambert,8 Robb T. Brumfield,9
Frederick H. Sheldon,9 Edward C. Holmes,10 Paul P. Gardner,11
Tammy E. Steeves,11 Peter F. Stadler,12 Sarah W. Burge,13
Eric Lyons,14 Jacqueline Smith,15 Fiona McCarthy,16
Frederique Pitel,17 Douglas Rhoads,18 David P. Froman19
1China National GeneBank, BGI-Shenzhen, Shenzhen 518083,China. 2Centre for GeoGenetics, Natural History Museum ofDenmark, University of Copenhagen, Øster Voldgade 5-7, 1350Copenhagen, Denmark. 3CIMAR/CIIMAR, Centro Interdisciplinar deInvestigação Marinha e Ambiental, Universidade do Porto, Ruados Bragas, 177, 4050-123 Porto, Portugal. 4Department ofBehavioral Neuroscience Oregon Health & Science UniversityPortland, OR 97239, USA. 5Departamento de Biologia, Faculdadede Ciências, Universidade do Porto, Rua do Campo Alegre, 4169-007 Porto, Portugal. 6Department of Biology, University of PuertoRico, Av Ponce de Leon, Rio Piedras Campus, JGD 224, San Juan,PR 009431-3360, USA. 7Trace and Environmental DNA laboratory,Department of Environment and Agriculture, Curtin University, Perth,Western Australia 6102, Australia. 8Environmental Futures ResearchInstitute, Griffith University, Nathan, Queensland 4121, Australia.9Museum of Natural Science, Louisiana State University, BatonRouge, LA 70803, USA. 10Marie Bashir Institute for InfectiousDiseases and Biosecurity, Charles Perkins Centre, School ofBiological Sciences and Sydney Medical School, The University ofSydney, Sydney NSW 2006, Australia. 11School of BiologicalSciences, University of Canterbury, Christchurch 8140, New Zealand.12Bioinformatics Group, Department of Computer Science, andInterdisciplinary Center for Bioinformatics, University of Leipzig,Hrtelstrasse 16-18, D-04107 Leipzig, Germany. 13European MolecularBiology Laboratory, European Bioinformatics Institute, Hinxton,Cambridge CB10 1SD, UK. 14School of Plant Sciences, BIO5 Institute,University of Arizona, Tucson, AZ 85721, USA. 15Division of Geneticsand Genomics, The Roslin Institute and Royal (Dick) School ofVeterinary Studies, The Roslin Institute Building, University ofEdinburgh, Easter Bush Campus, Midlothian EH25 9RG, UK.16Department of Veterinary Science and Microbiology, University ofArizona, 1117 E Lowell Street, Post Office Box 210090-0090, Tucson,AZ 85721, USA. 17Laboratoire de Génétique Cellulaire, INRA Cheminde Borde-Rouge, Auzeville, BP 52627 , 31326 CASTANET-TOLOSANCEDEX, France. 18Department of Biological Sciences, Science andEngineering 601, University of Arkansas, Fayetteville, AR 72701, USA.19Department of Animal Sciences, Oregon State University, Corvallis,OR 97331, USA.
SUPPLEMENTARY MATERIALS
www.sciencemag.org/content/346/6215/1311/suppl/DC1Supplementary TextFigs. S1 to S42Tables S1 to S51References (92–192)
27 January 2014; accepted 6 November 201410.1126/science.1251385
RESEARCH ARTICLE
Whole-genome analyses resolveearly branches in the tree of lifeof modern birdsErich D. Jarvis,1*† Siavash Mirarab,2* Andre J. Aberer,3 Bo Li,4,5,6 Peter Houde,7
Cai Li,4,6 Simon Y. W. Ho,8 Brant C. Faircloth,9,10 Benoit Nabholz,11
Jason T. Howard,1 Alexander Suh,12 Claudia C. Weber,12 Rute R. da Fonseca,6
Jianwen Li,4 Fang Zhang,4 Hui Li,4 Long Zhou,4 Nitish Narula,7,13 Liang Liu,14
Ganesh Ganapathy,1 Bastien Boussau,15 Md. Shamsuzzoha Bayzid,2
Volodymyr Zavidovych,1 Sankar Subramanian,16 Toni Gabaldón,17,18,19
Salvador Capella-Gutiérrez,17,18 Jaime Huerta-Cepas,17,18 Bhanu Rekepalli,20
Kasper Munch,21 Mikkel Schierup,21 Bent Lindow,6 Wesley C. Warren,22
David Ray,23,24,25 Richard E. Green,26 Michael W. Bruford,27 Xiangjiang Zhan,27,28
Andrew Dixon,29 Shengbin Li,30 Ning Li,31 Yinhua Huang,31
Elizabeth P. Derryberry,32,33 Mads Frost Bertelsen,34 Frederick H. Sheldon,33
Robb T. Brumfield,33 Claudio V. Mello,35,36 Peter V. Lovell,35 Morgan Wirthlin,35
Maria Paula Cruz Schneider,36,37 Francisco Prosdocimi,36,38 José Alfredo Samaniego,6
Amhed Missael Vargas Velazquez,6 Alonzo Alfaro-Núñez,6 Paula F. Campos,6
Bent Petersen,39 Thomas Sicheritz-Ponten,39 An Pas,40 Tom Bailey,41 Paul Scofield,42
Michael Bunce,43 David M. Lambert,16 Qi Zhou,44 Polina Perelman,45,46
Amy C. Driskell,47 Beth Shapiro,26 Zijun Xiong,4 Yongli Zeng,4 Shiping Liu,4
Zhenyu Li,4 Binghang Liu,4 Kui Wu,4 Jin Xiao,4 Xiong Yinqi,4 Qiuemei Zheng,4
Yong Zhang,4 Huanming Yang,48 Jian Wang,48 Linnea Smeds,12 Frank E. Rheindt,49
Michael Braun,50 Jon Fjeldsa,51 Ludovic Orlando,6 F. Keith Barker,52
Knud Andreas Jønsson,51,53,54 Warren Johnson,55 Klaus-Peter Koepfli,56
Stephen O’Brien,57,58 David Haussler,59 Oliver A. Ryder,60 Carsten Rahbek,51,54
Eske Willerslev,6 Gary R. Graves,51,61 Travis C. Glenn,62 John McCormack,63
Dave Burt,64 Hans Ellegren,12 Per Alström,65,66 Scott V. Edwards,67
Alexandros Stamatakis,3,68 David P. Mindell,69 Joel Cracraft,70 Edward L. Braun,71
Tandy Warnow,2,72† Wang Jun,48,73,74,75,76† M. Thomas P. Gilbert,6,43† Guojie Zhang4,77†
To better determine the history of modern birds, we performed a genome-scale phylogeneticanalysis of 48 species representing all orders of Neoaves using phylogenomic methodscreated to handle genome-scale data. We recovered a highly resolved tree that confirmspreviously controversial sister or close relationships. We identified the first divergence inNeoaves, two groups we named Passerea and Columbea, representing independent lineagesof diverse and convergently evolved land and water bird species. Among Passerea, we inferthe common ancestor of core landbirds to have been an apex predator and confirm independentgains of vocal learning. Among Columbea, we identify pigeons and flamingoes as belonging tosister clades. Even with whole genomes, some of the earliest branches in Neoaves provedchallenging to resolve, which was best explained by massive protein-coding sequenceconvergence and high levels of incomplete lineage sorting that occurred during a rapidradiation after the Cretaceous-Paleogene mass extinction event about 66 million years ago.
The diversification of species is not alwaysgradual but can occur in rapid radiations,especially aftermajor environmental changes(1, 2). Paleobiological (3–7) and molecular (8)evidence suggests that such “big bang” radia-
tions occurred for neoavian birds (e.g., songbirds,parrots, pigeons, and others) and placental mam-mals, representing 95% of extant avian and mam-malian species, after the Cretaceous to Paleogene(K-Pg)mass extinction event about 66million yearsago (Ma). However, other nuclear (9–12) and mito-chondrial (13, 14) DNA studies propose an earlier,more gradual diversification, beginning withinthe Cretaceous 80 to 125 Ma. This debate is con-founded by findings that different data sets (15–19)and analytical methods (20, 21) often yield con-
trasting species trees. Resolving such timing andphylogenetic relationships is important for com-parative genomics,which can informabout humantraits and diseases (22).Recent avian studies based on fragments of 5
[~5000 base pairs (bp) (8)] and 19 [31,000 bp (17)]genes recovered some relationships inferred frommorphological data (15, 23) and DNA-DNA hy-bridization (24), postulated new relationships,and contradicted many others. Consistent withmost previous molecular and contemporary mor-phological studies (15), they divided modernbirds (Neornithes) into Palaeognathae (tinamousand flightless ratites), Galloanseres [Galliformes(landfowl) and Anseriformes (waterfowl)], andNeoaves (all other extant birds). Within Neoaves,
1320 12 DECEMBER 2014 • VOL 346 ISSUE 6215 sciencemag.org SCIENCE
A FLOCK OF GENOMES
on
Dec
embe
r 11,
201
4w
ww
.sci
ence
mag
.org
Dow
nloa
ded
from
o
n D
ecem
ber 1
1, 2
014
ww
w.s
cien
cem
ag.o
rgD
ownl
oade
d fro
m
on
Dec
embe
r 11,
201
4w
ww
.sci
ence
mag
.org
Dow
nloa
ded
from
o
n D
ecem
ber 1
1, 2
014
ww
w.s
cien
cem
ag.o
rgD
ownl
oade
d fro
m
on
Dec
embe
r 11,
201
4w
ww
.sci
ence
mag
.org
Dow
nloa
ded
from
o
n D
ecem
ber 1
1, 2
014
ww
w.s
cien
cem
ag.o
rgD
ownl
oade
d fro
m
on
Dec
embe
r 11,
201
4w
ww
.sci
ence
mag
.org
Dow
nloa
ded
from
o
n D
ecem
ber 1
1, 2
014
ww
w.s
cien
cem
ag.o
rgD
ownl
oade
d fro
m
on
Dec
embe
r 11,
201
4w
ww
.sci
ence
mag
.org
Dow
nloa
ded
from
o
n D
ecem
ber 1
1, 2
014
ww
w.s
cien
cem
ag.o
rgD
ownl
oade
d fro
m
on
Dec
embe
r 11,
201
4w
ww
.sci
ence
mag
.org
Dow
nloa
ded
from
o
n D
ecem
ber 1
1, 2
014
ww
w.s
cien
cem
ag.o
rgD
ownl
oade
d fro
m
medianmean
0
5%
10%
15%
20%
0% 25% 50% 75% 100%branch bootstrap support
bran
ches
(per
cent
age)
Gene trees on the avian dataset
15
14,000 genes from avian genome-scale data [Jarvis*, Mirarab*, et al., Science, 2014]
A measure of confidence in estimated gene tree branches
medianmean
0
5%
10%
15%
20%
0% 25% 50% 75% 100%branch bootstrap support
bran
ches
(per
cent
age)
Gene trees on the avian dataset
15
14,000 genes from avian genome-scale data [Jarvis*, Mirarab*, et al., Science, 2014]
��
��
��
��
��
��
��
�� ����
��
��
��
��
��
��
��
������� �
���
��������
���������
��
��������
���������
��
�������
����
���
��� �������
�������� ��
��
!���� "�#$%&
'������ "�#$%&
��
��
��
��
��
��
��
��
����� �
��
����
��
��
��
��
��
!���� "�#$%&
'������ "�#$%&
��(�)�*��
���������
�+�����,
� � ��)�
'��)�� ����
�����)�� ����
�-�������
-������������
-���,���
%-����������
!-����
&-� ������
.-,�����
/-,�,�����
%-�������
&-)��������
0-����1
0-������,�����
2-�����)����
!-(��� ��
3- �)������
�-�����+���
0-�����
4-�)��������
"-���)����
�-�������
"-��� �)�����
��
'��)�� ����
�����)�� ����
��
-���,���
&-)��������
.-,�����
"-��� �)�����
4-�)��������
/-,�,�����
0-�����
%-�������
�-�����+���
3- �)������
�-�������
-������������
%-����������
0-������,�����
�-�������
!-����
2-�����)����
&-� ������
"-���)����
!-(��� ��
0-����1
!��������
�� �)���5)��������
��
�������+���
��
������)�+���
��
6��)�5����,�����
������5)�������
���)���
�����
�))������+���
��
&���5����
������5)�������
���)���
�����
����)����5)������
$,����5�,��7����
3�������5������
����)��)���15)����
���)��������
�����
.� ��5��������
������5��������
$�����,�5�����
!������)�5��,������
��� ����5 �)�+����
������)����5��7��
������5����
������5����,�)�
�����������5)������
�����
&����)�5�����������
���� ����5��)8
������
�)�
���5)������
�������5� ��
�����)���5,���������
"���������5���)����
�����)�������5�����
"����,���5,������ �
.����
�5,����
�����5��������)
��
%������5)������
&������5,�������
��
��
��
��
�� �)���5)��������
�����)�������5�����
�)�
���5)������
�������+���
��
������)�+���
��
6��)�5����,�����
�))������+���
��
&���5����
����)����5)������
$,����5�,��7����
3�������5������
����)��)���15)����
���)��������
�����
.� ��5��������
������5��������
$�����,�5�����
!������)�5��,������
��� ����5 �)�+����
������)����5��7��
������5����
������5����,�)�
�����������5)������
�����
�������5� ��
�����)���5,���������
"���������5���)����
"����,���5,������ �
.����
�5,����
�����5��������)
��
%������5)������
&������5,�������
&����)�5�����������
���� ����5��)8
������
5%
10%
15%
20%
25%
Infinity(true g.t.)
1,500 1,000 500 250
Gene sequence lengthSp
ecie
s tre
e to
polo
gica
l erro
r (FN
)
MP−EST
Avian-like simulations (1000 genes) [Mirarab, et al., Science, 2014]
A statistically consistent summary method
more gene tree error
medianmean
0
5%
10%
15%
20%
0% 25% 50% 75% 100%branch bootstrap support
bran
ches
(per
cent
age)
Gene trees on the avian dataset
15
14,000 genes from avian genome-scale data [Jarvis*, Mirarab*, et al., Science, 2014]
��
��
��
��
��
��
��
�� ����
��
��
��
��
��
��
��
������� �
���
��������
���������
��
��������
���������
��
�������
����
���
��� �������
�������� ��
��
!���� "�#$%&
'������ "�#$%&
��
��
��
��
��
��
��
��
����� �
��
����
��
��
��
��
��
!���� "�#$%&
'������ "�#$%&
��(�)�*��
���������
�+�����,
� � ��)�
'��)�� ����
�����)�� ����
�-�������
-������������
-���,���
%-����������
!-����
&-� ������
.-,�����
/-,�,�����
%-�������
&-)��������
0-����1
0-������,�����
2-�����)����
!-(��� ��
3- �)������
�-�����+���
0-�����
4-�)��������
"-���)����
�-�������
"-��� �)�����
��
'��)�� ����
�����)�� ����
��
-���,���
&-)��������
.-,�����
"-��� �)�����
4-�)��������
/-,�,�����
0-�����
%-�������
�-�����+���
3- �)������
�-�������
-������������
%-����������
0-������,�����
�-�������
!-����
2-�����)����
&-� ������
"-���)����
!-(��� ��
0-����1
!��������
�� �)���5)��������
��
�������+���
��
������)�+���
��
6��)�5����,�����
������5)�������
���)���
�����
�))������+���
��
&���5����
������5)�������
���)���
�����
����)����5)������
$,����5�,��7����
3�������5������
����)��)���15)����
���)��������
�����
.� ��5��������
������5��������
$�����,�5�����
!������)�5��,������
��� ����5 �)�+����
������)����5��7��
������5����
������5����,�)�
�����������5)������
�����
&����)�5�����������
���� ����5��)8
������
�)�
���5)������
�������5� ��
�����)���5,���������
"���������5���)����
�����)�������5�����
"����,���5,������ �
.����
�5,����
�����5��������)
��
%������5)������
&������5,�������
��
��
��
��
�� �)���5)��������
�����)�������5�����
�)�
���5)������
�������+���
��
������)�+���
��
6��)�5����,�����
�))������+���
��
&���5����
����)����5)������
$,����5�,��7����
3�������5������
����)��)���15)����
���)��������
�����
.� ��5��������
������5��������
$�����,�5�����
!������)�5��,������
��� ����5 �)�+����
������)����5��7��
������5����
������5����,�)�
�����������5)������
�����
�������5� ��
�����)���5,���������
"���������5���)����
"����,���5,������ �
.����
�5,����
�����5��������)
��
%������5)������
&������5,�������
&����)�5�����������
���� ����5��)8
������
5%
10%
15%
20%
25%
Infinity(true g.t.)
1,500 1,000 500 250
Gene sequence lengthSp
ecie
s tre
e to
polo
gica
l erro
r (FN
)
MP−EST
Avian-like simulations (1000 genes) [Mirarab, et al., Science, 2014]
A statistically consistent summary method
more gene tree error
Gene tree error matters
[Ané, et al, MBE, 2007] [Patel, et al, MBE, 2013] [Gatesy, Springer, MPE, 2014] [Mirarab, et al., Systematic Biology, 2014]
The individual gene sequence alignments in the avian datasets have poor phylogene&c signal, and result in poorly es&mated gene trees.
Species trees obtained by combining poorly es&mated gene trees have poor accuracy.
There are no theore&cal guarantees for summary methods except for perfectly correct gene trees.
The individual gene sequence alignments in the avian datasets have poor phylogene&c signal, and result in poorly es&mated gene trees.
Species trees obtained by combining poorly es&mated gene trees have poor accuracy.
There are no theore&cal guarantees for summary methods except for perfectly correct gene trees.
The individual gene sequence alignments in the avian datasets have poor phylogene&c signal, and result in poorly es&mated gene trees.
Species trees obtained by combining poorly es&mated gene trees have poor accuracy.
There are no theore&cal guarantees for standard summary methods except for perfectly correct gene trees.
The individual gene sequence alignments in the avian datasets have poor phylogene&c signal, and result in poorly es&mated gene trees.
Species trees obtained by combining poorly es&mated gene trees have poor accuracy.
There are no theore&cal guarantees for standard summary methods except for perfectly correct gene trees.
COMMON PHYLOGENOMICS PROBLEM: many poor gene trees
The individual gene sequence alignments in the avian datasets have poor phylogene&c signal, and result in poorly es&mated gene trees.
Species trees obtained by combining poorly es&mated gene trees have poor accuracy.
There are no theore&cal guarantees for standard summary methods except for perfectly correct gene trees.
COMMON PHYLOGENOMICS PROBLEM: many poor gene trees
See: S. Roch and T. Warnow. "On the robustness to gene tree es&ma&on error (or lack thereof) of coalescent-‐based species tree methods”, Systema&c Biology, 64(4):663-‐676, 2015
Orangutan Gorilla Chimpanzee Human
From the Tree of the Life Website, University of Arizona
Species tree es&ma&on: difficult, even for small datasets!
Idea: combine best aspects of concatenation and summary methods
• Concatenation (fully partitioned) works fine when the concatenated data evolve under identical (or very similar) trees
• Some pairs of genes are not discordant (at least in topology)
• Concatenate “combinable” sets of genes into “supergenes” to increase the phylogenetic signal
• But how do we know which genes are combinable if we cannot estimate them correctly?
21
IMPORTANT: Supergene trees are computed using fully par==oned maximum likelihood. Theorem: Weighted sta=s=cal binning is sta=s=cally consistent under the MSC. Theorem: Unweighted sta=s=cal binning is not sta=s=cally consistent under the MSC. Proofs in Bayzid, Mirarab, and Warnow, PLOS One, 2015. See also discussion in Warnow, PLOS Currents: Tree of Life 2015.
Sta&s&cal binning vs. unbinned
Binning produces bins with approximate 5 to 7 genes each Datasets: 11-‐taxon strongILS datasets with 50 genes, Chung and Ané, Systema&c Biology
0
0.05
0.1
0.15
0.2
0.25
MP−EST MDC*(75) MRP MRL GC
Av
erag
e F
N r
ate
UnbinnedStatistical−75
97/97
Cursores
Columbea
Otidimorphae
Australaves
80/79
73
67
92
79
94
99
68
88
87
9888
50/48 68
86
95
Binned MP-EST (unweighted/weighted) Unbinned MP-EST
Conflict with other lines of strong evidence
Podiceps cristatus9 7/94
PasseriformesPsittaciformesFalco peregrinusCariama cristataCoraciimorphaeAccipitriformesTyto alba
Cariama cristataCoraciimorphae
Pelecanus crispusEgrett agarzettaNipponia nipponPhalacrocorax carboProcellariimorphaeGavia stellataPhaethon lepturusEurypyga heliasBalearica regulorumCharadrius vociferusOpisthocomus hoazin
Calypte annaChaetura pelagicaAntrostomus carolinensis
Tauraco erythrolophusChlamydotis macqueeniiCuculus canorus
Columbal iviaPterocles gutturalisMesitornis unicolor
Phoenicopterus ruber
Meleagris gallopavoGallus gallusAnas platyrhynchos
Struthio camelusTinamus guttatus
91/87
58/56
59/57
99/99
Podiceps cristatusPhoenicopterus ruber
Cuculus canorus
PasseriformesPsittaciformes
Falco peregrinus
AccipitriformesTyto alba
Pelecanus crispusEgrett agarzettaNipponia nippon
Phalacrocorax carboProcellariimorphae
Gavia stellataPhaethon lepturus
Eurypyga heliasBalearica regulorumCharadrius vociferus
Opisthocomus hoazin
Calypte annaChaetura pelagica
Antrostomus carolinensis
Columbal iviaPterocles gutturalisMesitornis unicolor
Meleagris gallopavoGallus gallus
Anas platyrhynchos
Struthio camelusTinamus guttatus
Tauraco erythrolophusChlamydotis macqueenii
88/90100/99
100/99
100/99
Comparing Binned and Un-‐binned MP-‐EST on the Avian Dataset
Unbinned MP-‐EST strongly rejects Columbea, a major finding by Jarvis, Mirarab, et al., Science 2015.
Summary so far
Standard coalescent-‐based methods (such as MP-‐EST) have poor accuracy in the presence of gene tree error. Sta&s&cal binning improves the es&ma&on of gene tree distribu&ons, and so: • Improves species tree es&ma&on • Improves species tree branch lengths • Reduces incidence of strongly supported false
posi&ve branches
Summary so far
Standard coalescent-‐based methods (such as MP-‐EST) have poor accuracy in the presence of gene tree error. Sta&s&cal binning improves the es&ma&on of gene tree distribu&ons, and so: • Improves species tree es&ma&on • Improves species tree branch lengths Reduces
incidence of strongly supported false posi&ve branches
Summary so far
Standard coalescent-‐based methods (such as MP-‐EST) have poor accuracy in the presence of gene tree error. Sta&s&cal binning improves the es&ma&on of gene tree distribu&ons, and so: • Improves species tree es&ma&on • Improves species tree branch lengths • Reduces incidence of strongly supported false
posi&ve branches
Summary so far
Standard coalescent-‐based methods (such as MP-‐EST) have poor accuracy in the presence of gene tree error. Sta&s&cal binning improves the es&ma&on of gene tree distribu&ons, and so: • Improves species tree es&ma&on • Improves species tree branch lengths • Reduces incidence of strongly supported false
posi&ve branches
See Mirarab et al. Science 2015 and Bayzid et al. PLOS One 2015
1KP: Plant whole transcriptomes[Wickett*, Mirarab*, et al., PNAS, 2014]
16
Phylotranscriptomic analysis of the origin and earlydiversification of land plantsNorman J. Wicketta,b,1,2, Siavash Mirarabc,1, Nam Nguyenc, Tandy Warnowc, Eric Carpenterd, Naim Matascie,f,Saravanaraj Ayyampalayamg, Michael S. Barkerf, J. Gordon Burleighh, Matthew A. Gitzendannerh,i, Brad R. Ruhfelh,j,k,Eric Wafulal, Joshua P. Derl, Sean W. Grahamm, Sarah Mathewsn, Michael Melkoniano, Douglas E. Soltish,i,k,Pamela S. Soltish,i,k, Nicholas W. Milesk, Carl J. Rothfelsp,q, Lisa Pokornyp,r, A. Jonathan Shawp, Lisa DeGironimos,Dennis W. Stevensons, Barbara Sureko, Juan Carlos Villarrealt, Béatrice Roureu, Hervé Philippeu,v, Claude W. dePamphilisl,Tao Chenw, Michael K. Deyholosd, Regina S. Baucomx, Toni M. Kutchany, Megan M. Augustiny, Jun Wangz, Yong Zhangv,Zhijian Tianz, Zhixiang Yanz, Xiaolei Wuz, Xiao Sunz, Gane Ka-Shu Wongd,z,aa,2, and James Leebens-Mackg,2
aChicago Botanic Garden, Glencoe, IL 60022; bProgram in Biological Sciences, Northwestern University, Evanston, IL 60208; cDepartment of Computer Science,University of Texas, Austin, TX 78712; dDepartment of Biological Sciences, University of Alberta, Edmonton, AB, Canada T6G 2E9; eiPlant Collaborative,Tucson, AZ 85721; fDepartment of Ecology and Evolutionary Biology, University of Arizona, Tucson, AZ 85721; gDepartment of Plant Biology, University ofGeorgia, Athens, GA 30602; hDepartment of Biology and iGenetics Institute, University of Florida, Gainesville, FL 32611; jDepartment of Biological Sciences,Eastern Kentucky University, Richmond, KY 40475; kFlorida Museum of Natural History, Gainesville, FL 32611; lDepartment of Biology, Pennsylvania StateUniversity, University Park, PA 16803; mDepartment of Botany and qDepartment of Zoology, University of British Columbia, Vancouver, BC, Canada V6T 1Z4;nArnold Arboretum of Harvard University, Cambridge, MA 02138; oBotanical Institute, Universität zu Köln, Cologne D-50674, Germany; pDepartment ofBiology, Duke University, Durham, NC 27708; rDepartment of Biodiversity and Conservation, Real Jardín Botanico-Consejo Superior de InvestigacionesCientificas, 28014 Madrid, Spain; sNew York Botanical Garden, Bronx, NY 10458; tDepartment fur Biologie, Systematische Botanik und Mykologie,Ludwig-Maximilians-Universitat, 80638 Munich, Germany; uDépartement de Biochimie, Centre Robert-Cedergren, Université de Montréal, SuccursaleCentre-Ville, Montreal, QC, Canada H3C 3J7; vCNRS, Station d’ Ecologie Experimentale du CNRS, Moulis, 09200, France; wShenzhen Fairy Lake BotanicalGarden, The Chinese Academy of Sciences, Shenzhen, Guangdong 518004, China; xDepartment of Ecology and Evolutionary Biology, University ofMichigan, Ann Arbor, MI 48109; yDonald Danforth Plant Science Center, St. Louis, MO 63132; zBGI-Shenzhen, Bei shan Industrial Zone, Yantian District,Shenzhen 518083, China; and aaDepartment of Medicine, University of Alberta, Edmonton, AB, Canada T6G 2E1
Edited by Paul O. Lewis, University of Connecticut, Storrs, CT, and accepted by the Editorial Board September 29, 2014 (received for review December 23, 2013)
Reconstructing the origin and evolution of land plants and theiralgal relatives is a fundamental problem in plant phylogenetics, andis essential for understanding how critical adaptations arose, in-cluding the embryo, vascular tissue, seeds, and flowers. Despiteadvances in molecular systematics, some hypotheses of relationshipsremain weakly resolved. Inferring deep phylogenies with bouts ofrapid diversification can be problematic; however, genome-scaledata should significantly increase the number of informative charac-ters for analyses. Recent phylogenomic reconstructions focused onthe major divergences of plants have resulted in promising but in-consistent results. One limitation is sparse taxon sampling, likelyresulting from the difficulty and cost of data generation. To addressthis limitation, transcriptome data for 92 streptophyte taxa weregenerated and analyzed along with 11 published plant genomesequences. Phylogenetic reconstructions were conducted using upto 852 nuclear genes and 1,701,170 aligned sites. Sixty-nine analyseswere performed to test the robustness of phylogenetic inferences topermutations of the datamatrix or to phylogenetic method, includingsupermatrix, supertree, and coalescent-based approaches, maximum-likelihood and Bayesian methods, partitioned and unpartitioned ana-lyses, and amino acid versus DNA alignments. Among otherresults, we find robust support for a sister-group relationshipbetween land plants and one group of streptophyte green al-gae, the Zygnematophyceae. Strong and robust support for aclade comprising liverworts and mosses is inconsistent with awidely accepted view of early land plant evolution, and suggeststhat phylogenetic hypotheses used to understand the evolution offundamental plant traits should be reevaluated.
land plants | Streptophyta | phylogeny | phylogenomics | transcriptome
The origin of embryophytes (land plants) in the Ordovicianperiod roughly 480 Mya (1–4) marks one of the most im-
portant events in the evolution of life on Earth. The early evo-lution of embryophytes in terrestrial environments was facilitatedby numerous innovations, including parental protection for thedeveloping embryo, sperm and egg production in multicellularprotective structures, and an alternation of phases (often referred toas generations) in which a diploid sporophytic life history stagegives rise to a multicellular haploid gametophytic phase. With
Significance
Early branching events in the diversification of land plants andclosely related algal lineages remain fundamental and un-resolved questions in plant evolutionary biology. Accuratereconstructions of these relationships are critical for testing hy-potheses of character evolution: for example, the origins of theembryo, vascular tissue, seeds, and flowers. We investigatedrelationships among streptophyte algae and land plants usingthe largest set of nuclear genes that has been applied to thisproblem to date. Hypothesized relationships were rigorouslytested through a series of analyses to assess systematic errors inphylogenetic inference caused by sampling artifacts and modelmisspecification. Results support some generally accepted phy-logenetic hypotheses, while rejecting others. This work providesa new framework for studies of land plant evolution.
Author contributions: N.J.W., S. Mirarab, T.W., S.W.G., M.M., D.E.S., P.S.S., D.W.S., M.K.D.,J.W., G.K.-S.W., and J.L.-M. designed research; N.J.W., S. Mirarab, N.N., T.W., E.C., N.M., S.A.,M.S.B., J.G.B., M.A.G., B.R.R., E.W., J.P.D., S.W.G., S. Mathews, M.M., D.E.S., P.S.S., N.W.M.,C.J.R., L.P., A.J.S., L.D., D.W.S., B.S., J.C.V., B.R., H.P., C.W.d., T.C., M.K.D., M.M.A., J.W., Y.Z.,Z.T., Z.Y., X.W., X.S., G.K.-S.W., and J.L.-M. performed research; S. Mirarab, N.N., T.W., N.M.,S.A., M.S.B., J.G.B., M.A.G., E.W., J.P.D., S.W.G., S. Mathews, M.M., D.E.S., P.S.S., N.W.M., C.J.R.,L.P., A.J.S., L.D., D.W.S., B.S., J.C.V., H.P., C.W.d., T.C., M.K.D., R.S.B., T.M.K., M.M.A., J.W., Y.Z.,G.K.-S.W., and J.L.-M. contributed new reagents/analytic tools; N.J.W., S. Mirarab, N.N., E.C.,N.M., S.A., M.S.B., J.G.B., M.A.G., B.R.R., E.W., B.R., H.P., and J.L.-M. analyzed data; N.J.W.,S. Mirarab, T.W., S.W.G., M.M., D.E.S., D.W.S., H.P., G.K.-S.W., and J.L.-M. wrote the paper;and N.M. archived data.
The authors declare no conflict of interest.
This article is a PNAS Direct Submission. P.O.L. is a guest editor invited by theEditorial Board.
Freely available online through the PNAS open access option.
Data deposition: The sequences reported in this paper have been deposited in theiplant datastore database, mirrors.iplantcollaborative.org/onekp_pilot, and the Na-tional Center for Biotechnology Information Sequence Read Archive, www.ncbi.nlm.nih.gov/sra [accession no. PRJEB4921 (ERP004258)].1N.J.W. and S. Mirarab contributed equally to this work.2To whom correspondence may be addressed. Email: nwickett@chicagobotanic.org,gane@ualberta.ca, or jleebensmack@plantbio.uga.edu.
This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1323926111/-/DCSupplemental.
www.pnas.org/cgi/doi/10.1073/pnas.1323926111 PNAS Early Edition | 1 of 10
EVOLU
TION
PNASPL
US
• Whole transcriptomes for 103 plant species
• 1,200 in the next phase
• 400-800 single copy “genes”
• Spans ~1 billion years of evolution
• Many unanswered questions about plant evolution
medianmean
0%
2%
4%
6%
8%
0% 25% 50% 75% 100%branch bootstrap support
bran
ches
(per
cent
age)
Summary methods on the 1KP data (103 plants)
• Existing summary methods produced species trees with low support and unbelievable relationships
• .. despite having gene trees with relatively high bootstrap support
17
400 genes from 1KP data[Wickett*, Mirarab*, et al., PNAS, 2014]
medianmean
0%
2%
4%
6%
8%
0% 25% 50% 75% 100%branch bootstrap support
bran
ches
(per
cent
age)
Summary methods on the 1KP data (103 plants)
• Existing summary methods produced species trees with low support and unbelievable relationships
• .. despite having gene trees with relatively high bootstrap support
• Our simulation studies showed that the reason had to do with the number of taxa
17
400 genes from 1KP data[Wickett*, Mirarab*, et al., PNAS, 2014]
1000 simulated genes, “medium” levels of ILS[Mirarab and Warnow, ISMB, 2015]
medianmean
0%
2%
4%
6%
8%
0% 25% 50% 75% 100%branch bootstrap support
bran
ches
(per
cent
age)
Summary methods on the 1KP data (103 plants)
• Existing summary methods produced species trees with low support and unbelievable relationships
• .. despite having gene trees with relatively high bootstrap support
• Our simulation studies showed that the reason had to do with the number of taxa
17
400 genes from 1KP data[Wickett*, Mirarab*, et al., PNAS, 2014]
1000 simulated genes, “medium” levels of ILS[Mirarab and Warnow, ISMB, 2015]
medianmean
0%
2%
4%
6%
8%
0% 25% 50% 75% 100%branch bootstrap support
bran
ches
(per
cent
age)
Summary methods on the 1KP data (103 plants)
• Existing summary methods produced species trees with low support and unbelievable relationships
• .. despite having gene trees with relatively high bootstrap support
• Our simulation studies showed that the reason had to do with the number of taxa
17
400 genes from 1KP data[Wickett*, Mirarab*, et al., PNAS, 2014]
The problem size (# species) matters too!
1000 simulated genes, “medium” levels of ILS[Mirarab and Warnow, ISMB, 2015]
1KP: Thousand Transcriptome Project
l 1200 plant transcriptomes l More than 13,000 gene families (most not single copy) l Gene sequence alignments and trees computed using SATe (Liu et al.,
Science 2009 and Systema&c Biology 2012)
G. Ka-Shu Wong U Alberta
N. Wickett Northwestern
J. Leebens-Mack U Georgia
N. Matasci iPlant
T. Warnow, S. Mirarab, N. Nguyen, Md. S.Bayzid UT-Austin UT-Austin UT-Austin UT-Austin
MP-‐EST could not be used – dataset too large, and requirement that all gene trees be rooted correctly was also a problem. We used ASTRAL to esHmate a coalescent-‐based species tree 1KP paper by WickeL, Mirarab et al., PNAS 2014
Plus many other people…
1KP: Thousand Transcriptome Project
l 1200 plant transcriptomes l More than 13,000 gene families (most not single copy) l Gene sequence alignments and trees computed using SATe (Liu et al.,
Science 2009 and Systema&c Biology 2012)
G. Ka-Shu Wong U Alberta
N. Wickett Northwestern
J. Leebens-Mack U Georgia
N. Matasci iPlant
T. Warnow, S. Mirarab, N. Nguyen, Md. S.Bayzid UT-Austin UT-Austin UT-Austin UT-Austin
MP-‐EST could not be used – dataset too large, and requirement that all gene trees be rooted correctly was also a problem. We used ASTRAL to esHmate a coalescent-‐based species tree 1KP paper by WickeL, Mirarab et al., PNAS 2014
Plus many other people…
1KP: Thousand Transcriptome Project
l 1200 plant transcriptomes l More than 13,000 gene families (most not single copy) l Gene sequence alignments and trees computed using SATe (Liu et al.,
Science 2009 and Systema&c Biology 2012)
G. Ka-Shu Wong U Alberta
N. Wickett Northwestern
J. Leebens-Mack U Georgia
N. Matasci iPlant
T. Warnow, S. Mirarab, N. Nguyen, Md. S.Bayzid UT-Austin UT-Austin UT-Austin UT-Austin
MP-‐EST could not be used – dataset too large, and requirement that all gene trees be rooted correctly was also a problem. We used ASTRAL to esHmate a coalescent-‐based species tree 1KP paper by WickeL, Mirarab et al., PNAS 2014
Plus many other people…
ASTRAL’s approach
• Input: set of unrooted gene trees T1, T2, …, Tk and set X of bipar&&ons on species set S
• Output: Tree T* maximizing the total quartet-‐similarity score to the unrooted gene trees, subject to Bipar&&ons(T*) drawn from X
Theorem: ASTRAL is sta&s&cally consistent under the mul&-‐species coalescent model, and runs in polynomial &me.
0.00
0.05
0.10
0.15
0.2X 0.5X 1X 2X 5X
Mis
sing
bra
nch
rate
MP−EST ASTRAL Concatenation − ML
ASTRAL vs. MP-‐EST and Concatena&on 200 genes, 500bp
Less ILS
Mammalian Simula&on Study, Varying ILS level
�������������
��������
��������
�������������
����������
��������������
��������
����������
���
���
���
���
���
��
���
���
���
�
���
���
���
!���������
������������
��������
"#
�"�
$�%�&�& �!
""
Two coalescent-‐based analyses of the Song et al. mammals dataset
ASTRAL on plants dataset• The ASTRAL tree:
• High support
• Similar to concatenation with some interesting differences (e.g., recovered bryophytes)
• ASTRAL took only about 10 minutes (serial running time) on 103 taxa and 400 genes
43
estimated using CAT+GTR+Gamma models on nucleotide (firstand second codon positions) and amino acid alignments suggeststhat this model may still be too simple for concatenated alignmentsrelative to the true gene coalescence and substitution processes (seealso ref. 72). The placement of hornworts and a moss+liverwortclade as successively sister to vascular plants is consistent withanalyses based on morphological and developmental characters (78,79), including dextral sperm in hornworts rather than sinistralsperm, as in all other land plants, and the retention of the pyrenoid,a plastid structure that is the site of RUBISCO localization, sharedby hornworts and streptophytic algae (reviewed in ref. 80). Thepossibility that some of these trait mappings are the product ofevolutionary convergence should also be considered, and seemslikely in the case of the pyrenoid (81). The significance of othermorphological similarities is also not yet clear. For example, thedevelopment of gametangia in hornworts resembles antheridial (44,82) and archegonial (82) development in monilophytes, whereasthose of the liverworts and mosses are autapomorphic, suggestinga closer relationship between hornworts and vascular plants. Acomparison can also be made with respect to the development ofthe embryo and the young sporophyte. The hornwort embryo andsporophyte have no apical growth at any stage, but rather exhibit anintercalary meristem. In contrast, mosses and monilophytes haveapical growth on both ends of the sporophyte, although basal apicalgrowth is ephemeral in the former. The possibility of multiple ori-gins of the multicellular sporophyte in land plants can therefore beconsidered (83): once with intercalary growth, as in the hornworts,and once with apical growth, as in mosses and tracheophytes (liv-erworts have neither intercalary nor apical growth). Ultimately, thisfinding underscores the difficulty in placing hornworts—or bryo-phytes in general—within the phylogeny of land plants based oncurrent evidence from morphology alone.
In summary, three primary hypotheses emerge from ouranalyses with respect to the resolution of the earliest branchingevents in land plant phylogeny: (i) (hornworts, ((liverworts,mosses), vascular plants)) supported in most ML analyses of nu-cleotide and amino acid supermatrices; (ii) [(liverworts, mosses),(hornworts, vascular plants)], supported by the PhyloBayes anal-ysis of amino acids; and (iii) [(hornworts, [mosses, liverworts]),vascular plants], supported by supertree and ASTRAL analyses ofamino acids and first and second codon positions and some aminoacid supermatrix analyses. However, we cannot dismiss alternativehypotheses recovered by some of our analyses, including [mosses(liverworts [hornworts, vascular plants])], which is supported bythe PhyloBayes analysis of first and second codon positions (Fig.4). Caution should be taken in rejecting any of these hypothesesgiven the sparse sampling, especially for the hornworts.
Monilophyte and Lycophyte Relationships. Phylogenetic analyses ofmultigene (generally plastid) datasets (84–87) have consistentlyresolved the lycophytes and monilophytes as successive sister lin-eages to the seed plants, with the euphyllophytes comprising theseed-free monilophytes (ferns) and seed-bearing spermatophytes.Aside from the clearly artifactual placement of Selaginella as sisterto all other land plants in analyses including third codonpositions (mirrors.iplantcollaborative.org/onekp_pilot), ourresults support this branching order (Figs. 2 and 3; otherspecies trees at mirrors.iplantcollaborative.org/onekp_pilot).The placement of Selaginella has been problematic in previousanalyses (49) and we interpret its misplacement in several ana-lyses here as a consequence of GC content at the third codonposition, which is more similar to streptophyte algae than toembryophytes (Fig. S2).
Matrix type
Alignment
Codon positions
ASTRAL
AA AADNA to AA DNA to AA DNA
1 and 2 1 and 2all allNA NA NA
DNA
NA
Supermatrix
Zygnematophyceae-sisterCharales-sister
Coleochaetales-sister
Sister to land plants
Mosses + liverwortsBryophytes monophyletic
Hornworts-sister
Hornworts-basalLiverworts-basal
Bryophytes
GnepineConifers monophyletic
GnetiferGnetales-sister
Gymnosperms
Eudicots + magnoliidsEudicots + mag/Chlor
Magnoliids + ChloranthalesMag + Chlor, monocots
Monocots + eudicots
Angiosperms
Amborella + NupharAmborella-sister
ANA-grade angiosperms
untr
im.u
npar
t50
gene
s.un
part
50ge
nes5
0site
s.un
part
50ge
nes5
0site
s.pa
rt50
gene
s50s
ites.
gam
ma.
part
50ge
nesC
hara
.unp
art
50ge
nes5
0site
s.25
X.u
npar
t50
gene
s33t
axa.
unpa
rt60
4gen
es.tr
imE
xt.u
npar
t60
4gen
es.tr
imE
xt.g
amm
a.un
part
604g
enes
.trim
Ext
.Bay
es.C
ATG
TR60
4gen
es.tr
imE
xt.B
ayes
.CAT
untr
im.u
npar
t50
gene
s.un
part
50ge
nes5
0site
s.un
part
50ge
nes5
0site
s.pa
rt50
gene
s50s
ites.
gam
ma.
part
50ge
nesC
hara
.unp
art
50ge
nes5
0site
s.25
X.u
npar
t50
gene
s33t
axa.
unpa
rt60
4gen
es.tr
imE
xt.u
npar
t60
4gen
es.tr
imE
xt.g
amm
a.un
part
604g
enes
.trim
Ext
.Bay
es.C
ATG
TR
untr
im.u
npar
t50
gene
s.un
part
50ge
nes5
0site
s.un
part
50ge
nes5
0site
s.pa
rt50
gene
s50s
ites.
gam
ma.
part
50ge
nesC
hara
.unp
art
50ge
nes5
0site
s.25
X.u
npar
t50
gene
s33t
axa.
unpa
rt60
4gen
es.tr
imE
xt.u
npar
t60
4gen
es.tr
imE
xt.g
amm
a.un
part
50ge
nes5
0site
s.un
part
50ge
nes5
0site
s25X
.unp
art
untr
im50
gene
s50
gene
s.25
X50
gene
s33t
axa
untr
imun
trim
.gam
ma
50ge
nes
50ge
nes.
25X
50ge
nes3
3tax
a
untr
imun
trim
.gam
ma
50ge
nes
50ge
nes.
25X
50ge
nes3
3tax
a
untr
im50
gene
s50
gene
s.25
X
Strong Support Weak Support Compatible (Weak Rejection) Strong Rejection
Fig. 4. Summary of support for hypotheses of land plant relationships across 52 supermatrix and coalescent-based analyses including permutations of the fulldata matrix (Table S2). Occupancy-based gene trimming was carried out by removing genes for which >50% the full taxon set were not included in thealignment. Site trimming removed columns in the aligmment for which >50% of the full taxon set were represented by gap characters. Long-branch trimmingwas performed on gene trees when a terminal branch was 25-times longer than the median branch length. More stringent, blast-based removal of sequencesidentified as possible contaminants resulted in a set of 604 gene families (see SI Materials and Methods for stringent filtering strategy). Supermatrix analyseswere done with and without partitioning of genes into model parameter classes. Filtering and analysis strategies are indicated below each column andinclude combinations of: (i) untrim: untrimmed/unfiltered data; (ii) unpart: no data partitions; (iii) 50genes: occupancy-based gene trimming at 50%; (iv)50sites: occupancy-based site trimming at 50%; (v) gamma: full Gamma (vs. PSR approximation of Gamma); (vi) Chara: gene trimming to exclude genes notpresent in Chara vulgaris; (vii) 25X: long branch trimming; (viii) 33taxa: sequences with more than 66% gaps in gene alignments removed; (ix) 604genes.trimEXT: aggressive BLAST-based and long-branch filtering of sequence assemblies for each taxon followed by GBLOCKS filtering of sites (SI Materials andMethods). Strong support refers to bootstrap values above 75% for a clade containing the specified taxa. All trees and alignments are available in iPlant’sdata store (mirrors.iplantcollaborative.org/onekp_pilot).
6 of 10 | www.pnas.org/cgi/doi/10.1073/pnas.1323926111 Wickett et al.
estimated using CAT+GTR+Gamma models on nucleotide (firstand second codon positions) and amino acid alignments suggeststhat this model may still be too simple for concatenated alignmentsrelative to the true gene coalescence and substitution processes (seealso ref. 72). The placement of hornworts and a moss+liverwortclade as successively sister to vascular plants is consistent withanalyses based on morphological and developmental characters (78,79), including dextral sperm in hornworts rather than sinistralsperm, as in all other land plants, and the retention of the pyrenoid,a plastid structure that is the site of RUBISCO localization, sharedby hornworts and streptophytic algae (reviewed in ref. 80). Thepossibility that some of these trait mappings are the product ofevolutionary convergence should also be considered, and seemslikely in the case of the pyrenoid (81). The significance of othermorphological similarities is also not yet clear. For example, thedevelopment of gametangia in hornworts resembles antheridial (44,82) and archegonial (82) development in monilophytes, whereasthose of the liverworts and mosses are autapomorphic, suggestinga closer relationship between hornworts and vascular plants. Acomparison can also be made with respect to the development ofthe embryo and the young sporophyte. The hornwort embryo andsporophyte have no apical growth at any stage, but rather exhibit anintercalary meristem. In contrast, mosses and monilophytes haveapical growth on both ends of the sporophyte, although basal apicalgrowth is ephemeral in the former. The possibility of multiple ori-gins of the multicellular sporophyte in land plants can therefore beconsidered (83): once with intercalary growth, as in the hornworts,and once with apical growth, as in mosses and tracheophytes (liv-erworts have neither intercalary nor apical growth). Ultimately, thisfinding underscores the difficulty in placing hornworts—or bryo-phytes in general—within the phylogeny of land plants based oncurrent evidence from morphology alone.
In summary, three primary hypotheses emerge from ouranalyses with respect to the resolution of the earliest branchingevents in land plant phylogeny: (i) (hornworts, ((liverworts,mosses), vascular plants)) supported in most ML analyses of nu-cleotide and amino acid supermatrices; (ii) [(liverworts, mosses),(hornworts, vascular plants)], supported by the PhyloBayes anal-ysis of amino acids; and (iii) [(hornworts, [mosses, liverworts]),vascular plants], supported by supertree and ASTRAL analyses ofamino acids and first and second codon positions and some aminoacid supermatrix analyses. However, we cannot dismiss alternativehypotheses recovered by some of our analyses, including [mosses(liverworts [hornworts, vascular plants])], which is supported bythe PhyloBayes analysis of first and second codon positions (Fig.4). Caution should be taken in rejecting any of these hypothesesgiven the sparse sampling, especially for the hornworts.
Monilophyte and Lycophyte Relationships. Phylogenetic analyses ofmultigene (generally plastid) datasets (84–87) have consistentlyresolved the lycophytes and monilophytes as successive sister lin-eages to the seed plants, with the euphyllophytes comprising theseed-free monilophytes (ferns) and seed-bearing spermatophytes.Aside from the clearly artifactual placement of Selaginella as sisterto all other land plants in analyses including third codonpositions (mirrors.iplantcollaborative.org/onekp_pilot), ourresults support this branching order (Figs. 2 and 3; otherspecies trees at mirrors.iplantcollaborative.org/onekp_pilot).The placement of Selaginella has been problematic in previousanalyses (49) and we interpret its misplacement in several ana-lyses here as a consequence of GC content at the third codonposition, which is more similar to streptophyte algae than toembryophytes (Fig. S2).
Matrix type
Alignment
Codon positions
ASTRAL
AA AADNA to AA DNA to AA DNA
1 and 2 1 and 2all allNA NA NA
DNA
NA
Supermatrix
Zygnematophyceae-sisterCharales-sister
Coleochaetales-sister
Sister to land plants
Mosses + liverwortsBryophytes monophyletic
Hornworts-sister
Hornworts-basalLiverworts-basal
Bryophytes
GnepineConifers monophyletic
GnetiferGnetales-sister
Gymnosperms
Eudicots + magnoliidsEudicots + mag/Chlor
Magnoliids + ChloranthalesMag + Chlor, monocots
Monocots + eudicots
Angiosperms
Amborella + NupharAmborella-sister
ANA-grade angiosperms
untr
im.u
npar
t50
gene
s.un
part
50ge
nes5
0site
s.un
part
50ge
nes5
0site
s.pa
rt50
gene
s50s
ites.
gam
ma.
part
50ge
nesC
hara
.unp
art
50ge
nes5
0site
s.25
X.u
npar
t50
gene
s33t
axa.
unpa
rt60
4gen
es.tr
imE
xt.u
npar
t60
4gen
es.tr
imE
xt.g
amm
a.un
part
604g
enes
.trim
Ext
.Bay
es.C
ATG
TR60
4gen
es.tr
imE
xt.B
ayes
.CAT
untr
im.u
npar
t50
gene
s.un
part
50ge
nes5
0site
s.un
part
50ge
nes5
0site
s.pa
rt50
gene
s50s
ites.
gam
ma.
part
50ge
nesC
hara
.unp
art
50ge
nes5
0site
s.25
X.u
npar
t50
gene
s33t
axa.
unpa
rt60
4gen
es.tr
imE
xt.u
npar
t60
4gen
es.tr
imE
xt.g
amm
a.un
part
604g
enes
.trim
Ext
.Bay
es.C
ATG
TR
untr
im.u
npar
t50
gene
s.un
part
50ge
nes5
0site
s.un
part
50ge
nes5
0site
s.pa
rt50
gene
s50s
ites.
gam
ma.
part
50ge
nesC
hara
.unp
art
50ge
nes5
0site
s.25
X.u
npar
t50
gene
s33t
axa.
unpa
rt60
4gen
es.tr
imE
xt.u
npar
t60
4gen
es.tr
imE
xt.g
amm
a.un
part
50ge
nes5
0site
s.un
part
50ge
nes5
0site
s25X
.unp
art
untr
im50
gene
s50
gene
s.25
X50
gene
s33t
axa
untr
imun
trim
.gam
ma
50ge
nes
50ge
nes.
25X
50ge
nes3
3tax
a
untr
imun
trim
.gam
ma
50ge
nes
50ge
nes.
25X
50ge
nes3
3tax
a
untr
im50
gene
s50
gene
s.25
X
Strong Support Weak Support Compatible (Weak Rejection) Strong Rejection
Fig. 4. Summary of support for hypotheses of land plant relationships across 52 supermatrix and coalescent-based analyses including permutations of the fulldata matrix (Table S2). Occupancy-based gene trimming was carried out by removing genes for which >50% the full taxon set were not included in thealignment. Site trimming removed columns in the aligmment for which >50% of the full taxon set were represented by gap characters. Long-branch trimmingwas performed on gene trees when a terminal branch was 25-times longer than the median branch length. More stringent, blast-based removal of sequencesidentified as possible contaminants resulted in a set of 604 gene families (see SI Materials and Methods for stringent filtering strategy). Supermatrix analyseswere done with and without partitioning of genes into model parameter classes. Filtering and analysis strategies are indicated below each column andinclude combinations of: (i) untrim: untrimmed/unfiltered data; (ii) unpart: no data partitions; (iii) 50genes: occupancy-based gene trimming at 50%; (iv)50sites: occupancy-based site trimming at 50%; (v) gamma: full Gamma (vs. PSR approximation of Gamma); (vi) Chara: gene trimming to exclude genes notpresent in Chara vulgaris; (vii) 25X: long branch trimming; (viii) 33taxa: sequences with more than 66% gaps in gene alignments removed; (ix) 604genes.trimEXT: aggressive BLAST-based and long-branch filtering of sequence assemblies for each taxon followed by GBLOCKS filtering of sites (SI Materials andMethods). Strong support refers to bootstrap values above 75% for a clade containing the specified taxa. All trees and alignments are available in iPlant’sdata store (mirrors.iplantcollaborative.org/onekp_pilot).
6 of 10 | www.pnas.org/cgi/doi/10.1073/pnas.1323926111 Wickett et al.
estimated using CAT+GTR+Gamma models on nucleotide (firstand second codon positions) and amino acid alignments suggeststhat this model may still be too simple for concatenated alignmentsrelative to the true gene coalescence and substitution processes (seealso ref. 72). The placement of hornworts and a moss+liverwortclade as successively sister to vascular plants is consistent withanalyses based on morphological and developmental characters (78,79), including dextral sperm in hornworts rather than sinistralsperm, as in all other land plants, and the retention of the pyrenoid,a plastid structure that is the site of RUBISCO localization, sharedby hornworts and streptophytic algae (reviewed in ref. 80). Thepossibility that some of these trait mappings are the product ofevolutionary convergence should also be considered, and seemslikely in the case of the pyrenoid (81). The significance of othermorphological similarities is also not yet clear. For example, thedevelopment of gametangia in hornworts resembles antheridial (44,82) and archegonial (82) development in monilophytes, whereasthose of the liverworts and mosses are autapomorphic, suggestinga closer relationship between hornworts and vascular plants. Acomparison can also be made with respect to the development ofthe embryo and the young sporophyte. The hornwort embryo andsporophyte have no apical growth at any stage, but rather exhibit anintercalary meristem. In contrast, mosses and monilophytes haveapical growth on both ends of the sporophyte, although basal apicalgrowth is ephemeral in the former. The possibility of multiple ori-gins of the multicellular sporophyte in land plants can therefore beconsidered (83): once with intercalary growth, as in the hornworts,and once with apical growth, as in mosses and tracheophytes (liv-erworts have neither intercalary nor apical growth). Ultimately, thisfinding underscores the difficulty in placing hornworts—or bryo-phytes in general—within the phylogeny of land plants based oncurrent evidence from morphology alone.
In summary, three primary hypotheses emerge from ouranalyses with respect to the resolution of the earliest branchingevents in land plant phylogeny: (i) (hornworts, ((liverworts,mosses), vascular plants)) supported in most ML analyses of nu-cleotide and amino acid supermatrices; (ii) [(liverworts, mosses),(hornworts, vascular plants)], supported by the PhyloBayes anal-ysis of amino acids; and (iii) [(hornworts, [mosses, liverworts]),vascular plants], supported by supertree and ASTRAL analyses ofamino acids and first and second codon positions and some aminoacid supermatrix analyses. However, we cannot dismiss alternativehypotheses recovered by some of our analyses, including [mosses(liverworts [hornworts, vascular plants])], which is supported bythe PhyloBayes analysis of first and second codon positions (Fig.4). Caution should be taken in rejecting any of these hypothesesgiven the sparse sampling, especially for the hornworts.
Monilophyte and Lycophyte Relationships. Phylogenetic analyses ofmultigene (generally plastid) datasets (84–87) have consistentlyresolved the lycophytes and monilophytes as successive sister lin-eages to the seed plants, with the euphyllophytes comprising theseed-free monilophytes (ferns) and seed-bearing spermatophytes.Aside from the clearly artifactual placement of Selaginella as sisterto all other land plants in analyses including third codonpositions (mirrors.iplantcollaborative.org/onekp_pilot), ourresults support this branching order (Figs. 2 and 3; otherspecies trees at mirrors.iplantcollaborative.org/onekp_pilot).The placement of Selaginella has been problematic in previousanalyses (49) and we interpret its misplacement in several ana-lyses here as a consequence of GC content at the third codonposition, which is more similar to streptophyte algae than toembryophytes (Fig. S2).
Matrix type
Alignment
Codon positions
ASTRAL
AA AADNA to AA DNA to AA DNA
1 and 2 1 and 2all allNA NA NA
DNA
NA
Supermatrix
Zygnematophyceae-sisterCharales-sister
Coleochaetales-sister
Sister to land plants
Mosses + liverwortsBryophytes monophyletic
Hornworts-sister
Hornworts-basalLiverworts-basal
Bryophytes
GnepineConifers monophyletic
GnetiferGnetales-sister
Gymnosperms
Eudicots + magnoliidsEudicots + mag/Chlor
Magnoliids + ChloranthalesMag + Chlor, monocots
Monocots + eudicots
Angiosperms
Amborella + NupharAmborella-sister
ANA-grade angiosperms
untr
im.u
npar
t50
gene
s.un
part
50ge
nes5
0site
s.un
part
50ge
nes5
0site
s.pa
rt50
gene
s50s
ites.
gam
ma.
part
50ge
nesC
hara
.unp
art
50ge
nes5
0site
s.25
X.u
npar
t50
gene
s33t
axa.
unpa
rt60
4gen
es.tr
imE
xt.u
npar
t60
4gen
es.tr
imE
xt.g
amm
a.un
part
604g
enes
.trim
Ext
.Bay
es.C
ATG
TR
604g
enes
.trim
Ext
.Bay
es.C
AT
untr
im.u
npar
t50
gene
s.un
part
50ge
nes5
0site
s.un
part
50ge
nes5
0site
s.pa
rt50
gene
s50s
ites.
gam
ma.
part
50ge
nesC
hara
.unp
art
50ge
nes5
0site
s.25
X.u
npar
t50
gene
s33t
axa.
unpa
rt60
4gen
es.tr
imE
xt.u
npar
t60
4gen
es.tr
imE
xt.g
amm
a.un
part
604g
enes
.trim
Ext
.Bay
es.C
ATG
TR
untr
im.u
npar
t50
gene
s.un
part
50ge
nes5
0site
s.un
part
50ge
nes5
0site
s.pa
rt50
gene
s50s
ites.
gam
ma.
part
50ge
nesC
hara
.unp
art
50ge
nes5
0site
s.25
X.u
npar
t50
gene
s33t
axa.
unpa
rt60
4gen
es.tr
imE
xt.u
npar
t60
4gen
es.tr
imE
xt.g
amm
a.un
part
50ge
nes5
0site
s.un
part
50ge
nes5
0site
s25X
.unp
art
untr
im50
gene
s50
gene
s.25
X50
gene
s33t
axa
untr
imun
trim
.gam
ma
50ge
nes
50ge
nes.
25X
50ge
nes3
3tax
a
untr
imun
trim
.gam
ma
50ge
nes
50ge
nes.
25X
50ge
nes3
3tax
a
untr
im50
gene
s50
gene
s.25
X
Strong Support Weak Support Compatible (Weak Rejection) Strong Rejection
Fig. 4. Summary of support for hypotheses of land plant relationships across 52 supermatrix and coalescent-based analyses including permutations of the fulldata matrix (Table S2). Occupancy-based gene trimming was carried out by removing genes for which >50% the full taxon set were not included in thealignment. Site trimming removed columns in the aligmment for which >50% of the full taxon set were represented by gap characters. Long-branch trimmingwas performed on gene trees when a terminal branch was 25-times longer than the median branch length. More stringent, blast-based removal of sequencesidentified as possible contaminants resulted in a set of 604 gene families (see SI Materials and Methods for stringent filtering strategy). Supermatrix analyseswere done with and without partitioning of genes into model parameter classes. Filtering and analysis strategies are indicated below each column andinclude combinations of: (i) untrim: untrimmed/unfiltered data; (ii) unpart: no data partitions; (iii) 50genes: occupancy-based gene trimming at 50%; (iv)50sites: occupancy-based site trimming at 50%; (v) gamma: full Gamma (vs. PSR approximation of Gamma); (vi) Chara: gene trimming to exclude genes notpresent in Chara vulgaris; (vii) 25X: long branch trimming; (viii) 33taxa: sequences with more than 66% gaps in gene alignments removed; (ix) 604genes.trimEXT: aggressive BLAST-based and long-branch filtering of sequence assemblies for each taxon followed by GBLOCKS filtering of sites (SI Materials andMethods). Strong support refers to bootstrap values above 75% for a clade containing the specified taxa. All trees and alignments are available in iPlant’sdata store (mirrors.iplantcollaborative.org/onekp_pilot).
6 of 10 | www.pnas.org/cgi/doi/10.1073/pnas.1323926111 Wickett et al.
estimated using CAT+GTR+Gamma models on nucleotide (firstand second codon positions) and amino acid alignments suggeststhat this model may still be too simple for concatenated alignmentsrelative to the true gene coalescence and substitution processes (seealso ref. 72). The placement of hornworts and a moss+liverwortclade as successively sister to vascular plants is consistent withanalyses based on morphological and developmental characters (78,79), including dextral sperm in hornworts rather than sinistralsperm, as in all other land plants, and the retention of the pyrenoid,a plastid structure that is the site of RUBISCO localization, sharedby hornworts and streptophytic algae (reviewed in ref. 80). Thepossibility that some of these trait mappings are the product ofevolutionary convergence should also be considered, and seemslikely in the case of the pyrenoid (81). The significance of othermorphological similarities is also not yet clear. For example, thedevelopment of gametangia in hornworts resembles antheridial (44,82) and archegonial (82) development in monilophytes, whereasthose of the liverworts and mosses are autapomorphic, suggestinga closer relationship between hornworts and vascular plants. Acomparison can also be made with respect to the development ofthe embryo and the young sporophyte. The hornwort embryo andsporophyte have no apical growth at any stage, but rather exhibit anintercalary meristem. In contrast, mosses and monilophytes haveapical growth on both ends of the sporophyte, although basal apicalgrowth is ephemeral in the former. The possibility of multiple ori-gins of the multicellular sporophyte in land plants can therefore beconsidered (83): once with intercalary growth, as in the hornworts,and once with apical growth, as in mosses and tracheophytes (liv-erworts have neither intercalary nor apical growth). Ultimately, thisfinding underscores the difficulty in placing hornworts—or bryo-phytes in general—within the phylogeny of land plants based oncurrent evidence from morphology alone.
In summary, three primary hypotheses emerge from ouranalyses with respect to the resolution of the earliest branchingevents in land plant phylogeny: (i) (hornworts, ((liverworts,mosses), vascular plants)) supported in most ML analyses of nu-cleotide and amino acid supermatrices; (ii) [(liverworts, mosses),(hornworts, vascular plants)], supported by the PhyloBayes anal-ysis of amino acids; and (iii) [(hornworts, [mosses, liverworts]),vascular plants], supported by supertree and ASTRAL analyses ofamino acids and first and second codon positions and some aminoacid supermatrix analyses. However, we cannot dismiss alternativehypotheses recovered by some of our analyses, including [mosses(liverworts [hornworts, vascular plants])], which is supported bythe PhyloBayes analysis of first and second codon positions (Fig.4). Caution should be taken in rejecting any of these hypothesesgiven the sparse sampling, especially for the hornworts.
Monilophyte and Lycophyte Relationships. Phylogenetic analyses ofmultigene (generally plastid) datasets (84–87) have consistentlyresolved the lycophytes and monilophytes as successive sister lin-eages to the seed plants, with the euphyllophytes comprising theseed-free monilophytes (ferns) and seed-bearing spermatophytes.Aside from the clearly artifactual placement of Selaginella as sisterto all other land plants in analyses including third codonpositions (mirrors.iplantcollaborative.org/onekp_pilot), ourresults support this branching order (Figs. 2 and 3; otherspecies trees at mirrors.iplantcollaborative.org/onekp_pilot).The placement of Selaginella has been problematic in previousanalyses (49) and we interpret its misplacement in several ana-lyses here as a consequence of GC content at the third codonposition, which is more similar to streptophyte algae than toembryophytes (Fig. S2).
Matrix type
Alignment
Codon positions
ASTRAL
AA AADNA to AA DNA to AA DNA
1 and 2 1 and 2all allNA NA NA
DNA
NA
Supermatrix
Zygnematophyceae-sisterCharales-sister
Coleochaetales-sister
Sister to land plants
Mosses + liverwortsBryophytes monophyletic
Hornworts-sister
Hornworts-basalLiverworts-basal
Bryophytes
GnepineConifers monophyletic
GnetiferGnetales-sister
Gymnosperms
Eudicots + magnoliidsEudicots + mag/Chlor
Magnoliids + ChloranthalesMag + Chlor, monocots
Monocots + eudicots
Angiosperms
Amborella + NupharAmborella-sister
ANA-grade angiosperms
untr
im.u
npar
t50
gene
s.un
part
50ge
nes5
0site
s.un
part
50ge
nes5
0site
s.pa
rt50
gene
s50s
ites.
gam
ma.
part
50ge
nesC
hara
.unp
art
50ge
nes5
0site
s.25
X.u
npar
t50
gene
s33t
axa.
unpa
rt60
4gen
es.tr
imE
xt.u
npar
t60
4gen
es.tr
imE
xt.g
amm
a.un
part
604g
enes
.trim
Ext
.Bay
es.C
ATG
TR60
4gen
es.tr
imE
xt.B
ayes
.CAT
untr
im.u
npar
t50
gene
s.un
part
50ge
nes5
0site
s.un
part
50ge
nes5
0site
s.pa
rt50
gene
s50s
ites.
gam
ma.
part
50ge
nesC
hara
.unp
art
50ge
nes5
0site
s.25
X.u
npar
t50
gene
s33t
axa.
unpa
rt60
4gen
es.tr
imE
xt.u
npar
t60
4gen
es.tr
imE
xt.g
amm
a.un
part
604g
enes
.trim
Ext
.Bay
es.C
ATG
TR
untr
im.u
npar
t50
gene
s.un
part
50ge
nes5
0site
s.un
part
50ge
nes5
0site
s.pa
rt50
gene
s50s
ites.
gam
ma.
part
50ge
nesC
hara
.unp
art
50ge
nes5
0site
s.25
X.u
npar
t50
gene
s33t
axa.
unpa
rt60
4gen
es.tr
imE
xt.u
npar
t60
4gen
es.tr
imE
xt.g
amm
a.un
part
50ge
nes5
0site
s.un
part
50ge
nes5
0site
s25X
.unp
art
untr
im50
gene
s50
gene
s.25
X50
gene
s33t
axa
untr
imun
trim
.gam
ma
50ge
nes
50ge
nes.
25X
50ge
nes3
3tax
a
untr
imun
trim
.gam
ma
50ge
nes
50ge
nes.
25X
50ge
nes3
3tax
a
untr
im50
gene
s50
gene
s.25
X
Strong Support Weak Support Compatible (Weak Rejection) Strong Rejection
Fig. 4. Summary of support for hypotheses of land plant relationships across 52 supermatrix and coalescent-based analyses including permutations of the fulldata matrix (Table S2). Occupancy-based gene trimming was carried out by removing genes for which >50% the full taxon set were not included in thealignment. Site trimming removed columns in the aligmment for which >50% of the full taxon set were represented by gap characters. Long-branch trimmingwas performed on gene trees when a terminal branch was 25-times longer than the median branch length. More stringent, blast-based removal of sequencesidentified as possible contaminants resulted in a set of 604 gene families (see SI Materials and Methods for stringent filtering strategy). Supermatrix analyseswere done with and without partitioning of genes into model parameter classes. Filtering and analysis strategies are indicated below each column andinclude combinations of: (i) untrim: untrimmed/unfiltered data; (ii) unpart: no data partitions; (iii) 50genes: occupancy-based gene trimming at 50%; (iv)50sites: occupancy-based site trimming at 50%; (v) gamma: full Gamma (vs. PSR approximation of Gamma); (vi) Chara: gene trimming to exclude genes notpresent in Chara vulgaris; (vii) 25X: long branch trimming; (viii) 33taxa: sequences with more than 66% gaps in gene alignments removed; (ix) 604genes.trimEXT: aggressive BLAST-based and long-branch filtering of sequence assemblies for each taxon followed by GBLOCKS filtering of sites (SI Materials andMethods). Strong support refers to bootstrap values above 75% for a clade containing the specified taxa. All trees and alignments are available in iPlant’sdata store (mirrors.iplantcollaborative.org/onekp_pilot).
6 of 10 | www.pnas.org/cgi/doi/10.1073/pnas.1323926111 Wickett et al.
estimated using CAT+GTR+Gamma models on nucleotide (firstand second codon positions) and amino acid alignments suggeststhat this model may still be too simple for concatenated alignmentsrelative to the true gene coalescence and substitution processes (seealso ref. 72). The placement of hornworts and a moss+liverwortclade as successively sister to vascular plants is consistent withanalyses based on morphological and developmental characters (78,79), including dextral sperm in hornworts rather than sinistralsperm, as in all other land plants, and the retention of the pyrenoid,a plastid structure that is the site of RUBISCO localization, sharedby hornworts and streptophytic algae (reviewed in ref. 80). Thepossibility that some of these trait mappings are the product ofevolutionary convergence should also be considered, and seemslikely in the case of the pyrenoid (81). The significance of othermorphological similarities is also not yet clear. For example, thedevelopment of gametangia in hornworts resembles antheridial (44,82) and archegonial (82) development in monilophytes, whereasthose of the liverworts and mosses are autapomorphic, suggestinga closer relationship between hornworts and vascular plants. Acomparison can also be made with respect to the development ofthe embryo and the young sporophyte. The hornwort embryo andsporophyte have no apical growth at any stage, but rather exhibit anintercalary meristem. In contrast, mosses and monilophytes haveapical growth on both ends of the sporophyte, although basal apicalgrowth is ephemeral in the former. The possibility of multiple ori-gins of the multicellular sporophyte in land plants can therefore beconsidered (83): once with intercalary growth, as in the hornworts,and once with apical growth, as in mosses and tracheophytes (liv-erworts have neither intercalary nor apical growth). Ultimately, thisfinding underscores the difficulty in placing hornworts—or bryo-phytes in general—within the phylogeny of land plants based oncurrent evidence from morphology alone.
In summary, three primary hypotheses emerge from ouranalyses with respect to the resolution of the earliest branchingevents in land plant phylogeny: (i) (hornworts, ((liverworts,mosses), vascular plants)) supported in most ML analyses of nu-cleotide and amino acid supermatrices; (ii) [(liverworts, mosses),(hornworts, vascular plants)], supported by the PhyloBayes anal-ysis of amino acids; and (iii) [(hornworts, [mosses, liverworts]),vascular plants], supported by supertree and ASTRAL analyses ofamino acids and first and second codon positions and some aminoacid supermatrix analyses. However, we cannot dismiss alternativehypotheses recovered by some of our analyses, including [mosses(liverworts [hornworts, vascular plants])], which is supported bythe PhyloBayes analysis of first and second codon positions (Fig.4). Caution should be taken in rejecting any of these hypothesesgiven the sparse sampling, especially for the hornworts.
Monilophyte and Lycophyte Relationships. Phylogenetic analyses ofmultigene (generally plastid) datasets (84–87) have consistentlyresolved the lycophytes and monilophytes as successive sister lin-eages to the seed plants, with the euphyllophytes comprising theseed-free monilophytes (ferns) and seed-bearing spermatophytes.Aside from the clearly artifactual placement of Selaginella as sisterto all other land plants in analyses including third codonpositions (mirrors.iplantcollaborative.org/onekp_pilot), ourresults support this branching order (Figs. 2 and 3; otherspecies trees at mirrors.iplantcollaborative.org/onekp_pilot).The placement of Selaginella has been problematic in previousanalyses (49) and we interpret its misplacement in several ana-lyses here as a consequence of GC content at the third codonposition, which is more similar to streptophyte algae than toembryophytes (Fig. S2).
Matrix type
Alignment
Codon positions
ASTRAL
AA AADNA to AA DNA to AA DNA
1 and 2 1 and 2all allNA NA NA
DNA
NA
Supermatrix
Zygnematophyceae-sisterCharales-sister
Coleochaetales-sister
Sister to land plants
Mosses + liverwortsBryophytes monophyletic
Hornworts-sister
Hornworts-basalLiverworts-basal
Bryophytes
GnepineConifers monophyletic
GnetiferGnetales-sister
Gymnosperms
Eudicots + magnoliidsEudicots + mag/Chlor
Magnoliids + ChloranthalesMag + Chlor, monocots
Monocots + eudicots
Angiosperms
Amborella + NupharAmborella-sister
ANA-grade angiosperms
untr
im.u
npar
t50
gene
s.un
part
50ge
nes5
0site
s.un
part
50ge
nes5
0site
s.pa
rt50
gene
s50s
ites.
gam
ma.
part
50ge
nesC
hara
.unp
art
50ge
nes5
0site
s.25
X.u
npar
t50
gene
s33t
axa.
unpa
rt60
4gen
es.tr
imE
xt.u
npar
t60
4gen
es.tr
imE
xt.g
amm
a.un
part
604g
enes
.trim
Ext
.Bay
es.C
ATG
TR60
4gen
es.tr
imE
xt.B
ayes
.CAT
untr
im.u
npar
t50
gene
s.un
part
50ge
nes5
0site
s.un
part
50ge
nes5
0site
s.pa
rt50
gene
s50s
ites.
gam
ma.
part
50ge
nesC
hara
.unp
art
50ge
nes5
0site
s.25
X.u
npar
t50
gene
s33t
axa.
unpa
rt60
4gen
es.tr
imE
xt.u
npar
t60
4gen
es.tr
imE
xt.g
amm
a.un
part
604g
enes
.trim
Ext
.Bay
es.C
ATG
TR
untr
im.u
npar
t50
gene
s.un
part
50ge
nes5
0site
s.un
part
50ge
nes5
0site
s.pa
rt50
gene
s50s
ites.
gam
ma.
part
50ge
nesC
hara
.unp
art
50ge
nes5
0site
s.25
X.u
npar
t50
gene
s33t
axa.
unpa
rt60
4gen
es.tr
imE
xt.u
npar
t60
4gen
es.tr
imE
xt.g
amm
a.un
part
50ge
nes5
0site
s.un
part
50ge
nes5
0site
s25X
.unp
art
untr
im50
gene
s50
gene
s.25
X50
gene
s33t
axa
untr
imun
trim
.gam
ma
50ge
nes
50ge
nes.
25X
50ge
nes3
3tax
a
untr
imun
trim
.gam
ma
50ge
nes
50ge
nes.
25X
50ge
nes3
3tax
a
untr
im50
gene
s50
gene
s.25
X
Strong Support Weak Support Compatible (Weak Rejection) Strong Rejection
Fig. 4. Summary of support for hypotheses of land plant relationships across 52 supermatrix and coalescent-based analyses including permutations of the fulldata matrix (Table S2). Occupancy-based gene trimming was carried out by removing genes for which >50% the full taxon set were not included in thealignment. Site trimming removed columns in the aligmment for which >50% of the full taxon set were represented by gap characters. Long-branch trimmingwas performed on gene trees when a terminal branch was 25-times longer than the median branch length. More stringent, blast-based removal of sequencesidentified as possible contaminants resulted in a set of 604 gene families (see SI Materials and Methods for stringent filtering strategy). Supermatrix analyseswere done with and without partitioning of genes into model parameter classes. Filtering and analysis strategies are indicated below each column andinclude combinations of: (i) untrim: untrimmed/unfiltered data; (ii) unpart: no data partitions; (iii) 50genes: occupancy-based gene trimming at 50%; (iv)50sites: occupancy-based site trimming at 50%; (v) gamma: full Gamma (vs. PSR approximation of Gamma); (vi) Chara: gene trimming to exclude genes notpresent in Chara vulgaris; (vii) 25X: long branch trimming; (viii) 33taxa: sequences with more than 66% gaps in gene alignments removed; (ix) 604genes.trimEXT: aggressive BLAST-based and long-branch filtering of sequence assemblies for each taxon followed by GBLOCKS filtering of sites (SI Materials andMethods). Strong support refers to bootstrap values above 75% for a clade containing the specified taxa. All trees and alignments are available in iPlant’sdata store (mirrors.iplantcollaborative.org/onekp_pilot).
6 of 10 | www.pnas.org/cgi/doi/10.1073/pnas.1323926111 Wickett et al.
estimated using CAT+GTR+Gamma models on nucleotide (firstand second codon positions) and amino acid alignments suggeststhat this model may still be too simple for concatenated alignmentsrelative to the true gene coalescence and substitution processes (seealso ref. 72). The placement of hornworts and a moss+liverwortclade as successively sister to vascular plants is consistent withanalyses based on morphological and developmental characters (78,79), including dextral sperm in hornworts rather than sinistralsperm, as in all other land plants, and the retention of the pyrenoid,a plastid structure that is the site of RUBISCO localization, sharedby hornworts and streptophytic algae (reviewed in ref. 80). Thepossibility that some of these trait mappings are the product ofevolutionary convergence should also be considered, and seemslikely in the case of the pyrenoid (81). The significance of othermorphological similarities is also not yet clear. For example, thedevelopment of gametangia in hornworts resembles antheridial (44,82) and archegonial (82) development in monilophytes, whereasthose of the liverworts and mosses are autapomorphic, suggestinga closer relationship between hornworts and vascular plants. Acomparison can also be made with respect to the development ofthe embryo and the young sporophyte. The hornwort embryo andsporophyte have no apical growth at any stage, but rather exhibit anintercalary meristem. In contrast, mosses and monilophytes haveapical growth on both ends of the sporophyte, although basal apicalgrowth is ephemeral in the former. The possibility of multiple ori-gins of the multicellular sporophyte in land plants can therefore beconsidered (83): once with intercalary growth, as in the hornworts,and once with apical growth, as in mosses and tracheophytes (liv-erworts have neither intercalary nor apical growth). Ultimately, thisfinding underscores the difficulty in placing hornworts—or bryo-phytes in general—within the phylogeny of land plants based oncurrent evidence from morphology alone.
In summary, three primary hypotheses emerge from ouranalyses with respect to the resolution of the earliest branchingevents in land plant phylogeny: (i) (hornworts, ((liverworts,mosses), vascular plants)) supported in most ML analyses of nu-cleotide and amino acid supermatrices; (ii) [(liverworts, mosses),(hornworts, vascular plants)], supported by the PhyloBayes anal-ysis of amino acids; and (iii) [(hornworts, [mosses, liverworts]),vascular plants], supported by supertree and ASTRAL analyses ofamino acids and first and second codon positions and some aminoacid supermatrix analyses. However, we cannot dismiss alternativehypotheses recovered by some of our analyses, including [mosses(liverworts [hornworts, vascular plants])], which is supported bythe PhyloBayes analysis of first and second codon positions (Fig.4). Caution should be taken in rejecting any of these hypothesesgiven the sparse sampling, especially for the hornworts.
Monilophyte and Lycophyte Relationships. Phylogenetic analyses ofmultigene (generally plastid) datasets (84–87) have consistentlyresolved the lycophytes and monilophytes as successive sister lin-eages to the seed plants, with the euphyllophytes comprising theseed-free monilophytes (ferns) and seed-bearing spermatophytes.Aside from the clearly artifactual placement of Selaginella as sisterto all other land plants in analyses including third codonpositions (mirrors.iplantcollaborative.org/onekp_pilot), ourresults support this branching order (Figs. 2 and 3; otherspecies trees at mirrors.iplantcollaborative.org/onekp_pilot).The placement of Selaginella has been problematic in previousanalyses (49) and we interpret its misplacement in several ana-lyses here as a consequence of GC content at the third codonposition, which is more similar to streptophyte algae than toembryophytes (Fig. S2).
Matrix type
Alignment
Codon positions
ASTRAL
AA AADNA to AA DNA to AA DNA
1 and 2 1 and 2all allNA NA NA
DNA
NA
Supermatrix
Zygnematophyceae-sisterCharales-sister
Coleochaetales-sister
Sister to land plants
Mosses + liverwortsBryophytes monophyletic
Hornworts-sister
Hornworts-basalLiverworts-basal
Bryophytes
GnepineConifers monophyletic
GnetiferGnetales-sister
Gymnosperms
Eudicots + magnoliidsEudicots + mag/Chlor
Magnoliids + ChloranthalesMag + Chlor, monocots
Monocots + eudicots
Angiosperms
Amborella + NupharAmborella-sister
ANA-grade angiosperms
untr
im.u
npar
t50
gene
s.un
part
50ge
nes5
0site
s.un
part
50ge
nes5
0site
s.pa
rt50
gene
s50s
ites.
gam
ma.
part
50ge
nesC
hara
.unp
art
50ge
nes5
0site
s.25
X.u
npar
t50
gene
s33t
axa.
unpa
rt60
4gen
es.tr
imE
xt.u
npar
t60
4gen
es.tr
imE
xt.g
amm
a.un
part
604g
enes
.trim
Ext
.Bay
es.C
ATG
TR
604g
enes
.trim
Ext
.Bay
es.C
AT
untr
im.u
npar
t50
gene
s.un
part
50ge
nes5
0site
s.un
part
50ge
nes5
0site
s.pa
rt50
gene
s50s
ites.
gam
ma.
part
50ge
nesC
hara
.unp
art
50ge
nes5
0site
s.25
X.u
npar
t50
gene
s33t
axa.
unpa
rt60
4gen
es.tr
imE
xt.u
npar
t60
4gen
es.tr
imE
xt.g
amm
a.un
part
604g
enes
.trim
Ext
.Bay
es.C
ATG
TR
untr
im.u
npar
t50
gene
s.un
part
50ge
nes5
0site
s.un
part
50ge
nes5
0site
s.pa
rt50
gene
s50s
ites.
gam
ma.
part
50ge
nesC
hara
.unp
art
50ge
nes5
0site
s.25
X.u
npar
t50
gene
s33t
axa.
unpa
rt60
4gen
es.tr
imE
xt.u
npar
t60
4gen
es.tr
imE
xt.g
amm
a.un
part
50ge
nes5
0site
s.un
part
50ge
nes5
0site
s25X
.unp
art
untr
im50
gene
s50
gene
s.25
X50
gene
s33t
axa
untr
imun
trim
.gam
ma
50ge
nes
50ge
nes.
25X
50ge
nes3
3tax
a
untr
imun
trim
.gam
ma
50ge
nes
50ge
nes.
25X
50ge
nes3
3tax
a
untr
im50
gene
s50
gene
s.25
X
Strong Support Weak Support Compatible (Weak Rejection) Strong Rejection
Fig. 4. Summary of support for hypotheses of land plant relationships across 52 supermatrix and coalescent-based analyses including permutations of the fulldata matrix (Table S2). Occupancy-based gene trimming was carried out by removing genes for which >50% the full taxon set were not included in thealignment. Site trimming removed columns in the aligmment for which >50% of the full taxon set were represented by gap characters. Long-branch trimmingwas performed on gene trees when a terminal branch was 25-times longer than the median branch length. More stringent, blast-based removal of sequencesidentified as possible contaminants resulted in a set of 604 gene families (see SI Materials and Methods for stringent filtering strategy). Supermatrix analyseswere done with and without partitioning of genes into model parameter classes. Filtering and analysis strategies are indicated below each column andinclude combinations of: (i) untrim: untrimmed/unfiltered data; (ii) unpart: no data partitions; (iii) 50genes: occupancy-based gene trimming at 50%; (iv)50sites: occupancy-based site trimming at 50%; (v) gamma: full Gamma (vs. PSR approximation of Gamma); (vi) Chara: gene trimming to exclude genes notpresent in Chara vulgaris; (vii) 25X: long branch trimming; (viii) 33taxa: sequences with more than 66% gaps in gene alignments removed; (ix) 604genes.trimEXT: aggressive BLAST-based and long-branch filtering of sequence assemblies for each taxon followed by GBLOCKS filtering of sites (SI Materials andMethods). Strong support refers to bootstrap values above 75% for a clade containing the specified taxa. All trees and alignments are available in iPlant’sdata store (mirrors.iplantcollaborative.org/onekp_pilot).
6 of 10 | www.pnas.org/cgi/doi/10.1073/pnas.1323926111 Wickett et al.
estimated using CAT+GTR+Gamma models on nucleotide (firstand second codon positions) and amino acid alignments suggeststhat this model may still be too simple for concatenated alignmentsrelative to the true gene coalescence and substitution processes (seealso ref. 72). The placement of hornworts and a moss+liverwortclade as successively sister to vascular plants is consistent withanalyses based on morphological and developmental characters (78,79), including dextral sperm in hornworts rather than sinistralsperm, as in all other land plants, and the retention of the pyrenoid,a plastid structure that is the site of RUBISCO localization, sharedby hornworts and streptophytic algae (reviewed in ref. 80). Thepossibility that some of these trait mappings are the product ofevolutionary convergence should also be considered, and seemslikely in the case of the pyrenoid (81). The significance of othermorphological similarities is also not yet clear. For example, thedevelopment of gametangia in hornworts resembles antheridial (44,82) and archegonial (82) development in monilophytes, whereasthose of the liverworts and mosses are autapomorphic, suggestinga closer relationship between hornworts and vascular plants. Acomparison can also be made with respect to the development ofthe embryo and the young sporophyte. The hornwort embryo andsporophyte have no apical growth at any stage, but rather exhibit anintercalary meristem. In contrast, mosses and monilophytes haveapical growth on both ends of the sporophyte, although basal apicalgrowth is ephemeral in the former. The possibility of multiple ori-gins of the multicellular sporophyte in land plants can therefore beconsidered (83): once with intercalary growth, as in the hornworts,and once with apical growth, as in mosses and tracheophytes (liv-erworts have neither intercalary nor apical growth). Ultimately, thisfinding underscores the difficulty in placing hornworts—or bryo-phytes in general—within the phylogeny of land plants based oncurrent evidence from morphology alone.
In summary, three primary hypotheses emerge from ouranalyses with respect to the resolution of the earliest branchingevents in land plant phylogeny: (i) (hornworts, ((liverworts,mosses), vascular plants)) supported in most ML analyses of nu-cleotide and amino acid supermatrices; (ii) [(liverworts, mosses),(hornworts, vascular plants)], supported by the PhyloBayes anal-ysis of amino acids; and (iii) [(hornworts, [mosses, liverworts]),vascular plants], supported by supertree and ASTRAL analyses ofamino acids and first and second codon positions and some aminoacid supermatrix analyses. However, we cannot dismiss alternativehypotheses recovered by some of our analyses, including [mosses(liverworts [hornworts, vascular plants])], which is supported bythe PhyloBayes analysis of first and second codon positions (Fig.4). Caution should be taken in rejecting any of these hypothesesgiven the sparse sampling, especially for the hornworts.
Monilophyte and Lycophyte Relationships. Phylogenetic analyses ofmultigene (generally plastid) datasets (84–87) have consistentlyresolved the lycophytes and monilophytes as successive sister lin-eages to the seed plants, with the euphyllophytes comprising theseed-free monilophytes (ferns) and seed-bearing spermatophytes.Aside from the clearly artifactual placement of Selaginella as sisterto all other land plants in analyses including third codonpositions (mirrors.iplantcollaborative.org/onekp_pilot), ourresults support this branching order (Figs. 2 and 3; otherspecies trees at mirrors.iplantcollaborative.org/onekp_pilot).The placement of Selaginella has been problematic in previousanalyses (49) and we interpret its misplacement in several ana-lyses here as a consequence of GC content at the third codonposition, which is more similar to streptophyte algae than toembryophytes (Fig. S2).
Matrix type
Alignment
Codon positions
ASTRAL
AA AADNA to AA DNA to AA DNA
1 and 2 1 and 2all allNA NA NA
DNA
NA
Supermatrix
Zygnematophyceae-sisterCharales-sister
Coleochaetales-sister
Sister to land plants
Mosses + liverwortsBryophytes monophyletic
Hornworts-sister
Hornworts-basalLiverworts-basal
Bryophytes
GnepineConifers monophyletic
GnetiferGnetales-sister
Gymnosperms
Eudicots + magnoliidsEudicots + mag/Chlor
Magnoliids + ChloranthalesMag + Chlor, monocots
Monocots + eudicots
Angiosperms
Amborella + NupharAmborella-sister
ANA-grade angiosperms
untr
im.u
npar
t50
gene
s.un
part
50ge
nes5
0site
s.un
part
50ge
nes5
0site
s.pa
rt50
gene
s50s
ites.
gam
ma.
part
50ge
nesC
hara
.unp
art
50ge
nes5
0site
s.25
X.u
npar
t50
gene
s33t
axa.
unpa
rt60
4gen
es.tr
imE
xt.u
npar
t60
4gen
es.tr
imE
xt.g
amm
a.un
part
604g
enes
.trim
Ext
.Bay
es.C
ATG
TR60
4gen
es.tr
imE
xt.B
ayes
.CAT
untr
im.u
npar
t50
gene
s.un
part
50ge
nes5
0site
s.un
part
50ge
nes5
0site
s.pa
rt50
gene
s50s
ites.
gam
ma.
part
50ge
nesC
hara
.unp
art
50ge
nes5
0site
s.25
X.u
npar
t50
gene
s33t
axa.
unpa
rt60
4gen
es.tr
imE
xt.u
npar
t60
4gen
es.tr
imE
xt.g
amm
a.un
part
604g
enes
.trim
Ext
.Bay
es.C
ATG
TR
untr
im.u
npar
t50
gene
s.un
part
50ge
nes5
0site
s.un
part
50ge
nes5
0site
s.pa
rt50
gene
s50s
ites.
gam
ma.
part
50ge
nesC
hara
.unp
art
50ge
nes5
0site
s.25
X.u
npar
t50
gene
s33t
axa.
unpa
rt60
4gen
es.tr
imE
xt.u
npar
t60
4gen
es.tr
imE
xt.g
amm
a.un
part
50ge
nes5
0site
s.un
part
50ge
nes5
0site
s25X
.unp
art
untr
im50
gene
s50
gene
s.25
X50
gene
s33t
axa
untr
imun
trim
.gam
ma
50ge
nes
50ge
nes.
25X
50ge
nes3
3tax
a
untr
imun
trim
.gam
ma
50ge
nes
50ge
nes.
25X
50ge
nes3
3tax
a
untr
im50
gene
s50
gene
s.25
X
Strong Support Weak Support Compatible (Weak Rejection) Strong Rejection
Fig. 4. Summary of support for hypotheses of land plant relationships across 52 supermatrix and coalescent-based analyses including permutations of the fulldata matrix (Table S2). Occupancy-based gene trimming was carried out by removing genes for which >50% the full taxon set were not included in thealignment. Site trimming removed columns in the aligmment for which >50% of the full taxon set were represented by gap characters. Long-branch trimmingwas performed on gene trees when a terminal branch was 25-times longer than the median branch length. More stringent, blast-based removal of sequencesidentified as possible contaminants resulted in a set of 604 gene families (see SI Materials and Methods for stringent filtering strategy). Supermatrix analyseswere done with and without partitioning of genes into model parameter classes. Filtering and analysis strategies are indicated below each column andinclude combinations of: (i) untrim: untrimmed/unfiltered data; (ii) unpart: no data partitions; (iii) 50genes: occupancy-based gene trimming at 50%; (iv)50sites: occupancy-based site trimming at 50%; (v) gamma: full Gamma (vs. PSR approximation of Gamma); (vi) Chara: gene trimming to exclude genes notpresent in Chara vulgaris; (vii) 25X: long branch trimming; (viii) 33taxa: sequences with more than 66% gaps in gene alignments removed; (ix) 604genes.trimEXT: aggressive BLAST-based and long-branch filtering of sequence assemblies for each taxon followed by GBLOCKS filtering of sites (SI Materials andMethods). Strong support refers to bootstrap values above 75% for a clade containing the specified taxa. All trees and alignments are available in iPlant’sdata store (mirrors.iplantcollaborative.org/onekp_pilot).
6 of 10 | www.pnas.org/cgi/doi/10.1073/pnas.1323926111 Wickett et al.
ASTRALConcatenation-ML
[Wickett*, Mirarab*, et al., PNAS, 2014]
17
0
10
20
10 50 100 200 500 1000number of species
Run
ning
tim
e (h
ours
)
ASTRAL−IINJstMP−EST
Running time when varying the number of species
1000 genes, “medium” levels of recent ILS
1kp: Thousand Transcriptome Project
G. Ka-Shu Wong U Alberta
N. Wickett Northwestern
J. Leebens-Mack U Georgia
N. Matasci iPlant
T. Warnow, S. Mirarab, N. Nguyen, UIUC UT-Austin UT-Austin
Plus many many other people…
Upcoming Challenges (~1200 species, ~400 loci): • Species tree es&ma&on under the mul&-‐species coalescent from hundreds of conflic&ng gene trees on >1000 species; we will use ASTRAL-‐II (Mirarab and Warnow, 2015)
• Mul&ple sequence alignment of >100,000 sequences (with lots of fragments!) – we will use UPP (Nguyen et al., Genome Biology, 2015)
ASTRAL-I on biological datasets
10
• 1KP: 103 plant species, 400-800 genes
• Yang, et al. 96 Caryophyllales species, 1122 genes
• Dentinger, et al. 39 mushroom species, 208 genes
• Giarla and Esselstyn. 19 Philippine shrew species, 1112 genes
• Laumer, et al. 40 flatworm species, 516 genes
• Grover, et al. 8 cotton species, 52 genes
• Hosner, Braun, and Kimball. 28 quail species, 11 genes
• Simmons and Gatesy. 47 angiosperm species, 310 genes
Phylotranscriptomic analysis of the origin and earlydiversification of land plantsNorman J. Wicketta,b,1,2, Siavash Mirarabc,1, Nam Nguyenc, Tandy Warnowc, Eric Carpenterd, Naim Matascie,f,Saravanaraj Ayyampalayamg, Michael S. Barkerf, J. Gordon Burleighh, Matthew A. Gitzendannerh,i, Brad R. Ruhfelh,j,k,Eric Wafulal, Joshua P. Derl, Sean W. Grahamm, Sarah Mathewsn, Michael Melkoniano, Douglas E. Soltish,i,k,Pamela S. Soltish,i,k, Nicholas W. Milesk, Carl J. Rothfelsp,q, Lisa Pokornyp,r, A. Jonathan Shawp, Lisa DeGironimos,Dennis W. Stevensons, Barbara Sureko, Juan Carlos Villarrealt, Béatrice Roureu, Hervé Philippeu,v, Claude W. dePamphilisl,Tao Chenw, Michael K. Deyholosd, Regina S. Baucomx, Toni M. Kutchany, Megan M. Augustiny, Jun Wangz, Yong Zhangv,Zhijian Tianz, Zhixiang Yanz, Xiaolei Wuz, Xiao Sunz, Gane Ka-Shu Wongd,z,aa,2, and James Leebens-Mackg,2
aChicago Botanic Garden, Glencoe, IL 60022; bProgram in Biological Sciences, Northwestern University, Evanston, IL 60208; cDepartment of Computer Science,University of Texas, Austin, TX 78712; dDepartment of Biological Sciences, University of Alberta, Edmonton, AB, Canada T6G 2E9; eiPlant Collaborative,Tucson, AZ 85721; fDepartment of Ecology and Evolutionary Biology, University of Arizona, Tucson, AZ 85721; gDepartment of Plant Biology, University ofGeorgia, Athens, GA 30602; hDepartment of Biology and iGenetics Institute, University of Florida, Gainesville, FL 32611; jDepartment of Biological Sciences,Eastern Kentucky University, Richmond, KY 40475; kFlorida Museum of Natural History, Gainesville, FL 32611; lDepartment of Biology, Pennsylvania StateUniversity, University Park, PA 16803; mDepartment of Botany and qDepartment of Zoology, University of British Columbia, Vancouver, BC, Canada V6T 1Z4;nArnold Arboretum of Harvard University, Cambridge, MA 02138; oBotanical Institute, Universität zu Köln, Cologne D-50674, Germany; pDepartment ofBiology, Duke University, Durham, NC 27708; rDepartment of Biodiversity and Conservation, Real Jardín Botanico-Consejo Superior de InvestigacionesCientificas, 28014 Madrid, Spain; sNew York Botanical Garden, Bronx, NY 10458; tDepartment fur Biologie, Systematische Botanik und Mykologie,Ludwig-Maximilians-Universitat, 80638 Munich, Germany; uDépartement de Biochimie, Centre Robert-Cedergren, Université de Montréal, SuccursaleCentre-Ville, Montreal, QC, Canada H3C 3J7; vCNRS, Station d’ Ecologie Experimentale du CNRS, Moulis, 09200, France; wShenzhen Fairy Lake BotanicalGarden, The Chinese Academy of Sciences, Shenzhen, Guangdong 518004, China; xDepartment of Ecology and Evolutionary Biology, University ofMichigan, Ann Arbor, MI 48109; yDonald Danforth Plant Science Center, St. Louis, MO 63132; zBGI-Shenzhen, Bei shan Industrial Zone, Yantian District,Shenzhen 518083, China; and aaDepartment of Medicine, University of Alberta, Edmonton, AB, Canada T6G 2E1
Edited by Paul O. Lewis, University of Connecticut, Storrs, CT, and accepted by the Editorial Board September 29, 2014 (received for review December 23, 2013)
Reconstructing the origin and evolution of land plants and theiralgal relatives is a fundamental problem in plant phylogenetics, andis essential for understanding how critical adaptations arose, in-cluding the embryo, vascular tissue, seeds, and flowers. Despiteadvances in molecular systematics, some hypotheses of relationshipsremain weakly resolved. Inferring deep phylogenies with bouts ofrapid diversification can be problematic; however, genome-scaledata should significantly increase the number of informative charac-ters for analyses. Recent phylogenomic reconstructions focused onthe major divergences of plants have resulted in promising but in-consistent results. One limitation is sparse taxon sampling, likelyresulting from the difficulty and cost of data generation. To addressthis limitation, transcriptome data for 92 streptophyte taxa weregenerated and analyzed along with 11 published plant genomesequences. Phylogenetic reconstructions were conducted using upto 852 nuclear genes and 1,701,170 aligned sites. Sixty-nine analyseswere performed to test the robustness of phylogenetic inferences topermutations of the datamatrix or to phylogenetic method, includingsupermatrix, supertree, and coalescent-based approaches, maximum-likelihood and Bayesian methods, partitioned and unpartitioned ana-lyses, and amino acid versus DNA alignments. Among otherresults, we find robust support for a sister-group relationshipbetween land plants and one group of streptophyte green al-gae, the Zygnematophyceae. Strong and robust support for aclade comprising liverworts and mosses is inconsistent with awidely accepted view of early land plant evolution, and suggeststhat phylogenetic hypotheses used to understand the evolution offundamental plant traits should be reevaluated.
land plants | Streptophyta | phylogeny | phylogenomics | transcriptome
The origin of embryophytes (land plants) in the Ordovicianperiod roughly 480 Mya (1–4) marks one of the most im-
portant events in the evolution of life on Earth. The early evo-lution of embryophytes in terrestrial environments was facilitatedby numerous innovations, including parental protection for thedeveloping embryo, sperm and egg production in multicellularprotective structures, and an alternation of phases (often referred toas generations) in which a diploid sporophytic life history stagegives rise to a multicellular haploid gametophytic phase. With
Significance
Early branching events in the diversification of land plants andclosely related algal lineages remain fundamental and un-resolved questions in plant evolutionary biology. Accuratereconstructions of these relationships are critical for testing hy-potheses of character evolution: for example, the origins of theembryo, vascular tissue, seeds, and flowers. We investigatedrelationships among streptophyte algae and land plants usingthe largest set of nuclear genes that has been applied to thisproblem to date. Hypothesized relationships were rigorouslytested through a series of analyses to assess systematic errors inphylogenetic inference caused by sampling artifacts and modelmisspecification. Results support some generally accepted phy-logenetic hypotheses, while rejecting others. This work providesa new framework for studies of land plant evolution.
Author contributions: N.J.W., S. Mirarab, T.W., S.W.G., M.M., D.E.S., P.S.S., D.W.S., M.K.D.,J.W., G.K.-S.W., and J.L.-M. designed research; N.J.W., S. Mirarab, N.N., T.W., E.C., N.M., S.A.,M.S.B., J.G.B., M.A.G., B.R.R., E.W., J.P.D., S.W.G., S. Mathews, M.M., D.E.S., P.S.S., N.W.M.,C.J.R., L.P., A.J.S., L.D., D.W.S., B.S., J.C.V., B.R., H.P., C.W.d., T.C., M.K.D., M.M.A., J.W., Y.Z.,Z.T., Z.Y., X.W., X.S., G.K.-S.W., and J.L.-M. performed research; S. Mirarab, N.N., T.W., N.M.,S.A., M.S.B., J.G.B., M.A.G., E.W., J.P.D., S.W.G., S. Mathews, M.M., D.E.S., P.S.S., N.W.M., C.J.R.,L.P., A.J.S., L.D., D.W.S., B.S., J.C.V., H.P., C.W.d., T.C., M.K.D., R.S.B., T.M.K., M.M.A., J.W., Y.Z.,G.K.-S.W., and J.L.-M. contributed new reagents/analytic tools; N.J.W., S. Mirarab, N.N., E.C.,N.M., S.A., M.S.B., J.G.B., M.A.G., B.R.R., E.W., B.R., H.P., and J.L.-M. analyzed data; N.J.W.,S. Mirarab, T.W., S.W.G., M.M., D.E.S., D.W.S., H.P., G.K.-S.W., and J.L.-M. wrote the paper;and N.M. archived data.
The authors declare no conflict of interest.
This article is a PNAS Direct Submission. P.O.L. is a guest editor invited by theEditorial Board.
Freely available online through the PNAS open access option.
Data deposition: The sequences reported in this paper have been deposited in theiplant datastore database, mirrors.iplantcollaborative.org/onekp_pilot, and the Na-tional Center for Biotechnology Information Sequence Read Archive, www.ncbi.nlm.nih.gov/sra [accession no. PRJEB4921 (ERP004258)].1N.J.W. and S. Mirarab contributed equally to this work.2To whom correspondence may be addressed. Email: nwickett@chicagobotanic.org,gane@ualberta.ca, or jleebensmack@plantbio.uga.edu.
This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1323926111/-/DCSupplemental.
www.pnas.org/cgi/doi/10.1073/pnas.1323926111 PNAS Early Edition | 1 of 10
EVOLU
TION
PNASPL
US
ASTRAL-II on biological datasets (ongoing collaborations)
• 1200 plants with ~ 400 genes (1KP consortium)
• 250 avian species with 2000 genes (with LSU, UF, and Smithsonian)
• 200 avian species with whole genomes (with Genome 10K, international)
• 250 suboscine species (birds) with ~2000 genes (with LSU and Tulane)
• 140 Insects with 1400 genes (with U. Illinois at Urbana-Champaign)
• 50 Hummingbird species with 2000 genes (with U. Copenhagen and Smithsonian)
• 40 raptor species (birds) with 10,000 genes (with U. Copenhagen and Berkeley)
• 38 mammalian species with 10,000 genes (with U. of Bristol, Cambridge, and Nat. Univ. of Ireland)
44
Summary
• Gene tree discord due to ILS is a common challenge in species tree es&ma&on.
• Most of the first genera&on of coalescent-‐based are sta&s&cally consistent in the presence of large amounts of perfect data), but are insufficiently accurate under some biologically realis&c condi&ons (especially with large numbers of species).
• New methods have been developed that can analyze very large datasets (thousands of loci and taxa) with improved accuracy compared to previous methods.
• Yet, all methods have theore&cal and/or prac&cal limita&ons. New methods are needed, and this is an ac&ve research area.
• Concatena&on is o|en a reasonable approach, despite not being sta&s&cally consistent.
Papers and So|ware • M.S. Bayzid and T. Warnow. "Naive binning improves phylogenomic analyses". Bioinforma&cs 2013 29 (18): 2277-‐2284 • S. Mirarab, R. Reaz, Md. S. Bayzid, T. Zimmermann, M.S. Swenson, and T. Warnow. "ASTRAL: Genome-‐Scale Coalescent-‐
Based Species Tree Es&ma&on.” Bioinforma&cs 2014 30 (17):i541-‐i548 • Md S. Bayzid, T. Hunt, and T. Warnow. "Disk Covering Methods Improve Phylogenomic Analyses”. BMC Genomics 2014,
15(Suppl 6): S7. • T. Zimmermann, S. Mirarab and T. Warnow. "BBCA: Improving the scalability of *BEAST using random binning". BMC
Genomics 2014, 15(Suppl 6): S11 • S. Mirarab, Md S. Bayzid, and T. Warnow. "Evalua&ng summary methods for mul&-‐locus species tree es&ma&on in the
presence of incomplete lineage sor&ng". Systema&c Biology, doi = {10.1093/sysbio/syu063 • S. Mirarab, Md. S. Bayzid, B. Boussau, and T. Warnow. "Sta&s&cal binning enables an accurate coalescent-‐based
es&ma&on of the avian tree". Science, 12 December 2014: 1250463 • M. S. Bayzid, S. Mirarab, B. Boussau, and T. Warnow. "Weighted Sta&s&cal Binning: enabling sta&s&cally consistent
genome-‐scale phylogene&c analyses", PLOS One, 2015, DOI: 10.1371/journal.pone.0129183 • S. Mirarab and T. Warnow. "ASTRAL-‐II: coalescent-‐based species tree es&ma&on with many hundreds of taxa and
thousands of genes", Proceedings ISMB 2015, and Bioinforma&cs 2015 31 (12): i44-‐i52 • S. Roch and T. Warnow. "On the robustness to gene tree es&ma&on error (or lack thereof) of coalescent-‐based species
tree methods", Systema&c Biology, 64(4):663-‐676, 2015 • T. Warnow. "Concatena&on analyses in the presence of incomplete lineage sor&ng", PLOS Currents: Tree of Life 2015 • R. Davidson, P. Vachaspa&, S. Mirarab, and T. Warnow. Phylogenomic species tree es&ma&on in the presence of
incomplete lineage sor&ng and horizontal gene transfer. In press, BMC Genomics, 2015. • J. Chou, A. Gupta, S. Yaduvanshi, R. Davidson, M. Nute, S. Mirarab and T. Warnow. A compara&ve study of SVDquartets
and other coalescent-‐based species tree es&ma&on methods. In press, BMC Genomics, 2015 • P. Vachaspa& and T. Warnow. ASTRID: Accurate Species TRees from Internode Distances. In press, BMC Genomics, 2015
Open source so|ware available at github Papers available at hBp://tandy.cs.illinois.edu/papers.html
Papers and So|ware • M.S. Bayzid and T. Warnow. "Naive binning improves phylogenomic analyses". Bioinforma&cs 2013 29 (18): 2277-‐2284 • S. Mirarab, R. Reaz, Md. S. Bayzid, T. Zimmermann, M.S. Swenson, and T. Warnow. "ASTRAL: Genome-‐Scale Coalescent-‐
Based Species Tree Es&ma&on.” Bioinforma&cs 2014 30 (17):i541-‐i548 • Md S. Bayzid, T. Hunt, and T. Warnow. "Disk Covering Methods Improve Phylogenomic Analyses”. BMC Genomics 2014,
15(Suppl 6): S7. • T. Zimmermann, S. Mirarab and T. Warnow. "BBCA: Improving the scalability of *BEAST using random binning". BMC
Genomics 2014, 15(Suppl 6): S11 • S. Mirarab, Md S. Bayzid, and T. Warnow. "Evalua&ng summary methods for mul&-‐locus species tree es&ma&on in the
presence of incomplete lineage sor&ng". Systema&c Biology, doi = {10.1093/sysbio/syu063 • S. Mirarab, Md. S. Bayzid, B. Boussau, and T. Warnow. "Sta&s&cal binning enables an accurate coalescent-‐based
es&ma&on of the avian tree". Science, 12 December 2014: 1250463 • M. S. Bayzid, S. Mirarab, B. Boussau, and T. Warnow. "Weighted Sta&s&cal Binning: enabling sta&s&cally consistent
genome-‐scale phylogene&c analyses", PLOS One, 2015, DOI: 10.1371/journal.pone.0129183 • S. Mirarab and T. Warnow. "ASTRAL-‐II: coalescent-‐based species tree es&ma&on with many hundreds of taxa and
thousands of genes", Proceedings ISMB 2015, and Bioinforma&cs 2015 31 (12): i44-‐i52 • S. Roch and T. Warnow. "On the robustness to gene tree es&ma&on error (or lack thereof) of coalescent-‐based species
tree methods", Systema&c Biology, 64(4):663-‐676, 2015 • T. Warnow. "Concatena&on analyses in the presence of incomplete lineage sor&ng", PLOS Currents: Tree of Life 2015 • R. Davidson, P. Vachaspa&, S. Mirarab, and T. Warnow. Phylogenomic species tree es&ma&on in the presence of
incomplete lineage sor&ng and horizontal gene transfer. In press, BMC Genomics, 2015. • J. Chou, A. Gupta, S. Yaduvanshi, R. Davidson, M. Nute, S. Mirarab and T. Warnow. A compara&ve study of SVDquartets
and other coalescent-‐based species tree es&ma&on methods. In press, BMC Genomics, 2015 • P. Vachaspa& and T. Warnow. ASTRID: Accurate Species TRees from Internode Distances. In press, BMC Genomics, 2015
Open source so|ware available at github Papers available at hBp://tandy.cs.illinois.edu/papers.html
Acknowledgments
PhD students: Siavash Mirarab* and Md. S. Bayzid** Funding: Guggenheim Founda&on, NSF, David Bruton Jr. Centennial Professorship, TACC (Texas Advanced Compu&ng Center), and Grainger Founda&on (professorship). TACC and UTCS computa&onal resources * Supported by HHMI Predoctoral Fellowship ** Supported by Fulbright Founda&on