Molecular Phylogenetics (1) Phylogeny in wide sense is a ...
Transcript of Molecular Phylogenetics (1) Phylogeny in wide sense is a ...
Molecular Phylogenetics EEOB 563
Phylogeny (from phylum – tribe, and genesis – origin)
• the term introduced by E. Haeckel in the second half of the XIX century and now has two somewhat different meanings.
• (1) Phylogeny in wide sense is a historical development of organisms
• (2) Phylogeny in narrow sense includes not all aspects of historic development, but only succession of branching of a genealogical (i.e. a phylogenetic) tree.
• Usually represented by a phylogenetic tree.
What is a phylogenetic tree?
• A tree is a mathematical structure which is used to model the actual evolutionary history of a group of sequences or organisms.
• The actual pattern of historical relationships is an evolutionary tree which we try to estimate
Darwin’s letter to Thomas Huxley (1857)
“The time will come I believe, though I shall not live to see it, when we shall have fairly true genealogical (phylogenetic) trees of each great kingdom of nature”
Dawkins (2003), A Devil’s Chaplain
“… there is, after all, one true tree of life […]. It exists. It is in principle knowable. We don’t know it all yet. By 2050 we should –or if we do not, we shall have been defeated only at the terminal twigs, by the sheer number of species.”
The AToL initiative (Assembling the Tree of Life) is a large research effort sponsored by the National Science Foundation. Its goal is to reconstruct the evolutionary origins of all living things.
which offered tantalising glimpses of their potential forenabling interactions between researchers (Figure 7b)[44,45]. However, the release by Apple of first the iPhoneand subsequently the iPad have made touch interfacesmainstream. Not only does this mean that touch-screen
devices are now widely available, but there also is aconsistent vocabulary for how users can interact with thesedevices, using gestures such as ‘tap’, ‘swipe’ and ‘pinch andzoom’. Phylogenetics software developers have yet to ex-ploit fully the possibilities of these devices. Interacting
(a) (b)
TRENDS in Ecology & Evolution
Figure 3. Folding a tree. To save space, the subtree shaded in (a) is collapsed and drawn as a smaller triangle (b). The choice of which nodes to collapse can be automated,or left as a task for the user.
Evolutionary distance (Y)
ROOT
EDGE
GENE
NODE
Paralogs (Z)
Species (X)Archaea Eukaryota
(a) (b) (c)
TRENDS in Ecology & Evolution
Figure 4. Phylogenies in three dimensions. (a) Hyperbolic view of the National Center for Biotechnology Information taxonomy [58]. (b) Google Earth visualisation of theHawaiian endemic katydid genus Banza based on the phylogeny from [58]. (c) Stacked representation of a gene tree with multiple gene duplications [32].
Review Trends in Ecology and Evolution February 2012, Vol. 27, No. 2
117
with focus+context tools, such as Dendroscope [46] andTree Juxtaposer [42], using a desktop computer with amouse is rather clumsy, whereas a touch screen wouldprovide a more natural way to apply the spatial distortionsthese tools use to visualise very big trees.
PhyloinformaticsThere is a long tradition of annotating phylogenies bycolouring in branches, as popularised by the programMacClade [47]. However, much of this annotation hasbeen local; that is, only data contained within a singlefile are mapped on the tree (typically the data used tocreate the tree). A bigger challenge is annotating phy-logenies with data on genomics, geographic distribution,ecology and phenotype. Pioneering efforts in this direc-tion include TaxonTree [9], a stand-alone tool that runson desktop computers, and the web-based iToL [48]. Asan increasing amount of biodiversity data acquires digi-tal identifiers that can be resolved [49], one can lookforward to phylogeny viewers that automatically aggre-gate annotations from multiple data sources and displaythese to the user, as well as enabling the user to querythat information [50].
Perhaps one can draw a lesson here from the success ofGoogle Earth, which has become a ubiquitous tool forvisualising geographic data, in large part because of theease of creating the Keyhole Markup Language (KML) filesused by that program. This has enabled third parties,including evolutionary biologists, to create innovativevisualisations rich in biological data. This suggests anobvious way forward for the phylogenetic community,
TRENDS in Ecology & Evolution
(a) (b)
Figure 7. Interacting with phylogenies. (a) Displaying a phylogeny using multiple monitors. (b) Interacting with a visualisation using a touch screen.
A
B
C
D
A
B
C
D
A
B D
C
A
C
B
D
(a) (b) (c)
TRENDS in Ecology & Evolution
Figure 6. Alternative visualisations of uncertainty in trees. (a) Two and a half dimensional visualisation of a series of trees where neighbouring trees show minortopological changes. (b) DensiTree visualisation of variation in estimates of branch length among a set of trees. (c) Phylogenetic network showing two conflicting signals fora set of four taxa.
Reconciled tree(a) (b) Tanglegram
Host Parasite
Gene
Species
TRENDS in Ecology & Evolution
Figure 5. Reconciled trees and tanglegrams. (a) In a reconciled tree, one tree (suchas a gene tree) is embedded inside another tree; for example, the phylogeny of thespecies from which the genes were obtained. (b) Trees for different, associatedentities, such as genes and species, or hosts and parasites, can also be depictedusing a tanglegram.
Review Trends in Ecology and Evolution February 2012, Vol. 27, No. 2
118
Special Issue: Ecological and evolutionary informatics
Space, time, form: viewing the Treeof LifeRoderic D.M. Page
Institute of Biodiversity, Animal Health and Comparative Medicine, College of Medical, Veterinary and Life Sciences,University of Glasgow, Glasgow, G12 8QQ, UK
There are numerous ways to display a phylogenetic tree,which is reflected in the diversity of software toolsavailable to phylogenetists. Displaying very large treescontinues to be a challenge, made ever harder as in-creasing computing power enables researchers to con-struct ever-larger trees. At the same time, computingtechnology is enabling novel visualisations, rangingfrom geophylogenies embedded on digital globes totouch-screen interfaces that enable greater interactionwith evolutionary trees. In this review, I survey recentdevelopments in phylogenetic visualisation, highlight-ing successful (and less successful) approaches andsketching some future directions.
Visualising treesVisualising phylogenies is one of the fundamental tasks ofevolutionary analysis. Reviews of the field [1,2] list agrowing number of tree viewers, some of which, such asNJPlot [3] and TreeView [4], have been in use for over adecade. A quick glance at Felsenstein’s list of phylogenyprograms (http://evolution.genetics.washington.edu/phylip/software.html#Plotting) reveals viewers for just about everyconceivable operating system, written in a wide range ofcomputer programming languages. Given this diversity oftools that all provide essentially the same functionality, itwould be tempting to conclude that the basic problem ofdisplaying an evolutionary tree has been solved. Yet, it wasstriking that all the entries in the iEvoBio 2010 visualisationchallenge were tree viewers (Figure 1). This suggests thatalthough the niche of tree viewer is crowded, biologistsworking with trees are still searching for tools to help themvisualise phylogenies. The goal of this review is to surveysome recent developments in phylogeny visualisation, withan eye to future directions.
Trees are relatively simple structures that place fewrestrictions on how they can be depicted, apart from pre-serving the connections between the nodes in the tree. Thislack of constraints has led to a proliferation of ways tovisualise trees, many of which are striking (for a visualsurvey, see http://treevis.net). Conversely, this freedommeans that the interpretation of a tree diagram mightnot always be obvious to the person viewing it [5] [Green,D. and Shapley, R. (2005) Teaching with a visual treeof life; http://groups.ischool.berkeley.edu/TOL/], especiallydistinguishing which aspects of the diagram are providing
information, and which largely reflect artistic license(Box 1).
Although the most common representation of a phylog-eny is a two-dimensional (2D) Euclidean drawing [1], anincreasingly diverse range of visualisations are emerging(Figure 2). Typically phylogenies are drawn as trees; how-ever, authors have experimented with treemaps [6], whichlay out a tree as a set of nested rectangles (Figure 2).Treemaps are perhaps best suited for classifications ratherthan phylogenies, although Arvelakis et al. [7] recentlyused treemaps to display phylogenies with over 2000species.
Euclidean geometry is reassuringly familiar, but itbecomes difficult to accommodate very large trees withinthe confines of the printed page or a computer screen. Oneapproach is to ‘fold’ or collapse nodes to save space(Figure 3). Several methods, such as degree of interest(DOI) trees [8], space trees [9] and expand-ahead browsers[10], exploit the natural hierarchy of rooted trees to com-press the tree into a smaller display area. The choice ofwhich nodes in a tree to collapse can be made by the user, orthe process can be automated [11,12]
An alternative approach to saving space is to keep thetree unchanged, but instead distort the space in which thetree is being displayed, the best-known examples beinghyperbolic viewers (Figure 2) [13,14]. Although capable ofproducing some stunning images (e.g. Figure 4a), thesetools have gained little traction among users. In practice,users find them hard to navigate, and hyperbolic viewers inparticular are best suited to classifications, which tend tobe shallow (few nodes along the path from any tip to thebase or root of the tree) and frequently have internal nodesof high degree (many immediate descendants). By contrast,a fully resolved phylogeny may be deep (in a tree with nleaves, there may be a path from leaf to root with n–1nodes) and binary (each node having only two immediatedescendants); consequently, phylogenies rarely look goodin hyperbolic viewers.
Some three-dimensional (3D) phylogeny viewers haveforgone trying to truly display a phylogeny in three dimen-sions, and instead use the third dimension to provide a‘fly through’ experience over a 2D tree, such as Paloverde[15] and the Wellcome Trust Tree of Life (http://www.wellcometreeoflife.org/). Although perhaps less disorien-tating than hyperbolic viewers, it is not clear that thisprovides a better way to navigate through a tree comparedwith a simple 2D visualisation. Although the case for 3D
Review
Corresponding author: Page, R.D.M. ([email protected])
0169-5347/$ – see front matter ! 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.tree.2011.12.002 Trends in Ecology and Evolution, February 2012, Vol. 27, No. 2 113
a | The phylogeny shows the distributions of new Drosophila spp. genes involved in development46 (above) and in the brain76 (below) in various evolutionary stages within the past 36 million years81.
Why molecular phylogenetics?
• The stream of heredity makes phylogeny: in a sense, it is phylogeny. Complete genetic analysis would provide the most priceless data for the mapping of this stream. George G. Simpson, 1945
• “I do not fully understand why we are not proclaiming the message from the housetops ... We finally have a method that can sort homology from analogy.” Stephen J. Gould , 1985
Linus Pauling
• “We may ask the question where in the now living systems the greatest amount of information of their past history has survived and how it can be extracted”
• “Best fit are the different types of macromolecules (sequences) which carry the genetic information”
Molecules as documents of evolutionary history
Applications of Phylogenetic Analysis
• Systematics and classification
• Discovering new life forms
• Phylogeography and speciation
• Molecular evolution
• Genomics
• Epidemiology and forensics
• Biotechnology
• Agriculture
• Conservation
The Tree of Life: Benefits to Society throughPhylogenetic Research
Phylogenetic analysis is playing a major role in discoveringand identifying new life forms that could yield many newbenefits for human health and biotechnology. Many
microorganisms, including bacteria and fungi, cannot be culti-vated and studied directly in the laboratory, thus the principleroad to discovery is to isolate their DNA from samples collectedfrom marine or freshwater environments or from soils. The DNAsamples are then sequenced and compared in phylogenetic
analyses with the sequences of previouslydescribed organisms. This has led to majornew discoveries.
For several decades microbiologistshave been searching for new bacteria inextreme environments such as hotspringsor marine hydrothermal vents. The thermalsprings of Yellowstone National Park have yielded a host of new and important bacterial species, many of which were identified using phyloge-netic analysis of DNA sequences.
The most famous bacterium from Yellowstoneis Thermus aquaticus . Anenzyme derived from this species —DNA Taq polymerase — powers aprocess called the poly-merase chain reaction(PCR), which is used inthousands of laboratoriesto make large amounts of DNA for sequencing.This discovery led to thecreation of a major newbiotechnological industryand has revolutionizedmedical diagnostics, foren-sics, and other biologicalsciences. Many microorganisms in extreme environments mayyield innovative products for biotechnology.
Fungi are among the most ecologically important organisms.By feeding on dead or decaying organic material, fungi helprecycle nutrients through ecosystems. Additionally, fungi are
important economically as foods and as biotechnological sourcesfor medicines, insecticides, herbicides, and many other products.
About 200,000 species of fungi are known, but there may be millions more to be discovered because most are extremelysmall and found in poorly studied habitats such as soils.Increasingly, phylogenetic analysis is being used to discover
new microfungi through isolation and sequencing of DNA.Biological studies on these new species hold great promise for developing novel natural products.
2
Common fungi often havemycorrhizal associations inearly stages of development,and thus are important parts of Earth’s ecosystems.
Fungi — an unknown world revealed byphylogenetic analysis
Using phylogenetic analysis to discovernew life forms for biotechnology
pJP 74
pJP 7
pJP 8
pJP 6
pJP 81
pJP 33
pJP 9
Thermophilic bacteria found in Yellowstone hot springs
A phylogeny of some archaeobacteria. Newly discoverd life forms are in red.
Desufurococcus mobilis
Sulfolobus aciducaldarius
Pyrodictium occultum
Pyrobaculum islandicum
Pyrobaculum aerophilum
Thermoproteus tenax
Thermofilum pendens
Methanopyrus kandleri
Thermococcus celer
Archaeoglobus fulgidus
“Simple identification via phylogenetic classification of organisms has,to date, yielded more patent filings than any other use of phylogeny in industry.”Bader et al. (2001)
Proc. Natl. Acad. Sci. USAVol. 74, No. 11, pp. 5088-5090, November 1977Evolution
Phylogenetic structure of the prokaryotic domain: The primarykingdoms
(archaebacteria/eubacteria/urkaryote/16S ribosomal RNA/molecular phylogeny)
CARL R. WOESE AND GEORGE E. Fox*
Department of Genetics and Development, University of Illinois, Urbana, Illinois 61801
Communicated by T. M. Sonneborn, August 18,1977
ABSTRACT A phylogenetic analysis based upon ribosomalRNA sequence characterization reveals that living systemsrepresent one of three aboriginal lines of descent: (i) the eu-bacteria, comprising all typical bacteria; (ii) the archaebacteria,containing methanogenic bacteria; and (iii) the urkaryotes, nowrepresented in the cytoplasmic component of eukaryoticcells.
The biologist has customarily structured his world in terms ofcertain basic dichotomies. Classically, what was not plant wasanimal. The discovery that bacteria, which initially had beenconsidered plants, resembled both plants and animals less thanplants and animals resembled one another led to a reformula-tion of the issue in terms of a yet more basic dichotomy, that ofeukaryote versus prokaryote. The striking differences betweeneukaryotic and prokaryotic cells have now been documentedin endless molecular detail. As a result, it is generally taken forgranted that all extant life must be of these two basic types.
Thus, it appears that the biologist has solved the problem ofthe primary phylogenetic groupings. However, this is not thecase. Dividing the living world into Prokaryotae and Eukar-yotae has served, if anything, to obscure the problem of whatextant groupings represent the various primeval branches fromthe common line of descent. The reason is that eukaryote/prokaryote is not primarily a phylogenetic distinction, althoughit is generally treated so. The eukaryotic cell is organized in adifferent and more complex way than is the prokaryote; thisprobably reflects the former's composite origin as a symbioticcollection of various simpler organisms (1-5). However striking,these organizational dissimilarities do not guarantee that eu-karyote and prokaryote represent phylogenetic extremes.The eukaryotic cell per se cannot be directly compared to
the prokaryote. The composite nature of the eukaryotic cellmakes it necessary that it first be conceptually reduced to itsphylogenetically separate components, which arose from an-cestors that were noncomposite and so individually are com-parable to prokaryotes. In other words, the question of theprimary phylogenetic groupings must be formulated solely interms of relationships among "prokaryotes"-i.e., noncompositeentities. (Note that in this context there is no suggestion a priorithat the living world is structured in a dichotomous way.)The organizational differences between prokaryote and
eukaryote and the composite nature of the latter indicate animportant property of the evolutionary process: Evolution seemsto progress in a "quantized" fashion. One level or domain oforganization gives rise ultimately to a higher (more complex)one. What "prokaryote" and "eukaryote" actually representare two such domains. Thus, although it is useful to definephylogenetic patterns within each domain, it is not meaningful
The costs of publication of this article were defrayed in part by thepayment of page charges. This article must therefore be hereby marked"advertisement" in accordance with 18 U. S. C. §1734 solely to indicatethis fact.
to construct phylogenetic classifications between domains:Prokaryotic kingdoms are not comparable to eukaryotic ones.This should be recognized by, an appropriate terminology. Thehighest phylogenetic unit in the prokaryotic domain we thinkshould be called an "urkingdom"-or perhaps "primarykingdom." This would recognize the qualitative distinctionbetween prokaryotic and eukaryotic kingdoms and emphasizethat the former have primary evolutionary status.The passage from one domain to a higher one then becomes
a central problem. Initially one would like to know whether thisis a frequent or a rare (unique) evolutionary event. It is tradi-tionally assumed-without evidence-that the eukaryoticdomain has arisen but once; all extant eukaryotes stem from acommon ancestor, itself eukaryotic (2). A similar prejudice holdsfor the prokaryotic domain (2). [We elsewhere argue (6) thata hypothetical domain of lower complexity, that of "pro-genotes," may have preceded and given rise to the prokaryotes.]The present communication is a discussion of recent findingsthat relate to the urkingdom structure of the prokaryotic do-main and the question of its unique as opposed to multiple or-igin.
Phylogenetic relationships cannot be reliably established interms of noncomparable properties (7). A comparative ap-proach that can measure degree of difference in comparablestructures is required. An organism's genome seems to be theultimate record of its evolutionary history (8). Thus, compar-ative analysis of molecular sequences has become a powerfulapproach to determining evolutionary relationships (9, 10).To determine relationships covering the entire spectrum of
extant living systems, one optimally needs a molecule of ap-propriately broad distribution. None of the readily character-ized proteins fits this requirement. However, ribosomal RNAdoes. It is a component of all self-replicating systems; it is readilyisolated; and its sequence changes but slowly with time-per-mitting the detection of relatedness among very distant species(11-13). To date, the primary structure of the 16S (18S) ribo-somal RNA has been characterized in a moderately large andvaried collection of organisms and organelles, and the generalphylogenetic structure of the prokaryotic domain is beginningto emerge.A comparative analysis of these data, summarized in Table
1, shows that the organisms clearly cluster into several primarykingdoms. The first of these contains all of the typical bacteriaso far characterized, including the genera Acetobacterium,Acinetobacter, Acholeplasma, Aeromonas, Alcaligenes, An-acystis, Aphanocapsa, Bacillus, Bdellovbrio, Chlorobium,Chromatium, Clostridium, Corynebacterium, Escherichia,Eubacterium, Lactobacillus, Leptospira, Micrococcus, My-coplasna, Paracoccus, Photobacteriurn, Propionibacterium,
* Present address: Department of Biophysical Sciences, University ofHouston, Houston, TX 77004.
5088
Discovering new life forms
Environmental Genome ShotgunSequencing of the Sargasso SeaJ. Craig Venter,1* Karin Remington,1 John F. Heidelberg,3
Aaron L. Halpern,2 Doug Rusch,2 Jonathan A. Eisen,3
Dongying Wu,3 Ian Paulsen,3 Karen E. Nelson,3 William Nelson,3
Derrick E. Fouts,3 Samuel Levy,2 Anthony H. Knap,6
Michael W. Lomas,6 Ken Nealson,5 Owen White,3
Jeremy Peterson,3 Jeff Hoffman,1 Rachel Parsons,6
Holly Baden-Tillson,1 Cynthia Pfannkoch,1 Yu-Hui Rogers,4
Hamilton O. Smith1
Wehave applied “whole-genome shotgun sequencing” tomicrobial populationscollected enmasse on tangential flow and impact filters from seawater samplescollected from the Sargasso Sea near Bermuda. A total of 1.045 billion base pairsof nonredundant sequencewas generated, annotated, and analyzed to elucidatethe gene content, diversity, and relative abundance of the organisms withinthese environmental samples. These data are estimated to derive from at least1800 genomic species based on sequence relatedness, including 148 previouslyunknown bacterial phylotypes. We have identified over 1.2 million previouslyunknown genes represented in these samples, including more than 782 newrhodopsin-like photoreceptors. Variation in species present and stoichiometrysuggests substantial oceanic microbial diversity.
Microorganisms are responsible for most of thebiogeochemical cycles that shape the environ-ment of Earth and its oceans. Yet, these organ-isms are the least well understood on Earth, asthe ability to study and understand the metabol-ic potential of microorganisms has been ham-pered by the inability to generate pure cultures.Recent studies have begun to explore environ-mental bacteria in a culture-independent man-ner by isolating DNA from environmental sam-ples and transforming it into large insert clones.For example, a previously unknown light-drivenproton pump, proteorhodopsin, was discoveredwithin a bacterial artificial chromosome (BAC)from the genome of a SAR86 ribotype (1), andsoil microbial DNA libraries have been construct-ed and screened for specific activities (2).
Here we have applied whole-genome shot-gun sequencing to environmental-pooled DNAsamples to test whether new genomic approach-es can be effectively applied to gene and spe-cies discovery and to overall environmental
characterization. To help ensure a tractable pilotstudy, we sampled in the Sargasso Sea, a nutrient-limited, open ocean environment. Further, weconcentrated on the genetic material captured onfilters sized to isolate primarily microbial inhabit-ants of the environment, leaving detailed analysisof dissolved DNA and viral particles on one endof the size spectrum and eukaryotic inhabitants onthe other, for subsequent studies.The Sargasso Sea. The northwest Sar-
gasso Sea, at the Bermuda Atlantic Time-seriesStudy site (BATS), is one of the best-studiedand arguably most well-characterized regionsof the global ocean. The Gulf Stream representsthe western and northern boundaries of thisregion and provides a strong physical boundary,separating the low nutrient, oligotrophic openocean from the more nutrient-rich waters of theU.S. continental shelf. The Sargasso Sea hasbeen intensively studied as part of the 50-yeartime series of ocean physics and biogeochem-istry (3, 4) and provides an opportunity forinterpretation of environmental genomic data inan oceanographic context. In this region, for-mation of subtropical mode water occurs eachwinter as the passage of cold fronts across theregion erodes the seasonal thermocline andcauses convective mixing, resulting in mixedlayers of 150 to 300 m depth. The introductionof nutrient-rich deep water, following thebreakdown of seasonal thermoclines into thebrightly lit surface waters, leads to the bloom-ing of single cell phytoplankton, including twocyanobacteria species, Synechococcus and Pro-
chlorococcus, that numerically dominate thephotosynthetic biomass in the Sargasso Sea.
Surface water samples (170 to 200 liters)were collected aboard the RV Weatherbird IIfrom three sites off the coast of Bermuda inFebruary 2003. Additional samples were col-lected aboard the SV Sorcerer II from “Hydro-station S” in May 2003. Sample site locationsare indicated on Fig. 1 and described in tableS1; sampling protocols were fine-tuned fromone expedition to the next (5). Genomic DNAwas extracted from filters of 0.1 to 3.0 !m, andgenomic libraries with insert sizes ranging from2 to 6 kb were made as described (5). Theprepared plasmid clones were sequenced fromboth ends to provide paired-end reads at the J.Craig Venter Science Foundation Joint Tech-nology Center on ABI 3730XL DNA sequenc-ers (Applied Biosystems, Foster City, CA).Whole-genome random shotgun sequencing ofthe Weatherbird II samples (table S1, samples 1 to4) produced 1.66 million reads averaging 818 bpin length, for a total of approximately 1.36 Gbp ofmicrobial DNA sequence. An additional 325,561sequences were generated from the Sorcerer IIsamples (table S1, samples 5 to 7), yielding ap-proximately 265 Mbp of DNA sequence.Environmental genome shotgun as-
sembly. Whole-genome shotgun sequencingprojects have traditionally been applied to iden-tify the genome sequence(s) from one particularorganism, whereas the approach taken here isintended to capture representative sequencefrom many diverse organisms simultaneously.Variation in genome size and relative abun-dance determines the depth of coverage of anyparticular organism in the sample at a givenlevel of sequencing and has strong implicationsfor both the application of assembly algorithmsand for the metrics used in evaluating the re-sulting assembly. Although we would expectabundant species to be deeply covered and wellassembled, species of lower abundance may berepresented by only a few sequences. For asingle genome analysis, assembly coveragedepth in unique regions should approximate aPoisson distribution. The mean of this distribu-tion can be estimated from the observed data,looking at the depth of coverage of contigsgenerated before any scaffolding. The assem-bler used in this study, the Celera Assembler(6), uses this value to heuristically identifyclearly unique regions to form the backbone ofthe final assembly within the scaffolding phase.However, when the starting material consists ofa mixture of genomes of varying abundance, athreshold estimated in this way would classifysamples from the most abundant organism(s) asrepetitive, due to their greater-than-averagedepth of coverage, paradoxically leaving themost abundant organisms poorly assembled.We therefore used manual curation of an initial
1The Institute for Biological Energy Alternatives, 2TheCenter for the Advancement of Genomics, 1901 Re-search Boulevard, Rockville, MD 20850, USA. 3TheInstitute for Genomic Research, 9712 Medical CenterDrive, Rockville, MD 20850, USA. 4The J. Craig VenterScience Foundation Joint Technology Center, 5 Re-search Place, Rockville, MD 20850, USA. 5University ofSouthern California, 223 Science Hall, Los Angeles, CA90089–0740, USA. 6Bermuda Biological Station forResearch, Inc., 17 Biological Lane, St George GE 01,Bermuda.
*To whom correspondence should be addressed. E-mail: [email protected]
RESEARCH ARTICLE
2 APRIL 2004 VOL 304 SCIENCE www.sciencemag.org66
Disease Transmission and Medical Forensics
Brief Communications
Nature 444, 836-837 (14 December 2006) | doi:10.1038/444836a; Received 4 November 2006; Accepted 24 November 2006; Published online 6 December 2006
Molecular Epidemiology: HIV-1 and HCV sequences from Libyan outbreak
Tulio de Oliveira1, Oliver G. Pybus
1, Andrew Rambaut
2, Marco Salemi
3, Sharon Cassol
4, Massimo Ciccozzi
5, Giovanni Rezza
5, Guido
Castelli Gattinara6
, Roberta D'Arrigo7, Massimo Amicosante
8, Luc Perrin
9, Vittorio Colizzi
10, Carlo Federico Perno
11 and Benghazi Study
Group12
In 1998, outbreaks of human immunodeficiency virus type 1 (HIV-1) and hepatitis C virus (HCV) infection were reportedin children attending Al-Fateh Hospital in Benghazi, Libya. Here we use molecular phylogenetic techniques to analysenew virus sequences from these outbreaks. We find that the HIV-1 and HCV strains were already circulating andprevalent in this hospital and its environs before the arrival in March 1998 of the foreign medical staff (five Bulgariannurses and a Palestinian doctor) who stand accused of transmitting the HIV strain to the children.
Almost half of the 111 children studied in the early months after the discovery of the outbreak showed evidence of both HIV-1 and HCV
infection1. Of 418 children eventually affected by these viruses, 248 were referred to European hospitals
1, 2. Sequence analysis of 51 children
classified the HIV-1 infection as the strain CRF02_AG; HCV infection was classified as genotype 4 or subtype 1a in 15 children1, 2
.
We studied HIV-1 gag gene sequences from 44 affected children, plus 61 HCV E1E2 gene sequences that span the HCV hypervariable region(for methods, supplementary information). By using these data in an evolutionary analysis, we could place a real timescale on thetransmission history of the outbreaks.
We collated all available reference strains that were closely related to the sequences from the Al-Fateh Hospital, then estimated and assessedphylogenies using algorithmic, bayesian and maximum-likelihood methods (for details, supplementary information). The HIV-1sequences from the hospital form a well supported monophyletic cluster within the CRF02_AG clade, indicating that the outbreak arosefrom one CRF02_AG lineage. The cluster is closest to three west African reference sequences (Fig. 1a), the basal location of which suggeststhat the Al-Fateh Hospital lineage arrived in Libya from there. The branch length leading to the Al-Fateh Hospital cluster is perfectly typical;
hence the Al-Fateh Hospital strain is not unusually divergent2
.
Figure 1: HIV-1 and HCV sequences from 1998 Al-Fateh Hospital (AFH) outbreak.
a–c, Estimated maximum-likelihood phylogenies for HIV-1 CRF02_AG (a), HCV genotype 4 (b) and HCV genotype 1 (c). Source of sequencesused for analysis: AFH, red; Egypt, green; Cameroon, blue. Black circles mark the common ancestor of HCV subtype 4a and 1a; numbers aboveAFH lineages give clade support values using bootstrap and bayesian methods, respectively. Scale bar units are nucleotide substitutions per site.For visual clarity, AFH clusters are represented by triangles and some non-informative reference strains are excluded.
High resolution image and legend (21K)
In an equivalent HCV phylogenetic analysis, the HCV sequences from the hospital formed three monophyletic clusters containing 11subtype-4a sequences, phylogenetically placed among Egyptian subtype 4a lineages; 22 sequences most closely related to a Camerooniangenotype-4 strain; and 24 sequences belonging to the worldwide and prevalent subtype 1a; four remaining sequences belong to genotype 4(Fig. 1b, c; see supplementary information).
Epidemiological linkage of the HIV-1 and HCV clusters from Al-Fateh Hospital with sequences from sub-Saharan Africa is to be expected,
given the large number of migrants within or passing through Libya3
; indeed, the Libyan authorities have expressed concern about the risk
of introduction of HIV/AIDS and hepatitis as a result of this migration4
. In addition, HCV genotype 4 is endemic to central Africa and the
Middle East5, 6, 7
, and subtype 4a is exceptionally prevalent in neighbouring Egypt8, 9
.
Virus sequences also contain temporal information about the date of origin and age of epidemics10
. We therefore comprehensively analysed
the evolution of the Al-Fateh Hospital clusters using an established bayesian Markov chain Monte Carlo (MCMC) approach9, 10
thatappropriately accounts for estimation uncertainty. We estimated three parameter values for each cluster: the date of its most recentcommon ancestor; the probability that its most recent common ancestor was more recent than 1 March 1998; and the percentage of itslineages that already existed before 1 March 1998. (These values are conservative, because cluster origins could be older than the mostrecent common ancestor, but not younger.) To avoid model selection bias, we used a range of applicable models.
Ou et al. 1992
Applications of Phylogenetic Analysis
• Systematics and classification
• Discovering new life forms
• Phylogeography and speciation
• Molecular evolution
• Genomics
• Epidemiology and forensics
• Biotechnology
• Agriculture
• Conservation
The Tree of Life: Benefits to Society throughPhylogenetic Research
Phylogenetic analysis is playing a major role in discoveringand identifying new life forms that could yield many newbenefits for human health and biotechnology. Many
microorganisms, including bacteria and fungi, cannot be culti-vated and studied directly in the laboratory, thus the principleroad to discovery is to isolate their DNA from samples collectedfrom marine or freshwater environments or from soils. The DNAsamples are then sequenced and compared in phylogenetic
analyses with the sequences of previouslydescribed organisms. This has led to majornew discoveries.
For several decades microbiologistshave been searching for new bacteria inextreme environments such as hotspringsor marine hydrothermal vents. The thermalsprings of Yellowstone National Park have yielded a host of new and important bacterial species, many of which were identified using phyloge-netic analysis of DNA sequences.
The most famous bacterium from Yellowstoneis Thermus aquaticus . Anenzyme derived from this species —DNA Taq polymerase — powers aprocess called the poly-merase chain reaction(PCR), which is used inthousands of laboratoriesto make large amounts of DNA for sequencing.This discovery led to thecreation of a major newbiotechnological industryand has revolutionizedmedical diagnostics, foren-sics, and other biologicalsciences. Many microorganisms in extreme environments mayyield innovative products for biotechnology.
Fungi are among the most ecologically important organisms.By feeding on dead or decaying organic material, fungi helprecycle nutrients through ecosystems. Additionally, fungi are
important economically as foods and as biotechnological sourcesfor medicines, insecticides, herbicides, and many other products.
About 200,000 species of fungi are known, but there may be millions more to be discovered because most are extremelysmall and found in poorly studied habitats such as soils.Increasingly, phylogenetic analysis is being used to discover
new microfungi through isolation and sequencing of DNA.Biological studies on these new species hold great promise for developing novel natural products.
2
Common fungi often havemycorrhizal associations inearly stages of development,and thus are important parts of Earth’s ecosystems.
Fungi — an unknown world revealed byphylogenetic analysis
Using phylogenetic analysis to discovernew life forms for biotechnology
pJP 74
pJP 7
pJP 8
pJP 6
pJP 81
pJP 33
pJP 9
Thermophilic bacteria found in Yellowstone hot springs
A phylogeny of some archaeobacteria. Newly discoverd life forms are in red.
Desufurococcus mobilis
Sulfolobus aciducaldarius
Pyrodictium occultum
Pyrobaculum islandicum
Pyrobaculum aerophilum
Thermoproteus tenax
Thermofilum pendens
Methanopyrus kandleri
Thermococcus celer
Archaeoglobus fulgidus
“Simple identification via phylogenetic classification of organisms has,to date, yielded more patent filings than any other use of phylogeny in industry.”Bader et al. (2001)
The Tree of Life: Benefits to Society throughPhylogenetic Research
Phylogenetic analysis is playing a major role in discoveringand identifying new life forms that could yield many newbenefits for human health and biotechnology. Many
microorganisms, including bacteria and fungi, cannot be culti-vated and studied directly in the laboratory, thus the principleroad to discovery is to isolate their DNA from samples collectedfrom marine or freshwater environments or from soils. The DNAsamples are then sequenced and compared in phylogenetic
analyses with the sequences of previouslydescribed organisms. This has led to majornew discoveries.
For several decades microbiologistshave been searching for new bacteria inextreme environments such as hotspringsor marine hydrothermal vents. The thermalsprings of Yellowstone National Park have yielded a host of new and important bacterial species, many of which were identified using phyloge-netic analysis of DNA sequences.
The most famous bacterium from Yellowstoneis Thermus aquaticus . Anenzyme derived from this species —DNA Taq polymerase — powers aprocess called the poly-merase chain reaction(PCR), which is used inthousands of laboratoriesto make large amounts of DNA for sequencing.This discovery led to thecreation of a major newbiotechnological industryand has revolutionizedmedical diagnostics, foren-sics, and other biologicalsciences. Many microorganisms in extreme environments mayyield innovative products for biotechnology.
Fungi are among the most ecologically important organisms.By feeding on dead or decaying organic material, fungi helprecycle nutrients through ecosystems. Additionally, fungi are
important economically as foods and as biotechnological sourcesfor medicines, insecticides, herbicides, and many other products.
About 200,000 species of fungi are known, but there may be millions more to be discovered because most are extremelysmall and found in poorly studied habitats such as soils.Increasingly, phylogenetic analysis is being used to discover
new microfungi through isolation and sequencing of DNA.Biological studies on these new species hold great promise for developing novel natural products.
2
Common fungi often havemycorrhizal associations inearly stages of development,and thus are important parts of Earth’s ecosystems.
Fungi — an unknown world revealed byphylogenetic analysis
Using phylogenetic analysis to discovernew life forms for biotechnology
pJP 74
pJP 7
pJP 8
pJP 6
pJP 81
pJP 33
pJP 9
Thermophilic bacteria found in Yellowstone hot springs
A phylogeny of some archaeobacteria. Newly discoverd life forms are in red.
Desufurococcus mobilis
Sulfolobus aciducaldarius
Pyrodictium occultum
Pyrobaculum islandicum
Pyrobaculum aerophilum
Thermoproteus tenax
Thermofilum pendens
Methanopyrus kandleri
Thermococcus celer
Archaeoglobus fulgidus
“Simple identification via phylogenetic classification of organisms has,to date, yielded more patent filings than any other use of phylogeny in industry.”Bader et al. (2001)
How do we know that phylogenetics work?
Application and Accuracy of Molecular Phylogenies
David M. Hillis, John P. Huelsenbeck, Clifford W. Cunningham
Molecular investigations of evolutionary history are being used to study subjects as diverse as the epidemiology of acquired immune deficiency syndrome and the origin of life. These studies depend on accurate estimates of phylogeny. The performance of methods of phylogenetic analysis can be assessed by numerical simulation studies and by the ex- perimental evolution of organisms in controlled laboratory situations. Both kinds of as- sessment indicate that existing methods are effective at estimating phylogenies over awide range of evolutionary conditions, especially if information about substitution bias is used to provide differential weightings for character transformations.
Over the past few decades, biologists from many disciplines have turned to phyloge- netic analyses to interpret variation in bio- logical systems (1). This increased interest in evolutionary history has developed partly in response to a new appreciation of the importance of understanding evolutionary constraints when interpreting biological variation and partly in response to develop- ments in phylogenetic methodology. Three developments in particular have been crit- ical to the success of the field: (i) the development of objective criteria and algo- rithms for discriminating among potential phylogenies, (ii) increased computational power to implement phylogenetic algo- rithms, and (iii) a rapid increase in the data available for inferring phylogenies, espe- cially from molecular investigations (2). As a result of these developments, applications of phylogenetic analysis span the range of biological diversity from questions about the history of life (3) to studies of the epidemiology of acquired immune deficien- cy syndrome (AIDS) (4). However, the success of these applications depends on the accuracy of the inferred phylogenies, so it is necessary to ask how well the methods work and to identify the conditions under which they may fail.
The accuracy of methods of phylogenet- ic analysis can be assessed by the examina- tion of either numerical simulations of phy- logenies or phylogenies of organisms whose evolutionary history has been observed di- rectly. Numerical simulations assume a par- ticular model of evolution and then gener- ate characters (typically, nucleotide se- quences) according to the model and to a given phylogeny. Thus, an investigator can generate many replicate data sets under specified conditions in order to compare the performance of competing methods. The analysis of known phylogenies adds a reality check to the simulation studies: The history
The authors are in the Department of Zoology, Univer- sity of Texas, Austin, TX 78712, USA.
of the lineages is known (or, ideally, con- trolled by the investigator), but the orga- nisms evolve under real biological con- straints rather than idealized model condi- tions. Known phylogenies may involve lab- oratory or cultivated strains whose history has been recorded (5) or lineages that have been manipulated under controlled experi- mental conditions for the purpose of gener- ating testable phylogenies (6, 7).
The numerical simulation and experi- mental phylogeny approaches are largely complementary, and both kinds of studies are necessary to evaluate methods of phylo- genetic analysis effectively. Simulations can be used to explore virtually any conceivable phylogeny, and phylogenies can be replicat- ed with speed and ease. The primary limi- tation of numerical simulations is that they always include gross simplifications of bio- logical processes-. For -instance, most simu- lations assume that nucleotide positions evolve independently of one another, even though several causes of non-independence have been identified (8). Many simulations also assume simple one- or two-parameter substitution models; for instance, all possi- ble substitutions may be assumed to be equally probable (a one-parameter model), or separate probabilities of substitution may be assigned to transitions and transversions (a two-parameter model). However, real substitution biases are known to be much more complex (9). Although these com- plexities can be added to simulation studies, there is rarely sufficient knowledge to esti- mate the extent of the influence of factors such as non-independence among nucleo- tide positions or variance of rates of evolu- tion across nucleotide positions. Therefore, results from simulation studies need to be compared to results from studies of real biological organisms to determine the ef- fects of the simplifying assumptions. If re- sults from simulations can be replicated with experimental systems, then greater faith can be placed in the simulation re- sults. However, if departures from the sim-
ulation results are discovered, then the processes that are responsible for the differ- ences can be identified and the simulations can be improved. The simulations are likely to suggest conditions that are of interest in the experimental phylogenies, and the ex- perimental phylogenies can provide a test of the simulation results. Thus, a combination of the two approaches is the most effective way to evaluate the performance of meth- ods of phylogenetic analysis (10).
Simple Evolutionary Models
Most simulated phylogenies assume a sim- ple one- or two-parameter model of evolu- tion and then test the ability of various methods to reconstruct the evolutionary history of lineages generated under the as- sumed model (11, 12). Several methods are known to be consistent (at least for simple tree topologies) for data generated under such models, which means that they con- verge on the correct answer, given infinite data. In general, most of the commonly used methods are consistent if corrections are made for superimposed changes (such as multiple substitutions at a single nucleotide site) in accord with the model of evolution used (13). For instance, most pairwise dis- tance methods (except the UPGMA meth- od) are consistent under the Jukes-Cantor one-parameter model of evolution if Jukes- Cantor distances are used to infer the phy- logeny (12, 14). Character-based methods such as parsimony can also be made consis- tent by using a Hadamard transformation to correct the data (13). However, the fact that a method is consistent indicates only that it will converge on the correct answer when given unlimited data, so it is neces- sary to do power analyses in order to com- pare the performance of competing meth- ods, given finite data sets.
A common objection made to simula- tion studies is that it is easy to bias the results in favor of almost any method by choosing conditions to sirnulate that are most favorable to that method (15). Such biases can be avoided only by exhaustively exploring the potential parameters, of any given problem. As an example, consider one of the most commonly simulated cases: a simple four-taxon unrooted tree, in which the five lineages (four peripheral branches and a central branch) are evolving at two different rates (Fig. 1). Felsenstein (16) used a tree of this type to demonstrate that some methods of phylogenetic reconstruc-
SCIENCE * VOL. 264 * 29 APRIL 1994 671
tion are inconsistent when two of the op- posing peripheral branches are evolving much more rapidly than are the remaining three branches. Given a model of evolution
(for example, the Kimura two-parameter model, which allows for independent sub- stitution rates for transitions and transver-
sions) (17), and given two rates of evolu- tion (one rate for two of the opposing branches and a second rate for the remain-
ing three branches), the universe of possi- ble trees can be examined in a two-dimen- sional graph (Fig. 1). Instantaneous substi- tution rates can be varied from zero to
infinity along each of the axes, and se-
quences can be generated in accord with the model of evolution. A power analysis is conducted by generating sequences of given finite length and then inferring the trees from the sequences by the use of competing methods.
Figure 1 shows a power analysis for three common methods of phylogenetic inference and the effects of two common methods of data transformation under the model of evolution outlined above (18). For non- transformed data, all three methods are inconsistent in parts of the graph space; use of Kimura-corrected distances (which ex-
actly match the model of evolution) makes the neighbor-joining method consistent across the graph (12). Another common
type of data transformation involves char- acter weighting (19, 20). In character methods such as parsimony, differential
weights are often assigned to the different character-state changes, depending on their observed frequency of occurrence. Thus, in the Kimura model simulated in Fig. 1, transitions are 10 times more likely to occur than are transversions, so the weighted- parsimony analysis weights the transver- sions 10 times more heavily than transitions (in practice, a wide range of weights of transversions over transitions produces identical results) (Fig. 2). Such weighting is not equivalent to transforming the data to account for superimposed changes, so
weighted parsimony is not consistent across the entire graph space (12). However, the
power analysis shown in Fig. 1 indicates that weighting of characters has a much
greater effect on performance than does correction for superimposed changes, espe- cially at high rates of change. Although the
weighted-parsimony method is more likely to be misleading at extreme differences in the two rates (that is, in the upper left comer of the graph space), it is more likely to find the correct tree at high rates of
change (Fig. 1). The Kimura corrections do
improve the performance of the neighbor- joining method in regions that are incon- sistent for the uncorrected data but do not
improve performance when rates are uni-
formly higher (as does character weight-
672
ing). The Kimura corrections actually re- duce the performance of distance methods under conditions of equal rates of change (Fig. 1).
Some authors have argued that methods such as parsimony should be avoided be- cause they are inconsistent for some trees
\e/d / e 100 Flg 0.75 b 90 an<
1^./ | un80 70 un(
o / 160 eve
/ /0 ,.: / 5 | 4° ant u~~~~ ~~ ~~40
s OX / 130 rat(
m . 0 0.75 b s20 (d2 Branch lengths 10 luti
(a, b, and c) ex Parimnnv Wihtd narimonv tW(e Parsimnnv Weichted Darsimonv tw(
Neighbor joining I . 1.
Neighbor joining
(for example, those in the upper left comer of the graphs in Fig. 1) when they evolve under simple models of evolution (21). However, all methods become inconsistent for some trees when their assumptions are violated (12), and the cost of complete consistency under simple models of evolu-
9. 1. Performance of three methods of phylogenetic alysis on the basis of simulation of four-taxon trees der the Kimura model of evolution (18). Two rates of olution were simulated: one rate for branches a, b, d c (horizontal axis of each graph) and a second e for branches d and e (vertical axis). The diagonal ashed line, top left) represents equal rates of evo- on along all lineages. Branch lengths are shown in
pected frequency of divergent nucleotides at the o ends of the respective branches. At infinite rates
change, DNA sequences with equal base compo- ons are expected to differ at 75% of their positions. ue indicates that the method estimates the correct e a high percentage of the time under the simulated
nditions; red indicates poor performance of the
;thod (see color bar, top right). The solid white lines cumscribe the regions in which each method esti- ites the correct tree over 95% of the time. In the
gions above the dashed white lines, the methods timate the correct tree less than one-third of the time rate worse than that obtained by choosing a tree at
idom). The three colored graphs on the left were sed on nontransformed data; the three graphs on a right show the effects of character-state weighting r parsimony, top) and distance correction (for ighbor joining and UPGMA, middle and bottom).
Fig. 2. Efficiency of five 100- * -
methods of phylogenetic analysis for a four-taxon tree 90- / / with equal rates of evolution, Pars f5 0.5 0.5
evolving under a Kimura , 80- UPGMA
model of evolution and a 5 0.
10:1 transition:transversion 70 / / / N ratio. The branch lengths shown on the tree indicate 8 60- that 50% of the nucleotide / / sites are expected to 50- Lake's invarants
change along each branch.
Although all five methods 40- are consistent under these conditions (they all eventual- 30 . . . . ................ ........... ......
ly converge on the correct 101 102 103 104 105 106 107 108
solution), the methods differ Number of nucleotides
markedly in the number of nucleotides needed to find the correct solution. All points are based on 1000 simulated trees. WPars is weighted parsimony (45) (any weighting of transversions over transitions from 5:1 to infinity produces results indistinguishable from those shown); Pars is uniformly weighted parsimony (45); NJ is neighbor joining with Kimura distances (38); UPGMA is the unweighted pair-group method of
averages with Kimura distances (40); Lake's invariants is the method also known as evolutionary parsimony (22).
SCIENCE * VOL. 264 * 29 APRIL 1994
~~~""-""-~;·"""~I~C~ci~~j~'~
,r\ I IDPRMhA fKim irr\
ARTICLE
from parsimony. In the original study, the
phylogeny of these lineages was inferred from restriction site maps of the entire viral
genome, and all methods tested were suc- cessful at recovering the known phylogeny (6). The methods differed significantly in their ability to recover the branch lengths of the phylogeny (7), and the study also indicated a high degree of success in the reconstruction of ancestral restriction maps (>98% accuracy). However, the study did not discriminate among methods on the basis of their ability to find the correct order of branching events, because all methods found the correct tree.
We have now investigated this phylog- eny, using two additional data sets: restric- tion fragments and DNA sequences (33). Some authors recommend using the pres- ence or absence of restriction fragments (rather than the presence or absence of restriction sites) to infer phylogenies, be- cause it is much easier to collect restriction
fragment data than restriction site data
(34). However, restriction fragments do not evolve independently (a single site gain results in the loss of one fragment and the
gain of two others), and deletions can affect the fragments produced by many restriction
enzymes simultaneously. Because of these
problems, many authors argue that restric- tion site data should be preferred to restric- tion fragment data (35). This position is
supported by the experimental T7 phylog- eny, because all methods estimated an in- correct phylogeny when using high-resolu- tion restriction fragments, but they estimat-
ed the correct phylogeny when using re- striction sites. This difference in the
performance of analyses based on the two
types of data has not been apparent in simulation studies, possibly because simula- tion studies rarely include deletions in their models of evolutionary change.
The sequence data consist of 1091 base
pairs across four genes of T7 (36). There are
only 63 variable sites across the sequences, or about one-third as many variable char- acters as are present in the restriction site data (6). Competing methods do not per- form as well with the sequence data as they do with the restriction site data. With the
sequence data, only parsimony and weight- ed parsimony estimate the correct tree, although a second tree (that differs by one
branch) is equally parsimonious. Maximum likelihood (37), neighbor joining (38), the
Fitch-Margoliash method (39), and UP- GMA (40) each estimate a single, incorrect tree that differs from the correct tree by one branch rearrangement. The less accurate overall performance of all methods with the
sequence data does not necessarily imply that sequences are less reliable than restric- tion sites for inferring phylogeny, because there are fewer variable sites in the se-
quence data set. However, if bootstrap sam-
ples equal in size to the sequence data set are selected from the complete restriction site data and compared to bootstrap samples of the sequence data, then the restriction site data do appear to be somewhat more reliable for inferring phylogeny for most methods (maximum likelihood is the ex-
ception) (Fig. 7). A possible explanation lies in the non-independent evolution of some nucleotides within genes (7, 8); the
L N
-
0 (a
(0
0 0
tO
Fig. 6. Comparison of an observed phylogeny of viruses derived from bacteriophage T7 with an estimated phylogeny from the parsimony meth-
od, on the basis of analysis of the terminal
sequences (J through R). The numbers above the branches indicate the actual or estimated number of substitutions that occurred along the
respective lineages. The actual numbers of sub- stitutions were determined by sequencing the ancestral viruses. Ranges of values on the esti- mated tree indicate that multiple, equally parsi- monious reconstructions of character states are
possible.
Weighted Parsimony Neighbor UPGMA Maximum parsimony joining likelihood
Fig. 7. Comparison of phylogenetic analyses of the viral lineages derived from bacteriophage T7, on the basis of 1000 bootstrap samples of DNA sequences and 1000 bootstrap subsam-
ples of the restriction site data that have the same number of variable sites as are in the
sequence data. All methods found the correct tree with the complete restriction site data set; only parsimony and weighted parsimony found the correct tree with the complete sequence data set.
SCIENCE * VOL. 264 * 29 APRIL 1994
variable restriction sites are distributed across the entire T7 genome and therefore are more likely to vary independently of one another. For these data, differential
weighting of character states does not im-
prove phylogenetic resolution, because rare substitutions are restricted to single termi- nal lineages and therefore are uninforma- tive under the parsimony criterion. On the basis of the simulated HIV phylogenies discussed earlier, the beneficial effects of
weighting are expected only at higher rates of evolution than were observed. The rela-
tively poor performance of maximum-like- lihood estimation on the restriction site data may be because the strongly biased substitution matrix violates the assumptions of the method (7).
Clearly, it will be necessary to construct additional experimental phylogenies that are based on other tree topologies and
experimental conditions so that the gener- ality of the results can be checked. In
particular, predicted conditions of inconsis-
tency need to be examined experimentally. Nonetheless, there is a high degree of cor-
respondence between the results from sim- ulations and the experimental phylogenies, although the experiments suggest addition- al complexities that need to be added to simulations. For instance, the comparison of restriction site data with restriction frag- ment data indicates the need to incorporate insertion-deletion events into simulations as well as methods of analysis, and the
sequence analyses confirm the importance of accounting for non-independence among nucleotide sites. In general, however, the
experimental phylogenies confirm the rela-
tively high levels of performance of the various methods of phylogenetic analysis under realistic conditions.
Conclusions
Both simulation studies and experimental phylogenies indicate that many methods of
phylogenetic analysis are powerful enough to reconstruct evolutionary histories with a
high degree of accuracy, as long as the rates of change of the observed characters are
appropriate for analysis. This emphasizes the importance of methods that evaluate whether rates of evolutionary change in
target sequences are appropriate for phylo- genetic analysis (41). Experimental phylog- enies also indicate that many methods may be fairly robust to violations of the under-
lying assumptions, such as non-indepen- dence among nucleotide sites or deviations from simple models of evolution. It also is clear that differential weighting of charac- ter-state changes to reflect the observed fre-
quency of the different types of transforma- tions may substantially improve the perfor- mance of phylogenetic methods (especially
675
a) simulations
b) experimental phylogenies
Springer et al. 2004mammalian taxa, new questions arise, such as whether theunderlying genetic architecture responsible for thesechanges involves the same or different genes.
The root of the placental tree and other remainingproblemsWith the proposal of and strong support for the four majorclades of placental mammals, as well as Boreoeutheria
(Euarchontoglires þ Laurasiatheria), there are only threeviable locations for the root of the placental tree[19,21–23]. These are between (i) Afrotheria and otherplacental orders, (ii) Xenarthra and other placental orders(as favored by morphology), and (iii) ATLANTOGENATA
(Xenarthra þ Afrotheria) and Boreoeutheria. Numericalsimulations [21] reject the latter two hypotheses, but thesetests might be too liberal in rejecting alternate hypothesesif real data are not simulated accurately according tocurrent models of sequence evolution [40]. Resolving theplacental root remains the most fundamental problem forfuture studies of placental phylogeny and has implicationsfor understanding early placental biogeography. For allthree competing hypotheses, molecular data give theseparation of South American xenarthrans and African-origin afrotheres as being ,100 million years ago, whichcoincides with the vicariant separation of South Americaand Africa. Whereas some workers have suggested acausal connection between these plate-tectonic dates andmolecular dates separating Xenarthra and Afrotheria[18,21], others dismiss this as coincidence [41].
Similar to the placement of the placental root, remain-ing problems associated with resolving relationshipswithin the major clades involve minor perturbations ofthe tree shown in Figure 1b. The discovery of further RGCswill be crucial in testing alternate hypotheses that involveshort time intervals [22]. Within Laurasiatheria, it isunclear if perissodactyls are more closely related topangolins þ carnivores or to Cetartiodactyla. WithinAfrotheria, it has proved difficult to resolve the relation-ship among the three paenungulate orders (elephants,hyraxes, dugongs–manatees). By contrast, morphologystrongly supports a sister-group relationship betweenProboscidea and Sirenia (Tethytheria) [3,4,42], which isalso supported by complete mitochondrial genomes [43].
Minority viewsThe emerging consensus for placental ordinal relation-ships (Figure 1b), with its four major clades that aresupported by overwhelming sequence evidence and RGCs,is not without critics [4,14,44]. Arnason et al.’s [14]mtDNAanalysis suggests that hedgehogs are dissociated fromother core insectivores, such as shrews and moles, andwere the earliest offshoot of the placental tree. Arnasonet al. [14] also find that rodents, Glires, Euarchontoglires,and Boreoeutheria are all paraphyletic taxa. However, Linet al. [27] found that mtDNA trees recover the same fourclades as nuclear genes when outgroup taxa are removed.Peculiar features of rooted mtDNA trees can result frominadequate models of sequence evolution [27,28] and/orunbalanced taxon sampling [28,29]. In particular, somemarsupials have unusual nucleotide compositions andthere have been changes in the mutational process in bothhedgehogs and murid rodents relative to most otherplacental mammal mitochondrial genomes [27]. Thesechanges violate the assumptions of most methods ofphylogeny reconstruction. For example, general timereversible models of nucleotide substitution assume thatbase composition remains the same in different lineages.Other analyses suggest that protein-coding regions of the
Figure 2. Parallel morphological radiations in Afrotheria and Laurasiatheria illus-trate homoplasy in external morphology. (a) African golden mole (Chrysochlori-nae) and (b) Old World mole (Talpinae); (c) Malagasy hedgehog (Tenrecinae) and(d) common hedgehog (Erinaceinae); (e) shrew tenrec (Oryzorictinae; Microgalethomasi; Copyright Link Olson) and (f) common shrew (Soricinae); (g) manatee(Trichechidae) and (h) dolphin (Delphininae); (i) aardvark (Orycteropodidae) and (j)pangolin (Maninae).
Review TRENDS in Ecology and Evolution Vol.19 No.8 August 2004 435
www.sciencedirect.com
Convergence is widespread!
convergent evolution of features related to volancy in batsand flying lemurs, but eliminates the need to postulate theloss of archontan ankle specializations in bats [32].Complete mtDNA analyses recently placed flying lemurswithin primates and render the latter PARAPHYLETIC [14].However, SINE and LINE insertions [33] and analyses ofnuclear genes [21,24] recover traditional primate MONO-
PHYLY. Within Laurasiatheria, Eulipotyphla (e.g. moles,shrews, hedgehogs) is the probable sister-taxon to theremaining orders. The emerging molecular support for asister-group relationship between carnivores and pango-lins includes concatenated nuclear sequences [21], mito-chondrial protein sequences [14] and an RGC (Box 1).Morphologically, carnivores and pangolins are uniqueamong living placental mammals in possessing an osseoustentorium that separates the cerebral and cerebellarcompartments of the cranium [3].
Molecular data are also resolving relationships withinorders, sometimes with unexpected results. In addition tonesting whales within Artiodactyla, molecular data
separate hippos from other Suiformes (e.g. pigs) [10]. InEulipotyphla, shrews and hedgehogs group to the exclu-sion of moles [25,34]. This result contrasts with morpho-logical hypotheses that favor either moles þ shrews tothe exclusion of hedgehogs or moles þ hedgehogs to theexclusion of shrews. In Rodentia, molecular data suggest anovel mouse-related clade that includes murids (mice andrats), dipodids (jerboas), castorids (beavers), geomyids(pocket gophers), heteromyids (pocket mice), anomalurids(scaly-tailed flying squirrels), and pedetids (springhares)[35]. This group had never been proposed based onmorphological and paleontological data. Within Chirop-tera (bats), both nuclear and mitochondrial sequencesfavor microbat paraphyly, which has profound impli-cations for understanding the origins of laryngeal echolo-cation (Box 2).
The deployment of morphological character evolutionDarwin [36] recognized that ANALOGICAL or adaptivecharacters would be almost valueless to the systematist
Figure 1. The prevailing morphological tree (a) and the emerging molecular tree (b) of the placental orders. (a) Morphology generally places Xenarthra (sloths, anteatersand armadillos) as basal, and most of the remaining orders into three well-established clades: Ungulata (thought to be derived from CONDYLARTH ancestors, Archonta andAnagalida. The depicted tree is from Shoshani and McKenna [3]. The tree obtained by Liu et al. [4] is identical, apart from placing cetaceans as sister group to the perisso-dactyl-paenungulate clade. The tree of Novacek ([6]; http://tolweb.org/tree?group ¼ Eutheria&contgroup ¼ Mammalia) places Pholidota (pangolins) as basal sister toXenarthra, makes Primates and Scandentia (tree shrews) sister groups, and collapses several clades (black dotted lines). Novacek [5] subsequently collapses some furtherclades (gray dotted lines), which increases reconciliation with the molecular tree. (b) The molecular tree recognizes four major clades: Afrotheria, Xenarthra, Laurasiatheriaand Euarchontoglires, of which the latter two are joined into Boreoeutheria. The presented placental ordinal topology is according to Murphy et al. [21]. Placing Marsupialiaas sister to Placentalia is based on Phillips and Penny [54] and references therein. Clades indicated by solid lines are, with rare exceptions, supported independently by allother molecular data and analyses [24–29]. Notable exceptions are the strong tendency of mitochondrial protein sequences to place hedgehogs and rodents as basal in thetree [14]. Colors distinguish the four basal placental clades in the molecular tree.
TRENDS in Ecology & Evolution
Marsupialia
Xenarthra
Pholidota
Rodentia
Lagomorpha
Macroscelidea
Primates
Scandentia
Dermoptera
Chiroptera
Insectivora
Carnivora
Cetacea
Artiodactyla
Perissodactyla
Hyracoidea
Proboscidea
Sirenia
Tubulidentata
Monotremata
Marsupialia
Xenarthra
Pholidota
Rodentia
Lagomorpha
Macroscelidea
Primates
Scandentia
Dermoptera
Chiroptera
Eulipotyphla
Carnivora
Cetartiodactyla
Perissodactyla
Hyracoidea
Proboscidea
Tubulidentata
Afrosoricida
Monotremata
Sirenia
(a) (b)
Ungula
taA
rchonta
Anagalid
a
Laura
sia
theri
aE
uarc
honto
glir
es
Afr
oth
eri
aX
enart
hra
Review TRENDS in Ecology and Evolution Vol.19 No.8 August 2004 433
www.sciencedirect.com
Springer et al. 2004
0.0002substitutions per site
Mar
. 201
4A
pr.
May
.Ju
n.Ju
l.A
ug.
Sep
.O
ct.
Nov
.D
ec.
Jan.
201
5
Sierra LeoneGuinea
LiberiaMali
GN
1
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
0.71
0.91
1.0
0.8
0.99
0.78
GN
2S
L3G
N3
GN
4
Lineage A
Lineage B
Dec. Jan.2014
Feb. Mar. Apr. May. Jun. Jul. Aug. Sep. Oct. Nov. Dec. Jan.2015
0.0000
0.0002
0.0004
0.0006
0.0008
0.0010
Roo
t-to
-tip
div
erge
nce
Sierra LeoneGuinea
LiberiaMali
a b
Figure 2 | Phylogenetic relatedness and nucleotide sequence divergence ofEBOV isolates from the 2013–2015 outbreak. a, Phylogenetic relatedness ofEBOV isolates. Phylogenetic tree inferred using MrBayes11 for full-lengthEBOV genomes sequenced from 179 patient samples obtained between March2014 and January 2015. Displayed is the majority consensus of 10,000 treessampled from the posterior distribution with mean branch lengths. Posteriorsupport is shown for selected key nodes. Twenty-two samples originated inLiberia and were collected between March and August 2014 and six samples
from Sierra Leone were obtained in June and July 2014. In our analysis we alsoincluded published sequences, including the three early Guinean sequences2
and 78 sequences described by Gire et al.6. A number of lineages predominantlycirculating in Guinea are denoted as GN1–4 along with a uniquely Sierra Leonelineage (SL3) recognised in Gire et al.6. b, EBOV nucleotide sequencedivergence from root of the phylogeny in Fig. 2a plotted against time ofcollection of each virus. The date of the first documented case near Meliandouin eastern Guinea is indicated by the red triangle.
GN
1G
N2
GN
3G
N4
SL1
SL2
SL3
Dec. Jan.2014
Jan.2015
Feb. Mar. Apr. May. Jun. Jul. Aug. Sep. Oct. Nov. Dec.
Guinea
Liberia
Mali
Sierra Leone
1.0
1.0
1.0
0.99
0.96
Lineage ALineage B
Figure 3 | A time-scaledphylogenetic tree of 262 EBOVgenomes from Guinea, SierraLeone, Liberia and Mali. Shown is amaximum clade credibility treeconstructed from 10,000 treessampled from the posteriordistribution with mean node ages.Clades described in Gire et al.6 areidentified here (SL1, SL2 and SL3) aswell as a number of lineagespredominantly circulating in Guineaand posterior probability support isgiven for these. For certain key nodeages, 95% credible intervals areshown by horizontal bars.
6 A U G U S T 2 0 1 5 | V O L 5 2 4 | N A T U R E | 9 9
LETTER RESEARCH
G2015 Macmillan Publishers Limited. All rights reserved
LETTER OPENdoi:10.1038/nature14594
Temporal and spatial analysis of the 2014–2015 Ebolavirus outbreak in West AfricaMiles W. Carroll1,2,3, David A. Matthews4*, Julian A. Hiscox5*, Michael J. Elmore1*, Georgios Pollakis5*, Andrew Rambaut6,7,8*,Roger Hewson1,2,9, Isabel Garcıa-Dorival5, Joseph Akoi Bore2,10,11, Raymond Koundouno2,10,11, Saıd Abdellati2,12, Babak Afrough1,2,John Aiyepada2,13, Patience Akhilomen2,13, Danny Asogun2,13, Barry Atkinson1,2, Marlis Badusche2,14,15, Amadou Bah2,16,Simon Bate1,2, Jan Baumann2,14, Dirk Becker2,15,17, Beate Becker-Ziaja2,14,15, Anne Bocquin2,18,19, Benny Borremans2,20,Andrew Bosworth1,2,5, Jan Peter Boettcher2,21, Angela Cannas2,22, Fabrizio Carletti2,22, Concetta Castilletti2,22, Simon Clark1,2,Francesca Colavita2,22, Sandra Diederich2,15,23, Adomeh Donatus2,13, Sophie Duraffour2,14,24, Deborah Ehichioya2,14,25,Heinz Ellerbrok2,21, Maria Dolores Fernandez-Garcia2,26, Alexandra Fizet2,18,27, Erna Fleischmann2,15,28, Sophie Gryseels2,20,Antje Hermelink2,21, Julia Hinzmann2,21, Ute Hopf-Guevara2,21, Yemisi Ighodalo2,13, Lisa Jameson1,2, Anne Kelterbaum2,15,17,Zoltan Kis2,29, Stefan Kloth2,21, Claudia Kohl2,21, Misa Korva2,30, Annette Kraus2,31, Eeva Kuisma1,2, Andreas Kurth2,21,Britta Liedigk2,14,15, Christopher H. Logue1,2, Anja Ludtke2,15,32, Piet Maes2,24, James McCowen1,2, Stephane Mely2,18,19,Marc Mertens2,15,23, Silvia Meschi2,22, Benjamin Meyer2,15,33, Janine Michel2,21, Peter Molkenthin2,15,28, Cesar Munoz-Fontela2,15,32,Doreen Muth2,15,33, Edmund N. C. Newman1,2, Didier Ngabo1,2, Lisa Oestereich2,14,15, Jennifer Okosun2,13, Thomas Olokor2,13,Racheal Omiunu2,13, Emmanuel Omomoh2,13, Elisa Pallasch2,14,15, Bernadett Palyi2,29, Jasmine Portmann2,34, Thomas Pottage1,2,Catherine Pratt1,2, Simone Priesnitz2,35, Serena Quartu2,22, Julie Rappe2,36, Johanna Repits2,37, Martin Richter2,21,Martin Rudolf2,14,15, Andreas Sachse2,21, Kristina Maria Schmidt2,21, Gordian Schudt2,15,17, Thomas Strecker2,15,17, Ruth Thom1,2,Stephen Thomas1,2, Ekaete Tobin2,13, Howard Tolley1,2, Jochen Trautner2,38, Tine Vermoesen2,12, Ines Vitoriano1,2,Matthias Wagner2,15,28, Svenja Wolff2,15,17, Constanze Yue2,21, Maria Rosaria Capobianchi2,22, Birte Kretschmer39, Yper Hall1,John G. Kenny40, Natasha Y. Rickett5, Gytis Dudas6, Cordelia E. M. Coltart41, Romy Kerber2,14,15, Damien Steer42, Callum Wright43,Francis Senyah1, Sakoba Keita44, Patrick Drury45, Boubacar Diallo46, Hilde de Clerck47, Michel Van Herp47, Armand Sprecher47,Alexis Traore48, Mandiou Diakite49, Mandy Kader Konde50, Lamine Koivogui11, N’Faly Magassouba10, Tatjana Avsic-Zupanc2,30,Andreas Nitsche2,21, Marc Strasser2,34, Giuseppe Ippolito2,22, Stephan Becker2,15,17, Kilian Stoecker2,15,28, Martin Gabriel2,14,15,Herve Raoul2,19, Antonino Di Caro2,22, Roman Wolfel2,15,28, Pierre Formenty45 & Stephan Gunther2,14,15*
West Africa is currently witnessing the most extensive Ebola virus(EBOV) outbreak so far recorded1–3. Until now, there have been27,013 reported cases and 11,134 deaths. The origin of the virus isthought to have been a zoonotic transmission from a bat to a two-year-old boy in December 2013 (ref. 2). From this index case thevirus was spread by human-to-human contact throughout Guinea,Sierra Leone and Liberia. However, the origin of the particularvirus in each country and time of transmission is not known andcurrently relies on epidemiological analysis, which may be unre-liable owing to the difficulties of obtaining patient information.Here we trace the genetic evolution of EBOV in the current out-break that has resulted in multiple lineages. Deep sequencing of179 patient samples processed by the European Mobile Laboratory,the first diagnostics unit to be deployed to the epicentre of theoutbreak in Guinea, reveals an epidemiological and evolutionary
history of the epidemic from March 2014 to January 2015. Analysisof EBOV genome evolution has also benefited from a similarsequencing effort of patient samples from Sierra Leone. Our resultsconfirm that the EBOV from Guinea moved into Sierra Leone,most likely in April or early May. The viruses of the Guinea/Sierra Leone lineage mixed around June/July 2014. Viral sequencescovering August, September and October 2014 indicate that thislineage evolved independently within Guinea. These data can beused in conjunction with epidemiological information to test ret-rospectively the effectiveness of control measures, and provides anunprecedented window into the evolution of an ongoing viral hae-morrhagic fever outbreak.
We used a deep sequencing approach to gain insight into the evolu-tion of Ebola virus (EBOV) in Guinea from the ongoing West Africanoutbreak. This was an approach based on analysis pipelines developed
*These authors contributed equally to this work.
1Public Health England, Porton Down, Wiltshire SP4 0JG, UK. 2The European Mobile Laboratory Consortium, Bernhard-Nocht-Institute for Tropical Medicine, D-20359 Hamburg, Germany. 3University ofSouthampton, South General Hospital, Southampton SO16 6YD, UK. 4Department of Cellular and Molecular Medicine, School of Medical Sciences, University of Bristol, Bristol BS8 1TD, UK. 5Institute ofInfection and Global Health, University of Liverpool, Liverpool L69 2BE, UK. 6Institute of Evolutionary Biology, University of Edinburgh, Edinburgh EH9 2FL, UK. 7Fogarty International Center, NationalInstitutes of Health, Bethesda, Maryland 20892, USA. 8Centre for Immunology, Infection and Evolution, University of Edinburgh, Edinburgh EH9 2FL, UK. 9London School of Hygiene and Tropical Medicine,Keppel Street, London WC1E 7HT, UK. 10Universite Gamal Abdel Nasser de Conakry, Laboratoire des Fievres Hemorragiques en Guinee, Conakry, Guinea. 11Institut National de Sante Publique, Conakry,Guinea. 12Institute of Tropical Medicine, B-2000 Antwerp, Belgium. 13Institute of Lassa Fever Research and Control, Irrua Specialist Teaching Hospital, Irrua, Edo State, Nigeria. 14Bernhard Nocht Institutefor Tropical Medicine, D-20359 Hamburg, Germany. 15German Centre for Infection Research (DZIF), 38124 Braunschweig, Germany. 16Swiss Tropical and Public Health Institute, University of Basel, CH-4002 Basel, Switzerland. 17Institute of Virology, Philipps University Marburg, 35043 Marburg, Germany. 18National Reference Center for Viral Hemorrhagic Fevers, 69365 Lyon, France. 19Laboratoire P4Inserm-Jean Merieux, US003 Inserm, 69365 Lyon, France. 20Department of Biology, University of Antwerp, B-2020 Antwerp, Belgium. 21Robert Koch Institute, 13353 Berlin, Germany. 22National Institutefor Infectious Diseases (INMI) Lazzaro Spallanzani, 00149 Rome, Italy. 23Friedrich Loeffler Institute, Federal Research Institute for Animal Health, 17493 Greifswald, Insel Riems, Germany. 24KU LeuvenRega institute, B-3000 Leuven, Belgium. 25Redeemer’s University, Osun State, Nigeria. 26Centro Nacional de Microbiologia, Instituto de Salud Carlos III, 28029 Madrid, Spain. 27Unite de Biologie desInfections Virales Emergentes, Institut Pasteur, 69365 Lyon, France. 28Bundeswehr Institute of Microbiology, 80937 Munich, Germany. 29National Center for Epidemiology, National Biosafety Laboratory,H-1097Budapest,Hungary. 30Institute ofMicrobiologyand Immunology, Faculty of Medicine,University of Ljubljana, SI-1000Ljubljana, Slovenia. 31Public Health AgencyofSweden,171 82 Solna, Sweden.32Heinrich Pette Institute – Leibniz Institute for Experimental Virology, 20251 Hamburg, Germany. 33Institute of Virology, University of Bonn, 53127 Bonn, Germany. 34Federal Office for Civil Protection,Spiez Laboratory, CH-3700 Spiez, Switzerland. 35Bundeswehr Hospital, 22049 Hamburg, Germany. 36Institute of Virology and Immunology, CH-3147 Mittelhausern, Switzerland. 37Janssen-Cilag, SE-19207 Sollentuna, Sweden. 38Thunen Institute, D-22767 Hamburg, Germany. 39Eurice - European Research and Project Office GmbH, 10115 Berlin, Germany. 40Centre for Genomic Research, Institute ofIntegrative Biology, University of Liverpool, Liverpool L69 7ZB, UK. 41Department of Infection and Population Health, University College London, London WC1E 6JB, UK. 42Research IT, University of Bristol,Bristol BS8 1HH, UK. 43Advanced Computing Research Centre, University of Bristol, Bristol BS8 1HH, UK. 44Ministry of Health Guinea, Conakry, Guinea. 45World Health Organization, 1211 Geneva 27,Switzerland. 46World Health Organization, Conakry, Guinea. 47Medecins Sans Frontieres, B-1050 Brussels, Belgium. 48Section Prevention et Lutte contre la Maladie a la Direction Prefectorale de la Sante deGueckedou, Gueckedou, Guinea. 49Universite Gamal Abdel Nasser de Conakry, CHU Donka, Conakry, Guinea. 50Health and Sustainable Development Foundation, Conakry, Guinea.
6 A U G U S T 2 0 1 5 | V O L 5 2 4 | N A T U R E | 9 7G2015 Macmillan Publishers Limited. All rights reserved
ARTICLEdoi:10.1038/nature14447
Complex archaea that bridge the gapbetween prokaryotes and eukaryotesAnja Spang1*, Jimmy H. Saw1*, Steffen L. Jørgensen2*, Katarzyna Zaremba-Niedzwiedzka1*, Joran Martijn1, Anders E. Lind1,Roel van Eijk1{, Christa Schleper2,3, Lionel Guy1,4 & Thijs J. G. Ettema1
The origin of the eukaryotic cell remains one of the most contentious puzzles in modern biology. Recent studieshave provided support for the emergence of the eukaryotic host cell from within the archaeal domain of life, butthe identity and nature of the putative archaeal ancestor remain a subject of debate. Here we describe the discoveryof ‘Lokiarchaeota’, a novel candidate archaeal phylum, which forms a monophyletic group with eukaryotes inphylogenomic analyses, and whose genomes encode an expanded repertoire of eukaryotic signature proteins that aresuggestive of sophisticated membrane remodelling capabilities. Our results provide strong support for hypotheses inwhich the eukaryotic host evolved from a bona fide archaeon, and demonstrate that many components that underpineukaryote-specific features were already present in that ancestor. This provided the host with a rich genomic‘starter-kit’ to support the increase in the cellular and genomic complexity that is characteristic of eukaryotes.
Cellular life is currently classified into three domains: Bacteria,Archaea and Eukarya. Whereas the cytological properties ofBacteria and Archaea are relatively simple, eukaryotes are character-ized by a high degree of cellular complexity, which is hard to reconcilegiven that most hypotheses assume a prokaryote-to-eukaryote trans-ition1,2. In this context, it seems particularly difficult to account for thesuggested presence of the endomembrane system, the nuclear pores,the spliceosome, the ubiquitin protein degradation system, the RNAimachinery, the cytoskeletal motors and the phagocytotic machineryin the last eukaryotic common ancestor (ref. 3 and references therein).Ever since the recognition of the archaeal domain of life by Carl Woeseand co-workers4,5, Archaea have featured prominently in hypothesesfor the origin of eukaryotes, as eukaryotes and Archaea representedsister lineages in Woese’s ‘universal tree’5. The evolutionary linkbetween Archaea and eukaryotes was further reinforced through stud-ies of the transcription machinery6 and the first archaeal genomes7,revealing that many genes, including the core of the genetic informa-tion-processing machineries of Archaea, were more similar to those ofeukaryotes8 rather than to Bacteria. During the early stages of thegenomic era, it also became apparent that eukaryotic genomes werechimaeric by nature8,9, comprising genes of both archaeal and bacterialorigin, in addition to genes specific to eukaryotes. Yet, whereas many ofthe bacterial genes could be traced back to the alphaproteobacterialprogenitor of mitochondria, the nature of the lineage from which theeukaryotic host evolved remained obscure1,10–13. This lineage mighteither descend from a common ancestor shared with Archaea (follow-ing Woese’s classical three-domains-of-life tree5), or have emergedfrom within the archaeal domain (so-called archaeal host or eocyte-likescenarios1,14–17). Recent phylogenetic analyses of universal protein datasets have provided increasing support for models in which eukaryotesemerge as sister to or from within the archaeal ‘TACK’ superphylum18–22,a clade originally comprising the archaeal phyla Thaumarchaeota,Aigarchaeota, Crenarchaeota and Korarchaeota23. In support of thisrelationship, comparative genomics analyses have revealed severaleukaryotic signature proteins (ESPs)24 in TACK lineages, including dis-
tant archaeal homologues of actin25 and tubulin26, archaeal cell divisionproteins related to the eukaryotic endosomal sorting complexesrequired for transport (ESCRT)-III complex27, and several informa-tion-processing proteins involved in transcription and translation2,17,23.These findings suggest an archaeal ancestor of eukaryotes that mighthave been more complex than the archaeal lineages identified thusfar2,23,28. Yet, the absence of missing links in the prokaryote-to-eukaryotetransition currently precludes detailed predictions about the nature andtiming of events that have driven the process of eukaryogenesis1,2,17,28.Here we describe the discovery of a new archaeal lineage related to theTACK superphylum that represents the nearest relative of eukaryotes inphylogenomic analyses, and intriguingly, its genome encodes manyeukaryote-specific features, providing a unique insight in the emergenceof cellular complexity in eukaryotes.
Genomic exploration of new TACK archaeaWhile surveying microbial diversity in deep marine sediments influ-enced by hydrothermal activity from the Arctic Mid-Ocean Ridge, 16SrRNA gene sequences belonging to uncultivated archaeal candidatelineages were identified in a gravity core (GC14) sampled approximately15 km north-northwest of the active venting site Loki’s Castle29 at3283 m below sea level (73.763167 N, 8.464000 E) (Fig. 1a)30,31.Subsequent phylogenetic analyses of these sequences, which comprised,10% of the obtained 16S reads, revealed that they belonged to thegamma clade of the Deep-Sea Archaeal Group/Marine Benthic Group B(hereafter referred to as DSAG)31–33 (Fig. 1b–d and SupplementaryFigs 1 and 2), a clade proposed to be deeply-branching in the TACKsuperphylum23. DSAG constitutes one of the most abundant and widelydistributed archaeal groups in the deep marine biosphere, but so farnone of its representatives have been cultured or sequenced31.
To obtain genomic information for this archaeal lineage, we applieddeep metagenomic sequencing to the GC14 sediment sample, resultingin a smaller (LCGC14, 8.6 Gbp) and a larger, multiple-strand displace-ment amplified (MDA) metagenome data set (LCGC14AMP, 56.6 Gbp;Fig. 2a; Supplementary Fig. 3 and Supplementary Table 1). Given the
*These authors contributed equally to this work.
1Department of Cell and Molecular Biology, Science for Life Laboratory, Uppsala University, SE-75123 Uppsala, Sweden. 2Department of Biology, Centre for Geobiology, University of Bergen, N-5020Bergen, Norway. 3Division of Archaea Biology and Ecogenomics, Department of Ecogenomics and Systems Biology, University of Vienna, A-1090 Vienna, Austria. 4Department of Medical Biochemistry andMicrobiology, Uppsala University, SE-75123 Uppsala, Sweden. {Present address: Groningen Institute for Evolutionary Life Sciences, University of Groningen, NL-9747AG Groningen, The Netherlands.
1 4 M A Y 2 0 1 5 | V O L 5 2 1 | N A T U R E | 1 7 3G2015 Macmillan Publishers Limited. All rights reserved
key regulators of actin cytoskeleton dynamics, these small GTPasesrepresent essential components for the process of phagocytosis ineukaryotes. Intriguingly, the analysis of Lokiarchaeal ESPs revealeda multitude of Ras-superfamily GTPases, comprising nearly 2% of theLokiarchaeal proteome (Fig. 3b). The relative amount of smallGTPases in the Lokiarchaeum genome is comparable to that observedin several unicellular eukaryotes, only being surpassed by the protistNaegleria gruberi. In contrast, bacterial and archaeal genomes encodeonly few, if any, small GTPase homologues of the Ras superfamily(Fig. 3b).
Phylogenetic analyses of the Lokiarchaeal small GTPases revealedthat these represent several distinct clusters, each of which comprisesseveral GTPase sequences (Fig. 3c and Supplementary Fig. 18).Although phylogenetic analyses failed to resolve most of the deepernodes, several of the eukaryotic small GTPase families appear to sharea common ancestry with Lokiarchaeal GTPases (Fig. 3c), suggestingan archaeal origin of specific subgroups of the eukaryotic smallGTPases, followed by independent expansions in eukaryotes andLokiarchaeota. This scenario contrasts with previous studies that havesuggested that eukaryotic small GTPases were acquired from thealphaproteobacterial progenitor of mitochondria37.
Although genes encoding canonical eukaryotic GTPase-activatingproteins (GAPs) were absent in Lokiarchaeota, twelve roadblock/LC7-domain-containing proteins were identified (SupplementaryTables 6 and 10). While such proteins have been implicated in dyneinorganization in eukaryotes, roadblock/LC7 protein MglB of the bac-terium Myxococcus xanthus was shown to act as a GAP of the small
GTPase MglA43. Hence, the Lokiarchaeal roadblock/LC7 proteinsrepresent possible candidates for alternative GAPs in this archaeon.
Presence of a primordial ESCRT complexIn eukaryotes, the ESCRT machinery represents an essential com-ponent of the multivesicular endosome pathway for lysosomaldegradation of damaged or superfluous proteins, and it plays a rolein several budding processes including cytokinesis, autophagy andviral budding44. The ESCRT machinery generally consists of theESCRT-I–III subcomplexes, as well as associated subunits45. Theanalysis of the Lokiarchaeum genome revealed the presence of anESCRT gene cluster (Fig. 4a), as well as of several additional pro-teins homologous to components of the eukaryotic multivesicularendosome pathway. For instance, Lokiarchaeum encodes divergentSNF7 domain proteins of the eukaryotic ESCRT-III complex, whichappear to represent members of the Vps2/Vps24/Vps46 and Vps20/Vps32/Vps60 families, respectively. A phylogenetic analysis of theLokiarchaeal SNF7 domain proteins revealed that these branch atthe base of these two eukaryotic ESCRT-III families with low boot-strap support (Fig. 4b and Supplementary Fig. 19), not only indi-cating that they might represent ancestral SNF7 copies, but alsosuggesting that the last eukaryotic common ancestor already inher-ited two divergent SNF7-domain-encoding genes from its putativearchaeal ancestor rather than a single gene46. Furthermore, the genecluster encodes an ATPase that displays closest resemblance toeukaryotic VPS4-type ATPases, including katanin, membrane scaf-fold protein (MSP) and spastin (Fig. 4c and Supplementary Fig. 20)as well as hypothetical proteins that show significant similarity
0.4
LCGC14AMP_05736710
Crenactin
LCGC14AMP and Lokiarchaeum (4/1)
Actin and related sequences
Arp2
LCGC14AMP (5)
Arp1
LCGC14AMP andLokiarchaeum(11/1)
LCGC14AMP/Lokiarchaeum (11/2)
LCGC14AMP (2)
Arp3
LCGC14AMP (2)
LCGC14AMP_06532160
100
100
51
83
100
100
96
100
100
100
100
a c
b
0.4
Euryarchaeota (13)
Euryarchaeota (77)
Lokiarchaeum (2)
Bacteria andEuryarchaeota (19)
Bacteria andEuryarchaeota (12)
Bacteria (5)
Lokiarchaeum (2)
Lokiarchaeum (6)
Euryarchaeota (5)
Lokiarch_12880
Ran-family (7)
Rho-family (7)
Arf-family (7)
Lokiarchaeum (2)
Bacteria (46)
Lokiarch_01230
Euryarchaeota (21)
Lokiarch_37110
Lokiarchaeum (7)
Lokiarchaeum (3)
Ras-family (5)
170290521 Ca. Korarchaeum cryptofilum OPF8
Lokiarch_31930
503411226 Methanobacterium lacus
Archaea (5)
Lokiarchaeum (3)
Lokiarchaeum (4)
Lokiarchaeum (6)
Lokiarchaeum (4)
Euryarchaeota (9)
170174596 Ca. Korarchaeum cryptofilum OPF8
Sar1-family (7)
Lokiarchaeum (3)
315425475 Ca.Caldiarchaeum subterraneum
Lokiarchaeum (3)
Lokiarchaeum (2)
Rab-family (7)
SRbeta (7)
Bacteria and Crenarchaeota (10)Bacteria (8)
502865047 Methanocaldococcus infernus Lokiarchaeum (4)
Bacteria (35)
Lokiarchaeum (8)
Lokiarchaeum (3)
Euryarchaeota (2)
499329248 Methanopyrus kandleri
Lokiarchaeum (2)
Crenarchaeota (21)
Lokiarch_45420
Lokiarchaeum (5)
Lokiarchaeum (19)
Thermophilum sp. (2)
96
96
51
79
100
99
100
100
71
100
87
99
84
87 73
90
96
74
97
98
89
100
95
82
93
69
100
68
51
61
82
99
97
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Naegleria gruberi
Lokiarchaeum
Dictyostelium discoideum
Homo sapiens
Tetrahymena thermophila SB210
Giardia intestinalis ATCC 50581
Saccharomyces cerevisiae S288c
Reticulomyxa filosa
Trypanosoma brucei brucei 927/4
Arabidopsis thaliana
Thalassiosira pseudonana
Methanopyrus kandleri AV19
Pyrobaculum aerophilum IM2
Aciduliprofundum boonei T469
Korarchaeum cryptofilum OPF8
Caldiarchaeum subterraneum
Myxococcus xanthus DK1622Nitrosopumilus maritimus SCM1 0
3
1
2
2
4
34
23
152
31
113
39
200
4
270
453
153
92
Per cent
LCGC14AMP andLokiarchaeum (5/1)
Figure 3 | Identification and phylogeny of small GTPases and actinorthologues. a, Maximum-likelihood phylogeny of 378 aligned amino acidresidues of actin homologues identified in Lokiarchaeum and in theLCGC14AMP metagenome, including eukaryotic actins, ARP1–3 homologuesand crenactins25. Consecutive numbers in brackets refer to the number ofsequences in a respective clade from LCGC14AMP and Lokiarchaeum,respectively. b, Relative amount of small GTPases (assigned to IPR006689 andIPR001806) in the Lokiarchaeum genome in comparison with other eukaryotic,
archaeal and bacterial species. Numbers refer to total amount of small GTPasesper predicted proteome. c, Maximum-likelihood phylogeny of 150 alignedamino acid residues of small Ras- and Arf-type GTPases (IPR006689 andIPR001806) in all domains of life. Numbers in brackets refer to the number ofsequences in the respective clades. a, c, Sequence clusters comprisingLokiarchaeum and/or LCGC14AMP sequences (red), eukaryotes (blue) andBacteria/Archaea (grey) have been collapsed. Bootstrap values above 50 areshown. Scale indicates the number of substitutions per site.
1 7 6 | N A T U R E | V O L 5 2 1 | 1 4 M A Y 2 0 1 5
RESEARCH ARTICLE
G2015 Macmillan Publishers Limited. All rights reserved
LETTERdoi:10.1038/nature14249
Ancient proteins resolve the evolutionary history ofDarwin’s South American ungulatesFrido Welker1,2, Matthew J. Collins1, Jessica A. Thomas1, Marc Wadsley1, Selina Brace3, Enrico Cappellini4, Samuel T. Turvey5,Marcelo Reguero6, Javier N. Gelfo6, Alejandro Kramarz7, Joachim Burger8, Jane Thomas-Oates9, David A. Ashford10,Peter D. Ashton10, Keri Rowsell1, Duncan M. Porter11, Benedikt Kessler12, Roman Fischer12, Carsten Baessmann13,Stephanie Kaspar13, Jesper V. Olsen14, Patrick Kiley15, James A. Elliott15, Christian D. Kelstrup14, Victoria Mullin16,Michael Hofreiter1,17, Eske Willerslev4, Jean-Jacques Hublin2, Ludovic Orlando4, Ian Barnes3 & Ross D. E. MacPhee18
No large group of recently extinct placental mammals remains asevolutionarily cryptic as the approximately 280 genera grouped as‘South American native ungulates’. To Charles Darwin1,2, who firstcollected their remains, they included perhaps the ‘strangest animal[s]ever discovered’. Today, much like 180 years ago, it is no clearerwhether they had one origin or several, arose before or after theCretaceous/Palaeogene transition 66.2 million years ago3, or aremore likely to belong with the elephants and sirenians of superorderAfrotheria than with the euungulates (cattle, horses, and allies) ofsuperorder Laurasiatheria4–6. Morphology-based analyses have provedunconvincing because convergences are pervasive among unrelatedungulate-like placentals. Approaches using ancient DNA have alsobeen unsuccessful, probably because of rapid DNA degradationin semitropical and temperate deposits. Here we apply proteomicanalysis to screen bone samples of the Late Quaternary SouthAmerican native ungulate taxa Toxodon (Notoungulata) andMacrauchenia (Litopterna) for phylogenetically informative protein
sequences. For each ungulate, we obtain approximately 90% directsequence coverage of type I collagen a1- and a2-chains, representingapproximately 900 of 1,140 amino-acid residues for each subunit. Aphylogeny is estimated from an alignment of these fossil sequenceswith collagen (I) gene transcripts from available mammalian genomesor mass spectrometrically derived sequence data obtained for this study.The resulting consensus tree agrees well with recent higher-levelmammalian phylogenies7–9. Toxodon and Macrauchenia form amonophyletic group whose sister taxon is not Afrotheria or anyof its constituent clades as recently claimed5,6, but instead crownPerissodactyla (horses, tapirs, and rhinoceroses). These results areconsistent with the origin of at least some South American nativeungulates4,6 from ‘condylarths’, a paraphyletic assembly of archaicplacentals. With ongoing improvements in instrumentation andanalytical procedures, proteomics may produce a revolution insystematics such as that achieved by genomics, but with the possibilityof reaching much further back in time.
1BioArCh, University of York, York YO10 5DD, UK. 2Department of Human Evolution, Max Planck Institute for Evolutionary Anthropology, 04103 Leipzig, Germany. 3Department of Earth Sciences, NaturalHistory Museum, London SW7 5BD, UK. 4Centre for GeoGenetics, Natural History Museum of Denmark, University of Copenhagen, Øster Voldgade 5–7, 1350 Copenhagen K, Denmark. 5Institute of Zoology,Zoological Society of London, London NW1 4RY, UK. 6CONICET- Division Paleontologıa de Vertebrados, Museo de La Plata. Facultad de Ciencias Naturales y Museo de La Plata, Universidad Nacional de LaPlata. Paseo del Bosque s/n, B1900FWA, La Plata, Argentina. 7Seccion Paleontologıa de Vertebrados. Museo Argentino de Ciencias Naturales ‘‘Bernardino Rivadavia’’, 470 Angel Gallardo Av., C1405DJR,Buenos Aires, Argentina. 8Institute of Anthropology, Johannes Gutenberg-University, Anselm-Franz-von-Bentzel-Weg 7, D-55128 Mainz, Germany. 9Department of Chemistry, University of York, York YO105DD, UK. 10Bioscience Technology Facility, Department of Biology, University of York, York YO10 5DD, UK. 11Department of Biological Sciences, Virginia Polytechnic Institute and State University,Blacksburg, Virginia 24061, USA. 12Target Discovery Institute, Nuffield Department of Medicine, University of Oxford, Roosevelt Drive, Oxford OX3 7FZ, UK. 13Applications Development, Bruker DaltonikGmbH, 28359 Bremen, Germany. 14Novo Nordisk Foundation Center for Protein Research, Faculty of Health Sciences, University of Copenhagen, Blegdamsvej 3b, 2200 Copenhagen, Denmark.15Department of Materials Science and Metallurgy, University of Cambridge, Cambridge CB3 0FS, UK. 16Smurfit Institute of Genetics, Trinity College Dublin, Dublin 2, Ireland. 17Institute for Biochemistryand Biology, Karl-Liebknecht-Strasse 24–25, 14476 Potsdam OT Golm, Germany. 18Department of Mammalogy, American Museum of Natural History, New York, New York 10024, USA.
Darwin’s samplesSamples used in this studyMS/MS
Survival of 80-bp DNA fragments at 10 ka (%)
Dea
mid
ated
glu
tam
ines
(%)
Number of observed glutamines M. patachonica
T. platensisEquus sp.
Mylodon sp.
a
HMS Beagle
0 100 200 300 4000
20
40
60
80
100
cb
0–0.1
0.1–0
.2
0.2–0
.3
0.3–0
.4
4.1–0
.5
0.5–0
.6
0.6–0
.7
0.7–0
.8
0.8–0
.90.9
–1
Figure 1 | Samples used in this investigation. a, Predicted survival of an80-base-pair (bp) DNA fragment after 10,000 years (10 ka) modelled usingthe rate given in ref. 29. b, Location of finds by Darwin1,2 and of samples used inthis study (basemap30). c, Glutamine deamidation ratios for bone samples from
the sequenced Pleistocene SANUs are high compared with coeval horse(MACN Pv 5719) as well as modern hippopotamus and tapir, providingsupport for the authenticity of the ancient sequences (see SupplementaryInformation).
G2015 Macmillan Publishers Limited. All rights reserved
4 J U N E 2 0 1 5 | V O L 5 2 2 | N A T U R E | 8 1
South American native ungulates (SANUs) are conventionally orga-nized into five orders (Litopterna, Notoungulata, Astrapotheria, Xen-ungulata, and Pyrotheria) that are sometimes grouped together as aseparate placental superorder (Meridiungulata)10. They appear very earlyin the Palaeogene record and evolved thereafter along many divergentlines, as their abundant fossil record attests. Most lineages had becomeextinct by the end of the Miocene epoch, although a few species of lito-pterns and notoungulates persisted into the Late Pleistocene epoch.Despite continuing interest in their evolutionary history (for example
refs 5, 11–14), phylogenetic relationships of the major SANU clades toone another and to other placentals remain poorly understood (seeSupplementary Information). Although some recent investigations (forexample refs 4–6) have suggested that basal South American membersof Litopterna conclusively group with certain Holarctic condylarths,and are thus best placed in Euungulata (Laurasiatheria), several otherstudies claim to have identified potential synapomorphies linking var-ious SANU taxa with Afrotheria5,6,15,16. This latter view is broadly con-sistent with such indicators as prolonged late Mesozoic faunal exchange
0.02
1
0.96
1
1
1
1
0.58
0.97
0.77
0.99
1
1
1
0.79
0.56
1
0.56
1
1
0.56
1
1
1
1
0.99
1
1
1
0.98
0.96
0.96
1
1
1
0.94
1
0.99
1
0.53
1
1
1
1
1
0.99
1
1
1
0.99
1
0.86
1
1
0.99
1
1
1
0.99
1
0.64
1
1
1
0.64
0.89
1
1
1
0.84
0.96
1
1
1
Gallus gallus
Mylodon sp
Mus musculus
Ictidomys tridecemlineatus
Pan paniscus
Nomascus leucogenys
Dasypus novemcinctus
Microcebus murinus
Jaculus jaculus
Cyclopes didactylus
Pan troglodytes
Pongo abelii
Choloepus hoffmanni
Ornithorhynchus anatinus
Tupaia chinensis
Sarcophilus harrisii
Dipodomys ordii
Saimiri boliviensis
Pongo pygmaeus
Otolemur garnettii
Macaca mulatta
Monodelphis domestica
Homo sapiens
Macropus eugenii
Heterocephalus glaber
Myotis davidii
Chinchilla lanigera
Equus caballus
Eptesicus fuscus
Erinaceus europaeus
Myotis brandtii
Tapirus terrestris
Pteropus alecto
Ochotona princeps
Rattus norvegicus
Sorex araneus
Oryctolagus cuniculus
Mesocricetus auratus
Equus asinus
Myotis lucifugus
Ceratotherium simum
Condylura cristata
Cavia porcellus
Pteropus vampyrus
Octodon degus
Cricetulus griseus
Bubalus bubalis
Panthera tigris
Orcinus orca
Odobenus rosmarus
Toxodon sp
Canis lupus
Felis catus
Sus scrofa
Procavia capensis
Orycteropus afer
Bos primigenius
Ovis aries
Vicugna pacos
Tursiops truncatus
Hippopotamus amphibius
Physeter catodon
Camelus bactrianus
Mustela putorius
Macrauchenia sp
Pantholops hodgsonii
Leptonychotes weddellii
Echinops telfairi
Loxodonta africana
Mammut sp
Orycteropus afer
Elephantulus edwardii
Ailuropoda melanoleuca
Trichechus manatus
Chrysochloris asiatica
Mammuthus sp
Toxodon sp.
Tapirus terrestris
Equus caballus
Ceratotherium simum
Equus asinus
Equus sp. (Tapalqué)
1
1
1
1
1
1
Panperissodactyla
Equus sp. (Tapalqué)
Marsupialia
Afrotheria
Carnivora
Aves Monotremata
Artiodactyla
Perissodactyla
Chiroptera
Lipotyphla
Lagomorpha
Rodentia
Scandentia
Primates
Xenarthra
Macrauchenia sp.
Figure 2 | Relationship of Toxodon (Notoungulata) and Macrauchenia(Litopterna) to other placental mammals. Fifty per cent majority ruleBayesian consensus tree of COL1 protein sequence data, with chicken (Gallus)as outgroup. Scale bar indicates branch length, expressed as the expectednumber of substitutions per site. Major clades (orders and superorders) arecolour coded; species names in bold indicate collagen sequences derived from
MS/MS rather than genomic data, fossil taxa depicted in silhouette. Inset: inall tree-reconstructions conducted (see Supplementary Information), Toxodonand Macrauchenia (dark grey) group monophyletically at the base ofcrown Perissodactyla (light green) with 100% posterior probability,forming Panperissodactyla.
RESEARCH LETTER
G2015 Macmillan Publishers Limited. All rights reserved
8 2 | N A T U R E | V O L 5 2 2 | 4 J U N E 2 0 1 5
Horizontal Transfer of Entire Genomesvia Mitochondrial Fusion in theAngiosperm AmborellaDanny W. Rice,1 Andrew J. Alverson,1* Aaron O. Richardson,1 Gregory J. Young,1†M. Virginia Sanchez-Puerta,1‡ Jérôme Munzinger,2§ Kerrie Barry,3 Jeffrey L. Boore,3||Yan Zhang,4 Claude W. dePamphilis,4 Eric B. Knox,1 Jeffrey D. Palmer1¶We report the complete mitochondrial genome sequence of the flowering plant Amborella trichopoda.This enormous, 3.9-megabase genome contains six genome equivalents of foreign mitochondrialDNA, acquired from green algae, mosses, and other angiosperms. Many of these horizontal transferswere large, including acquisition of entire mitochondrial genomes from three green algae and onemoss. We propose a fusion-compatibility model to explain these findings, with Amborella capturingwhole mitochondria from diverse eukaryotes, followed by mitochondrial fusion (limited mechanisticallyto green plant mitochondria) and then genome recombination. Amborella’s epiphyte load, propensityto produce suckers from wounds, and low rate of mitochondrial DNA loss probably all contribute tothe high level of foreign DNA in its mitochondrial genome.
Many of the fundamental properties ofeukaryotes arose from horizontal evo-lution on a grand scale–—that is, the
endosymbiotic origin of the mitochondrion andplastid from bacterial progenitors (1). Since theirbirth, however, mitochondrial and plastid genomesseem to have been little affected by horizontal genetransfer (HGT). Themost notable exception involvesland plants, especially flowering plants (angio-sperms), in which HGT is common in the mito-chondrial genome but unknown in plastids (2–10).
To gain insight into the causes and conse-quences ofHGT inmitochondrial DNA (mtDNA),we sequenced the mitochondrial genome ofAmborella trichopoda because polymerase chainreaction–based sampling had shown it to be richin foreign genes (4). This large shrub is endemicto rain forests of New Caledonia and is probablysister to all other angiosperms, a divergence datingback about 200 million years (11, 12).
Overall Genome PropertiesTheAmborellamitochondrial genome assembledas five autonomous, circular-mapping chromo-
somes of lengths 3179, 244, 187, 137, and 119 kb,giving a total genome size of 3,866,039 basepairs (bp) (Fig. 1A and figs. S1 to S4) (13). Thefive chromosomes are distinct in sequence butsimilar in base composition (45 to 47% G + C),stoichiometry, and HGT properties (Fig. 1A andfigs. S2 and S4). Stoichiometry was assessed bysequencing coverage and Southern blot analysis of32 individuals from three populations (fig. S5) (13).
Asdescribed in thenext three sections,AmborellamtDNA possesses an extensive and diverse col-lection of foreign sequences, corresponding to aboutsix genome equivalents of mtDNA acquired frommosses, angiosperms, and green algae. MultigeneHGT has been described in two other lineages ofplant mtDNA (8, 10), but not on a scale approach-ing Amborella. The Amborella mitochondrial ge-nome also contains a large amount (138kb) of plastidDNA (ptDNA) (Fig. 1A, fig. S2, and table S1).
Multichromosomal mitochondrial genomes inplants were only recently discovered (14, 15) andmostly involve large (>1Mb) genomes, with Silenegenomes of 6.7 and 11.3 Mb dwarfing Amborellain size and chromosome number (15). These threemitochondrial genomes are the largest complete-ly assembled organelle genomes, larger thanmanybacterial genomes and even some nuclear ge-nomes. However, the processes responsible fortheir expansion differ in that Silene genomes pos-sess no readily discernible foreign mtDNA andrelatively little ptDNA (15).
HGT from MossesAmborellamtDNAcontains four regions, of lengths48, 40, 9, and 4 kb, acquired from moss mtDNA(Fig. 1A and fig. S2). With one exception, the 41protein and ribosomal RNA (rRNA) genes fromthese four regions were placed phylogenetical-ly, almost always strongly, as sister to the mossPhyscomitrella (Fig. 2, A to D, and figs. S8 andS9). Gene order in the four regions (Fig. 3 and
fig. S6) is highly similar to both Physcomitrellaand Anomodon (mosses that are themselves iden-tical in gene order and content) (16) and extreme-ly different fromangiosperms. Themosslike regionsin Amborella also harbor the same 27 introns andlargely the same set of intergenic sequences asmoss mtDNAs (fig. S6) (13).
The four moss regions contain one, and onlyone, copy of 61 of the 65 genes present in se-quenced moss mtDNAs (Fig. 3 and fig. S6) (13).Taking into account six inferred deletions andduplications larger than 100 bp, the 101.8 kb ofmoss DNA in Amborella reconstructs to a hypo-thetical donor genome of 106.0 kb, comparedwiththe 104.2- and 105.3-kb genomes inPhyscomitrellaand Anomodon, respectively. We infer, therefore,that Amborella captured an entire mitochondrialgenome (13) from a moss with nearly identicalmtDNA architecture to those of Physcomitrellaand Anomodon. This foreign genome subsequent-ly rearranged into four pieces, with a few gene-order changes and 11 gene losses, truncations,and/or partial duplications, all of which are asso-ciated with rearrangement breakpoints (Figs. 1Aand 3, figs. S2 and S6, and table S2).
HGT from Green AlgaeThe Amborella mitochondrial genome containsan average of three green algal–derived copies ofeach protein and rRNA gene commonly found ingreen algal mtDNAs (Figs. 1A and 2, A to D;figs. S2, S4, S8, S10, and S11; and table S3).Many of these genes are clustered in two largetracts of lengths 83 and 61 kb. The 83-kb tract(B1 +A2 in Fig. 1A) contains two copies of a 10-gene cluster (eachmarked by 10 red arrows in thetop comparison of Fig. 4), with all 10 “dupli-cates” highly divergent from each other. The61-kb tract (B2 + A1 in Fig. 1A) lacks these10 genes and instead contains highly divergentduplicates of two genes that are absent from the83-kb tract. A single hypothesized recombinationevent between these two tracts (Figs. 1A and 4)accounts for the above duplications, with the ini-tial, 92- and 52-kb regions each containing a nearlycomplete set of green algal mitochondrial genesand no extra copies (fig. S11). We conclude thatthe 83-kb and 61-kb tracts arose by acquisition ofwhole mitochondrial genomes (designated the Aand B genomes) from two green algae, followedby a single recombination between them and afew gene losses (13). Additionally, the two in-ferred donor genomes are phylogenetically distinct:WheneverAmborella has three ormore green algalcopies of a given gene, the A-genome copy isseparated by a relatively long branch from awell-supported clade containing the other green algalcopies (Fig. 2, A andD, and fig. S8). Furthermore,the two regions assigned to the A genome have alower noncoding G + C composition (39%) thanthe two B-genome regions (47%) (table S4).
Most of the remaining green algal mtDNA inAmborella, comprising tracts of lengths 49, 18,16, and 2 kb (Fig. 1A and fig. S2), also appears,
RESEARCHARTICLES
1Department of Biology, Indiana University, Bloomington, IN47405, USA. 2Institut de Recherche pour le Développement(IRD), UMR Botanique et Bioinformatique de l’Architecture desPlantes (AMAP), Laboratoire de Botanique et d’Ecologie VégétaleAppliquées, Nouméa, New Caledonia. 3Department of EnergyJoint Genome Institute, Walnut Creek, CA 94598, USA. 4De-partment of Biology, Penn State University, University Park, PA16802, USA.
*Present address: Department of Biological Sciences, Univer-sity of Arkansas, Fayetteville, AR 72701, USA.†Present address: DuPont Pioneer, Wilmington, DE 19880, USA.‡Present address: Consejo Nacional de Investigaciones Científicasy Técnicas (CONICET) and Universidad Nacional de Cuyo,Mendoza, Argentina.§Present address: IRDUMRAMAP, TA A51/PS2, 34398Montpelliercedex 5, France.||Present address: Genome Project Solutions, Hercules, CA94547, USA.¶Corresponding author. E-mail: [email protected]
20 DECEMBER 2013 VOL 342 SCIENCE www.sciencemag.org1468
rates than Amborella (fig. S17) (13, 18). Third,levels of sequence identity to other angiospermmtDNAs were measured on a genome-wide basisto define native aswell as angiosperm-HGTregions(13). Finally, native (or angiosperm-HGT) sequen-ces defined by the above four criteria and locatedwithin 5 kb of each other were combined intocontinuous native (or angiosperm-HGT) tracts (13).
These analyses identified 753 kb of DNA ashaving been acquired fromother angiosperms (Fig.1A and figs. S2 and S4). This DNA contains anaverage of 2.0 copies of the 32 protein and rRNAgenes that are virtually always present in angio-sperm mtDNA (table S3) (17) and thus corre-sponds to roughly two genome equivalents offoreign angiosperm mtDNA. Most (86%) of the753 kb is intergenic, consistent with the highproportion of intergenic mtDNA in angiosperms(11, 13). About half of the 753 kb shares ≥90%sequence identity with one or more sequencedangiospermmitochondrial genomes (fig. S4). This
far surpasses the level of highly conservedmtDNAin other angiosperms (fig. S18) (13). The 753-kbestimate is probably conservative owing to thelimited number of angiosperm mtDNAs availa-ble for comparison (13).
Angiosperm DonorsOne class of plastid-derived DNA played a keyrole in donor identification. Phylogenetic analysisshows that most of the 138 kb of ptDNA presentin Amborella mtDNA was acquired through in-tracellular gene transfer (IGT), that is, from theAmborella plastid genome (Fig. 2, E to H, andfig. S19). Analysis of the remaining 10 kb ofptDNA, which probably entered Amborella fromforeignmitochondria, identified donorswithmuchgreater specificity than did themitochondrial geneanalyses (13). Four of the HGT plastid regionsidentified Fagales, Oxalidales, or the predom-inantly parasitic Santalales as the donor, while afifth pointed to Magnoliidae (Fig. 2, E to H, and
fig. S18). A santalalean origin is also supportedby four of the five mitochondrial genes for whichmultiple Santalales have been sampled (fig. S14,nad1b, and fig. S20). The exceptionally high andspecific similarity of two featureless regions toRicinus communis orBambusa oldhamii (Fig. 1Band fig. S21) identified transfers from these line-ages. Finally, the exceptionally high divergencethat diagnosed six angiosperm-like genes as foreignalso suggests that they came from additional do-nors, with high mitochondrial substitution rates.
Because some angiosperm-HGT tracts inAmborella mtDNA are of mixed phylogeneticorigin (Fig. 1) (13), some of its foreignDNAmaybe the product of serial, angiosperm-to-angiosperm-to-angiosperm HGT (13). In particular, the rbcLgene of santalalean origin (Fig. 2E) resides only3 kb from the Bambusa-derived sequence on thesame 27-kb foreign tract (Fig. 1B). Because allfour genes of meaningful length on this tractevidently came from core eudicots (fig. S14), and
Fig. 2. Maximum likelihood evi-dence forHGT inAmborellamtDNA.(A to D) Mitochondrial gene trees ofland plants and green algae revealdiverse donors in Amborella mtDNA.Colors are as in Fig. 1. See fig. S8 foroutgroups. Bootstrap values ≥50%are shown. The number after eachAmb (Amborella) sequence correspondsto its left-most coordinate in kb (figs. S2and S4). Scale bars correspond to 0.1[(A) to (D)] or 0.01 [(E) to (H)] sub-stitutions per site. Bold branches arereduced in length by 50%. (E to H)Plastid gene trees of angiospermsshowing strong support for HGT tothe level of taxonomic order: lightblue, Santalales [(E) and (F)]; brown,Oxalidales (G); violet, Fagales (H).Amborella labels: Amb plastid, genein Amborella plastid; Amb IGT; genein mitochondrion via IGT; red Amb,gene in mitochondrion via HGT. Out-groups are not shown, but see fig. S19for more taxon-rich analyses, includ-ing outgroups. rps7 denotes the rps7-rps12-trnV-rrnS cluster.
atp1A
atp4B
atp8C
cobD
rbcLE
psbCDF
psaAG
rps7H
ProtothecaHelicosporidium100
Amb A 403CoccomyxaAmb B 1505
Amb C 140810054
100
OltmannsiellopsisNephroselmis66 OstreococcusChlorokybus
ChaetosphaeridiumChara
PhyscomitrellaAmb 1657
100
Nothoceros52MarchantiaPleurozia
100
54
CycasAmb 1276Liriodendron
OryzaCaricaVitis
Amb 307088BetaAmb 510
NicotianaArabidopsis59
91
71
100619097
100
100
100
10098
ProtothecaHelicosporidium79
Amb A 394Coccomyxa
Amb B 336Amb C 1003Amb D 233375
100100
60
NephroselmisOltmannsiellopsis55
ChlorokybusOstreococcus
Micromonas87ChaetosphaeridiumCharaPhyscomitrellaAmb 269
57 MarchantiaPleurozia100Nothoceros
CycasAmb 488Amb 653
Amb 2809Oryza
LiriodendronAmb 2196VitisBetaCaricaArabidopsisNicotiana
Amb 617
82100
6069
86100
55
ChlorokybusOstreococcus64
CoccomyxaAmb A 395
Amb B 335Amb C 1005Amb D 233272
10084
68
NephroselmisOltmannsiellopsis
ProtothecaHelicosporidium
ChaetosphaeridiumChara
PhyscomitrellaAmb 268
100 MarchantiaPleurozia
94 Amb 120Amb 7476Cycas
LiriodendronNicotianaAmb 2226
VitisAmb 917
Amb 3487BetaOryza
CaricaArabidopsis66
100
59
83
8399
OltmannsiellopsisPrototheca
Helicosporidium100Amb A 385
CoccomyxaAmb B 346
Amb C 2284100100
98
NephroselmisMicromonas
Ostreococcus100Chlorokybus
ChaetosphaeridiumChara
PhyscomitrellaAmb 1698
100
MarchantiaPleurozia
100Nothoceros
CycasAmb 2013Liriodendron
Amb 506VitisCaricaBeta
NicotianaArabidopsisOryza53
95100
100
100
9072
56
97
IlliciumAmb plastid
Amb IGT100NymphaeaNuphar10074
DrimysChloranthusLiriodendron
AcorusYucca
Lemna9090
CeratophyllumPlatanus
TrochodendronBuxus69
VitisGossypium
Arabidopsis89Quercus
EuonymusOxalis65
55
PlumbagoXimeniaEngomegomaComandra
Phoradendron
67PhanerodiscusAmb 3078
HondurodendronHarmandia72
94
59
BerberidopsisRhododendron
HelianthusNicotiana
Coffea8052
76
65
80
100
98
Amb plastidAmb IGT100
NupharNymphaea100
IlliciumChloranthus
LiriodendronCeratophyllum
DrimysAcorus
LemnaYuccaTypha95
7873
BuxusPlatanus
TrochodendronVitis
GossypiumArabidopsis98
QuercusCucumis52 EuonymusPopulusOxalis
7086
XimeniaPhoradendron
Amb 35479997
BerberidopsisPlumbago
Spinacia100RhododendronHelianthusDaucus98 Nicotiana
Coffea8194
10088
100
64
58
100
59
84
96
77
Amb plastidAmb IGT100Nuphar
Nymphaea100Illicium
CeratophyllumAcorus
LemnaYuccaTypha55
7280Liriodendron
ChloranthusDrimys
PlatanusBuxusTrochodendron
VitisGossypium
Arabidopsis73 QuercusCucumis65
PopulusEuonymus
OxalisAmb 4749876
51
53
BerberidopsisPlumbago
Spinacia96Ximenia
Phoradendron76RhododendronHelianthus
Daucus93 NicotianaCoffea79
8993
56
100
76
85
100
56
95
70
81
NupharNymphaea100Amb plastid
Amb IGT100Illicium
ChloranthusLiriodendron
CeratophyllumDrimys
AcorusLemna
YuccaTypha68
7073
53
TrochodendronBuxusPlatanus
VitisGossypium
Arabidopsis93 CucumisQuercus
Amb 380510097Euonymus
Populus77Oxalis
82
81
61
74
XimeniaPhoradendron89
SpinaciaPlumbago
RhododendronBerberidopsis
HelianthusDaucus99 NicotianaCoffea86
75
95
98
82
74
58
20 DECEMBER 2013 VOL 342 SCIENCE www.sciencemag.org1470
RESEARCH ARTICLES
Sequencing Individuals from Additional TurquoiseKillifish Strains Reveals Variants in Aging-RelatedGenesWithin the turquoise killifish species, there exist several strainswith reported differences in lifespan in specific laboratory envi-ronments (Kirschner et al., 2012; Terzibasi et al., 2008) (Fig-ures 5A, S5A, and 6B), and these differences could be leveraged
to understand the genetic architecture of lifespan. To assessthe genetic differences among turquoise killifish strains, wesequenced at lower coverage individuals from two additionalstrains that were captured in Mozambique in 2004 and 2007(MZM-0403 and MZM-0703, respectively) and from a controlGRZ individual (Figure 5A). This analysis uncovered over threemillion single nucleotide polymorphisms (SNPs) between
Phylogenetic tree and lifespan B
C Selected GO term enrichment for the genes under positive selection
Genes under positive selection in the turquoise killifishA
14,857
13,140
GOenrichment
Overlap with known aging genes
Expressionwith age
Functionaleffect prediction
D Functional effect prediction for the sites under positiveselection
1-to-1 orthologs withother fish genomes
(13,637)
Protein-codinggenes
(28,494)
Genes under positive selection
-10-505
PR
OV
EA
N s
core
00.20.40.60.81
SIF
T s
core
100
100 10096
100100
100
100
100
100100
90
100
100
100
100
100
MedakaTetraodon
FuguStickleback
Cod
CoelacanthXenopusChicken
MouseHuman
DogPig
C. intestinalisC. savignyiSea urchin
FlyWorm
Teleost fish
Tetrapods
Invertebrates
Lifespan
Platyfish
ZebrafishLobefin fish
Years (log scale)
1
3
10
30
Turquoise killifish
0.1
2
4
6
8
Enrichment(log scale)
Sig
nalin
gM
etab
olis
mD
evel
opm
ent
ProteasomeImmunity
249 497
Viral process
Proteasome assembly
Cell−cell signaling involved in cell fate commitment
Ectoderm development
Morphogenesis of embryonic epithelium
Cellular process involved in reproductionin multicellular organism
Developmental induction
Carbohydrate derivative biosynthetic process
Nucleoside phosphate biosynthetic process
Single−organism carbohydrate metabolic process
Lipid metabolic process
Single−organism biosynthetic process
Integrin−mediated signaling pathway
Regulation of Ras protein signal transduction
Negative regulation of Wnt signaling pathway
Regulation of small GTPase mediatedsignal transduction
Regulation of intracellular signal transduction (12)
(10)(4)
(7)
(3)
(15)
(12)
(10)(6)
(9)
(2)
(3)
(3)
(2)
(2)
(2)
(2)
p-value0.01 0.02 0.03
Deleterious
Neutral
Deleterious
Tolerated
Figure 3. Evolutionary Analysis of the Turquoise Killifish Genome(A) Phylogenetic tree of 20 animal species, including the turquoise killifish, based on 619 one-to-one orthologs (Table S2C). Number on nodes: level of confidence
(% bootstrap support). Scale bar: evolutionary distance (substitution per site). Maximum lifespan data are from our experimental data (turquoise killifish) or from
the AnAge database (other fish species), and represented as a heat map.
(B) Proportion and analysis of the genes under positive selection in the turquoise killifish compared to 7 other fish species after multiple hypothesis correction
(FDR < 5%). See also Figure S3A.
(C) Selected GO term enrichment for the genes under positive selection in the turquoise killifish. The number of genes associated with each category is indicated
in brackets after the term description, and enrichment values are indicated in colored scale. See also Table S3C.
(D) Predicted functional effect on the protein of residues under positive selection in the turquoise killifish have based on SIFT (top row) and PROVEAN (bottom
row). Residues are ordered from left to right based on the rank-product of the SIFT and PROVEAN scores. Only sites scored by both methods are displayed. See
also Figure S3B and Tables S3D, and S4G.
1544 Cell 163, 1539–1554, December 3, 2015 ª2015 Elsevier Inc.
Resource
The African Turquoise Killifish Genome ProvidesInsights into Evolution and Genetic Architecture ofLifespan
Graphical Abstract
Highlightsd De novo genome assembly and annotation of the African
turquoise killifish
d Key aging genes are under positive selection in the turquoise
killifish
d Differences in lifespan between killifish strains are
genetically linked to sex
d A resource for comparative genomics and experimental
aging studies
AuthorsDario Riccardo Valenzano,
Berenice A. Benayoun,
Param Priya Singh, ..., Andreas Beyer,
Eric A. Johnson, Anne Brunet
[email protected] (D.R.V.),[email protected] (A.B.)
In BriefThe genome of the African turquoise
killifish, an exceptionally short-lived fish,
is a useful resource to explore the genetic
principles and the evolution of unique
traits in lifespan and embryonic diapause.
Linkage analysis suggests that short
lifespan could have co-evolved with sex
determination.
Valenzano et al., 2015, Cell 163, 1539–1554December 3, 2015 ª2015 Elsevier Inc.http://dx.doi.org/10.1016/j.cell.2015.11.008
LETTERdoi:10.1038/nature15697
A comprehensive phylogeny of birds (Aves) usingtargeted next-generation DNA sequencingRichard O. Prum1,2*, Jacob S. Berv3*, Alex Dornburg1,2,4, Daniel J. Field2,5, Jeffrey P. Townsend1,6,Emily Moriarty Lemmon7 & Alan R. Lemmon8
Although reconstruction of the phylogeny of living birds has pro-gressed tremendously in the last decade, the evolutionary history ofNeoaves—a clade that encompasses nearly all living bird species—remains the greatest unresolved challenge in dinosaur systematics.Here we investigate avian phylogeny with an unprecedented scaleof data: .390,000 bases of genomic sequence data from each of198 species of living birds, representing all major avian lineages,and two crocodilian outgroups. Sequence data were collected usinganchored hybrid enrichment, yielding 259 nuclear loci with anaverage length of 1,523 bases for a total data set of over 7.8 3 107
bases. Bayesian and maximum likelihood analyses yielded highlysupported and nearly identical phylogenetic trees for all majoravian lineages. Five major clades form successive sister groups tothe rest of Neoaves: (1) a clade including nightjars, other caprimul-giforms, swifts, and hummingbirds; (2) a clade uniting cuckoos,bustards, and turacos with pigeons, mesites, and sandgrouse; (3)cranes and their relatives; (4) a comprehensive waterbird clade,including all diving, wading, and shorebirds; and (5) a compre-hensive landbird clade with the enigmatic hoatzin (Opisthocomushoazin) as the sister group to the rest. Neither of the two main,recently proposed Neoavian clades—Columbea and Passerea1—were supported as monophyletic. The results of our divergencetime analyses are congruent with the palaeontological record, sup-porting a major radiation of crown birds in the wake of theCretaceous–Palaeogene (K–Pg) mass extinction.
Birds (Aves) are the most diverse lineage of extant tetrapod verte-brates. They comprise over 10,000 living species2, and exhibit an extra-ordinary diversity in morphology, ecology, and behaviour3. Substantialprogress has been made in resolving the phylogenetic history of birds.Phylogenetic analyses of both molecular and morphological data sup-port the monophyletic Palaeognathae (the tinamous and flightlessratites) and Galloanserae (gamebirds and waterfowl) as successive,monophyletic sister groups to the Neoaves—a diverse clade includingall other living birds4. Resolving neoavian phylogeny has proven to be adifficult challenge because this radiation was very rapid and deep intime, resulting in very short internodes4.
In the last decade, phylogenetic analyses of large, multilocus datasets have resulted in the proposal of numerous, novel neoavian rela-tionships. For example, a clade consisting of diving and wading birdshas been consistently recovered, as well as a large landbird clade inwhich falcons and parrots are successive sister groups to the perchingbirds4–8. Recently, phylogenetic analyses of 48 whole avian genomesresulted in the proposal of a novel phylogenetic resolution of the initialbranching sequence within Neoaves1. Although this genomic studyprovided much needed corroboration of many neoavian clades, thelimited taxon sampling precluded further insights into the evolution-ary history of birds.
It has long been recognized that phylogenetic confidence dependsnot only on the number of characters analysed and their rate of evolu-tion, but also on the number and relationships of the taxa sampledrelative to the nodes of interest9–11. Theory predicts that sampling asingle taxon that diverges close to a node of interest will have a fargreater effect on phylogenetic resolution than will adding more char-acters11. Despite using an alignment of .40 million base pairs, sparsesampling of 48 species in the recent avian genomic analysis may nothave been sufficient to confidently resolve the deep divergences amongmajor lineages of Neoaves. Thus, expanded taxon sampling is requiredto test the monophyly of neoavian clades, and to further resolve thephylogenetic relationships within Neoaves.
Here, we present a phylogenetic analysis of 198 bird species and2 crocodilians (Supplementary Table 1) based on loci captured usinganchored enrichment12. Our sample includes species of 122 avianfamilies in all 40 extant avian orders2, with denser representation ofnon-oscine birds (108 families) than of oscine songbirds (14 families).Effort was made to include taxa that would break up long phylogeneticbranches, and provide the highest likelihood of resolving short inter-nodes at the base of Neoaves11. We also sampled multiple specieswithin groups whose monophyly or phylogenetic interrelationshipshave been controversial—that is, tinamous, nightjars, hummingbirds,turacos, cuckoos, pigeons, sandgrouse, mesites, rails, storm petrels,petrels, storks, herons, hawks, hornbills, mousebirds, trogons, king-fishers, barbets, seriemas, falcons, parrots, and suboscine passerines.
We targeted 394 loci centred on conserved anchor regions of thegenome that are flanked by more variable regions12. We performed allphylogenetic analyses on a data set of 259 genes with the highestquality assemblies. The average locus was 1,524 bases in length(361–2,316 base pairs (bp)), and the total percentage of missing datawas 1.84%. The concatenated alignment contained 394,684 sites. Tominimize overall model complexity while accurately accounting forsubstitution processes, we performed a partition model sensitivityanalysis with PartitionFinder13,14, and compared a complex partitionmodel (one partition per locus) to a heuristically optimized (rclust)partition model. Phylogenetic informativeness (PI) approaches15,16
provided strong evidence that the phylogenetic utility of our data setwas high, with low declines in PI profiles for individual loci, data setpartitions, and the concatenated matrix (Supplementary Fig. 4). Weestimated concatenated trees in ExaBayes17 and RAxML18 using a 75partition model. Coalescent species trees were estimated with the genetree summation methods in STAR19, NJst20, and ASTRAL21 from genetrees estimated with RAxML (see Methods.)
Our concatenated Bayesian analyses resulted in a completelyresolved, well supported phylogeny. All clades had a posterior prob-ability (PP) of 1, except for a single clade including shoebill(Balaeniceps) and pelican (PP 5 0.54) (Fig. 1). The concatenated
*These authors contributed equally to this work.
1Department of Ecology & Evolutionary Biology, Yale University, New Haven, Connecticut 06520, USA. 2Peabody Museum of Natural History, Yale University, New Haven, Connecticut 06520, USA.3Department of Ecology and Evolutionary Biology, Fuller Evolutionary Biology Program, Cornell University, and Cornell Laboratory of Ornithology, Ithaca, New York 14853, USA. 4North Carolina Museum ofNatural Sciences, Raleigh, North Carolina 27601, USA. 5Department of Geology & Geophysics, Yale University, New Haven, Connecticut 06520, USA. 6Department of Biostatistics, and Program inComputationalBiology and Bioinformatics, Yale University,New Haven, Connecticut06520, USA. 7Departmentof Biological Science, Florida State University, Tallahassee, Florida 32306,USA. 8Departmentof Scientific Computing, Florida State University, Tallahassee, Florida 32306, USA.
2 2 O C T O B E R 2 0 1 5 | V O L 5 2 6 | N A T U R E | 5 6 9G2015 Macmillan Publishers Limited. All rights reserved
maximum likelihood analysis recovered a single topology that wasidentical to the Bayesian tree except for three clades, all of which arefar from the base of Neoaves: the relationships among pigeons; amongskimmers, gulls, and terns; and among pelicans, shoebill, and waders(Supplementary Fig. 1). Almost all clades in the maximum likelihood
tree were maximally supported with bootstrap scores (BS) of 1.00, butnine clades within Neoaves (including four of the most inclusiveneoavian clades) received support ,0.70 (Supplementary Fig. 1).Coalescent species tree analyses produced substantially differenthypotheses for neoavian relationships (Supplementary Fig. 3), but
Ple.Pli.MioceneOligoceneEocenePalaeoceneUpper
Q.NeogenePalaeogeneCretaceous
70 60 50 40 30 20 10 0
Ma
Inopinaves
Neoaves continued
Coraciim
orphaeA
ustralavesP
asseriformes
Buteo
Momotus
Trogon
Smithornis
Apaloderma
Indicator
Alcedo
Buccanodon
Corvus
TockusMerops
Furnarius
Cathartes
Hymenops
Hirundinea
Thamnophilus
Strix
Jynx
SylviaRegulus
Micrastur
Rupicola
Myiobius
Turdus
Sclerurus
PipritesRhynchocyclus
Neopelma
Fringilla
Upupa
Todus
Falco
Myrmornis
Cotinga
Deroptyus
Ceratopipra
Lepidocolaptes
Tyrannus
Caracara
Tityra
Picus
Terenura
Oxyruncus
Ibycter
Schiffornis
Capito
Bucco
Accipiter
Psittrichas
ChloroceryleGalbula
Chelidoptera
Vultur
Probosciger
Coracias
Ramphastos
Sagittarius
Atelornis
Leptosomus
Opisthocomus
Psittacus
Melanopareia
Climacteris
Malurus
Barnardius
Elanus
Eurylaimus
Nestor
Phoeniculus
Megalaima
Pitta
ColiusUrocolius
Menura
Cryptopipo
Cariama
Myrmothera
Elaenia
Neodrepanis
Ptilonorhynchus
Pandion
Tyto
Chunga
CalandrellaPoecile
Lophorina
Calyptomena
Sericulus
Spizella
Pycnonotus
Bucorvus
Acanthisitta
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
2324
25
26
27
28
2930
31
3233
34
3536
37
38
3940
41
42
43
44
45
46
47
48
49
5051
52
53
5455
56
57
58
59
60
6162
63
64
66
67
68
69
70
71
72
7374
75
7677
7879
80
8182
83
8485
86
87
88
89
90
91
92
93
9495
96
97
65
Accipitriform
es
Figure 1 | Continued.
2 2 O C T O B E R 2 0 1 5 | V O L 5 2 6 | N A T U R E | 5 7 1
LETTER RESEARCH
G2015 Macmillan Publishers Limited. All rights reserved
Palaeognathae
Galloanserae
Neoaves
Strisores
Colum
baves G
ruiformes
Aequorlitornithes
Tinam.
Galliform
esA
nseriform.
Apodiform
.O
tidimorph.
Colum
bimorph.
Tinam.
Tinaam.
quorlitornnnnnnnnnnnnnnnnnnnnnithes
Ple.Pli.MioceneOligoceneEocenePalaeoceneUpper
Q.NeogenePalaeogeneCretaceous
Streptoprocne
Tauraco
Treron
Corythaeola
Tringa
Theristicus
Chroicocephalus
Burhinus
Ciconia
Columba
Charadrius
TopazaPhaethornis
Leptotila
Crax
Odontophorus
NothoproctaCrypturellusTinamus
Coccyzus
Tigrisoma
Columbina
Chordeiles
Ardea
Chaetura
Nyctibius
Colinus
Anas
Anseranas
Morus
Podargus
Leipoa
Oxyura
Caprimulgus
Dromaius
Psophia
Sterna
Balaeniceps
Archilochus
Bonasa
Jacana
Ardeotis
Oceanodroma
Dendrocygna
Anser
Phoenicopterus
Aythya
Haematopus
Oceanites
Mesitornis
Sarothrura
Monias
Recurvirostra
Rollulus
Phalacrocorax
ChaunaGallus
Phaethon
Leptoptilos
Heliornis
Anhinga
Casuarius
Fregata
Pelecanoides
Hemiprocne
Apteryx
Pelecanus
Rynchops
Aegotheles
Pterodroma
Eurypyga
Centropus
Eurostopodus
Glareola
Rostratula
Syrrhaptes
Fulmarus
Grus
Puffinus
Porphyrio
Uria
Turnix
Pterocles
Pelagodroma
Rhea
Phoebastria
Scopus
Aramus
Ixobrychus
Rollandia
Cuculus
Tapera
Micropygia
Ortalis
Arenaria
Rallus
Limosa
Eudromia
Balearica
Ptilinopus
Steatornis
Numida
GaviaSpheniscus
Struthio
Pedionomus
70 60 50 40 30 20 10 0
Ma
1
2
3
4
5
6
98
99
100
101
102
103
104
105
106107
108109
110111
112
113
114
115
116
117118
119120
121
122
123
124
125
126127
128129
130
131
132133
134135
136137
138
139
140
141
142143
144
145146
147
148
149
150
151152
153
154155
156
157
158159
160161
162
163
164
165
166
167
168169
170171
172
173174
175
176
177
178
179
180181
182
183
184185
186187
188189
190
191
192
193
194195
196
197
Aves
Figure 1 | Phylogeny of birds. Time-calibrated phylogeny of 198 species ofbirds inferred from a concatenated, Bayesian analysis of 259 anchoredphylogenomic loci using ExaBayes17. Figure continues on the opposite pagefrom green arrow at the bottom of this panel. Complete taxon data inSupplementary Table 1. Higher taxon names appear at right. All clades aresupported with posterior probability (PP) of 1.0, except for the Balaeniceps–Pelecanus clade (PP 5 0.54; clade 109). The five major, successive, neoavian
sister clades are: Strisores (brown), Columbaves (purple), Gruiformes (yellow),Aequorlitornithes (blue), and Inopinaves (green). Background colours markgeological periods. Ma, million years ago; Ple, Pleistocene; Pli, Pliocene;Q., Quaternary. Clade numbers refer to the plot of estimated divergencedates (Supplementary Fig. 7). Fossil age-calibrated nodes are shown in grey.Illustrations of representative bird species30 are depicted by their lineages. SeeSupplementary Information for details and further discussion.
5 7 0 | N A T U R E | V O L 5 2 6 | 2 2 O C T O B E R 2 0 1 5
RESEARCH LETTER
G2015 Macmillan Publishers Limited. All rights reserved