Molecular Phylogenetics (1) Phylogeny in wide sense is a ... · Molecular Phylogenetics (1)...

7
Molecular Phylogenetics EEOB 563 Phylogeny (from phylum – tribe, and genesis – origin) the term introduced by E. Haeckel in the second half of the XIX century and now has two somewhat different meanings. (1) Phylogeny in wide sense is a historical development of organisms (2) Phylogeny in narrow sense includes not all aspects of historic development, but only succession of branching of a genealogical (i.e. a phylogenetic) tree. Usually represented by a phylogenetic tree. What is a phylogenetic tree? A tree is a mathematical structure which is used to model the actual evolutionary history of a group of sequences or organisms. The actual pattern of historical relationships is an evolutionary tree which we try to estimate

Transcript of Molecular Phylogenetics (1) Phylogeny in wide sense is a ... · Molecular Phylogenetics (1)...

Page 1: Molecular Phylogenetics (1) Phylogeny in wide sense is a ... · Molecular Phylogenetics (1) Phylogeny in wide sense is a historical development of EEOB 563 Phylogeny (from phylum

Molecular Phylogenetics EEOB 563

Phylogeny (from phylum – tribe, and genesis – origin)

• the term introduced by E. Haeckel in the second half of the XIX century and now has two somewhat different meanings.

• (1) Phylogeny in wide sense is a historical development of organisms

• (2) Phylogeny in narrow sense includes not all aspects of historic development, but only succession of branching of a genealogical (i.e. a phylogenetic) tree.

• Usually represented by a phylogenetic tree.

What is a phylogenetic tree?

• A tree is a mathematical structure which is used to model the actual evolutionary history of a group of sequences or organisms.

• The actual pattern of historical relationships is an evolutionary tree which we try to estimate

Page 2: Molecular Phylogenetics (1) Phylogeny in wide sense is a ... · Molecular Phylogenetics (1) Phylogeny in wide sense is a historical development of EEOB 563 Phylogeny (from phylum

Darwin’s letter to Thomas Huxley (1857)

“The time will come I believe, though I shall not live to see it, when we shall have fairly true genealogical (phylogenetic) trees of each great kingdom of nature”

Dawkins (2003), A Devil’s Chaplain

“… there is, after all, one true tree of life […]. It exists. It is in principle knowable. We don’t know it all yet. By 2050 we should –or if we do not, we shall have been defeated only at the terminal twigs, by the sheer number of species.”

The AToL initiative (Assembling the Tree of Life) is a large research effort sponsored by the National Science Foundation. Its goal is to reconstruct the evolutionary origins of all living things.

which offered tantalising glimpses of their potential forenabling interactions between researchers (Figure 7b)[44,45]. However, the release by Apple of first the iPhoneand subsequently the iPad have made touch interfacesmainstream. Not only does this mean that touch-screen

devices are now widely available, but there also is aconsistent vocabulary for how users can interact with thesedevices, using gestures such as ‘tap’, ‘swipe’ and ‘pinch andzoom’. Phylogenetics software developers have yet to ex-ploit fully the possibilities of these devices. Interacting

(a) (b)

TRENDS in Ecology & Evolution

Figure 3. Folding a tree. To save space, the subtree shaded in (a) is collapsed and drawn as a smaller triangle (b). The choice of which nodes to collapse can be automated,or left as a task for the user.

Evolutionary distance (Y)

ROOT

EDGE

GENE

NODE

Paralogs (Z)

Species (X)Archaea Eukaryota

(a) (b) (c)

TRENDS in Ecology & Evolution

Figure 4. Phylogenies in three dimensions. (a) Hyperbolic view of the National Center for Biotechnology Information taxonomy [58]. (b) Google Earth visualisation of theHawaiian endemic katydid genus Banza based on the phylogeny from [58]. (c) Stacked representation of a gene tree with multiple gene duplications [32].

Review Trends in Ecology and Evolution February 2012, Vol. 27, No. 2

117

with focus+context tools, such as Dendroscope [46] andTree Juxtaposer [42], using a desktop computer with amouse is rather clumsy, whereas a touch screen wouldprovide a more natural way to apply the spatial distortionsthese tools use to visualise very big trees.

PhyloinformaticsThere is a long tradition of annotating phylogenies bycolouring in branches, as popularised by the programMacClade [47]. However, much of this annotation hasbeen local; that is, only data contained within a singlefile are mapped on the tree (typically the data used tocreate the tree). A bigger challenge is annotating phy-logenies with data on genomics, geographic distribution,ecology and phenotype. Pioneering efforts in this direc-tion include TaxonTree [9], a stand-alone tool that runson desktop computers, and the web-based iToL [48]. Asan increasing amount of biodiversity data acquires digi-tal identifiers that can be resolved [49], one can lookforward to phylogeny viewers that automatically aggre-gate annotations from multiple data sources and displaythese to the user, as well as enabling the user to querythat information [50].

Perhaps one can draw a lesson here from the success ofGoogle Earth, which has become a ubiquitous tool forvisualising geographic data, in large part because of theease of creating the Keyhole Markup Language (KML) filesused by that program. This has enabled third parties,including evolutionary biologists, to create innovativevisualisations rich in biological data. This suggests anobvious way forward for the phylogenetic community,

TRENDS in Ecology & Evolution

(a) (b)

Figure 7. Interacting with phylogenies. (a) Displaying a phylogeny using multiple monitors. (b) Interacting with a visualisation using a touch screen.

A

B

C

D

A

B

C

D

A

B D

C

A

C

B

D

(a) (b) (c)

TRENDS in Ecology & Evolution

Figure 6. Alternative visualisations of uncertainty in trees. (a) Two and a half dimensional visualisation of a series of trees where neighbouring trees show minortopological changes. (b) DensiTree visualisation of variation in estimates of branch length among a set of trees. (c) Phylogenetic network showing two conflicting signals fora set of four taxa.

Reconciled tree(a) (b) Tanglegram

Host Parasite

Gene

Species

TRENDS in Ecology & Evolution

Figure 5. Reconciled trees and tanglegrams. (a) In a reconciled tree, one tree (suchas a gene tree) is embedded inside another tree; for example, the phylogeny of thespecies from which the genes were obtained. (b) Trees for different, associatedentities, such as genes and species, or hosts and parasites, can also be depictedusing a tanglegram.

Review Trends in Ecology and Evolution February 2012, Vol. 27, No. 2

118

Special Issue: Ecological and evolutionary informatics

Space, time, form: viewing the Treeof LifeRoderic D.M. Page

Institute of Biodiversity, Animal Health and Comparative Medicine, College of Medical, Veterinary and Life Sciences,University of Glasgow, Glasgow, G12 8QQ, UK

There are numerous ways to display a phylogenetic tree,which is reflected in the diversity of software toolsavailable to phylogenetists. Displaying very large treescontinues to be a challenge, made ever harder as in-creasing computing power enables researchers to con-struct ever-larger trees. At the same time, computingtechnology is enabling novel visualisations, rangingfrom geophylogenies embedded on digital globes totouch-screen interfaces that enable greater interactionwith evolutionary trees. In this review, I survey recentdevelopments in phylogenetic visualisation, highlight-ing successful (and less successful) approaches andsketching some future directions.

Visualising treesVisualising phylogenies is one of the fundamental tasks ofevolutionary analysis. Reviews of the field [1,2] list agrowing number of tree viewers, some of which, such asNJPlot [3] and TreeView [4], have been in use for over adecade. A quick glance at Felsenstein’s list of phylogenyprograms (http://evolution.genetics.washington.edu/phylip/software.html#Plotting) reveals viewers for just about everyconceivable operating system, written in a wide range ofcomputer programming languages. Given this diversity oftools that all provide essentially the same functionality, itwould be tempting to conclude that the basic problem ofdisplaying an evolutionary tree has been solved. Yet, it wasstriking that all the entries in the iEvoBio 2010 visualisationchallenge were tree viewers (Figure 1). This suggests thatalthough the niche of tree viewer is crowded, biologistsworking with trees are still searching for tools to help themvisualise phylogenies. The goal of this review is to surveysome recent developments in phylogeny visualisation, withan eye to future directions.

Trees are relatively simple structures that place fewrestrictions on how they can be depicted, apart from pre-serving the connections between the nodes in the tree. Thislack of constraints has led to a proliferation of ways tovisualise trees, many of which are striking (for a visualsurvey, see http://treevis.net). Conversely, this freedommeans that the interpretation of a tree diagram mightnot always be obvious to the person viewing it [5] [Green,D. and Shapley, R. (2005) Teaching with a visual treeof life; http://groups.ischool.berkeley.edu/TOL/], especiallydistinguishing which aspects of the diagram are providing

information, and which largely reflect artistic license(Box 1).

Although the most common representation of a phylog-eny is a two-dimensional (2D) Euclidean drawing [1], anincreasingly diverse range of visualisations are emerging(Figure 2). Typically phylogenies are drawn as trees; how-ever, authors have experimented with treemaps [6], whichlay out a tree as a set of nested rectangles (Figure 2).Treemaps are perhaps best suited for classifications ratherthan phylogenies, although Arvelakis et al. [7] recentlyused treemaps to display phylogenies with over 2000species.

Euclidean geometry is reassuringly familiar, but itbecomes difficult to accommodate very large trees withinthe confines of the printed page or a computer screen. Oneapproach is to ‘fold’ or collapse nodes to save space(Figure 3). Several methods, such as degree of interest(DOI) trees [8], space trees [9] and expand-ahead browsers[10], exploit the natural hierarchy of rooted trees to com-press the tree into a smaller display area. The choice ofwhich nodes in a tree to collapse can be made by the user, orthe process can be automated [11,12]

An alternative approach to saving space is to keep thetree unchanged, but instead distort the space in which thetree is being displayed, the best-known examples beinghyperbolic viewers (Figure 2) [13,14]. Although capable ofproducing some stunning images (e.g. Figure 4a), thesetools have gained little traction among users. In practice,users find them hard to navigate, and hyperbolic viewers inparticular are best suited to classifications, which tend tobe shallow (few nodes along the path from any tip to thebase or root of the tree) and frequently have internal nodesof high degree (many immediate descendants). By contrast,a fully resolved phylogeny may be deep (in a tree with nleaves, there may be a path from leaf to root with n–1nodes) and binary (each node having only two immediatedescendants); consequently, phylogenies rarely look goodin hyperbolic viewers.

Some three-dimensional (3D) phylogeny viewers haveforgone trying to truly display a phylogeny in three dimen-sions, and instead use the third dimension to provide a‘fly through’ experience over a 2D tree, such as Paloverde[15] and the Wellcome Trust Tree of Life (http://www.wellcometreeoflife.org/). Although perhaps less disorien-tating than hyperbolic viewers, it is not clear that thisprovides a better way to navigate through a tree comparedwith a simple 2D visualisation. Although the case for 3D

Review

Corresponding author: Page, R.D.M. ([email protected])

0169-5347/$ – see front matter ! 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.tree.2011.12.002 Trends in Ecology and Evolution, February 2012, Vol. 27, No. 2 113

Page 3: Molecular Phylogenetics (1) Phylogeny in wide sense is a ... · Molecular Phylogenetics (1) Phylogeny in wide sense is a historical development of EEOB 563 Phylogeny (from phylum

a | The phylogeny shows the distributions of new Drosophila spp. genes involved in development46 (above) and in the brain76 (below) in various evolutionary stages within the past 36 million years81.

Why molecular phylogenetics?

• The stream of heredity makes phylogeny: in a sense, it is phylogeny. Complete genetic analysis would provide the most priceless data for the mapping of this stream. George G. Simpson, 1945

• “I do not fully understand why we are not proclaiming the message from the housetops ... We finally have a method that can sort homology from analogy.” Stephen J. Gould , 1985

Linus Pauling

• “We may ask the question where in the now living systems the greatest amount of information of their past history has survived and how it can be extracted”

• “Best fit are the different types of macromolecules (sequences) which carry the genetic information”

Molecules as documents of evolutionary history

Page 4: Molecular Phylogenetics (1) Phylogeny in wide sense is a ... · Molecular Phylogenetics (1) Phylogeny in wide sense is a historical development of EEOB 563 Phylogeny (from phylum

Applications of Phylogenetic Analysis

• Systematics and classification

• Discovering new life forms

• Phylogeography and speciation

• Molecular evolution

• Genomics

• Epidemiology and forensics

• Biotechnology

• Agriculture

• Conservation

The Tree of Life: Benefits to Society throughPhylogenetic Research

Phylogenetic analysis is playing a major role in discoveringand identifying new life forms that could yield many newbenefits for human health and biotechnology. Many

microorganisms, including bacteria and fungi, cannot be culti-vated and studied directly in the laboratory, thus the principleroad to discovery is to isolate their DNA from samples collectedfrom marine or freshwater environments or from soils. The DNAsamples are then sequenced and compared in phylogenetic

analyses with the sequences of previouslydescribed organisms. This has led to majornew discoveries.

For several decades microbiologistshave been searching for new bacteria inextreme environments such as hotspringsor marine hydrothermal vents. The thermalsprings of Yellowstone National Park have yielded a host of new and important bacterial species, many of which were identified using phyloge-netic analysis of DNA sequences.

The most famous bacterium from Yellowstoneis Thermus aquaticus . Anenzyme derived from this species —DNA Taq polymerase — powers aprocess called the poly-merase chain reaction(PCR), which is used inthousands of laboratoriesto make large amounts of DNA for sequencing.This discovery led to thecreation of a major newbiotechnological industryand has revolutionizedmedical diagnostics, foren-sics, and other biologicalsciences. Many microorganisms in extreme environments mayyield innovative products for biotechnology.

Fungi are among the most ecologically important organisms.By feeding on dead or decaying organic material, fungi helprecycle nutrients through ecosystems. Additionally, fungi are

important economically as foods and as biotechnological sourcesfor medicines, insecticides, herbicides, and many other products.

About 200,000 species of fungi are known, but there may be millions more to be discovered because most are extremelysmall and found in poorly studied habitats such as soils.Increasingly, phylogenetic analysis is being used to discover

new microfungi through isolation and sequencing of DNA.Biological studies on these new species hold great promise for developing novel natural products.

2

Common fungi often havemycorrhizal associations inearly stages of development,and thus are important parts of Earth’s ecosystems.

Fungi — an unknown world revealed byphylogenetic analysis

Using phylogenetic analysis to discovernew life forms for biotechnology

pJP 74

pJP 7

pJP 8

pJP 6

pJP 81

pJP 33

pJP 9

Thermophilic bacteria found in Yellowstone hot springs

A phylogeny of some archaeobacteria. Newly discoverd life forms are in red.

Desufurococcus mobilis

Sulfolobus aciducaldarius

Pyrodictium occultum

Pyrobaculum islandicum

Pyrobaculum aerophilum

Thermoproteus tenax

Thermofilum pendens

Methanopyrus kandleri

Thermococcus celer

Archaeoglobus fulgidus

“Simple identification via phylogenetic classification of organisms has,to date, yielded more patent filings than any other use of phylogeny in industry.”Bader et al. (2001)

Proc. Natl. Acad. Sci. USAVol. 74, No. 11, pp. 5088-5090, November 1977Evolution

Phylogenetic structure of the prokaryotic domain: The primarykingdoms

(archaebacteria/eubacteria/urkaryote/16S ribosomal RNA/molecular phylogeny)

CARL R. WOESE AND GEORGE E. Fox*

Department of Genetics and Development, University of Illinois, Urbana, Illinois 61801

Communicated by T. M. Sonneborn, August 18,1977

ABSTRACT A phylogenetic analysis based upon ribosomalRNA sequence characterization reveals that living systemsrepresent one of three aboriginal lines of descent: (i) the eu-bacteria, comprising all typical bacteria; (ii) the archaebacteria,containing methanogenic bacteria; and (iii) the urkaryotes, nowrepresented in the cytoplasmic component of eukaryoticcells.

The biologist has customarily structured his world in terms ofcertain basic dichotomies. Classically, what was not plant wasanimal. The discovery that bacteria, which initially had beenconsidered plants, resembled both plants and animals less thanplants and animals resembled one another led to a reformula-tion of the issue in terms of a yet more basic dichotomy, that ofeukaryote versus prokaryote. The striking differences betweeneukaryotic and prokaryotic cells have now been documentedin endless molecular detail. As a result, it is generally taken forgranted that all extant life must be of these two basic types.

Thus, it appears that the biologist has solved the problem ofthe primary phylogenetic groupings. However, this is not thecase. Dividing the living world into Prokaryotae and Eukar-yotae has served, if anything, to obscure the problem of whatextant groupings represent the various primeval branches fromthe common line of descent. The reason is that eukaryote/prokaryote is not primarily a phylogenetic distinction, althoughit is generally treated so. The eukaryotic cell is organized in adifferent and more complex way than is the prokaryote; thisprobably reflects the former's composite origin as a symbioticcollection of various simpler organisms (1-5). However striking,these organizational dissimilarities do not guarantee that eu-karyote and prokaryote represent phylogenetic extremes.The eukaryotic cell per se cannot be directly compared to

the prokaryote. The composite nature of the eukaryotic cellmakes it necessary that it first be conceptually reduced to itsphylogenetically separate components, which arose from an-cestors that were noncomposite and so individually are com-parable to prokaryotes. In other words, the question of theprimary phylogenetic groupings must be formulated solely interms of relationships among "prokaryotes"-i.e., noncompositeentities. (Note that in this context there is no suggestion a priorithat the living world is structured in a dichotomous way.)The organizational differences between prokaryote and

eukaryote and the composite nature of the latter indicate animportant property of the evolutionary process: Evolution seemsto progress in a "quantized" fashion. One level or domain oforganization gives rise ultimately to a higher (more complex)one. What "prokaryote" and "eukaryote" actually representare two such domains. Thus, although it is useful to definephylogenetic patterns within each domain, it is not meaningful

The costs of publication of this article were defrayed in part by thepayment of page charges. This article must therefore be hereby marked"advertisement" in accordance with 18 U. S. C. §1734 solely to indicatethis fact.

to construct phylogenetic classifications between domains:Prokaryotic kingdoms are not comparable to eukaryotic ones.This should be recognized by, an appropriate terminology. Thehighest phylogenetic unit in the prokaryotic domain we thinkshould be called an "urkingdom"-or perhaps "primarykingdom." This would recognize the qualitative distinctionbetween prokaryotic and eukaryotic kingdoms and emphasizethat the former have primary evolutionary status.The passage from one domain to a higher one then becomes

a central problem. Initially one would like to know whether thisis a frequent or a rare (unique) evolutionary event. It is tradi-tionally assumed-without evidence-that the eukaryoticdomain has arisen but once; all extant eukaryotes stem from acommon ancestor, itself eukaryotic (2). A similar prejudice holdsfor the prokaryotic domain (2). [We elsewhere argue (6) thata hypothetical domain of lower complexity, that of "pro-genotes," may have preceded and given rise to the prokaryotes.]The present communication is a discussion of recent findingsthat relate to the urkingdom structure of the prokaryotic do-main and the question of its unique as opposed to multiple or-igin.

Phylogenetic relationships cannot be reliably established interms of noncomparable properties (7). A comparative ap-proach that can measure degree of difference in comparablestructures is required. An organism's genome seems to be theultimate record of its evolutionary history (8). Thus, compar-ative analysis of molecular sequences has become a powerfulapproach to determining evolutionary relationships (9, 10).To determine relationships covering the entire spectrum of

extant living systems, one optimally needs a molecule of ap-propriately broad distribution. None of the readily character-ized proteins fits this requirement. However, ribosomal RNAdoes. It is a component of all self-replicating systems; it is readilyisolated; and its sequence changes but slowly with time-per-mitting the detection of relatedness among very distant species(11-13). To date, the primary structure of the 16S (18S) ribo-somal RNA has been characterized in a moderately large andvaried collection of organisms and organelles, and the generalphylogenetic structure of the prokaryotic domain is beginningto emerge.A comparative analysis of these data, summarized in Table

1, shows that the organisms clearly cluster into several primarykingdoms. The first of these contains all of the typical bacteriaso far characterized, including the genera Acetobacterium,Acinetobacter, Acholeplasma, Aeromonas, Alcaligenes, An-acystis, Aphanocapsa, Bacillus, Bdellovbrio, Chlorobium,Chromatium, Clostridium, Corynebacterium, Escherichia,Eubacterium, Lactobacillus, Leptospira, Micrococcus, My-coplasna, Paracoccus, Photobacteriurn, Propionibacterium,

* Present address: Department of Biophysical Sciences, University ofHouston, Houston, TX 77004.

5088

Discovering new life forms

Environmental Genome ShotgunSequencing of the Sargasso SeaJ. Craig Venter,1* Karin Remington,1 John F. Heidelberg,3

Aaron L. Halpern,2 Doug Rusch,2 Jonathan A. Eisen,3

Dongying Wu,3 Ian Paulsen,3 Karen E. Nelson,3 William Nelson,3

Derrick E. Fouts,3 Samuel Levy,2 Anthony H. Knap,6

Michael W. Lomas,6 Ken Nealson,5 Owen White,3

Jeremy Peterson,3 Jeff Hoffman,1 Rachel Parsons,6

Holly Baden-Tillson,1 Cynthia Pfannkoch,1 Yu-Hui Rogers,4

Hamilton O. Smith1

Wehave applied “whole-genome shotgun sequencing” tomicrobial populationscollected enmasse on tangential flow and impact filters from seawater samplescollected from the Sargasso Sea near Bermuda. A total of 1.045 billion base pairsof nonredundant sequencewas generated, annotated, and analyzed to elucidatethe gene content, diversity, and relative abundance of the organisms withinthese environmental samples. These data are estimated to derive from at least1800 genomic species based on sequence relatedness, including 148 previouslyunknown bacterial phylotypes. We have identified over 1.2 million previouslyunknown genes represented in these samples, including more than 782 newrhodopsin-like photoreceptors. Variation in species present and stoichiometrysuggests substantial oceanic microbial diversity.

Microorganisms are responsible for most of thebiogeochemical cycles that shape the environ-ment of Earth and its oceans. Yet, these organ-isms are the least well understood on Earth, asthe ability to study and understand the metabol-ic potential of microorganisms has been ham-pered by the inability to generate pure cultures.Recent studies have begun to explore environ-mental bacteria in a culture-independent man-ner by isolating DNA from environmental sam-ples and transforming it into large insert clones.For example, a previously unknown light-drivenproton pump, proteorhodopsin, was discoveredwithin a bacterial artificial chromosome (BAC)from the genome of a SAR86 ribotype (1), andsoil microbial DNA libraries have been construct-ed and screened for specific activities (2).

Here we have applied whole-genome shot-gun sequencing to environmental-pooled DNAsamples to test whether new genomic approach-es can be effectively applied to gene and spe-cies discovery and to overall environmental

characterization. To help ensure a tractable pilotstudy, we sampled in the Sargasso Sea, a nutrient-limited, open ocean environment. Further, weconcentrated on the genetic material captured onfilters sized to isolate primarily microbial inhabit-ants of the environment, leaving detailed analysisof dissolved DNA and viral particles on one endof the size spectrum and eukaryotic inhabitants onthe other, for subsequent studies.The Sargasso Sea. The northwest Sar-

gasso Sea, at the Bermuda Atlantic Time-seriesStudy site (BATS), is one of the best-studiedand arguably most well-characterized regionsof the global ocean. The Gulf Stream representsthe western and northern boundaries of thisregion and provides a strong physical boundary,separating the low nutrient, oligotrophic openocean from the more nutrient-rich waters of theU.S. continental shelf. The Sargasso Sea hasbeen intensively studied as part of the 50-yeartime series of ocean physics and biogeochem-istry (3, 4) and provides an opportunity forinterpretation of environmental genomic data inan oceanographic context. In this region, for-mation of subtropical mode water occurs eachwinter as the passage of cold fronts across theregion erodes the seasonal thermocline andcauses convective mixing, resulting in mixedlayers of 150 to 300 m depth. The introductionof nutrient-rich deep water, following thebreakdown of seasonal thermoclines into thebrightly lit surface waters, leads to the bloom-ing of single cell phytoplankton, including twocyanobacteria species, Synechococcus and Pro-

chlorococcus, that numerically dominate thephotosynthetic biomass in the Sargasso Sea.

Surface water samples (170 to 200 liters)were collected aboard the RV Weatherbird IIfrom three sites off the coast of Bermuda inFebruary 2003. Additional samples were col-lected aboard the SV Sorcerer II from “Hydro-station S” in May 2003. Sample site locationsare indicated on Fig. 1 and described in tableS1; sampling protocols were fine-tuned fromone expedition to the next (5). Genomic DNAwas extracted from filters of 0.1 to 3.0 !m, andgenomic libraries with insert sizes ranging from2 to 6 kb were made as described (5). Theprepared plasmid clones were sequenced fromboth ends to provide paired-end reads at the J.Craig Venter Science Foundation Joint Tech-nology Center on ABI 3730XL DNA sequenc-ers (Applied Biosystems, Foster City, CA).Whole-genome random shotgun sequencing ofthe Weatherbird II samples (table S1, samples 1 to4) produced 1.66 million reads averaging 818 bpin length, for a total of approximately 1.36 Gbp ofmicrobial DNA sequence. An additional 325,561sequences were generated from the Sorcerer IIsamples (table S1, samples 5 to 7), yielding ap-proximately 265 Mbp of DNA sequence.Environmental genome shotgun as-

sembly. Whole-genome shotgun sequencingprojects have traditionally been applied to iden-tify the genome sequence(s) from one particularorganism, whereas the approach taken here isintended to capture representative sequencefrom many diverse organisms simultaneously.Variation in genome size and relative abun-dance determines the depth of coverage of anyparticular organism in the sample at a givenlevel of sequencing and has strong implicationsfor both the application of assembly algorithmsand for the metrics used in evaluating the re-sulting assembly. Although we would expectabundant species to be deeply covered and wellassembled, species of lower abundance may berepresented by only a few sequences. For asingle genome analysis, assembly coveragedepth in unique regions should approximate aPoisson distribution. The mean of this distribu-tion can be estimated from the observed data,looking at the depth of coverage of contigsgenerated before any scaffolding. The assem-bler used in this study, the Celera Assembler(6), uses this value to heuristically identifyclearly unique regions to form the backbone ofthe final assembly within the scaffolding phase.However, when the starting material consists ofa mixture of genomes of varying abundance, athreshold estimated in this way would classifysamples from the most abundant organism(s) asrepetitive, due to their greater-than-averagedepth of coverage, paradoxically leaving themost abundant organisms poorly assembled.We therefore used manual curation of an initial

1The Institute for Biological Energy Alternatives, 2TheCenter for the Advancement of Genomics, 1901 Re-search Boulevard, Rockville, MD 20850, USA. 3TheInstitute for Genomic Research, 9712 Medical CenterDrive, Rockville, MD 20850, USA. 4The J. Craig VenterScience Foundation Joint Technology Center, 5 Re-search Place, Rockville, MD 20850, USA. 5University ofSouthern California, 223 Science Hall, Los Angeles, CA90089–0740, USA. 6Bermuda Biological Station forResearch, Inc., 17 Biological Lane, St George GE 01,Bermuda.

*To whom correspondence should be addressed. E-mail: [email protected]

RESEARCH ARTICLE

2 APRIL 2004 VOL 304 SCIENCE www.sciencemag.org66

Disease Transmission and Medical Forensics

Brief Communications

Nature 444, 836-837 (14 December 2006) | doi:10.1038/444836a; Received 4 November 2006; Accepted 24 November 2006; Published online 6 December 2006

Molecular Epidemiology: HIV-1 and HCV sequences from Libyan outbreak

Tulio de Oliveira1, Oliver G. Pybus

1, Andrew Rambaut

2, Marco Salemi

3, Sharon Cassol

4, Massimo Ciccozzi

5, Giovanni Rezza

5, Guido

Castelli Gattinara6

, Roberta D'Arrigo7, Massimo Amicosante

8, Luc Perrin

9, Vittorio Colizzi

10, Carlo Federico Perno

11 and Benghazi Study

Group12

In 1998, outbreaks of human immunodeficiency virus type 1 (HIV-1) and hepatitis C virus (HCV) infection were reportedin children attending Al-Fateh Hospital in Benghazi, Libya. Here we use molecular phylogenetic techniques to analysenew virus sequences from these outbreaks. We find that the HIV-1 and HCV strains were already circulating andprevalent in this hospital and its environs before the arrival in March 1998 of the foreign medical staff (five Bulgariannurses and a Palestinian doctor) who stand accused of transmitting the HIV strain to the children.

Almost half of the 111 children studied in the early months after the discovery of the outbreak showed evidence of both HIV-1 and HCV

infection1. Of 418 children eventually affected by these viruses, 248 were referred to European hospitals

1, 2. Sequence analysis of 51 children

classified the HIV-1 infection as the strain CRF02_AG; HCV infection was classified as genotype 4 or subtype 1a in 15 children1, 2

.

We studied HIV-1 gag gene sequences from 44 affected children, plus 61 HCV E1E2 gene sequences that span the HCV hypervariable region(for methods, supplementary information). By using these data in an evolutionary analysis, we could place a real timescale on thetransmission history of the outbreaks.

We collated all available reference strains that were closely related to the sequences from the Al-Fateh Hospital, then estimated and assessedphylogenies using algorithmic, bayesian and maximum-likelihood methods (for details, supplementary information). The HIV-1sequences from the hospital form a well supported monophyletic cluster within the CRF02_AG clade, indicating that the outbreak arosefrom one CRF02_AG lineage. The cluster is closest to three west African reference sequences (Fig. 1a), the basal location of which suggeststhat the Al-Fateh Hospital lineage arrived in Libya from there. The branch length leading to the Al-Fateh Hospital cluster is perfectly typical;

hence the Al-Fateh Hospital strain is not unusually divergent2

.

Figure 1: HIV-1 and HCV sequences from 1998 Al-Fateh Hospital (AFH) outbreak.

a–c, Estimated maximum-likelihood phylogenies for HIV-1 CRF02_AG (a), HCV genotype 4 (b) and HCV genotype 1 (c). Source of sequencesused for analysis: AFH, red; Egypt, green; Cameroon, blue. Black circles mark the common ancestor of HCV subtype 4a and 1a; numbers aboveAFH lineages give clade support values using bootstrap and bayesian methods, respectively. Scale bar units are nucleotide substitutions per site.For visual clarity, AFH clusters are represented by triangles and some non-informative reference strains are excluded.

High resolution image and legend (21K)

In an equivalent HCV phylogenetic analysis, the HCV sequences from the hospital formed three monophyletic clusters containing 11subtype-4a sequences, phylogenetically placed among Egyptian subtype 4a lineages; 22 sequences most closely related to a Camerooniangenotype-4 strain; and 24 sequences belonging to the worldwide and prevalent subtype 1a; four remaining sequences belong to genotype 4(Fig. 1b, c; see supplementary information).

Epidemiological linkage of the HIV-1 and HCV clusters from Al-Fateh Hospital with sequences from sub-Saharan Africa is to be expected,

given the large number of migrants within or passing through Libya3

; indeed, the Libyan authorities have expressed concern about the risk

of introduction of HIV/AIDS and hepatitis as a result of this migration4

. In addition, HCV genotype 4 is endemic to central Africa and the

Middle East5, 6, 7

, and subtype 4a is exceptionally prevalent in neighbouring Egypt8, 9

.

Virus sequences also contain temporal information about the date of origin and age of epidemics10

. We therefore comprehensively analysed

the evolution of the Al-Fateh Hospital clusters using an established bayesian Markov chain Monte Carlo (MCMC) approach9, 10

thatappropriately accounts for estimation uncertainty. We estimated three parameter values for each cluster: the date of its most recentcommon ancestor; the probability that its most recent common ancestor was more recent than 1 March 1998; and the percentage of itslineages that already existed before 1 March 1998. (These values are conservative, because cluster origins could be older than the mostrecent common ancestor, but not younger.) To avoid model selection bias, we used a range of applicable models.

Ou et al. 1992

Applications of Phylogenetic Analysis

• Systematics and classification

• Discovering new life forms

• Phylogeography and speciation

• Molecular evolution

• Genomics

• Epidemiology and forensics

• Biotechnology

• Agriculture

• Conservation

The Tree of Life: Benefits to Society throughPhylogenetic Research

Phylogenetic analysis is playing a major role in discoveringand identifying new life forms that could yield many newbenefits for human health and biotechnology. Many

microorganisms, including bacteria and fungi, cannot be culti-vated and studied directly in the laboratory, thus the principleroad to discovery is to isolate their DNA from samples collectedfrom marine or freshwater environments or from soils. The DNAsamples are then sequenced and compared in phylogenetic

analyses with the sequences of previouslydescribed organisms. This has led to majornew discoveries.

For several decades microbiologistshave been searching for new bacteria inextreme environments such as hotspringsor marine hydrothermal vents. The thermalsprings of Yellowstone National Park have yielded a host of new and important bacterial species, many of which were identified using phyloge-netic analysis of DNA sequences.

The most famous bacterium from Yellowstoneis Thermus aquaticus . Anenzyme derived from this species —DNA Taq polymerase — powers aprocess called the poly-merase chain reaction(PCR), which is used inthousands of laboratoriesto make large amounts of DNA for sequencing.This discovery led to thecreation of a major newbiotechnological industryand has revolutionizedmedical diagnostics, foren-sics, and other biologicalsciences. Many microorganisms in extreme environments mayyield innovative products for biotechnology.

Fungi are among the most ecologically important organisms.By feeding on dead or decaying organic material, fungi helprecycle nutrients through ecosystems. Additionally, fungi are

important economically as foods and as biotechnological sourcesfor medicines, insecticides, herbicides, and many other products.

About 200,000 species of fungi are known, but there may be millions more to be discovered because most are extremelysmall and found in poorly studied habitats such as soils.Increasingly, phylogenetic analysis is being used to discover

new microfungi through isolation and sequencing of DNA.Biological studies on these new species hold great promise for developing novel natural products.

2

Common fungi often havemycorrhizal associations inearly stages of development,and thus are important parts of Earth’s ecosystems.

Fungi — an unknown world revealed byphylogenetic analysis

Using phylogenetic analysis to discovernew life forms for biotechnology

pJP 74

pJP 7

pJP 8

pJP 6

pJP 81

pJP 33

pJP 9

Thermophilic bacteria found in Yellowstone hot springs

A phylogeny of some archaeobacteria. Newly discoverd life forms are in red.

Desufurococcus mobilis

Sulfolobus aciducaldarius

Pyrodictium occultum

Pyrobaculum islandicum

Pyrobaculum aerophilum

Thermoproteus tenax

Thermofilum pendens

Methanopyrus kandleri

Thermococcus celer

Archaeoglobus fulgidus

“Simple identification via phylogenetic classification of organisms has,to date, yielded more patent filings than any other use of phylogeny in industry.”Bader et al. (2001)

The Tree of Life: Benefits to Society throughPhylogenetic Research

Phylogenetic analysis is playing a major role in discoveringand identifying new life forms that could yield many newbenefits for human health and biotechnology. Many

microorganisms, including bacteria and fungi, cannot be culti-vated and studied directly in the laboratory, thus the principleroad to discovery is to isolate their DNA from samples collectedfrom marine or freshwater environments or from soils. The DNAsamples are then sequenced and compared in phylogenetic

analyses with the sequences of previouslydescribed organisms. This has led to majornew discoveries.

For several decades microbiologistshave been searching for new bacteria inextreme environments such as hotspringsor marine hydrothermal vents. The thermalsprings of Yellowstone National Park have yielded a host of new and important bacterial species, many of which were identified using phyloge-netic analysis of DNA sequences.

The most famous bacterium from Yellowstoneis Thermus aquaticus . Anenzyme derived from this species —DNA Taq polymerase — powers aprocess called the poly-merase chain reaction(PCR), which is used inthousands of laboratoriesto make large amounts of DNA for sequencing.This discovery led to thecreation of a major newbiotechnological industryand has revolutionizedmedical diagnostics, foren-sics, and other biologicalsciences. Many microorganisms in extreme environments mayyield innovative products for biotechnology.

Fungi are among the most ecologically important organisms.By feeding on dead or decaying organic material, fungi helprecycle nutrients through ecosystems. Additionally, fungi are

important economically as foods and as biotechnological sourcesfor medicines, insecticides, herbicides, and many other products.

About 200,000 species of fungi are known, but there may be millions more to be discovered because most are extremelysmall and found in poorly studied habitats such as soils.Increasingly, phylogenetic analysis is being used to discover

new microfungi through isolation and sequencing of DNA.Biological studies on these new species hold great promise for developing novel natural products.

2

Common fungi often havemycorrhizal associations inearly stages of development,and thus are important parts of Earth’s ecosystems.

Fungi — an unknown world revealed byphylogenetic analysis

Using phylogenetic analysis to discovernew life forms for biotechnology

pJP 74

pJP 7

pJP 8

pJP 6

pJP 81

pJP 33

pJP 9

Thermophilic bacteria found in Yellowstone hot springs

A phylogeny of some archaeobacteria. Newly discoverd life forms are in red.

Desufurococcus mobilis

Sulfolobus aciducaldarius

Pyrodictium occultum

Pyrobaculum islandicum

Pyrobaculum aerophilum

Thermoproteus tenax

Thermofilum pendens

Methanopyrus kandleri

Thermococcus celer

Archaeoglobus fulgidus

“Simple identification via phylogenetic classification of organisms has,to date, yielded more patent filings than any other use of phylogeny in industry.”Bader et al. (2001)

Page 5: Molecular Phylogenetics (1) Phylogeny in wide sense is a ... · Molecular Phylogenetics (1) Phylogeny in wide sense is a historical development of EEOB 563 Phylogeny (from phylum

How do we know that phylogenetics work?

Application and Accuracy of Molecular Phylogenies

David M. Hillis, John P. Huelsenbeck, Clifford W. Cunningham

Molecular investigations of evolutionary history are being used to study subjects as diverse as the epidemiology of acquired immune deficiency syndrome and the origin of life. These studies depend on accurate estimates of phylogeny. The performance of methods of phylogenetic analysis can be assessed by numerical simulation studies and by the ex- perimental evolution of organisms in controlled laboratory situations. Both kinds of as- sessment indicate that existing methods are effective at estimating phylogenies over awide range of evolutionary conditions, especially if information about substitution bias is used to provide differential weightings for character transformations.

Over the past few decades, biologists from many disciplines have turned to phyloge- netic analyses to interpret variation in bio- logical systems (1). This increased interest in evolutionary history has developed partly in response to a new appreciation of the importance of understanding evolutionary constraints when interpreting biological variation and partly in response to develop- ments in phylogenetic methodology. Three developments in particular have been crit- ical to the success of the field: (i) the development of objective criteria and algo- rithms for discriminating among potential phylogenies, (ii) increased computational power to implement phylogenetic algo- rithms, and (iii) a rapid increase in the data available for inferring phylogenies, espe- cially from molecular investigations (2). As a result of these developments, applications of phylogenetic analysis span the range of biological diversity from questions about the history of life (3) to studies of the epidemiology of acquired immune deficien- cy syndrome (AIDS) (4). However, the success of these applications depends on the accuracy of the inferred phylogenies, so it is necessary to ask how well the methods work and to identify the conditions under which they may fail.

The accuracy of methods of phylogenet- ic analysis can be assessed by the examina- tion of either numerical simulations of phy- logenies or phylogenies of organisms whose evolutionary history has been observed di- rectly. Numerical simulations assume a par- ticular model of evolution and then gener- ate characters (typically, nucleotide se- quences) according to the model and to a given phylogeny. Thus, an investigator can generate many replicate data sets under specified conditions in order to compare the performance of competing methods. The analysis of known phylogenies adds a reality check to the simulation studies: The history

The authors are in the Department of Zoology, Univer- sity of Texas, Austin, TX 78712, USA.

of the lineages is known (or, ideally, con- trolled by the investigator), but the orga- nisms evolve under real biological con- straints rather than idealized model condi- tions. Known phylogenies may involve lab- oratory or cultivated strains whose history has been recorded (5) or lineages that have been manipulated under controlled experi- mental conditions for the purpose of gener- ating testable phylogenies (6, 7).

The numerical simulation and experi- mental phylogeny approaches are largely complementary, and both kinds of studies are necessary to evaluate methods of phylo- genetic analysis effectively. Simulations can be used to explore virtually any conceivable phylogeny, and phylogenies can be replicat- ed with speed and ease. The primary limi- tation of numerical simulations is that they always include gross simplifications of bio- logical processes-. For -instance, most simu- lations assume that nucleotide positions evolve independently of one another, even though several causes of non-independence have been identified (8). Many simulations also assume simple one- or two-parameter substitution models; for instance, all possi- ble substitutions may be assumed to be equally probable (a one-parameter model), or separate probabilities of substitution may be assigned to transitions and transversions (a two-parameter model). However, real substitution biases are known to be much more complex (9). Although these com- plexities can be added to simulation studies, there is rarely sufficient knowledge to esti- mate the extent of the influence of factors such as non-independence among nucleo- tide positions or variance of rates of evolu- tion across nucleotide positions. Therefore, results from simulation studies need to be compared to results from studies of real biological organisms to determine the ef- fects of the simplifying assumptions. If re- sults from simulations can be replicated with experimental systems, then greater faith can be placed in the simulation re- sults. However, if departures from the sim-

ulation results are discovered, then the processes that are responsible for the differ- ences can be identified and the simulations can be improved. The simulations are likely to suggest conditions that are of interest in the experimental phylogenies, and the ex- perimental phylogenies can provide a test of the simulation results. Thus, a combination of the two approaches is the most effective way to evaluate the performance of meth- ods of phylogenetic analysis (10).

Simple Evolutionary Models

Most simulated phylogenies assume a sim- ple one- or two-parameter model of evolu- tion and then test the ability of various methods to reconstruct the evolutionary history of lineages generated under the as- sumed model (11, 12). Several methods are known to be consistent (at least for simple tree topologies) for data generated under such models, which means that they con- verge on the correct answer, given infinite data. In general, most of the commonly used methods are consistent if corrections are made for superimposed changes (such as multiple substitutions at a single nucleotide site) in accord with the model of evolution used (13). For instance, most pairwise dis- tance methods (except the UPGMA meth- od) are consistent under the Jukes-Cantor one-parameter model of evolution if Jukes- Cantor distances are used to infer the phy- logeny (12, 14). Character-based methods such as parsimony can also be made consis- tent by using a Hadamard transformation to correct the data (13). However, the fact that a method is consistent indicates only that it will converge on the correct answer when given unlimited data, so it is neces- sary to do power analyses in order to com- pare the performance of competing meth- ods, given finite data sets.

A common objection made to simula- tion studies is that it is easy to bias the results in favor of almost any method by choosing conditions to sirnulate that are most favorable to that method (15). Such biases can be avoided only by exhaustively exploring the potential parameters, of any given problem. As an example, consider one of the most commonly simulated cases: a simple four-taxon unrooted tree, in which the five lineages (four peripheral branches and a central branch) are evolving at two different rates (Fig. 1). Felsenstein (16) used a tree of this type to demonstrate that some methods of phylogenetic reconstruc-

SCIENCE * VOL. 264 * 29 APRIL 1994 671

tion are inconsistent when two of the op- posing peripheral branches are evolving much more rapidly than are the remaining three branches. Given a model of evolution

(for example, the Kimura two-parameter model, which allows for independent sub- stitution rates for transitions and transver-

sions) (17), and given two rates of evolu- tion (one rate for two of the opposing branches and a second rate for the remain-

ing three branches), the universe of possi- ble trees can be examined in a two-dimen- sional graph (Fig. 1). Instantaneous substi- tution rates can be varied from zero to

infinity along each of the axes, and se-

quences can be generated in accord with the model of evolution. A power analysis is conducted by generating sequences of given finite length and then inferring the trees from the sequences by the use of competing methods.

Figure 1 shows a power analysis for three common methods of phylogenetic inference and the effects of two common methods of data transformation under the model of evolution outlined above (18). For non- transformed data, all three methods are inconsistent in parts of the graph space; use of Kimura-corrected distances (which ex-

actly match the model of evolution) makes the neighbor-joining method consistent across the graph (12). Another common

type of data transformation involves char- acter weighting (19, 20). In character methods such as parsimony, differential

weights are often assigned to the different character-state changes, depending on their observed frequency of occurrence. Thus, in the Kimura model simulated in Fig. 1, transitions are 10 times more likely to occur than are transversions, so the weighted- parsimony analysis weights the transver- sions 10 times more heavily than transitions (in practice, a wide range of weights of transversions over transitions produces identical results) (Fig. 2). Such weighting is not equivalent to transforming the data to account for superimposed changes, so

weighted parsimony is not consistent across the entire graph space (12). However, the

power analysis shown in Fig. 1 indicates that weighting of characters has a much

greater effect on performance than does correction for superimposed changes, espe- cially at high rates of change. Although the

weighted-parsimony method is more likely to be misleading at extreme differences in the two rates (that is, in the upper left comer of the graph space), it is more likely to find the correct tree at high rates of

change (Fig. 1). The Kimura corrections do

improve the performance of the neighbor- joining method in regions that are incon- sistent for the uncorrected data but do not

improve performance when rates are uni-

formly higher (as does character weight-

672

ing). The Kimura corrections actually re- duce the performance of distance methods under conditions of equal rates of change (Fig. 1).

Some authors have argued that methods such as parsimony should be avoided be- cause they are inconsistent for some trees

\e/d / e 100 Flg 0.75 b 90 an<

1^./ | un80 70 un(

o / 160 eve

/ /0 ,.: / 5 | 4° ant u~~~~ ~~ ~~40

s OX / 130 rat(

m . 0 0.75 b s20 (d2 Branch lengths 10 luti

(a, b, and c) ex Parimnnv Wihtd narimonv tW(e Parsimnnv Weichted Darsimonv tw(

Neighbor joining I . 1.

Neighbor joining

(for example, those in the upper left comer of the graphs in Fig. 1) when they evolve under simple models of evolution (21). However, all methods become inconsistent for some trees when their assumptions are violated (12), and the cost of complete consistency under simple models of evolu-

9. 1. Performance of three methods of phylogenetic alysis on the basis of simulation of four-taxon trees der the Kimura model of evolution (18). Two rates of olution were simulated: one rate for branches a, b, d c (horizontal axis of each graph) and a second e for branches d and e (vertical axis). The diagonal ashed line, top left) represents equal rates of evo- on along all lineages. Branch lengths are shown in

pected frequency of divergent nucleotides at the o ends of the respective branches. At infinite rates

change, DNA sequences with equal base compo- ons are expected to differ at 75% of their positions. ue indicates that the method estimates the correct e a high percentage of the time under the simulated

nditions; red indicates poor performance of the

;thod (see color bar, top right). The solid white lines cumscribe the regions in which each method esti- ites the correct tree over 95% of the time. In the

gions above the dashed white lines, the methods timate the correct tree less than one-third of the time rate worse than that obtained by choosing a tree at

idom). The three colored graphs on the left were sed on nontransformed data; the three graphs on a right show the effects of character-state weighting r parsimony, top) and distance correction (for ighbor joining and UPGMA, middle and bottom).

Fig. 2. Efficiency of five 100- * -

methods of phylogenetic analysis for a four-taxon tree 90- / / with equal rates of evolution, Pars f5 0.5 0.5

evolving under a Kimura , 80- UPGMA

model of evolution and a 5 0.

10:1 transition:transversion 70 / / / N ratio. The branch lengths shown on the tree indicate 8 60- that 50% of the nucleotide / / sites are expected to 50- Lake's invarants

change along each branch.

Although all five methods 40- are consistent under these conditions (they all eventual- 30 . . . . ................ ........... ......

ly converge on the correct 101 102 103 104 105 106 107 108

solution), the methods differ Number of nucleotides

markedly in the number of nucleotides needed to find the correct solution. All points are based on 1000 simulated trees. WPars is weighted parsimony (45) (any weighting of transversions over transitions from 5:1 to infinity produces results indistinguishable from those shown); Pars is uniformly weighted parsimony (45); NJ is neighbor joining with Kimura distances (38); UPGMA is the unweighted pair-group method of

averages with Kimura distances (40); Lake's invariants is the method also known as evolutionary parsimony (22).

SCIENCE * VOL. 264 * 29 APRIL 1994

~~~""-""-~;·"""~I~C~ci~~j~'~

,r\ I IDPRMhA fKim irr\

ARTICLE

from parsimony. In the original study, the

phylogeny of these lineages was inferred from restriction site maps of the entire viral

genome, and all methods tested were suc- cessful at recovering the known phylogeny (6). The methods differed significantly in their ability to recover the branch lengths of the phylogeny (7), and the study also indicated a high degree of success in the reconstruction of ancestral restriction maps (>98% accuracy). However, the study did not discriminate among methods on the basis of their ability to find the correct order of branching events, because all methods found the correct tree.

We have now investigated this phylog- eny, using two additional data sets: restric- tion fragments and DNA sequences (33). Some authors recommend using the pres- ence or absence of restriction fragments (rather than the presence or absence of restriction sites) to infer phylogenies, be- cause it is much easier to collect restriction

fragment data than restriction site data

(34). However, restriction fragments do not evolve independently (a single site gain results in the loss of one fragment and the

gain of two others), and deletions can affect the fragments produced by many restriction

enzymes simultaneously. Because of these

problems, many authors argue that restric- tion site data should be preferred to restric- tion fragment data (35). This position is

supported by the experimental T7 phylog- eny, because all methods estimated an in- correct phylogeny when using high-resolu- tion restriction fragments, but they estimat-

ed the correct phylogeny when using re- striction sites. This difference in the

performance of analyses based on the two

types of data has not been apparent in simulation studies, possibly because simula- tion studies rarely include deletions in their models of evolutionary change.

The sequence data consist of 1091 base

pairs across four genes of T7 (36). There are

only 63 variable sites across the sequences, or about one-third as many variable char- acters as are present in the restriction site data (6). Competing methods do not per- form as well with the sequence data as they do with the restriction site data. With the

sequence data, only parsimony and weight- ed parsimony estimate the correct tree, although a second tree (that differs by one

branch) is equally parsimonious. Maximum likelihood (37), neighbor joining (38), the

Fitch-Margoliash method (39), and UP- GMA (40) each estimate a single, incorrect tree that differs from the correct tree by one branch rearrangement. The less accurate overall performance of all methods with the

sequence data does not necessarily imply that sequences are less reliable than restric- tion sites for inferring phylogeny, because there are fewer variable sites in the se-

quence data set. However, if bootstrap sam-

ples equal in size to the sequence data set are selected from the complete restriction site data and compared to bootstrap samples of the sequence data, then the restriction site data do appear to be somewhat more reliable for inferring phylogeny for most methods (maximum likelihood is the ex-

ception) (Fig. 7). A possible explanation lies in the non-independent evolution of some nucleotides within genes (7, 8); the

L N

-

0 (a

(0

0 0

tO

Fig. 6. Comparison of an observed phylogeny of viruses derived from bacteriophage T7 with an estimated phylogeny from the parsimony meth-

od, on the basis of analysis of the terminal

sequences (J through R). The numbers above the branches indicate the actual or estimated number of substitutions that occurred along the

respective lineages. The actual numbers of sub- stitutions were determined by sequencing the ancestral viruses. Ranges of values on the esti- mated tree indicate that multiple, equally parsi- monious reconstructions of character states are

possible.

Weighted Parsimony Neighbor UPGMA Maximum parsimony joining likelihood

Fig. 7. Comparison of phylogenetic analyses of the viral lineages derived from bacteriophage T7, on the basis of 1000 bootstrap samples of DNA sequences and 1000 bootstrap subsam-

ples of the restriction site data that have the same number of variable sites as are in the

sequence data. All methods found the correct tree with the complete restriction site data set; only parsimony and weighted parsimony found the correct tree with the complete sequence data set.

SCIENCE * VOL. 264 * 29 APRIL 1994

variable restriction sites are distributed across the entire T7 genome and therefore are more likely to vary independently of one another. For these data, differential

weighting of character states does not im-

prove phylogenetic resolution, because rare substitutions are restricted to single termi- nal lineages and therefore are uninforma- tive under the parsimony criterion. On the basis of the simulated HIV phylogenies discussed earlier, the beneficial effects of

weighting are expected only at higher rates of evolution than were observed. The rela-

tively poor performance of maximum-like- lihood estimation on the restriction site data may be because the strongly biased substitution matrix violates the assumptions of the method (7).

Clearly, it will be necessary to construct additional experimental phylogenies that are based on other tree topologies and

experimental conditions so that the gener- ality of the results can be checked. In

particular, predicted conditions of inconsis-

tency need to be examined experimentally. Nonetheless, there is a high degree of cor-

respondence between the results from sim- ulations and the experimental phylogenies, although the experiments suggest addition- al complexities that need to be added to simulations. For instance, the comparison of restriction site data with restriction frag- ment data indicates the need to incorporate insertion-deletion events into simulations as well as methods of analysis, and the

sequence analyses confirm the importance of accounting for non-independence among nucleotide sites. In general, however, the

experimental phylogenies confirm the rela-

tively high levels of performance of the various methods of phylogenetic analysis under realistic conditions.

Conclusions

Both simulation studies and experimental phylogenies indicate that many methods of

phylogenetic analysis are powerful enough to reconstruct evolutionary histories with a

high degree of accuracy, as long as the rates of change of the observed characters are

appropriate for analysis. This emphasizes the importance of methods that evaluate whether rates of evolutionary change in

target sequences are appropriate for phylo- genetic analysis (41). Experimental phylog- enies also indicate that many methods may be fairly robust to violations of the under-

lying assumptions, such as non-indepen- dence among nucleotide sites or deviations from simple models of evolution. It also is clear that differential weighting of charac- ter-state changes to reflect the observed fre-

quency of the different types of transforma- tions may substantially improve the perfor- mance of phylogenetic methods (especially

675

a) simulations

b) experimental phylogenies

Springer et al. 2004mammalian taxa, new questions arise, such as whether theunderlying genetic architecture responsible for thesechanges involves the same or different genes.

The root of the placental tree and other remainingproblemsWith the proposal of and strong support for the four majorclades of placental mammals, as well as Boreoeutheria

(Euarchontoglires þ Laurasiatheria), there are only threeviable locations for the root of the placental tree[19,21–23]. These are between (i) Afrotheria and otherplacental orders, (ii) Xenarthra and other placental orders(as favored by morphology), and (iii) ATLANTOGENATA

(Xenarthra þ Afrotheria) and Boreoeutheria. Numericalsimulations [21] reject the latter two hypotheses, but thesetests might be too liberal in rejecting alternate hypothesesif real data are not simulated accurately according tocurrent models of sequence evolution [40]. Resolving theplacental root remains the most fundamental problem forfuture studies of placental phylogeny and has implicationsfor understanding early placental biogeography. For allthree competing hypotheses, molecular data give theseparation of South American xenarthrans and African-origin afrotheres as being ,100 million years ago, whichcoincides with the vicariant separation of South Americaand Africa. Whereas some workers have suggested acausal connection between these plate-tectonic dates andmolecular dates separating Xenarthra and Afrotheria[18,21], others dismiss this as coincidence [41].

Similar to the placement of the placental root, remain-ing problems associated with resolving relationshipswithin the major clades involve minor perturbations ofthe tree shown in Figure 1b. The discovery of further RGCswill be crucial in testing alternate hypotheses that involveshort time intervals [22]. Within Laurasiatheria, it isunclear if perissodactyls are more closely related topangolins þ carnivores or to Cetartiodactyla. WithinAfrotheria, it has proved difficult to resolve the relation-ship among the three paenungulate orders (elephants,hyraxes, dugongs–manatees). By contrast, morphologystrongly supports a sister-group relationship betweenProboscidea and Sirenia (Tethytheria) [3,4,42], which isalso supported by complete mitochondrial genomes [43].

Minority viewsThe emerging consensus for placental ordinal relation-ships (Figure 1b), with its four major clades that aresupported by overwhelming sequence evidence and RGCs,is not without critics [4,14,44]. Arnason et al.’s [14]mtDNAanalysis suggests that hedgehogs are dissociated fromother core insectivores, such as shrews and moles, andwere the earliest offshoot of the placental tree. Arnasonet al. [14] also find that rodents, Glires, Euarchontoglires,and Boreoeutheria are all paraphyletic taxa. However, Linet al. [27] found that mtDNA trees recover the same fourclades as nuclear genes when outgroup taxa are removed.Peculiar features of rooted mtDNA trees can result frominadequate models of sequence evolution [27,28] and/orunbalanced taxon sampling [28,29]. In particular, somemarsupials have unusual nucleotide compositions andthere have been changes in the mutational process in bothhedgehogs and murid rodents relative to most otherplacental mammal mitochondrial genomes [27]. Thesechanges violate the assumptions of most methods ofphylogeny reconstruction. For example, general timereversible models of nucleotide substitution assume thatbase composition remains the same in different lineages.Other analyses suggest that protein-coding regions of the

Figure 2. Parallel morphological radiations in Afrotheria and Laurasiatheria illus-trate homoplasy in external morphology. (a) African golden mole (Chrysochlori-nae) and (b) Old World mole (Talpinae); (c) Malagasy hedgehog (Tenrecinae) and(d) common hedgehog (Erinaceinae); (e) shrew tenrec (Oryzorictinae; Microgalethomasi; Copyright Link Olson) and (f) common shrew (Soricinae); (g) manatee(Trichechidae) and (h) dolphin (Delphininae); (i) aardvark (Orycteropodidae) and (j)pangolin (Maninae).

Review TRENDS in Ecology and Evolution Vol.19 No.8 August 2004 435

www.sciencedirect.com

Convergence is widespread!

convergent evolution of features related to volancy in batsand flying lemurs, but eliminates the need to postulate theloss of archontan ankle specializations in bats [32].Complete mtDNA analyses recently placed flying lemurswithin primates and render the latter PARAPHYLETIC [14].However, SINE and LINE insertions [33] and analyses ofnuclear genes [21,24] recover traditional primate MONO-

PHYLY. Within Laurasiatheria, Eulipotyphla (e.g. moles,shrews, hedgehogs) is the probable sister-taxon to theremaining orders. The emerging molecular support for asister-group relationship between carnivores and pango-lins includes concatenated nuclear sequences [21], mito-chondrial protein sequences [14] and an RGC (Box 1).Morphologically, carnivores and pangolins are uniqueamong living placental mammals in possessing an osseoustentorium that separates the cerebral and cerebellarcompartments of the cranium [3].

Molecular data are also resolving relationships withinorders, sometimes with unexpected results. In addition tonesting whales within Artiodactyla, molecular data

separate hippos from other Suiformes (e.g. pigs) [10]. InEulipotyphla, shrews and hedgehogs group to the exclu-sion of moles [25,34]. This result contrasts with morpho-logical hypotheses that favor either moles þ shrews tothe exclusion of hedgehogs or moles þ hedgehogs to theexclusion of shrews. In Rodentia, molecular data suggest anovel mouse-related clade that includes murids (mice andrats), dipodids (jerboas), castorids (beavers), geomyids(pocket gophers), heteromyids (pocket mice), anomalurids(scaly-tailed flying squirrels), and pedetids (springhares)[35]. This group had never been proposed based onmorphological and paleontological data. Within Chirop-tera (bats), both nuclear and mitochondrial sequencesfavor microbat paraphyly, which has profound impli-cations for understanding the origins of laryngeal echolo-cation (Box 2).

The deployment of morphological character evolutionDarwin [36] recognized that ANALOGICAL or adaptivecharacters would be almost valueless to the systematist

Figure 1. The prevailing morphological tree (a) and the emerging molecular tree (b) of the placental orders. (a) Morphology generally places Xenarthra (sloths, anteatersand armadillos) as basal, and most of the remaining orders into three well-established clades: Ungulata (thought to be derived from CONDYLARTH ancestors, Archonta andAnagalida. The depicted tree is from Shoshani and McKenna [3]. The tree obtained by Liu et al. [4] is identical, apart from placing cetaceans as sister group to the perisso-dactyl-paenungulate clade. The tree of Novacek ([6]; http://tolweb.org/tree?group ¼ Eutheria&contgroup ¼ Mammalia) places Pholidota (pangolins) as basal sister toXenarthra, makes Primates and Scandentia (tree shrews) sister groups, and collapses several clades (black dotted lines). Novacek [5] subsequently collapses some furtherclades (gray dotted lines), which increases reconciliation with the molecular tree. (b) The molecular tree recognizes four major clades: Afrotheria, Xenarthra, Laurasiatheriaand Euarchontoglires, of which the latter two are joined into Boreoeutheria. The presented placental ordinal topology is according to Murphy et al. [21]. Placing Marsupialiaas sister to Placentalia is based on Phillips and Penny [54] and references therein. Clades indicated by solid lines are, with rare exceptions, supported independently by allother molecular data and analyses [24–29]. Notable exceptions are the strong tendency of mitochondrial protein sequences to place hedgehogs and rodents as basal in thetree [14]. Colors distinguish the four basal placental clades in the molecular tree.

TRENDS in Ecology & Evolution

Marsupialia

Xenarthra

Pholidota

Rodentia

Lagomorpha

Macroscelidea

Primates

Scandentia

Dermoptera

Chiroptera

Insectivora

Carnivora

Cetacea

Artiodactyla

Perissodactyla

Hyracoidea

Proboscidea

Sirenia

Tubulidentata

Monotremata

Marsupialia

Xenarthra

Pholidota

Rodentia

Lagomorpha

Macroscelidea

Primates

Scandentia

Dermoptera

Chiroptera

Eulipotyphla

Carnivora

Cetartiodactyla

Perissodactyla

Hyracoidea

Proboscidea

Tubulidentata

Afrosoricida

Monotremata

Sirenia

(a) (b)

Ungula

taA

rchonta

Anagalid

a

Laura

sia

theri

aE

uarc

honto

glir

es

Afr

oth

eri

aX

enart

hra

Review TRENDS in Ecology and Evolution Vol.19 No.8 August 2004 433

www.sciencedirect.com

Springer et al. 2004

0.0002substitutions per site

Mar

. 201

4A

pr.

May

.Ju

n.Ju

l.A

ug.

Sep

.O

ct.

Nov

.D

ec.

Jan.

201

5

Sierra LeoneGuinea

LiberiaMali

GN

1

1.0

1.0

1.0

1.0

1.0

1.0

1.0

1.0

1.0

0.71

0.91

1.0

0.8

0.99

0.78

GN

2S

L3G

N3

GN

4

Lineage A

Lineage B

Dec. Jan.2014

Feb. Mar. Apr. May. Jun. Jul. Aug. Sep. Oct. Nov. Dec. Jan.2015

0.0000

0.0002

0.0004

0.0006

0.0008

0.0010

Roo

t-to

-tip

div

erge

nce

Sierra LeoneGuinea

LiberiaMali

a b

Figure 2 | Phylogenetic relatedness and nucleotide sequence divergence ofEBOV isolates from the 2013–2015 outbreak. a, Phylogenetic relatedness ofEBOV isolates. Phylogenetic tree inferred using MrBayes11 for full-lengthEBOV genomes sequenced from 179 patient samples obtained between March2014 and January 2015. Displayed is the majority consensus of 10,000 treessampled from the posterior distribution with mean branch lengths. Posteriorsupport is shown for selected key nodes. Twenty-two samples originated inLiberia and were collected between March and August 2014 and six samples

from Sierra Leone were obtained in June and July 2014. In our analysis we alsoincluded published sequences, including the three early Guinean sequences2

and 78 sequences described by Gire et al.6. A number of lineages predominantlycirculating in Guinea are denoted as GN1–4 along with a uniquely Sierra Leonelineage (SL3) recognised in Gire et al.6. b, EBOV nucleotide sequencedivergence from root of the phylogeny in Fig. 2a plotted against time ofcollection of each virus. The date of the first documented case near Meliandouin eastern Guinea is indicated by the red triangle.

GN

1G

N2

GN

3G

N4

SL1

SL2

SL3

Dec. Jan.2014

Jan.2015

Feb. Mar. Apr. May. Jun. Jul. Aug. Sep. Oct. Nov. Dec.

Guinea

Liberia

Mali

Sierra Leone

1.0

1.0

1.0

0.99

0.96

Lineage ALineage B

Figure 3 | A time-scaledphylogenetic tree of 262 EBOVgenomes from Guinea, SierraLeone, Liberia and Mali. Shown is amaximum clade credibility treeconstructed from 10,000 treessampled from the posteriordistribution with mean node ages.Clades described in Gire et al.6 areidentified here (SL1, SL2 and SL3) aswell as a number of lineagespredominantly circulating in Guineaand posterior probability support isgiven for these. For certain key nodeages, 95% credible intervals areshown by horizontal bars.

6 A U G U S T 2 0 1 5 | V O L 5 2 4 | N A T U R E | 9 9

LETTER RESEARCH

G2015 Macmillan Publishers Limited. All rights reserved

LETTER OPENdoi:10.1038/nature14594

Temporal and spatial analysis of the 2014–2015 Ebolavirus outbreak in West AfricaMiles W. Carroll1,2,3, David A. Matthews4*, Julian A. Hiscox5*, Michael J. Elmore1*, Georgios Pollakis5*, Andrew Rambaut6,7,8*,Roger Hewson1,2,9, Isabel Garcıa-Dorival5, Joseph Akoi Bore2,10,11, Raymond Koundouno2,10,11, Saıd Abdellati2,12, Babak Afrough1,2,John Aiyepada2,13, Patience Akhilomen2,13, Danny Asogun2,13, Barry Atkinson1,2, Marlis Badusche2,14,15, Amadou Bah2,16,Simon Bate1,2, Jan Baumann2,14, Dirk Becker2,15,17, Beate Becker-Ziaja2,14,15, Anne Bocquin2,18,19, Benny Borremans2,20,Andrew Bosworth1,2,5, Jan Peter Boettcher2,21, Angela Cannas2,22, Fabrizio Carletti2,22, Concetta Castilletti2,22, Simon Clark1,2,Francesca Colavita2,22, Sandra Diederich2,15,23, Adomeh Donatus2,13, Sophie Duraffour2,14,24, Deborah Ehichioya2,14,25,Heinz Ellerbrok2,21, Maria Dolores Fernandez-Garcia2,26, Alexandra Fizet2,18,27, Erna Fleischmann2,15,28, Sophie Gryseels2,20,Antje Hermelink2,21, Julia Hinzmann2,21, Ute Hopf-Guevara2,21, Yemisi Ighodalo2,13, Lisa Jameson1,2, Anne Kelterbaum2,15,17,Zoltan Kis2,29, Stefan Kloth2,21, Claudia Kohl2,21, Misa Korva2,30, Annette Kraus2,31, Eeva Kuisma1,2, Andreas Kurth2,21,Britta Liedigk2,14,15, Christopher H. Logue1,2, Anja Ludtke2,15,32, Piet Maes2,24, James McCowen1,2, Stephane Mely2,18,19,Marc Mertens2,15,23, Silvia Meschi2,22, Benjamin Meyer2,15,33, Janine Michel2,21, Peter Molkenthin2,15,28, Cesar Munoz-Fontela2,15,32,Doreen Muth2,15,33, Edmund N. C. Newman1,2, Didier Ngabo1,2, Lisa Oestereich2,14,15, Jennifer Okosun2,13, Thomas Olokor2,13,Racheal Omiunu2,13, Emmanuel Omomoh2,13, Elisa Pallasch2,14,15, Bernadett Palyi2,29, Jasmine Portmann2,34, Thomas Pottage1,2,Catherine Pratt1,2, Simone Priesnitz2,35, Serena Quartu2,22, Julie Rappe2,36, Johanna Repits2,37, Martin Richter2,21,Martin Rudolf2,14,15, Andreas Sachse2,21, Kristina Maria Schmidt2,21, Gordian Schudt2,15,17, Thomas Strecker2,15,17, Ruth Thom1,2,Stephen Thomas1,2, Ekaete Tobin2,13, Howard Tolley1,2, Jochen Trautner2,38, Tine Vermoesen2,12, Ines Vitoriano1,2,Matthias Wagner2,15,28, Svenja Wolff2,15,17, Constanze Yue2,21, Maria Rosaria Capobianchi2,22, Birte Kretschmer39, Yper Hall1,John G. Kenny40, Natasha Y. Rickett5, Gytis Dudas6, Cordelia E. M. Coltart41, Romy Kerber2,14,15, Damien Steer42, Callum Wright43,Francis Senyah1, Sakoba Keita44, Patrick Drury45, Boubacar Diallo46, Hilde de Clerck47, Michel Van Herp47, Armand Sprecher47,Alexis Traore48, Mandiou Diakite49, Mandy Kader Konde50, Lamine Koivogui11, N’Faly Magassouba10, Tatjana Avsic-Zupanc2,30,Andreas Nitsche2,21, Marc Strasser2,34, Giuseppe Ippolito2,22, Stephan Becker2,15,17, Kilian Stoecker2,15,28, Martin Gabriel2,14,15,Herve Raoul2,19, Antonino Di Caro2,22, Roman Wolfel2,15,28, Pierre Formenty45 & Stephan Gunther2,14,15*

West Africa is currently witnessing the most extensive Ebola virus(EBOV) outbreak so far recorded1–3. Until now, there have been27,013 reported cases and 11,134 deaths. The origin of the virus isthought to have been a zoonotic transmission from a bat to a two-year-old boy in December 2013 (ref. 2). From this index case thevirus was spread by human-to-human contact throughout Guinea,Sierra Leone and Liberia. However, the origin of the particularvirus in each country and time of transmission is not known andcurrently relies on epidemiological analysis, which may be unre-liable owing to the difficulties of obtaining patient information.Here we trace the genetic evolution of EBOV in the current out-break that has resulted in multiple lineages. Deep sequencing of179 patient samples processed by the European Mobile Laboratory,the first diagnostics unit to be deployed to the epicentre of theoutbreak in Guinea, reveals an epidemiological and evolutionary

history of the epidemic from March 2014 to January 2015. Analysisof EBOV genome evolution has also benefited from a similarsequencing effort of patient samples from Sierra Leone. Our resultsconfirm that the EBOV from Guinea moved into Sierra Leone,most likely in April or early May. The viruses of the Guinea/Sierra Leone lineage mixed around June/July 2014. Viral sequencescovering August, September and October 2014 indicate that thislineage evolved independently within Guinea. These data can beused in conjunction with epidemiological information to test ret-rospectively the effectiveness of control measures, and provides anunprecedented window into the evolution of an ongoing viral hae-morrhagic fever outbreak.

We used a deep sequencing approach to gain insight into the evolu-tion of Ebola virus (EBOV) in Guinea from the ongoing West Africanoutbreak. This was an approach based on analysis pipelines developed

*These authors contributed equally to this work.

1Public Health England, Porton Down, Wiltshire SP4 0JG, UK. 2The European Mobile Laboratory Consortium, Bernhard-Nocht-Institute for Tropical Medicine, D-20359 Hamburg, Germany. 3University ofSouthampton, South General Hospital, Southampton SO16 6YD, UK. 4Department of Cellular and Molecular Medicine, School of Medical Sciences, University of Bristol, Bristol BS8 1TD, UK. 5Institute ofInfection and Global Health, University of Liverpool, Liverpool L69 2BE, UK. 6Institute of Evolutionary Biology, University of Edinburgh, Edinburgh EH9 2FL, UK. 7Fogarty International Center, NationalInstitutes of Health, Bethesda, Maryland 20892, USA. 8Centre for Immunology, Infection and Evolution, University of Edinburgh, Edinburgh EH9 2FL, UK. 9London School of Hygiene and Tropical Medicine,Keppel Street, London WC1E 7HT, UK. 10Universite Gamal Abdel Nasser de Conakry, Laboratoire des Fievres Hemorragiques en Guinee, Conakry, Guinea. 11Institut National de Sante Publique, Conakry,Guinea. 12Institute of Tropical Medicine, B-2000 Antwerp, Belgium. 13Institute of Lassa Fever Research and Control, Irrua Specialist Teaching Hospital, Irrua, Edo State, Nigeria. 14Bernhard Nocht Institutefor Tropical Medicine, D-20359 Hamburg, Germany. 15German Centre for Infection Research (DZIF), 38124 Braunschweig, Germany. 16Swiss Tropical and Public Health Institute, University of Basel, CH-4002 Basel, Switzerland. 17Institute of Virology, Philipps University Marburg, 35043 Marburg, Germany. 18National Reference Center for Viral Hemorrhagic Fevers, 69365 Lyon, France. 19Laboratoire P4Inserm-Jean Merieux, US003 Inserm, 69365 Lyon, France. 20Department of Biology, University of Antwerp, B-2020 Antwerp, Belgium. 21Robert Koch Institute, 13353 Berlin, Germany. 22National Institutefor Infectious Diseases (INMI) Lazzaro Spallanzani, 00149 Rome, Italy. 23Friedrich Loeffler Institute, Federal Research Institute for Animal Health, 17493 Greifswald, Insel Riems, Germany. 24KU LeuvenRega institute, B-3000 Leuven, Belgium. 25Redeemer’s University, Osun State, Nigeria. 26Centro Nacional de Microbiologia, Instituto de Salud Carlos III, 28029 Madrid, Spain. 27Unite de Biologie desInfections Virales Emergentes, Institut Pasteur, 69365 Lyon, France. 28Bundeswehr Institute of Microbiology, 80937 Munich, Germany. 29National Center for Epidemiology, National Biosafety Laboratory,H-1097Budapest,Hungary. 30Institute ofMicrobiologyand Immunology, Faculty of Medicine,University of Ljubljana, SI-1000Ljubljana, Slovenia. 31Public Health AgencyofSweden,171 82 Solna, Sweden.32Heinrich Pette Institute – Leibniz Institute for Experimental Virology, 20251 Hamburg, Germany. 33Institute of Virology, University of Bonn, 53127 Bonn, Germany. 34Federal Office for Civil Protection,Spiez Laboratory, CH-3700 Spiez, Switzerland. 35Bundeswehr Hospital, 22049 Hamburg, Germany. 36Institute of Virology and Immunology, CH-3147 Mittelhausern, Switzerland. 37Janssen-Cilag, SE-19207 Sollentuna, Sweden. 38Thunen Institute, D-22767 Hamburg, Germany. 39Eurice - European Research and Project Office GmbH, 10115 Berlin, Germany. 40Centre for Genomic Research, Institute ofIntegrative Biology, University of Liverpool, Liverpool L69 7ZB, UK. 41Department of Infection and Population Health, University College London, London WC1E 6JB, UK. 42Research IT, University of Bristol,Bristol BS8 1HH, UK. 43Advanced Computing Research Centre, University of Bristol, Bristol BS8 1HH, UK. 44Ministry of Health Guinea, Conakry, Guinea. 45World Health Organization, 1211 Geneva 27,Switzerland. 46World Health Organization, Conakry, Guinea. 47Medecins Sans Frontieres, B-1050 Brussels, Belgium. 48Section Prevention et Lutte contre la Maladie a la Direction Prefectorale de la Sante deGueckedou, Gueckedou, Guinea. 49Universite Gamal Abdel Nasser de Conakry, CHU Donka, Conakry, Guinea. 50Health and Sustainable Development Foundation, Conakry, Guinea.

6 A U G U S T 2 0 1 5 | V O L 5 2 4 | N A T U R E | 9 7G2015 Macmillan Publishers Limited. All rights reserved

Page 6: Molecular Phylogenetics (1) Phylogeny in wide sense is a ... · Molecular Phylogenetics (1) Phylogeny in wide sense is a historical development of EEOB 563 Phylogeny (from phylum

ARTICLEdoi:10.1038/nature14447

Complex archaea that bridge the gapbetween prokaryotes and eukaryotesAnja Spang1*, Jimmy H. Saw1*, Steffen L. Jørgensen2*, Katarzyna Zaremba-Niedzwiedzka1*, Joran Martijn1, Anders E. Lind1,Roel van Eijk1{, Christa Schleper2,3, Lionel Guy1,4 & Thijs J. G. Ettema1

The origin of the eukaryotic cell remains one of the most contentious puzzles in modern biology. Recent studieshave provided support for the emergence of the eukaryotic host cell from within the archaeal domain of life, butthe identity and nature of the putative archaeal ancestor remain a subject of debate. Here we describe the discoveryof ‘Lokiarchaeota’, a novel candidate archaeal phylum, which forms a monophyletic group with eukaryotes inphylogenomic analyses, and whose genomes encode an expanded repertoire of eukaryotic signature proteins that aresuggestive of sophisticated membrane remodelling capabilities. Our results provide strong support for hypotheses inwhich the eukaryotic host evolved from a bona fide archaeon, and demonstrate that many components that underpineukaryote-specific features were already present in that ancestor. This provided the host with a rich genomic‘starter-kit’ to support the increase in the cellular and genomic complexity that is characteristic of eukaryotes.

Cellular life is currently classified into three domains: Bacteria,Archaea and Eukarya. Whereas the cytological properties ofBacteria and Archaea are relatively simple, eukaryotes are character-ized by a high degree of cellular complexity, which is hard to reconcilegiven that most hypotheses assume a prokaryote-to-eukaryote trans-ition1,2. In this context, it seems particularly difficult to account for thesuggested presence of the endomembrane system, the nuclear pores,the spliceosome, the ubiquitin protein degradation system, the RNAimachinery, the cytoskeletal motors and the phagocytotic machineryin the last eukaryotic common ancestor (ref. 3 and references therein).Ever since the recognition of the archaeal domain of life by Carl Woeseand co-workers4,5, Archaea have featured prominently in hypothesesfor the origin of eukaryotes, as eukaryotes and Archaea representedsister lineages in Woese’s ‘universal tree’5. The evolutionary linkbetween Archaea and eukaryotes was further reinforced through stud-ies of the transcription machinery6 and the first archaeal genomes7,revealing that many genes, including the core of the genetic informa-tion-processing machineries of Archaea, were more similar to those ofeukaryotes8 rather than to Bacteria. During the early stages of thegenomic era, it also became apparent that eukaryotic genomes werechimaeric by nature8,9, comprising genes of both archaeal and bacterialorigin, in addition to genes specific to eukaryotes. Yet, whereas many ofthe bacterial genes could be traced back to the alphaproteobacterialprogenitor of mitochondria, the nature of the lineage from which theeukaryotic host evolved remained obscure1,10–13. This lineage mighteither descend from a common ancestor shared with Archaea (follow-ing Woese’s classical three-domains-of-life tree5), or have emergedfrom within the archaeal domain (so-called archaeal host or eocyte-likescenarios1,14–17). Recent phylogenetic analyses of universal protein datasets have provided increasing support for models in which eukaryotesemerge as sister to or from within the archaeal ‘TACK’ superphylum18–22,a clade originally comprising the archaeal phyla Thaumarchaeota,Aigarchaeota, Crenarchaeota and Korarchaeota23. In support of thisrelationship, comparative genomics analyses have revealed severaleukaryotic signature proteins (ESPs)24 in TACK lineages, including dis-

tant archaeal homologues of actin25 and tubulin26, archaeal cell divisionproteins related to the eukaryotic endosomal sorting complexesrequired for transport (ESCRT)-III complex27, and several informa-tion-processing proteins involved in transcription and translation2,17,23.These findings suggest an archaeal ancestor of eukaryotes that mighthave been more complex than the archaeal lineages identified thusfar2,23,28. Yet, the absence of missing links in the prokaryote-to-eukaryotetransition currently precludes detailed predictions about the nature andtiming of events that have driven the process of eukaryogenesis1,2,17,28.Here we describe the discovery of a new archaeal lineage related to theTACK superphylum that represents the nearest relative of eukaryotes inphylogenomic analyses, and intriguingly, its genome encodes manyeukaryote-specific features, providing a unique insight in the emergenceof cellular complexity in eukaryotes.

Genomic exploration of new TACK archaeaWhile surveying microbial diversity in deep marine sediments influ-enced by hydrothermal activity from the Arctic Mid-Ocean Ridge, 16SrRNA gene sequences belonging to uncultivated archaeal candidatelineages were identified in a gravity core (GC14) sampled approximately15 km north-northwest of the active venting site Loki’s Castle29 at3283 m below sea level (73.763167 N, 8.464000 E) (Fig. 1a)30,31.Subsequent phylogenetic analyses of these sequences, which comprised,10% of the obtained 16S reads, revealed that they belonged to thegamma clade of the Deep-Sea Archaeal Group/Marine Benthic Group B(hereafter referred to as DSAG)31–33 (Fig. 1b–d and SupplementaryFigs 1 and 2), a clade proposed to be deeply-branching in the TACKsuperphylum23. DSAG constitutes one of the most abundant and widelydistributed archaeal groups in the deep marine biosphere, but so farnone of its representatives have been cultured or sequenced31.

To obtain genomic information for this archaeal lineage, we applieddeep metagenomic sequencing to the GC14 sediment sample, resultingin a smaller (LCGC14, 8.6 Gbp) and a larger, multiple-strand displace-ment amplified (MDA) metagenome data set (LCGC14AMP, 56.6 Gbp;Fig. 2a; Supplementary Fig. 3 and Supplementary Table 1). Given the

*These authors contributed equally to this work.

1Department of Cell and Molecular Biology, Science for Life Laboratory, Uppsala University, SE-75123 Uppsala, Sweden. 2Department of Biology, Centre for Geobiology, University of Bergen, N-5020Bergen, Norway. 3Division of Archaea Biology and Ecogenomics, Department of Ecogenomics and Systems Biology, University of Vienna, A-1090 Vienna, Austria. 4Department of Medical Biochemistry andMicrobiology, Uppsala University, SE-75123 Uppsala, Sweden. {Present address: Groningen Institute for Evolutionary Life Sciences, University of Groningen, NL-9747AG Groningen, The Netherlands.

1 4 M A Y 2 0 1 5 | V O L 5 2 1 | N A T U R E | 1 7 3G2015 Macmillan Publishers Limited. All rights reserved

key regulators of actin cytoskeleton dynamics, these small GTPasesrepresent essential components for the process of phagocytosis ineukaryotes. Intriguingly, the analysis of Lokiarchaeal ESPs revealeda multitude of Ras-superfamily GTPases, comprising nearly 2% of theLokiarchaeal proteome (Fig. 3b). The relative amount of smallGTPases in the Lokiarchaeum genome is comparable to that observedin several unicellular eukaryotes, only being surpassed by the protistNaegleria gruberi. In contrast, bacterial and archaeal genomes encodeonly few, if any, small GTPase homologues of the Ras superfamily(Fig. 3b).

Phylogenetic analyses of the Lokiarchaeal small GTPases revealedthat these represent several distinct clusters, each of which comprisesseveral GTPase sequences (Fig. 3c and Supplementary Fig. 18).Although phylogenetic analyses failed to resolve most of the deepernodes, several of the eukaryotic small GTPase families appear to sharea common ancestry with Lokiarchaeal GTPases (Fig. 3c), suggestingan archaeal origin of specific subgroups of the eukaryotic smallGTPases, followed by independent expansions in eukaryotes andLokiarchaeota. This scenario contrasts with previous studies that havesuggested that eukaryotic small GTPases were acquired from thealphaproteobacterial progenitor of mitochondria37.

Although genes encoding canonical eukaryotic GTPase-activatingproteins (GAPs) were absent in Lokiarchaeota, twelve roadblock/LC7-domain-containing proteins were identified (SupplementaryTables 6 and 10). While such proteins have been implicated in dyneinorganization in eukaryotes, roadblock/LC7 protein MglB of the bac-terium Myxococcus xanthus was shown to act as a GAP of the small

GTPase MglA43. Hence, the Lokiarchaeal roadblock/LC7 proteinsrepresent possible candidates for alternative GAPs in this archaeon.

Presence of a primordial ESCRT complexIn eukaryotes, the ESCRT machinery represents an essential com-ponent of the multivesicular endosome pathway for lysosomaldegradation of damaged or superfluous proteins, and it plays a rolein several budding processes including cytokinesis, autophagy andviral budding44. The ESCRT machinery generally consists of theESCRT-I–III subcomplexes, as well as associated subunits45. Theanalysis of the Lokiarchaeum genome revealed the presence of anESCRT gene cluster (Fig. 4a), as well as of several additional pro-teins homologous to components of the eukaryotic multivesicularendosome pathway. For instance, Lokiarchaeum encodes divergentSNF7 domain proteins of the eukaryotic ESCRT-III complex, whichappear to represent members of the Vps2/Vps24/Vps46 and Vps20/Vps32/Vps60 families, respectively. A phylogenetic analysis of theLokiarchaeal SNF7 domain proteins revealed that these branch atthe base of these two eukaryotic ESCRT-III families with low boot-strap support (Fig. 4b and Supplementary Fig. 19), not only indi-cating that they might represent ancestral SNF7 copies, but alsosuggesting that the last eukaryotic common ancestor already inher-ited two divergent SNF7-domain-encoding genes from its putativearchaeal ancestor rather than a single gene46. Furthermore, the genecluster encodes an ATPase that displays closest resemblance toeukaryotic VPS4-type ATPases, including katanin, membrane scaf-fold protein (MSP) and spastin (Fig. 4c and Supplementary Fig. 20)as well as hypothetical proteins that show significant similarity

0.4

LCGC14AMP_05736710

Crenactin

LCGC14AMP and Lokiarchaeum (4/1)

Actin and related sequences

Arp2

LCGC14AMP (5)

Arp1

LCGC14AMP andLokiarchaeum(11/1)

LCGC14AMP/Lokiarchaeum (11/2)

LCGC14AMP (2)

Arp3

LCGC14AMP (2)

LCGC14AMP_06532160

100

100

51

83

100

100

96

100

100

100

100

a c

b

0.4

Euryarchaeota (13)

Euryarchaeota (77)

Lokiarchaeum (2)

Bacteria andEuryarchaeota (19)

Bacteria andEuryarchaeota (12)

Bacteria (5)

Lokiarchaeum (2)

Lokiarchaeum (6)

Euryarchaeota (5)

Lokiarch_12880

Ran-family (7)

Rho-family (7)

Arf-family (7)

Lokiarchaeum (2)

Bacteria (46)

Lokiarch_01230

Euryarchaeota (21)

Lokiarch_37110

Lokiarchaeum (7)

Lokiarchaeum (3)

Ras-family (5)

170290521 Ca. Korarchaeum cryptofilum OPF8

Lokiarch_31930

503411226 Methanobacterium lacus

Archaea (5)

Lokiarchaeum (3)

Lokiarchaeum (4)

Lokiarchaeum (6)

Lokiarchaeum (4)

Euryarchaeota (9)

170174596 Ca. Korarchaeum cryptofilum OPF8

Sar1-family (7)

Lokiarchaeum (3)

315425475 Ca.Caldiarchaeum subterraneum

Lokiarchaeum (3)

Lokiarchaeum (2)

Rab-family (7)

SRbeta (7)

Bacteria and Crenarchaeota (10)Bacteria (8)

502865047 Methanocaldococcus infernus Lokiarchaeum (4)

Bacteria (35)

Lokiarchaeum (8)

Lokiarchaeum (3)

Euryarchaeota (2)

499329248 Methanopyrus kandleri

Lokiarchaeum (2)

Crenarchaeota (21)

Lokiarch_45420

Lokiarchaeum (5)

Lokiarchaeum (19)

Thermophilum sp. (2)

96

96

51

79

100

99

100

100

71

100

87

99

84

87 73

90

96

74

97

98

89

100

95

82

93

69

100

68

51

61

82

99

97

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

Naegleria gruberi

Lokiarchaeum

Dictyostelium discoideum

Homo sapiens

Tetrahymena thermophila SB210

Giardia intestinalis ATCC 50581

Saccharomyces cerevisiae S288c

Reticulomyxa filosa

Trypanosoma brucei brucei 927/4

Arabidopsis thaliana

Thalassiosira pseudonana

Methanopyrus kandleri AV19

Pyrobaculum aerophilum IM2

Aciduliprofundum boonei T469

Korarchaeum cryptofilum OPF8

Caldiarchaeum subterraneum

Myxococcus xanthus DK1622Nitrosopumilus maritimus SCM1 0

3

1

2

2

4

34

23

152

31

113

39

200

4

270

453

153

92

Per cent

LCGC14AMP andLokiarchaeum (5/1)

Figure 3 | Identification and phylogeny of small GTPases and actinorthologues. a, Maximum-likelihood phylogeny of 378 aligned amino acidresidues of actin homologues identified in Lokiarchaeum and in theLCGC14AMP metagenome, including eukaryotic actins, ARP1–3 homologuesand crenactins25. Consecutive numbers in brackets refer to the number ofsequences in a respective clade from LCGC14AMP and Lokiarchaeum,respectively. b, Relative amount of small GTPases (assigned to IPR006689 andIPR001806) in the Lokiarchaeum genome in comparison with other eukaryotic,

archaeal and bacterial species. Numbers refer to total amount of small GTPasesper predicted proteome. c, Maximum-likelihood phylogeny of 150 alignedamino acid residues of small Ras- and Arf-type GTPases (IPR006689 andIPR001806) in all domains of life. Numbers in brackets refer to the number ofsequences in the respective clades. a, c, Sequence clusters comprisingLokiarchaeum and/or LCGC14AMP sequences (red), eukaryotes (blue) andBacteria/Archaea (grey) have been collapsed. Bootstrap values above 50 areshown. Scale indicates the number of substitutions per site.

1 7 6 | N A T U R E | V O L 5 2 1 | 1 4 M A Y 2 0 1 5

RESEARCH ARTICLE

G2015 Macmillan Publishers Limited. All rights reserved

LETTERdoi:10.1038/nature14249

Ancient proteins resolve the evolutionary history ofDarwin’s South American ungulatesFrido Welker1,2, Matthew J. Collins1, Jessica A. Thomas1, Marc Wadsley1, Selina Brace3, Enrico Cappellini4, Samuel T. Turvey5,Marcelo Reguero6, Javier N. Gelfo6, Alejandro Kramarz7, Joachim Burger8, Jane Thomas-Oates9, David A. Ashford10,Peter D. Ashton10, Keri Rowsell1, Duncan M. Porter11, Benedikt Kessler12, Roman Fischer12, Carsten Baessmann13,Stephanie Kaspar13, Jesper V. Olsen14, Patrick Kiley15, James A. Elliott15, Christian D. Kelstrup14, Victoria Mullin16,Michael Hofreiter1,17, Eske Willerslev4, Jean-Jacques Hublin2, Ludovic Orlando4, Ian Barnes3 & Ross D. E. MacPhee18

No large group of recently extinct placental mammals remains asevolutionarily cryptic as the approximately 280 genera grouped as‘South American native ungulates’. To Charles Darwin1,2, who firstcollected their remains, they included perhaps the ‘strangest animal[s]ever discovered’. Today, much like 180 years ago, it is no clearerwhether they had one origin or several, arose before or after theCretaceous/Palaeogene transition 66.2 million years ago3, or aremore likely to belong with the elephants and sirenians of superorderAfrotheria than with the euungulates (cattle, horses, and allies) ofsuperorder Laurasiatheria4–6. Morphology-based analyses have provedunconvincing because convergences are pervasive among unrelatedungulate-like placentals. Approaches using ancient DNA have alsobeen unsuccessful, probably because of rapid DNA degradationin semitropical and temperate deposits. Here we apply proteomicanalysis to screen bone samples of the Late Quaternary SouthAmerican native ungulate taxa Toxodon (Notoungulata) andMacrauchenia (Litopterna) for phylogenetically informative protein

sequences. For each ungulate, we obtain approximately 90% directsequence coverage of type I collagen a1- and a2-chains, representingapproximately 900 of 1,140 amino-acid residues for each subunit. Aphylogeny is estimated from an alignment of these fossil sequenceswith collagen (I) gene transcripts from available mammalian genomesor mass spectrometrically derived sequence data obtained for this study.The resulting consensus tree agrees well with recent higher-levelmammalian phylogenies7–9. Toxodon and Macrauchenia form amonophyletic group whose sister taxon is not Afrotheria or anyof its constituent clades as recently claimed5,6, but instead crownPerissodactyla (horses, tapirs, and rhinoceroses). These results areconsistent with the origin of at least some South American nativeungulates4,6 from ‘condylarths’, a paraphyletic assembly of archaicplacentals. With ongoing improvements in instrumentation andanalytical procedures, proteomics may produce a revolution insystematics such as that achieved by genomics, but with the possibilityof reaching much further back in time.

1BioArCh, University of York, York YO10 5DD, UK. 2Department of Human Evolution, Max Planck Institute for Evolutionary Anthropology, 04103 Leipzig, Germany. 3Department of Earth Sciences, NaturalHistory Museum, London SW7 5BD, UK. 4Centre for GeoGenetics, Natural History Museum of Denmark, University of Copenhagen, Øster Voldgade 5–7, 1350 Copenhagen K, Denmark. 5Institute of Zoology,Zoological Society of London, London NW1 4RY, UK. 6CONICET- Division Paleontologıa de Vertebrados, Museo de La Plata. Facultad de Ciencias Naturales y Museo de La Plata, Universidad Nacional de LaPlata. Paseo del Bosque s/n, B1900FWA, La Plata, Argentina. 7Seccion Paleontologıa de Vertebrados. Museo Argentino de Ciencias Naturales ‘‘Bernardino Rivadavia’’, 470 Angel Gallardo Av., C1405DJR,Buenos Aires, Argentina. 8Institute of Anthropology, Johannes Gutenberg-University, Anselm-Franz-von-Bentzel-Weg 7, D-55128 Mainz, Germany. 9Department of Chemistry, University of York, York YO105DD, UK. 10Bioscience Technology Facility, Department of Biology, University of York, York YO10 5DD, UK. 11Department of Biological Sciences, Virginia Polytechnic Institute and State University,Blacksburg, Virginia 24061, USA. 12Target Discovery Institute, Nuffield Department of Medicine, University of Oxford, Roosevelt Drive, Oxford OX3 7FZ, UK. 13Applications Development, Bruker DaltonikGmbH, 28359 Bremen, Germany. 14Novo Nordisk Foundation Center for Protein Research, Faculty of Health Sciences, University of Copenhagen, Blegdamsvej 3b, 2200 Copenhagen, Denmark.15Department of Materials Science and Metallurgy, University of Cambridge, Cambridge CB3 0FS, UK. 16Smurfit Institute of Genetics, Trinity College Dublin, Dublin 2, Ireland. 17Institute for Biochemistryand Biology, Karl-Liebknecht-Strasse 24–25, 14476 Potsdam OT Golm, Germany. 18Department of Mammalogy, American Museum of Natural History, New York, New York 10024, USA.

Darwin’s samplesSamples used in this studyMS/MS

Survival of 80-bp DNA fragments at 10 ka (%)

Dea

mid

ated

glu

tam

ines

(%)

Number of observed glutamines M. patachonica

T. platensisEquus sp.

Mylodon sp.

a

HMS Beagle

0 100 200 300 4000

20

40

60

80

100

cb

0–0.1

0.1–0

.2

0.2–0

.3

0.3–0

.4

4.1–0

.5

0.5–0

.6

0.6–0

.7

0.7–0

.8

0.8–0

.90.9

–1

Figure 1 | Samples used in this investigation. a, Predicted survival of an80-base-pair (bp) DNA fragment after 10,000 years (10 ka) modelled usingthe rate given in ref. 29. b, Location of finds by Darwin1,2 and of samples used inthis study (basemap30). c, Glutamine deamidation ratios for bone samples from

the sequenced Pleistocene SANUs are high compared with coeval horse(MACN Pv 5719) as well as modern hippopotamus and tapir, providingsupport for the authenticity of the ancient sequences (see SupplementaryInformation).

G2015 Macmillan Publishers Limited. All rights reserved

4 J U N E 2 0 1 5 | V O L 5 2 2 | N A T U R E | 8 1

South American native ungulates (SANUs) are conventionally orga-nized into five orders (Litopterna, Notoungulata, Astrapotheria, Xen-ungulata, and Pyrotheria) that are sometimes grouped together as aseparate placental superorder (Meridiungulata)10. They appear very earlyin the Palaeogene record and evolved thereafter along many divergentlines, as their abundant fossil record attests. Most lineages had becomeextinct by the end of the Miocene epoch, although a few species of lito-pterns and notoungulates persisted into the Late Pleistocene epoch.Despite continuing interest in their evolutionary history (for example

refs 5, 11–14), phylogenetic relationships of the major SANU clades toone another and to other placentals remain poorly understood (seeSupplementary Information). Although some recent investigations (forexample refs 4–6) have suggested that basal South American membersof Litopterna conclusively group with certain Holarctic condylarths,and are thus best placed in Euungulata (Laurasiatheria), several otherstudies claim to have identified potential synapomorphies linking var-ious SANU taxa with Afrotheria5,6,15,16. This latter view is broadly con-sistent with such indicators as prolonged late Mesozoic faunal exchange

0.02

1

0.96

1

1

1

1

0.58

0.97

0.77

0.99

1

1

1

0.79

0.56

1

0.56

1

1

0.56

1

1

1

1

0.99

1

1

1

0.98

0.96

0.96

1

1

1

0.94

1

0.99

1

0.53

1

1

1

1

1

0.99

1

1

1

0.99

1

0.86

1

1

0.99

1

1

1

0.99

1

0.64

1

1

1

0.64

0.89

1

1

1

0.84

0.96

1

1

1

Gallus gallus

Mylodon sp

Mus musculus

Ictidomys tridecemlineatus

Pan paniscus

Nomascus leucogenys

Dasypus novemcinctus

Microcebus murinus

Jaculus jaculus

Cyclopes didactylus

Pan troglodytes

Pongo abelii

Choloepus hoffmanni

Ornithorhynchus anatinus

Tupaia chinensis

Sarcophilus harrisii

Dipodomys ordii

Saimiri boliviensis

Pongo pygmaeus

Otolemur garnettii

Macaca mulatta

Monodelphis domestica

Homo sapiens

Macropus eugenii

Heterocephalus glaber

Myotis davidii

Chinchilla lanigera

Equus caballus

Eptesicus fuscus

Erinaceus europaeus

Myotis brandtii

Tapirus terrestris

Pteropus alecto

Ochotona princeps

Rattus norvegicus

Sorex araneus

Oryctolagus cuniculus

Mesocricetus auratus

Equus asinus

Myotis lucifugus

Ceratotherium simum

Condylura cristata

Cavia porcellus

Pteropus vampyrus

Octodon degus

Cricetulus griseus

Bubalus bubalis

Panthera tigris

Orcinus orca

Odobenus rosmarus

Toxodon sp

Canis lupus

Felis catus

Sus scrofa

Procavia capensis

Orycteropus afer

Bos primigenius

Ovis aries

Vicugna pacos

Tursiops truncatus

Hippopotamus amphibius

Physeter catodon

Camelus bactrianus

Mustela putorius

Macrauchenia sp

Pantholops hodgsonii

Leptonychotes weddellii

Echinops telfairi

Loxodonta africana

Mammut sp

Orycteropus afer

Elephantulus edwardii

Ailuropoda melanoleuca

Trichechus manatus

Chrysochloris asiatica

Mammuthus sp

Toxodon sp.

Tapirus terrestris

Equus caballus

Ceratotherium simum

Equus asinus

Equus sp. (Tapalqué)

1

1

1

1

1

1

Panperissodactyla

Equus sp. (Tapalqué)

Marsupialia

Afrotheria

Carnivora

Aves Monotremata

Artiodactyla

Perissodactyla

Chiroptera

Lipotyphla

Lagomorpha

Rodentia

Scandentia

Primates

Xenarthra

Macrauchenia sp.

Figure 2 | Relationship of Toxodon (Notoungulata) and Macrauchenia(Litopterna) to other placental mammals. Fifty per cent majority ruleBayesian consensus tree of COL1 protein sequence data, with chicken (Gallus)as outgroup. Scale bar indicates branch length, expressed as the expectednumber of substitutions per site. Major clades (orders and superorders) arecolour coded; species names in bold indicate collagen sequences derived from

MS/MS rather than genomic data, fossil taxa depicted in silhouette. Inset: inall tree-reconstructions conducted (see Supplementary Information), Toxodonand Macrauchenia (dark grey) group monophyletically at the base ofcrown Perissodactyla (light green) with 100% posterior probability,forming Panperissodactyla.

RESEARCH LETTER

G2015 Macmillan Publishers Limited. All rights reserved

8 2 | N A T U R E | V O L 5 2 2 | 4 J U N E 2 0 1 5

Horizontal Transfer of Entire Genomesvia Mitochondrial Fusion in theAngiosperm AmborellaDanny W. Rice,1 Andrew J. Alverson,1* Aaron O. Richardson,1 Gregory J. Young,1†M. Virginia Sanchez-Puerta,1‡ Jérôme Munzinger,2§ Kerrie Barry,3 Jeffrey L. Boore,3||Yan Zhang,4 Claude W. dePamphilis,4 Eric B. Knox,1 Jeffrey D. Palmer1¶We report the complete mitochondrial genome sequence of the flowering plant Amborella trichopoda.This enormous, 3.9-megabase genome contains six genome equivalents of foreign mitochondrialDNA, acquired from green algae, mosses, and other angiosperms. Many of these horizontal transferswere large, including acquisition of entire mitochondrial genomes from three green algae and onemoss. We propose a fusion-compatibility model to explain these findings, with Amborella capturingwhole mitochondria from diverse eukaryotes, followed by mitochondrial fusion (limited mechanisticallyto green plant mitochondria) and then genome recombination. Amborella’s epiphyte load, propensityto produce suckers from wounds, and low rate of mitochondrial DNA loss probably all contribute tothe high level of foreign DNA in its mitochondrial genome.

Many of the fundamental properties ofeukaryotes arose from horizontal evo-lution on a grand scale–—that is, the

endosymbiotic origin of the mitochondrion andplastid from bacterial progenitors (1). Since theirbirth, however, mitochondrial and plastid genomesseem to have been little affected by horizontal genetransfer (HGT). Themost notable exception involvesland plants, especially flowering plants (angio-sperms), in which HGT is common in the mito-chondrial genome but unknown in plastids (2–10).

To gain insight into the causes and conse-quences ofHGT inmitochondrial DNA (mtDNA),we sequenced the mitochondrial genome ofAmborella trichopoda because polymerase chainreaction–based sampling had shown it to be richin foreign genes (4). This large shrub is endemicto rain forests of New Caledonia and is probablysister to all other angiosperms, a divergence datingback about 200 million years (11, 12).

Overall Genome PropertiesTheAmborellamitochondrial genome assembledas five autonomous, circular-mapping chromo-

somes of lengths 3179, 244, 187, 137, and 119 kb,giving a total genome size of 3,866,039 basepairs (bp) (Fig. 1A and figs. S1 to S4) (13). Thefive chromosomes are distinct in sequence butsimilar in base composition (45 to 47% G + C),stoichiometry, and HGT properties (Fig. 1A andfigs. S2 and S4). Stoichiometry was assessed bysequencing coverage and Southern blot analysis of32 individuals from three populations (fig. S5) (13).

Asdescribed in thenext three sections,AmborellamtDNA possesses an extensive and diverse col-lection of foreign sequences, corresponding to aboutsix genome equivalents of mtDNA acquired frommosses, angiosperms, and green algae. MultigeneHGT has been described in two other lineages ofplant mtDNA (8, 10), but not on a scale approach-ing Amborella. The Amborella mitochondrial ge-nome also contains a large amount (138kb) of plastidDNA (ptDNA) (Fig. 1A, fig. S2, and table S1).

Multichromosomal mitochondrial genomes inplants were only recently discovered (14, 15) andmostly involve large (>1Mb) genomes, with Silenegenomes of 6.7 and 11.3 Mb dwarfing Amborellain size and chromosome number (15). These threemitochondrial genomes are the largest complete-ly assembled organelle genomes, larger thanmanybacterial genomes and even some nuclear ge-nomes. However, the processes responsible fortheir expansion differ in that Silene genomes pos-sess no readily discernible foreign mtDNA andrelatively little ptDNA (15).

HGT from MossesAmborellamtDNAcontains four regions, of lengths48, 40, 9, and 4 kb, acquired from moss mtDNA(Fig. 1A and fig. S2). With one exception, the 41protein and ribosomal RNA (rRNA) genes fromthese four regions were placed phylogenetical-ly, almost always strongly, as sister to the mossPhyscomitrella (Fig. 2, A to D, and figs. S8 andS9). Gene order in the four regions (Fig. 3 and

fig. S6) is highly similar to both Physcomitrellaand Anomodon (mosses that are themselves iden-tical in gene order and content) (16) and extreme-ly different fromangiosperms. Themosslike regionsin Amborella also harbor the same 27 introns andlargely the same set of intergenic sequences asmoss mtDNAs (fig. S6) (13).

The four moss regions contain one, and onlyone, copy of 61 of the 65 genes present in se-quenced moss mtDNAs (Fig. 3 and fig. S6) (13).Taking into account six inferred deletions andduplications larger than 100 bp, the 101.8 kb ofmoss DNA in Amborella reconstructs to a hypo-thetical donor genome of 106.0 kb, comparedwiththe 104.2- and 105.3-kb genomes inPhyscomitrellaand Anomodon, respectively. We infer, therefore,that Amborella captured an entire mitochondrialgenome (13) from a moss with nearly identicalmtDNA architecture to those of Physcomitrellaand Anomodon. This foreign genome subsequent-ly rearranged into four pieces, with a few gene-order changes and 11 gene losses, truncations,and/or partial duplications, all of which are asso-ciated with rearrangement breakpoints (Figs. 1Aand 3, figs. S2 and S6, and table S2).

HGT from Green AlgaeThe Amborella mitochondrial genome containsan average of three green algal–derived copies ofeach protein and rRNA gene commonly found ingreen algal mtDNAs (Figs. 1A and 2, A to D;figs. S2, S4, S8, S10, and S11; and table S3).Many of these genes are clustered in two largetracts of lengths 83 and 61 kb. The 83-kb tract(B1 +A2 in Fig. 1A) contains two copies of a 10-gene cluster (eachmarked by 10 red arrows in thetop comparison of Fig. 4), with all 10 “dupli-cates” highly divergent from each other. The61-kb tract (B2 + A1 in Fig. 1A) lacks these10 genes and instead contains highly divergentduplicates of two genes that are absent from the83-kb tract. A single hypothesized recombinationevent between these two tracts (Figs. 1A and 4)accounts for the above duplications, with the ini-tial, 92- and 52-kb regions each containing a nearlycomplete set of green algal mitochondrial genesand no extra copies (fig. S11). We conclude thatthe 83-kb and 61-kb tracts arose by acquisition ofwhole mitochondrial genomes (designated the Aand B genomes) from two green algae, followedby a single recombination between them and afew gene losses (13). Additionally, the two in-ferred donor genomes are phylogenetically distinct:WheneverAmborella has three ormore green algalcopies of a given gene, the A-genome copy isseparated by a relatively long branch from awell-supported clade containing the other green algalcopies (Fig. 2, A andD, and fig. S8). Furthermore,the two regions assigned to the A genome have alower noncoding G + C composition (39%) thanthe two B-genome regions (47%) (table S4).

Most of the remaining green algal mtDNA inAmborella, comprising tracts of lengths 49, 18,16, and 2 kb (Fig. 1A and fig. S2), also appears,

RESEARCHARTICLES

1Department of Biology, Indiana University, Bloomington, IN47405, USA. 2Institut de Recherche pour le Développement(IRD), UMR Botanique et Bioinformatique de l’Architecture desPlantes (AMAP), Laboratoire de Botanique et d’Ecologie VégétaleAppliquées, Nouméa, New Caledonia. 3Department of EnergyJoint Genome Institute, Walnut Creek, CA 94598, USA. 4De-partment of Biology, Penn State University, University Park, PA16802, USA.

*Present address: Department of Biological Sciences, Univer-sity of Arkansas, Fayetteville, AR 72701, USA.†Present address: DuPont Pioneer, Wilmington, DE 19880, USA.‡Present address: Consejo Nacional de Investigaciones Científicasy Técnicas (CONICET) and Universidad Nacional de Cuyo,Mendoza, Argentina.§Present address: IRDUMRAMAP, TA A51/PS2, 34398Montpelliercedex 5, France.||Present address: Genome Project Solutions, Hercules, CA94547, USA.¶Corresponding author. E-mail: [email protected]

20 DECEMBER 2013 VOL 342 SCIENCE www.sciencemag.org1468

rates than Amborella (fig. S17) (13, 18). Third,levels of sequence identity to other angiospermmtDNAs were measured on a genome-wide basisto define native aswell as angiosperm-HGTregions(13). Finally, native (or angiosperm-HGT) sequen-ces defined by the above four criteria and locatedwithin 5 kb of each other were combined intocontinuous native (or angiosperm-HGT) tracts (13).

These analyses identified 753 kb of DNA ashaving been acquired fromother angiosperms (Fig.1A and figs. S2 and S4). This DNA contains anaverage of 2.0 copies of the 32 protein and rRNAgenes that are virtually always present in angio-sperm mtDNA (table S3) (17) and thus corre-sponds to roughly two genome equivalents offoreign angiosperm mtDNA. Most (86%) of the753 kb is intergenic, consistent with the highproportion of intergenic mtDNA in angiosperms(11, 13). About half of the 753 kb shares ≥90%sequence identity with one or more sequencedangiospermmitochondrial genomes (fig. S4). This

far surpasses the level of highly conservedmtDNAin other angiosperms (fig. S18) (13). The 753-kbestimate is probably conservative owing to thelimited number of angiosperm mtDNAs availa-ble for comparison (13).

Angiosperm DonorsOne class of plastid-derived DNA played a keyrole in donor identification. Phylogenetic analysisshows that most of the 138 kb of ptDNA presentin Amborella mtDNA was acquired through in-tracellular gene transfer (IGT), that is, from theAmborella plastid genome (Fig. 2, E to H, andfig. S19). Analysis of the remaining 10 kb ofptDNA, which probably entered Amborella fromforeignmitochondria, identified donorswithmuchgreater specificity than did themitochondrial geneanalyses (13). Four of the HGT plastid regionsidentified Fagales, Oxalidales, or the predom-inantly parasitic Santalales as the donor, while afifth pointed to Magnoliidae (Fig. 2, E to H, and

fig. S18). A santalalean origin is also supportedby four of the five mitochondrial genes for whichmultiple Santalales have been sampled (fig. S14,nad1b, and fig. S20). The exceptionally high andspecific similarity of two featureless regions toRicinus communis orBambusa oldhamii (Fig. 1Band fig. S21) identified transfers from these line-ages. Finally, the exceptionally high divergencethat diagnosed six angiosperm-like genes as foreignalso suggests that they came from additional do-nors, with high mitochondrial substitution rates.

Because some angiosperm-HGT tracts inAmborella mtDNA are of mixed phylogeneticorigin (Fig. 1) (13), some of its foreignDNAmaybe the product of serial, angiosperm-to-angiosperm-to-angiosperm HGT (13). In particular, the rbcLgene of santalalean origin (Fig. 2E) resides only3 kb from the Bambusa-derived sequence on thesame 27-kb foreign tract (Fig. 1B). Because allfour genes of meaningful length on this tractevidently came from core eudicots (fig. S14), and

Fig. 2. Maximum likelihood evi-dence forHGT inAmborellamtDNA.(A to D) Mitochondrial gene trees ofland plants and green algae revealdiverse donors in Amborella mtDNA.Colors are as in Fig. 1. See fig. S8 foroutgroups. Bootstrap values ≥50%are shown. The number after eachAmb (Amborella) sequence correspondsto its left-most coordinate in kb (figs. S2and S4). Scale bars correspond to 0.1[(A) to (D)] or 0.01 [(E) to (H)] sub-stitutions per site. Bold branches arereduced in length by 50%. (E to H)Plastid gene trees of angiospermsshowing strong support for HGT tothe level of taxonomic order: lightblue, Santalales [(E) and (F)]; brown,Oxalidales (G); violet, Fagales (H).Amborella labels: Amb plastid, genein Amborella plastid; Amb IGT; genein mitochondrion via IGT; red Amb,gene in mitochondrion via HGT. Out-groups are not shown, but see fig. S19for more taxon-rich analyses, includ-ing outgroups. rps7 denotes the rps7-rps12-trnV-rrnS cluster.

atp1A

atp4B

atp8C

cobD

rbcLE

psbCDF

psaAG

rps7H

ProtothecaHelicosporidium100

Amb A 403CoccomyxaAmb B 1505

Amb C 140810054

100

OltmannsiellopsisNephroselmis66 OstreococcusChlorokybus

ChaetosphaeridiumChara

PhyscomitrellaAmb 1657

100

Nothoceros52MarchantiaPleurozia

100

54

CycasAmb 1276Liriodendron

OryzaCaricaVitis

Amb 307088BetaAmb 510

NicotianaArabidopsis59

91

71

100619097

100

100

100

10098

ProtothecaHelicosporidium79

Amb A 394Coccomyxa

Amb B 336Amb C 1003Amb D 233375

100100

60

NephroselmisOltmannsiellopsis55

ChlorokybusOstreococcus

Micromonas87ChaetosphaeridiumCharaPhyscomitrellaAmb 269

57 MarchantiaPleurozia100Nothoceros

CycasAmb 488Amb 653

Amb 2809Oryza

LiriodendronAmb 2196VitisBetaCaricaArabidopsisNicotiana

Amb 617

82100

6069

86100

55

ChlorokybusOstreococcus64

CoccomyxaAmb A 395

Amb B 335Amb C 1005Amb D 233272

10084

68

NephroselmisOltmannsiellopsis

ProtothecaHelicosporidium

ChaetosphaeridiumChara

PhyscomitrellaAmb 268

100 MarchantiaPleurozia

94 Amb 120Amb 7476Cycas

LiriodendronNicotianaAmb 2226

VitisAmb 917

Amb 3487BetaOryza

CaricaArabidopsis66

100

59

83

8399

OltmannsiellopsisPrototheca

Helicosporidium100Amb A 385

CoccomyxaAmb B 346

Amb C 2284100100

98

NephroselmisMicromonas

Ostreococcus100Chlorokybus

ChaetosphaeridiumChara

PhyscomitrellaAmb 1698

100

MarchantiaPleurozia

100Nothoceros

CycasAmb 2013Liriodendron

Amb 506VitisCaricaBeta

NicotianaArabidopsisOryza53

95100

100

100

9072

56

97

IlliciumAmb plastid

Amb IGT100NymphaeaNuphar10074

DrimysChloranthusLiriodendron

AcorusYucca

Lemna9090

CeratophyllumPlatanus

TrochodendronBuxus69

VitisGossypium

Arabidopsis89Quercus

EuonymusOxalis65

55

PlumbagoXimeniaEngomegomaComandra

Phoradendron

67PhanerodiscusAmb 3078

HondurodendronHarmandia72

94

59

BerberidopsisRhododendron

HelianthusNicotiana

Coffea8052

76

65

80

100

98

Amb plastidAmb IGT100

NupharNymphaea100

IlliciumChloranthus

LiriodendronCeratophyllum

DrimysAcorus

LemnaYuccaTypha95

7873

BuxusPlatanus

TrochodendronVitis

GossypiumArabidopsis98

QuercusCucumis52 EuonymusPopulusOxalis

7086

XimeniaPhoradendron

Amb 35479997

BerberidopsisPlumbago

Spinacia100RhododendronHelianthusDaucus98 Nicotiana

Coffea8194

10088

100

64

58

100

59

84

96

77

Amb plastidAmb IGT100Nuphar

Nymphaea100Illicium

CeratophyllumAcorus

LemnaYuccaTypha55

7280Liriodendron

ChloranthusDrimys

PlatanusBuxusTrochodendron

VitisGossypium

Arabidopsis73 QuercusCucumis65

PopulusEuonymus

OxalisAmb 4749876

51

53

BerberidopsisPlumbago

Spinacia96Ximenia

Phoradendron76RhododendronHelianthus

Daucus93 NicotianaCoffea79

8993

56

100

76

85

100

56

95

70

81

NupharNymphaea100Amb plastid

Amb IGT100Illicium

ChloranthusLiriodendron

CeratophyllumDrimys

AcorusLemna

YuccaTypha68

7073

53

TrochodendronBuxusPlatanus

VitisGossypium

Arabidopsis93 CucumisQuercus

Amb 380510097Euonymus

Populus77Oxalis

82

81

61

74

XimeniaPhoradendron89

SpinaciaPlumbago

RhododendronBerberidopsis

HelianthusDaucus99 NicotianaCoffea86

75

95

98

82

74

58

20 DECEMBER 2013 VOL 342 SCIENCE www.sciencemag.org1470

RESEARCH ARTICLES

Sequencing Individuals from Additional TurquoiseKillifish Strains Reveals Variants in Aging-RelatedGenesWithin the turquoise killifish species, there exist several strainswith reported differences in lifespan in specific laboratory envi-ronments (Kirschner et al., 2012; Terzibasi et al., 2008) (Fig-ures 5A, S5A, and 6B), and these differences could be leveraged

to understand the genetic architecture of lifespan. To assessthe genetic differences among turquoise killifish strains, wesequenced at lower coverage individuals from two additionalstrains that were captured in Mozambique in 2004 and 2007(MZM-0403 and MZM-0703, respectively) and from a controlGRZ individual (Figure 5A). This analysis uncovered over threemillion single nucleotide polymorphisms (SNPs) between

Phylogenetic tree and lifespan B

C Selected GO term enrichment for the genes under positive selection

Genes under positive selection in the turquoise killifishA

14,857

13,140

GOenrichment

Overlap with known aging genes

Expressionwith age

Functionaleffect prediction

D Functional effect prediction for the sites under positiveselection

1-to-1 orthologs withother fish genomes

(13,637)

Protein-codinggenes

(28,494)

Genes under positive selection

-10-505

PR

OV

EA

N s

core

00.20.40.60.81

SIF

T s

core

100

100 10096

100100

100

100

100

100100

90

100

100

100

100

100

MedakaTetraodon

FuguStickleback

Cod

CoelacanthXenopusChicken

MouseHuman

DogPig

C. intestinalisC. savignyiSea urchin

FlyWorm

Teleost fish

Tetrapods

Invertebrates

Lifespan

Platyfish

ZebrafishLobefin fish

Years (log scale)

1

3

10

30

Turquoise killifish

0.1

2

4

6

8

Enrichment(log scale)

Sig

nalin

gM

etab

olis

mD

evel

opm

ent

ProteasomeImmunity

249 497

Viral process

Proteasome assembly

Cell−cell signaling involved in cell fate commitment

Ectoderm development

Morphogenesis of embryonic epithelium

Cellular process involved in reproductionin multicellular organism

Developmental induction

Carbohydrate derivative biosynthetic process

Nucleoside phosphate biosynthetic process

Single−organism carbohydrate metabolic process

Lipid metabolic process

Single−organism biosynthetic process

Integrin−mediated signaling pathway

Regulation of Ras protein signal transduction

Negative regulation of Wnt signaling pathway

Regulation of small GTPase mediatedsignal transduction

Regulation of intracellular signal transduction (12)

(10)(4)

(7)

(3)

(15)

(12)

(10)(6)

(9)

(2)

(3)

(3)

(2)

(2)

(2)

(2)

p-value0.01 0.02 0.03

Deleterious

Neutral

Deleterious

Tolerated

Figure 3. Evolutionary Analysis of the Turquoise Killifish Genome(A) Phylogenetic tree of 20 animal species, including the turquoise killifish, based on 619 one-to-one orthologs (Table S2C). Number on nodes: level of confidence

(% bootstrap support). Scale bar: evolutionary distance (substitution per site). Maximum lifespan data are from our experimental data (turquoise killifish) or from

the AnAge database (other fish species), and represented as a heat map.

(B) Proportion and analysis of the genes under positive selection in the turquoise killifish compared to 7 other fish species after multiple hypothesis correction

(FDR < 5%). See also Figure S3A.

(C) Selected GO term enrichment for the genes under positive selection in the turquoise killifish. The number of genes associated with each category is indicated

in brackets after the term description, and enrichment values are indicated in colored scale. See also Table S3C.

(D) Predicted functional effect on the protein of residues under positive selection in the turquoise killifish have based on SIFT (top row) and PROVEAN (bottom

row). Residues are ordered from left to right based on the rank-product of the SIFT and PROVEAN scores. Only sites scored by both methods are displayed. See

also Figure S3B and Tables S3D, and S4G.

1544 Cell 163, 1539–1554, December 3, 2015 ª2015 Elsevier Inc.

Resource

The African Turquoise Killifish Genome ProvidesInsights into Evolution and Genetic Architecture ofLifespan

Graphical Abstract

Highlightsd De novo genome assembly and annotation of the African

turquoise killifish

d Key aging genes are under positive selection in the turquoise

killifish

d Differences in lifespan between killifish strains are

genetically linked to sex

d A resource for comparative genomics and experimental

aging studies

AuthorsDario Riccardo Valenzano,

Berenice A. Benayoun,

Param Priya Singh, ..., Andreas Beyer,

Eric A. Johnson, Anne Brunet

[email protected] (D.R.V.),[email protected] (A.B.)

In BriefThe genome of the African turquoise

killifish, an exceptionally short-lived fish,

is a useful resource to explore the genetic

principles and the evolution of unique

traits in lifespan and embryonic diapause.

Linkage analysis suggests that short

lifespan could have co-evolved with sex

determination.

Valenzano et al., 2015, Cell 163, 1539–1554December 3, 2015 ª2015 Elsevier Inc.http://dx.doi.org/10.1016/j.cell.2015.11.008

Page 7: Molecular Phylogenetics (1) Phylogeny in wide sense is a ... · Molecular Phylogenetics (1) Phylogeny in wide sense is a historical development of EEOB 563 Phylogeny (from phylum

LETTERdoi:10.1038/nature15697

A comprehensive phylogeny of birds (Aves) usingtargeted next-generation DNA sequencingRichard O. Prum1,2*, Jacob S. Berv3*, Alex Dornburg1,2,4, Daniel J. Field2,5, Jeffrey P. Townsend1,6,Emily Moriarty Lemmon7 & Alan R. Lemmon8

Although reconstruction of the phylogeny of living birds has pro-gressed tremendously in the last decade, the evolutionary history ofNeoaves—a clade that encompasses nearly all living bird species—remains the greatest unresolved challenge in dinosaur systematics.Here we investigate avian phylogeny with an unprecedented scaleof data: .390,000 bases of genomic sequence data from each of198 species of living birds, representing all major avian lineages,and two crocodilian outgroups. Sequence data were collected usinganchored hybrid enrichment, yielding 259 nuclear loci with anaverage length of 1,523 bases for a total data set of over 7.8 3 107

bases. Bayesian and maximum likelihood analyses yielded highlysupported and nearly identical phylogenetic trees for all majoravian lineages. Five major clades form successive sister groups tothe rest of Neoaves: (1) a clade including nightjars, other caprimul-giforms, swifts, and hummingbirds; (2) a clade uniting cuckoos,bustards, and turacos with pigeons, mesites, and sandgrouse; (3)cranes and their relatives; (4) a comprehensive waterbird clade,including all diving, wading, and shorebirds; and (5) a compre-hensive landbird clade with the enigmatic hoatzin (Opisthocomushoazin) as the sister group to the rest. Neither of the two main,recently proposed Neoavian clades—Columbea and Passerea1—were supported as monophyletic. The results of our divergencetime analyses are congruent with the palaeontological record, sup-porting a major radiation of crown birds in the wake of theCretaceous–Palaeogene (K–Pg) mass extinction.

Birds (Aves) are the most diverse lineage of extant tetrapod verte-brates. They comprise over 10,000 living species2, and exhibit an extra-ordinary diversity in morphology, ecology, and behaviour3. Substantialprogress has been made in resolving the phylogenetic history of birds.Phylogenetic analyses of both molecular and morphological data sup-port the monophyletic Palaeognathae (the tinamous and flightlessratites) and Galloanserae (gamebirds and waterfowl) as successive,monophyletic sister groups to the Neoaves—a diverse clade includingall other living birds4. Resolving neoavian phylogeny has proven to be adifficult challenge because this radiation was very rapid and deep intime, resulting in very short internodes4.

In the last decade, phylogenetic analyses of large, multilocus datasets have resulted in the proposal of numerous, novel neoavian rela-tionships. For example, a clade consisting of diving and wading birdshas been consistently recovered, as well as a large landbird clade inwhich falcons and parrots are successive sister groups to the perchingbirds4–8. Recently, phylogenetic analyses of 48 whole avian genomesresulted in the proposal of a novel phylogenetic resolution of the initialbranching sequence within Neoaves1. Although this genomic studyprovided much needed corroboration of many neoavian clades, thelimited taxon sampling precluded further insights into the evolution-ary history of birds.

It has long been recognized that phylogenetic confidence dependsnot only on the number of characters analysed and their rate of evolu-tion, but also on the number and relationships of the taxa sampledrelative to the nodes of interest9–11. Theory predicts that sampling asingle taxon that diverges close to a node of interest will have a fargreater effect on phylogenetic resolution than will adding more char-acters11. Despite using an alignment of .40 million base pairs, sparsesampling of 48 species in the recent avian genomic analysis may nothave been sufficient to confidently resolve the deep divergences amongmajor lineages of Neoaves. Thus, expanded taxon sampling is requiredto test the monophyly of neoavian clades, and to further resolve thephylogenetic relationships within Neoaves.

Here, we present a phylogenetic analysis of 198 bird species and2 crocodilians (Supplementary Table 1) based on loci captured usinganchored enrichment12. Our sample includes species of 122 avianfamilies in all 40 extant avian orders2, with denser representation ofnon-oscine birds (108 families) than of oscine songbirds (14 families).Effort was made to include taxa that would break up long phylogeneticbranches, and provide the highest likelihood of resolving short inter-nodes at the base of Neoaves11. We also sampled multiple specieswithin groups whose monophyly or phylogenetic interrelationshipshave been controversial—that is, tinamous, nightjars, hummingbirds,turacos, cuckoos, pigeons, sandgrouse, mesites, rails, storm petrels,petrels, storks, herons, hawks, hornbills, mousebirds, trogons, king-fishers, barbets, seriemas, falcons, parrots, and suboscine passerines.

We targeted 394 loci centred on conserved anchor regions of thegenome that are flanked by more variable regions12. We performed allphylogenetic analyses on a data set of 259 genes with the highestquality assemblies. The average locus was 1,524 bases in length(361–2,316 base pairs (bp)), and the total percentage of missing datawas 1.84%. The concatenated alignment contained 394,684 sites. Tominimize overall model complexity while accurately accounting forsubstitution processes, we performed a partition model sensitivityanalysis with PartitionFinder13,14, and compared a complex partitionmodel (one partition per locus) to a heuristically optimized (rclust)partition model. Phylogenetic informativeness (PI) approaches15,16

provided strong evidence that the phylogenetic utility of our data setwas high, with low declines in PI profiles for individual loci, data setpartitions, and the concatenated matrix (Supplementary Fig. 4). Weestimated concatenated trees in ExaBayes17 and RAxML18 using a 75partition model. Coalescent species trees were estimated with the genetree summation methods in STAR19, NJst20, and ASTRAL21 from genetrees estimated with RAxML (see Methods.)

Our concatenated Bayesian analyses resulted in a completelyresolved, well supported phylogeny. All clades had a posterior prob-ability (PP) of 1, except for a single clade including shoebill(Balaeniceps) and pelican (PP 5 0.54) (Fig. 1). The concatenated

*These authors contributed equally to this work.

1Department of Ecology & Evolutionary Biology, Yale University, New Haven, Connecticut 06520, USA. 2Peabody Museum of Natural History, Yale University, New Haven, Connecticut 06520, USA.3Department of Ecology and Evolutionary Biology, Fuller Evolutionary Biology Program, Cornell University, and Cornell Laboratory of Ornithology, Ithaca, New York 14853, USA. 4North Carolina Museum ofNatural Sciences, Raleigh, North Carolina 27601, USA. 5Department of Geology & Geophysics, Yale University, New Haven, Connecticut 06520, USA. 6Department of Biostatistics, and Program inComputationalBiology and Bioinformatics, Yale University,New Haven, Connecticut06520, USA. 7Departmentof Biological Science, Florida State University, Tallahassee, Florida 32306,USA. 8Departmentof Scientific Computing, Florida State University, Tallahassee, Florida 32306, USA.

2 2 O C T O B E R 2 0 1 5 | V O L 5 2 6 | N A T U R E | 5 6 9G2015 Macmillan Publishers Limited. All rights reserved

maximum likelihood analysis recovered a single topology that wasidentical to the Bayesian tree except for three clades, all of which arefar from the base of Neoaves: the relationships among pigeons; amongskimmers, gulls, and terns; and among pelicans, shoebill, and waders(Supplementary Fig. 1). Almost all clades in the maximum likelihood

tree were maximally supported with bootstrap scores (BS) of 1.00, butnine clades within Neoaves (including four of the most inclusiveneoavian clades) received support ,0.70 (Supplementary Fig. 1).Coalescent species tree analyses produced substantially differenthypotheses for neoavian relationships (Supplementary Fig. 3), but

Ple.Pli.MioceneOligoceneEocenePalaeoceneUpper

Q.NeogenePalaeogeneCretaceous

70 60 50 40 30 20 10 0

Ma

Inopinaves

Neoaves continued

Coraciim

orphaeA

ustralavesP

asseriformes

Buteo

Momotus

Trogon

Smithornis

Apaloderma

Indicator

Alcedo

Buccanodon

Corvus

TockusMerops

Furnarius

Cathartes

Hymenops

Hirundinea

Thamnophilus

Strix

Jynx

SylviaRegulus

Micrastur

Rupicola

Myiobius

Turdus

Sclerurus

PipritesRhynchocyclus

Neopelma

Fringilla

Upupa

Todus

Falco

Myrmornis

Cotinga

Deroptyus

Ceratopipra

Lepidocolaptes

Tyrannus

Caracara

Tityra

Picus

Terenura

Oxyruncus

Ibycter

Schiffornis

Capito

Bucco

Accipiter

Psittrichas

ChloroceryleGalbula

Chelidoptera

Vultur

Probosciger

Coracias

Ramphastos

Sagittarius

Atelornis

Leptosomus

Opisthocomus

Psittacus

Melanopareia

Climacteris

Malurus

Barnardius

Elanus

Eurylaimus

Nestor

Phoeniculus

Megalaima

Pitta

ColiusUrocolius

Menura

Cryptopipo

Cariama

Myrmothera

Elaenia

Neodrepanis

Ptilonorhynchus

Pandion

Tyto

Chunga

CalandrellaPoecile

Lophorina

Calyptomena

Sericulus

Spizella

Pycnonotus

Bucorvus

Acanthisitta

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

2324

25

26

27

28

2930

31

3233

34

3536

37

38

3940

41

42

43

44

45

46

47

48

49

5051

52

53

5455

56

57

58

59

60

6162

63

64

66

67

68

69

70

71

72

7374

75

7677

7879

80

8182

83

8485

86

87

88

89

90

91

92

93

9495

96

97

65

Accipitriform

es

Figure 1 | Continued.

2 2 O C T O B E R 2 0 1 5 | V O L 5 2 6 | N A T U R E | 5 7 1

LETTER RESEARCH

G2015 Macmillan Publishers Limited. All rights reserved

Palaeognathae

Galloanserae

Neoaves

Strisores

Colum

baves G

ruiformes

Aequorlitornithes

Tinam.

Galliform

esA

nseriform.

Apodiform

.O

tidimorph.

Colum

bimorph.

Tinam.

Tinaam.

quorlitornnnnnnnnnnnnnnnnnnnnnithes

Ple.Pli.MioceneOligoceneEocenePalaeoceneUpper

Q.NeogenePalaeogeneCretaceous

Streptoprocne

Tauraco

Treron

Corythaeola

Tringa

Theristicus

Chroicocephalus

Burhinus

Ciconia

Columba

Charadrius

TopazaPhaethornis

Leptotila

Crax

Odontophorus

NothoproctaCrypturellusTinamus

Coccyzus

Tigrisoma

Columbina

Chordeiles

Ardea

Chaetura

Nyctibius

Colinus

Anas

Anseranas

Morus

Podargus

Leipoa

Oxyura

Caprimulgus

Dromaius

Psophia

Sterna

Balaeniceps

Archilochus

Bonasa

Jacana

Ardeotis

Oceanodroma

Dendrocygna

Anser

Phoenicopterus

Aythya

Haematopus

Oceanites

Mesitornis

Sarothrura

Monias

Recurvirostra

Rollulus

Phalacrocorax

ChaunaGallus

Phaethon

Leptoptilos

Heliornis

Anhinga

Casuarius

Fregata

Pelecanoides

Hemiprocne

Apteryx

Pelecanus

Rynchops

Aegotheles

Pterodroma

Eurypyga

Centropus

Eurostopodus

Glareola

Rostratula

Syrrhaptes

Fulmarus

Grus

Puffinus

Porphyrio

Uria

Turnix

Pterocles

Pelagodroma

Rhea

Phoebastria

Scopus

Aramus

Ixobrychus

Rollandia

Cuculus

Tapera

Micropygia

Ortalis

Arenaria

Rallus

Limosa

Eudromia

Balearica

Ptilinopus

Steatornis

Numida

GaviaSpheniscus

Struthio

Pedionomus

70 60 50 40 30 20 10 0

Ma

1

2

3

4

5

6

98

99

100

101

102

103

104

105

106107

108109

110111

112

113

114

115

116

117118

119120

121

122

123

124

125

126127

128129

130

131

132133

134135

136137

138

139

140

141

142143

144

145146

147

148

149

150

151152

153

154155

156

157

158159

160161

162

163

164

165

166

167

168169

170171

172

173174

175

176

177

178

179

180181

182

183

184185

186187

188189

190

191

192

193

194195

196

197

Aves

Figure 1 | Phylogeny of birds. Time-calibrated phylogeny of 198 species ofbirds inferred from a concatenated, Bayesian analysis of 259 anchoredphylogenomic loci using ExaBayes17. Figure continues on the opposite pagefrom green arrow at the bottom of this panel. Complete taxon data inSupplementary Table 1. Higher taxon names appear at right. All clades aresupported with posterior probability (PP) of 1.0, except for the Balaeniceps–Pelecanus clade (PP 5 0.54; clade 109). The five major, successive, neoavian

sister clades are: Strisores (brown), Columbaves (purple), Gruiformes (yellow),Aequorlitornithes (blue), and Inopinaves (green). Background colours markgeological periods. Ma, million years ago; Ple, Pleistocene; Pli, Pliocene;Q., Quaternary. Clade numbers refer to the plot of estimated divergencedates (Supplementary Fig. 7). Fossil age-calibrated nodes are shown in grey.Illustrations of representative bird species30 are depicted by their lineages. SeeSupplementary Information for details and further discussion.

5 7 0 | N A T U R E | V O L 5 2 6 | 2 2 O C T O B E R 2 0 1 5

RESEARCH LETTER

G2015 Macmillan Publishers Limited. All rights reserved