Misconceptions on Missing Data in RAD-seq Phylogenetics ...€¦ · Misconceptions on Missing Data...

14
Syst. Biol. 66(3):399–412, 2017 © The Author(s) 2016. Published by Oxford University Press, on behalf of the Society of Systematic Biologists. All rights reserved. For Permissions, please email: [email protected] DOI:10.1093/sysbio/syw092 Advance Access publication October 14, 2016 Misconceptions on Missing Data in RAD-seq Phylogenetics with a Deep-scale Example from Flowering Plants DEREN A. R. EATON ,ELIZABETH L. SPRIGGS,BRIAN PARK, AND MICHAEL J. DONOGHUE Department of Ecology and Evolutionary Biology, Yale University, PO Box 208106, New Haven, CT, 06520, USA; Correspondence to be sent to: Department of Ecology and Evolutionary Biology, Yale University, PO Box 208106, New Haven, CT, 06520, USA; E-mail: [email protected] Received 29 March 2016; reviews returned 12 June 2016; accepted 10 October 2016 Associate Editor: Laura Kubtoko Abstract.—Restriction-site associated DNA (RAD) sequencing and related methods rely on the conservation of enzyme recognition sites to isolate homologous DNA fragments for sequencing, with the consequence that mutations disrupting these sites lead to missing information. There is thus a clear expectation for how missing data should be distributed, with fewer loci recovered between more distantly related samples. This observation has led to a related expectation: that RAD-seq data are insufficiently informative for resolving deeper scale phylogenetic relationships. Here we investigate the relationship between missing information among samples at the tips of a tree and information at edges within it. We re-analyze and review the distribution of missing data across ten RAD-seq data sets and carry out simulations to determine expected patterns of missing information. We also present new empirical results for the angiosperm clade Viburnum (Adoxaceae, with a crown age >50 Ma) for which we examine phylogenetic information at different depths in the tree and with varied sequencing effort. The total number of loci, the proportion that are shared, and phylogenetic informativeness varied dramatically across the examined RAD-seq data sets. Insufficient or uneven sequencing coverage accounted for similar proportions of missing data as dropout from mutation-disruption. Simulations reveal that mutation- disruption, which results in phylogenetically distributed missing data, can be distinguished from the more stochastic patterns of missing data caused by low sequencing coverage. In Viburnum, doubling sequencing coverage nearly doubled the number of parsimony informative sites, and increased by >10X the number of loci with data shared across >40 taxa. Our analysis leads to a set of practical recommendations for maximizing phylogenetic information in RAD-seq studies. [hierarchical redundancy; phylogenetic informativeness; quartet informativeness; Restriction-site associated DNA (RAD) sequencing; sequencing coverage; Viburnum.] Restriction-site associated DNA sequencing (RAD- seq) is a method for subsampling the genome to concentrate sequencing efforts as a way to efficiently attain high coverage data across many individuals for comparative genomics (Miller et al. 2007; Baird et al. 2008). Because it relies on the use of restriction enzymes to digest genomic DNA and isolate homologous fragments, the conservation of enzyme recognition sites is critical for recovering shared data among sampled individuals. For this reason, RAD-seq methods are expected to yield fewer shared data between more highly divergent taxa where the opportunity for mutations to disrupt shared restriction sites has been greater. This has led to a common conception that RAD-seq data are not applicable to deep-scale phylogenetic analyses. A general motivation for combining multiple sequence markers (loci) into a single joint phylogenetic analysis is the expectation that a combination of markers contains more overall information than could be attained by examining each individually. Concatenation (de Queiroz and Gatesy 2007), binning (Bayzid et al. 2015), and supertree methods (Sanderson et al. 1998) represent just several of the many ways in which phylogenetic information can be combined. This logic extends similarly to the case of RAD-seq, in which there are often few loci that contain information for all taxa in a data set. When many loci with variable but overlapping taxon sets are combined there can be thousands of loci that inform any given split in a tree. A great concern, however, for applying RAD-seq data in this framework has to do with the distribution of missing data. If it is highly nonrandom, there may be too few markers with sufficiently overlapping taxon sets to infer certain splits in a tree — particularly those deeper in time. The total proportion of missing data in supermatrices composed of RAD sequences has received considerable attention (Rubin et al. 2012; Cariou et al. 2013; Hipp et al. 2014; Huang and Knowles 2016; Leaché et al. 2015; Ree and Hipp 2015). However, there has been much less investigation of the underlying cause of missing data, and how different sources may contribute to its distribution. In many cases, it has been assumed that most missing information comes from a single source: allelic drop-out caused by mutations. However, there are multiple sources that can give rise to missing RAD-seq data: (i) variable recovery during library preparation due to technical issues related to digestion, size selection, and DNA quality (Escudero et al. 2014); (ii) bioinformatic errors in identifying homology (Eaton 2014); (iii) mutation-generation of new fragments that are not shared by common descent among all sampled lineages; (iv) mutation-disruption of ancestral fragments such that they are no longer shared by all descendant lineages (Rubin et al. 2012; Cariou et al. 2013); and (5) insufficient or uneven amplification and sequencing coverage (Huang and Knowles 2016). The relationship between missing data at the tips of a tree and the loss of phylogenetic information across its edges (splits) depends on a number of factors, including topology and branch lengths, as well as the source of missing data, which determines its distribution. For example, data loss from mutation-disruption will be hierarchical, with a phylogenetic pattern matching that of mutations; whereas low (or uneven) sequencing 399

Transcript of Misconceptions on Missing Data in RAD-seq Phylogenetics ...€¦ · Misconceptions on Missing Data...

Page 1: Misconceptions on Missing Data in RAD-seq Phylogenetics ...€¦ · Misconceptions on Missing Data in RAD-seq Phylogenetics with a ... led to a common conception that RAD-seq data

Syst. Biol. 66(3):399–412, 2017© The Author(s) 2016. Published by Oxford University Press, on behalf of the Society of Systematic Biologists. All rights reserved.For Permissions, please email: [email protected]:10.1093/sysbio/syw092Advance Access publication October 14, 2016

Misconceptions on Missing Data in RAD-seq Phylogenetics with aDeep-scale Example from Flowering Plants

DEREN A. R. EATON∗, ELIZABETH L. SPRIGGS, BRIAN PARK, AND MICHAEL J. DONOGHUE

Department of Ecology and Evolutionary Biology, Yale University, PO Box 208106, New Haven, CT, 06520, USA;∗Correspondence to be sent to: Department of Ecology and Evolutionary Biology, Yale University, PO Box 208106, New Haven,

CT, 06520, USA; E-mail: [email protected]

Received 29 March 2016; reviews returned 12 June 2016; accepted 10 October 2016Associate Editor: Laura Kubtoko

Abstract.—Restriction-site associated DNA (RAD) sequencing and related methods rely on the conservation of enzymerecognition sites to isolate homologous DNA fragments for sequencing, with the consequence that mutations disruptingthese sites lead to missing information. There is thus a clear expectation for how missing data should be distributed,with fewer loci recovered between more distantly related samples. This observation has led to a related expectation: thatRAD-seq data are insufficiently informative for resolving deeper scale phylogenetic relationships. Here we investigatethe relationship between missing information among samples at the tips of a tree and information at edges within it.We re-analyze and review the distribution of missing data across ten RAD-seq data sets and carry out simulations todetermine expected patterns of missing information. We also present new empirical results for the angiosperm cladeViburnum (Adoxaceae, with a crown age >50 Ma) for which we examine phylogenetic information at different depths inthe tree and with varied sequencing effort. The total number of loci, the proportion that are shared, and phylogeneticinformativeness varied dramatically across the examined RAD-seq data sets. Insufficient or uneven sequencing coverageaccounted for similar proportions of missing data as dropout from mutation-disruption. Simulations reveal that mutation-disruption, which results in phylogenetically distributed missing data, can be distinguished from the more stochastic patternsof missing data caused by low sequencing coverage. In Viburnum, doubling sequencing coverage nearly doubled the numberof parsimony informative sites, and increased by >10X the number of loci with data shared across >40 taxa. Our analysisleads to a set of practical recommendations for maximizing phylogenetic information in RAD-seq studies. [hierarchicalredundancy; phylogenetic informativeness; quartet informativeness; Restriction-site associated DNA (RAD) sequencing;sequencing coverage; Viburnum.]

Restriction-site associated DNA sequencing (RAD-seq) is a method for subsampling the genome toconcentrate sequencing efforts as a way to efficientlyattain high coverage data across many individuals forcomparative genomics (Miller et al. 2007; Baird et al.2008). Because it relies on the use of restriction enzymesto digest genomic DNA and isolate homologousfragments, the conservation of enzyme recognition sitesis critical for recovering shared data among sampledindividuals. For this reason, RAD-seq methods areexpected to yield fewer shared data between more highlydivergent taxa where the opportunity for mutations todisrupt shared restriction sites has been greater. This hasled to a common conception that RAD-seq data are notapplicable to deep-scale phylogenetic analyses.

A general motivation for combining multiple sequencemarkers (loci) into a single joint phylogenetic analysis isthe expectation that a combination of markers containsmore overall information than could be attained byexamining each individually. Concatenation (de Queirozand Gatesy 2007), binning (Bayzid et al. 2015), andsupertree methods (Sanderson et al. 1998) representjust several of the many ways in which phylogeneticinformation can be combined. This logic extendssimilarly to the case of RAD-seq, in which there areoften few loci that contain information for all taxa in adata set. When many loci with variable but overlappingtaxon sets are combined there can be thousands of locithat inform any given split in a tree. A great concern,however, for applying RAD-seq data in this frameworkhas to do with the distribution of missing data. If it ishighly nonrandom, there may be too few markers with

sufficiently overlapping taxon sets to infer certain splitsin a tree — particularly those deeper in time.

The total proportion of missing data in supermatricescomposed of RAD sequences has received considerableattention (Rubin et al. 2012; Cariou et al. 2013; Hippet al. 2014; Huang and Knowles 2016; Leaché et al. 2015;Ree and Hipp 2015). However, there has been muchless investigation of the underlying cause of missingdata, and how different sources may contribute toits distribution. In many cases, it has been assumedthat most missing information comes from a singlesource: allelic drop-out caused by mutations. However,there are multiple sources that can give rise to missingRAD-seq data: (i) variable recovery during librarypreparation due to technical issues related to digestion,size selection, and DNA quality (Escudero et al. 2014);(ii) bioinformatic errors in identifying homology (Eaton2014); (iii) mutation-generation of new fragments thatare not shared by common descent among all sampledlineages; (iv) mutation-disruption of ancestral fragmentssuch that they are no longer shared by all descendantlineages (Rubin et al. 2012; Cariou et al. 2013); and(5) insufficient or uneven amplification and sequencingcoverage (Huang and Knowles 2016).

The relationship between missing data at the tips of atree and the loss of phylogenetic information across itsedges (splits) depends on a number of factors, includingtopology and branch lengths, as well as the source ofmissing data, which determines its distribution.

For example, data loss from mutation-disruption willbe hierarchical, with a phylogenetic pattern matchingthat of mutations; whereas low (or uneven) sequencing

399

Page 2: Misconceptions on Missing Data in RAD-seq Phylogenetics ...€¦ · Misconceptions on Missing Data in RAD-seq Phylogenetics with a ... led to a common conception that RAD-seq data

400 SYSTEMATIC BIOLOGY VOL. 66

(a) (b) (c) (d)

FIGURE 1. Expected distribution of missing information (grey) from a) mutation-disruption and b) low sequencing coverage. c) Quartetinformativeness is a multi-locus measure of phylogenetic information content that measures the number of markers with sufficient taxon (tip)sampling to be potentially informative about a specific split/bipartition (black) in a tree. For a marker to be informative about a given split atleast one taxon from each of the four connected edges (colored clades or vertical bars) must have data for the marker. d) Splits that have a greaternumber of descendant taxa that can be sampled from each connected edge have greater hierarchical redundancy.

coverage will generally produce a more stochasticdistribution of missing sequences that is little if atall influenced by the relationships among samples(Fig. 1a-b). Examined phylogenetically, the first type ofmissing data perpetuates information loss deeper intothe tree, because entire clades are likely to be missinginformation, whereas missing data resulting from lowsequencing coverage will tend to be stochasticallydistributed toward the tips.

Here, we examine patterns of shared RAD-seq dataobserved in 10 empirical data sets to identify the mostlikely sources of missing data, and to investigate itseffect on phylogenetic information content. Nine of thesedata sets are from published studies of a variety oforganisms, and one is from a new study, presentedhere, of the angiosperm clade Viburnum (Adoxaceae).For the Viburnum data set, we examine the distributionof shared data across both shallow and deep splitswithin the tree, and investigate how sequencing coverageinfluences phylogenetic informativeness. These resultsare compared with simulations to explore expectedpatterns of data sharing when missing data arise fromdifferent sources. Together, our analyses yield a set ofpractical guidelines for improving the effectiveness ofphylogenetic studies using RAD-seq data.

MATERIALS AND METHODS

SimulationsWe developed a Python program, simrrls

(http://github.com/dereneaton/simrrls), to simulateRADseq-like sequence data based on coalescentsimulations performed with the Python package

egglib (Mita and Siol 2012). To model realistic levelsof mutation-disruption, haplotypes are dropped if(i) a mutation arises in the restriction recognitionsite (or sites if two cutters are used) relative to theancestral sequence, or (ii) a mutation gives rise to a newrestriction recognition site within an existing fragmentand the resulting shortened fragment falls outside of adefined size selection window (Supplementary FigureS1 in Supplementary Material, available on Dryad athttp://dx.doi.org/10.5061/dryad.g549v). To modelvariable sequencing coverage the number of readssampled from each haplotype is drawn from a normaldistribution. Mutations occur under the Jukes–Cantormodel of sequence evolution with equal starting basefrequencies.

We simulated 1000 diploid loci on three differenttopologies of equal tree lengths (6 coalescent units);all lineages had a constant population size of 1e6, andper-site mutation rate 1e−9 (�=0.04). This yieldedhighly divergent loci (mean=20 single nucleotidepolymorphisms, SNPs), which were chosen in order tomodel the extreme at which mutation-disruption couldgive rise to missing data, but still fall within the rangewhere homology could be identified in bioinformaticanalyses. Data were simulated with a single cutter(e.g., RAD) or with two cutters (e.g., ddRAD; (Petersonet al. 2012)). The first used an 8-bp cutter, whereas thesecond used an 8-bp cutter in addition to a 4-bp cutter.When allowing for mutation-disruption, haplotypeswere discarded if a RAD fragment was digested to lessthan 300 bp in length, or a mutation occurred within thecut site. When allowing for low sequencing depth, readswere sampled from each haplotype with mean=2 andsd=5, or mean=5 and sd=5.

Page 3: Misconceptions on Missing Data in RAD-seq Phylogenetics ...€¦ · Misconceptions on Missing Data in RAD-seq Phylogenetics with a ... led to a common conception that RAD-seq data

2017 EATON ET AL.—MISCONCEPTIONS ABOUT MISSING RAD-SEQ DATA 401

To compare results with no missing sequencesagainst those with only mutation-disruption, or onlylow sequencing coverage, data were simulated underthree combinations of parameter settings, and on threedifferent topologies (Fig. S2 available on Dryad). Eachtopology had an identical tree length but differed inshape. This included a completely balanced topologywith 64 tips, a completely imbalanced topology with 64tips, and an empirical topology inferred for the plantclade Viburnum (65 tips; empirical data set 1, describedbelow).

We define a metric, quartet informativeness, as thenumber of loci in a multi-locus data set that havesufficient taxon sampling to be potentially informativeabout a given split (bipartition) in an unrooted tree(Fig. 1c and d). We used this metric to compareinformation loss at different edge depths for bothsimulated and empirical data sets that vary in theirprimary source of missing data, and thus the distributionof missing data across different edge depths in theirtrees.

Empirical Data SetsWe examined 10 RAD-seq data sets for which the

raw data were accessible, spanning a range of crownages, sizes, and library preparation methods (Table S1available on Dryad). These included the following:65 samples of Viburnum (generated for this study,described below); 56 samples of Heliconius (Nadeauet al. 2013); 36 samples of Quercus (Eaton et al. 2015);32 samples of Ohomopterus (Takahashi et al. 2014); 31samples of Danio (McCluskey and Postlethwait 2015); 13samples of Pedicularis (Eaton and Ree 2013); 64 samplesof Orestias (Takahashi and Moreno 2015); 74 samplesof Phrynosomatidae (Leaché et al. 2015); 13 samplesof Barnacles (Herrera et al. 2015); and 25 samples ofFinches (DaCosta and Sorenson 2016). To minimize theinfluence of bioinformatic variables on our results, alldata sets were assembled de novo under the same generalparameter settings in the program pyrad v.3.0.66 (Eaton2014); see supplemental notebooks (Table S1 availableon Dryad). This software uses an alignment clusteringalgorithm (https://github.com/torognes/vsearch)that allows for insertion–deletion polymorphismsto minimize clustering bias between close versusdistant relatives. Reads were clustered at 85% sequencesimilarity and we required a minimum depth of coverageof six for clusters to be retained in the assembly. Eachdata set was assembled into two final alignments: a“min2” data set that included all loci shared by at leasttwo samples (non-singletons), and a “min4” data setthat included all loci shared by at least four samples.Because data must be shared across four tips to attainphylogenetic information for an unrooted tree, thelatter data set represents the maximum phylogeneticinformation in the data. The min4 assembly wasused to infer a tree for each data set using RAxMLv.8.1.16 (Stamatakis 2014) under the GTR+� nucleotidesubstitution model.

To investigate sources of missing data, we usedphylogenetic generalized least squares (PGLS; Symondsand Blomberg 2014) to fit a regression between theobserved number of loci shared among sets of fourindividuals (quartets), their amount of raw sequencedata, and their phylogenetic distance. Phylogeneticdistance was measured as the sum of GTR+� branchlengths joining quartet samples, and raw sequence datawas measured as the log median number of reads amongquartet samples. All values were mean standardizedprior to analysis.

Because our data represent quartets of species, ratherthan individual species, we implemented a modifiedPGLS method. Many sets of quartets share taxa incommon, or span the same internal edges of thetree, and thus are expected to have experienced thesame mutation-disruption of RAD-seq loci. To modeltheir nonindependence, we constructed a variance–covariance matrix measuring shared branch lengthsamong all sets of quartet samples. Models were fit usingthe gls function in the R package nlme (Pinheiro et al.2016) with an imposed correlation structure derivedfrom the quartet covariance matrix. Following (Symondsand Blomberg 2014) we estimated phylogenetic signal(�) for each data set and used this value to transformoff-diagonal elements of the covariance matrix such thatif no phylogenetic signal is present the phylogeneticcorrection has no effect. For large data sets with manytaxa the covariance matrices were too large to analyze in asingle analysis, and so we used a subsampling approachto randomly sample 200 data points, and repeated this100 times to attain a distribution of model estimates foreach data set.

Viburnum PhylogenyThe flowering plant clade Viburnum (Adoxaceae)

includes approximately 165 species of shrubs and smalltrees and has an estimated crown age of 50–60 Ma(Spriggs et al. 2015). Currently, the best estimate ofViburnum phylogeny is based on whole chloroplastgenomes for 22 species representing all major clades,plus a 10-gene data set for 138 species including 9cpDNA markers and nuclear ribosomal ITS sequences(Clement et al. 2014; Spriggs et al. 2015). Although manyrelationships are now well supported, several of thedeepest splits in the tree are subtended by very shortbranches and receive low support, and several recent andrapid radiations remain unresolved due to insufficientvariation.

A RAD-seq library was prepared for 95 individualsrepresenting 65 species from all major clades withinViburnum. For the present study, we included onlyone representative per species by selecting the samplewith the most data when replicates were present(Table S2). Genomic DNA was extracted from leaftissues preserved in silica or from herbarium specimens,using Qiagen DNEasy kits (Valencia, CA). Multipleextractions from the same individual were sometimesrequired to obtain 1.0 �g of high molecular weight

Page 4: Misconceptions on Missing Data in RAD-seq Phylogenetics ...€¦ · Misconceptions on Missing Data in RAD-seq Phylogenetics with a ... led to a common conception that RAD-seq data

402 SYSTEMATIC BIOLOGY VOL. 66

DNA. RAD libraries were prepared by Floragenex inc.(http://floragenex.com) using a 6-bp restriction enzyme(PstI) for digestion followed by sonication. To minimizethe influence of technical effects (e.g., stochasticity inamplification, sequencing, and PCR duplicates) on ourresults a second library was also prepared following thesame protocol, and with a separate amplification. Thetwo libraries were sequenced on separate lanes of anIllumina HiSeq 2000 at the University of Oregon GC3Ffacility (http://gc3f.uoregon.edu).

To evaluate the influence of sequencing effort onour results we first analyzed the data using only onelane of sequence data, and then pooled replicatesfrom both lanes. Both data sets were assembled inpyrad with the same parameters described above. Werefer to these as the “half” and “full” data sets.For each, the “min4” assembly was used to infera maximum likelihood phylogeny in RAxML with100 nonparametric bootstrap replicates. In addition,we inferred an unrooted species tree (i.e., consistentunder the multi-species coalescent) with the programtetrad v.0.4.0 (http://github.com/dereneaton/ipyrad).This software implements the SVDQuartets algorithm(Chifman and Kubatko 2014) to infer quartet trees usingthe full SNP alignment for each sampled quartet oftaxa. We inferred all 677,040 possible quartets for 65taxa. The quartet trees are then joined into a supertreeusing the quartet-maxcut algorithm implemented bywQMC (Avni et al. 2015). This approach is well suitedfor RAD-seq data because it maximizes phylogeneticinformation available for each quartet of sampled taxaregardless of missing data among other taxa. We ran 100nonparametric bootstrap replicates where each replicateresamples RAD-seq loci with replacement. For eachsampled quartet, in each replicate, a single SNP israndomly sampled from the four-taxon alignment ateach locus for which they share data. Heterozygoussites are randomly resolved in each replicate, sinceindividually sampled SNPs are putatively unlinked. Wereport the tree as an extended majority-rule consensuswith bootstrap supports. For visualization, all trees wererooted along the V. clemensiae branch based on previousanalyses using outgroups (Clement et al. 2014).

A detailed description of an expanded Viburnum RADphylogeny, its correspondence with cpDNA trees, and itsimplications for character evolution and biogeographywill be presented elsewhere. Here we focus ourdiscussion on how sequencing coverage impacted thedistribution of phylogenetic information at differentdepths in the tree. Specifically, we focused on tworegions of the phylogeny that have been especiallydifficult to resolve — one near the base of the treeand one near the tips. Near the base of the tree weused one representative of each of seven well-supportedmajor clades to specifically test the relationship of V.taiwanianum (of the Urceolata clade) to V. amplificatumplus V. lutescens (as supported in cpDNA trees) versusto V. lantanoides (of the Pseudotinus clade, as suggestedby morphological data). Near the tips we analyzedrelationships among members of the Oreinodentinus

clade, which here includes four species representingthe rapid and previously weakly resolved radiation ofViburnum in neotropical cloud forests.

For both focused phylogenetic analyses we inferreda primary concordance tree with the program BUCKy(Larget et al. 2010) to examine the proportion of locithat recover each split while accounting for gene-treeestimation error and heterogeneity. Loci were onlyincluded in the analysis if they were sampled forall taxa chosen for the focused analysis. From thesewe sub-selected only loci that contained at least 1parsimony informative site. A distribution of gene treeswas estimated for each locus in MrBayes v.3.2.1 (Ronquistand Huelsenbeck 2003) under a GTR+� model from4 MCMCMC chains run for 1M generations samplingevery 1000 steps. These distributions were analyzed inBUCKy, combining 4 independent runs each with 4MCMC chains run 1M generations and repeated at threedifferent values for � (0.1, 1, 10), the prior expectation onthe number of distinct trees.

RESULTS

SimulationsLibrary types.—In simulations, as expected, allelicdropout caused by mutation-disruption of enzymerecognition sites occurred more frequently when theenzyme recognition site was longer. In contrast, allelicdropout caused by mutations giving rise to new cutsites within existing fragments occurred more frequentlywhen enzyme recognition sites were shorter (Fig. S3aavailable on Dryad). Variation in the range of the size-selected window had little effect on these results. Whenboth forms of mutation-disruption occur together theeffect of cutter length is effectively nullified (Fig. S3bavailable on Dryad). A comparison of single-digestversus double-digest methods, however, shows thatadding a second independent cutter approximatelydoubles the rate of data loss (Fig. S3b available onDryad), consistent with predictions from in silico studies(Collins and Hrbek 2015) that have shown more rapidmutation-disruption in double digest data.

Quartet informativeness.—To investigate the impact ofmissing RAD-seq data on phylogenetic information losswe examined the distribution of RAD loci that wereoriginally present in the common ancestor of a 64-tip tree and measured how many loci remain quartetinformative about each edge after some proportion ofdata is removed. If a locus lacks data for any oneof the four terminal edges of a quartet then it isnot informative about that quartet. The two sourcesof data loss examined, mutation-disruption and lowsequencing coverage, lead to different distributions ofphylogenetic information, particularly with respect toquartet information.

Hierarchical redundancy.—On the balanced topology alledges are equal in length and thus mutation-disruption

Page 5: Misconceptions on Missing Data in RAD-seq Phylogenetics ...€¦ · Misconceptions on Missing Data in RAD-seq Phylogenetics with a ... led to a common conception that RAD-seq data

2017 EATON ET AL.—MISCONCEPTIONS ABOUT MISSING RAD-SEQ DATA 403

is equally likely to occur along any edge. If mutation-disruption occurs for a given locus along a “tip” edge ofthe tree, then the locus no longer contains information toform any quartet that includes that tip taxon. However,the locus remains informative about all other quartetsformed by all other sets of four taxa in the data set.The same is not true when mutation-disruption occursalong deeper edges. In that case, data are lost for alltaxa subtending the edge (e.g., Fig. 1a), which disruptsall quartets that include a tip from the descendantclade. This means that mutation-disruption decreasesquartet information the most for splits near the tipsof the tree, since these edges can be affected by anymutations that occur between the tips and the root ofthe tree. The counter-intuitive result is that mutation-disruption actually has the least effect on informationloss at deeper edges, since these are least affected bymutations occurring elsewhere in the tree. We refer tothis property of the data as hierarchical redundancy (e.g.,Fig. 1c and d), wherein the amount of information ata given split in the tree is affected by its hierarchicalplacement in the tree. A split with high hierarchicalredundancy would have multiple individuals sampledfrom each of its edges, such that even if data were lostfor some individuals from each edge, there is still likelyto be at least one individual from each edge that retainsdata.

Mutation-disruption.—Of the original 1000 simulatedloci, approximately 900 remained quartet informativeacross the deepest splits of the balanced treefollowing mutation-disruption, and this decreasedto approximately 800 for edges near the tips (Fig. 2e).The rate of loss was approximately doubled for double-digest data (Fig. S2a available on Dryad). The effectof tree shape, and correspondingly edge length, ismade clear when these results are compared with thosefrom the imbalanced topology (Fig. 2f). On this tree,the shallowest edges have similar numbers of quartetinformative loci as in the balanced tree, but the deepestedges have the least phylogenetic information. This isbecause the hierarchical redundancy at the shallowestedges is similar for both trees, but highly dissimilarat their deepest edges (Fig. 2a and d). For a locusto be quartet informative at the deepest split in theimbalanced tree would require no disrupting mutationsto occur along any of the three longest edges of the tree,while only one edge of the quartet has high hierarchicalredundancy (Fig. 2d). The structure of the balanced tree,in contrast, has redundancy of sampling, on average,across all edges (Fig. 2c).

For practical purposes, these results suggest thatsampling a more balanced topology can increasephylogenetic information at deeper scales, and, inparticular, that outgroup sampling would be moreeffective if multiple representatives from a sisterclade were sampled as opposed to a grade ofindividual taxa representing multiple lineages. Thiswould have the effect of breaking up longer branches onwhich mutation-disruption could occur, and increasing

hierarchical redundancy, such that data lost for somemembers of a clade would more often still be representedby at least one other representative of that clade.

Sequencing coverage.—When data loss occurs morerandomly, as in the case of low sequencing coverage,its effect on phylogenetic information loss is quitedifferent. In Figure 2g-h, we show an example whereapproximately 50% of the data was randomly removedfrom each taxon. On the balanced topology, the quartetinformativeness of these data is nearly completelyrecovered only 2–3 edges deeper into the tree (Fig. 2g).This is once again because of hierarchical redundancy;data lost for any one tip at a given locus can often besampled from multiple other relatives as a substitute.Again, this effect is clear when compared against theimbalanced topology that lacks hierarchical redundancy(every split is connected to at least two tips). In thatcase, deeper nodes in the tree do not have morephylogenetic information because all internal edgesdepend on sampling a specific taxon — they do notincrease in hierarchical redundancy (Fig. 2h). Thus, theexpectation from low (or uneven) sequencing coverage isthat large amounts of missing data will decrease quartetinformation near the tips in a balanced tree, or equallyacross all edges for an imbalanced tree, but will neverlead to comparatively more information loss at edgesthat are deeper in time.

These results can be leveraged to maximizephylogenetic information at deeper depths of atree. When taxa are sampled in a way that maximizestree balance, deeper edges will have greater hierarchicalredundancy, and low sequencing coverage will have lesseffect on phylogenetic information further back in time.As more taxa are added to a tree, sampling them in abalanced way can rapidly recover quartet informationacross deeper edges that was otherwise lost due torandomly missing data among tips of the tree (Fig. 2g).

Empirical Data SetsThe number of recovered RAD loci shared among two

or four individuals varied by up to 15-30X across the10 empirical data sets examined, with proportions ofmissing data ranging between 34% and 92% (Table 1).Size and completeness are affected by a range of factorsthat differ among the data sets, including genomesize, sequencing effort, the restriction enzyme(s) used,the number of samples in the data set, and theamount of sequence divergence between them. Fromour small sample of studies, the influence of differentrestriction enzymes is perhaps most apparent. Despitemany differences among the data sets, those that usedsimilar restriction enzymes generally recovered a similarnumber of total shared loci (Table 1). Focusing onquartets as the minimum phylogenetic unit, we cancrudely estimate the amount of potential phylogeneticinformation as the number of RAD loci with sequencedata shared across at least four samples (Fig. 3). Data

Page 6: Misconceptions on Missing Data in RAD-seq Phylogenetics ...€¦ · Misconceptions on Missing Data in RAD-seq Phylogenetics with a ... led to a common conception that RAD-seq data

404 SYSTEMATIC BIOLOGY VOL. 66

(a)

(b) (d)

(c) (e)

(f)

(g)

(h)

FIGURE 2. The impact of missing data on quartet informativeness when missing data arise from either mutation-disruption or low sequencingcoverage. Data were simulated on two trees with contrasting balance and thus edge lengths. The shallowest splits on both topologies (markedthroughout by a downward facing triangle) have similar topologies a) and b), and also similar numbers of loci that are quartet-informative(compare the shallowest edge depth in e) to f). In both topologies, the shallowest split contains three short terminal edges along which disruptingmutations are unlikely to occur, and a fourth edge at which many possible taxa could contain data for a locus to be informative a) and b). Thetwo trees differ, however, in the information at their deepest splits c) and d) marked by a upward facing triangle). There is greatest redundancyin sampling across tips of the deepest split in the balanced topology c), but redundancy on only one edge of the deepest split of the imbalancedtopology d), which also contains three long terminal edges on which mutation-disruption could occur. As a result, quartet information followingmutation-disruption increases across deeper edges of the balanced topology e), but decreases across deeper deeper edges of the imbalancedtopology f). When missing data arises from low sequencing coverage, the balanced topology quickly recovers quartet informativeness acrossdeeper edges because of its hierarchical redundancy g). In contrast, the imbalanced topology does not increase in redundancy at deeper edgesd), and therefore does not recover quartet informativeness h). Black circles in e–h show the number of quartet informative loci for each splitwithout missing data, grey circles are with missing data, and triangles highlight the deepest and shallowest splits.

generated with a 6-bp restriction enzyme had an averageof 9,837±2,496 loci shared among four samples; similarto, but slightly more than in data sets using an 8-bpcutter (mean=7,026±6,983). Both were substantiallylarger than double-digest data sets that used an 8-bp+4-bp combination (mean=1,292±1,011) (Fig. 3). The largevariance among the 8-bp data sets likely reflects the factthat it includes the Orestias data set, which is by far theyoungest clade examined, and was also sequenced torelatively high coverage, making it an outlier.

The 10 empirical data sets examined vary with respectto size and tree balance, and thus also in the degreeof hierarchical redundancy of their internal edges. Inagreement with our simulation results, the number ofquartet informative loci is not equally distributed acrossall edges in these trees, but instead is most concentratedon edges with greatest redundancy — those that havea greater number of descendants from each subtendingedge. Figure 4 shows an example for Viburnum, Quercus,and Orestias, each of which accumulates more quartet

informative loci on internal edges that have greaterhierarchical redundancy (Spearman’s rank correlations:rs >0.6, P<1e−7). This pattern is most prominent indata sets that have many taxa since many tips arerequired for some edges to have multiple descendantleaves from each subtending edge. It also is mostconspicuous in data sets that have low or unevensequencing coverage since hierarchical redundancy hasa greater effect with randomly missing data. Because ofthe more balanced tree shape of the Viburnum data sethierarchical redundancy is greatest at many of its deepestedges, whereas, in contrast, the Orestias tree is lessbalanced across deeper edges and thus its informationis more concentrated at shallow depths.

Sources of missing data.—Across data sets, phylogeneticdistance was generally a poorer predictor of the numberof loci shared among quartets of taxa than sequencingcoverage (the number of input reads; Table 1; Fig. S4available on Dryad). Phylogenetic distance was a better

Page 7: Misconceptions on Missing Data in RAD-seq Phylogenetics ...€¦ · Misconceptions on Missing Data in RAD-seq Phylogenetics with a ... led to a common conception that RAD-seq data

2017 EATON ET AL.—MISCONCEPTIONS ABOUT MISSING RAD-SEQ DATA 405

(Viburnum)CTGCAG

(Heliconius)CTGCAG

(Phrynosomatidae)CCTGCAGG+CCGG

(Quercus)CTGCAG

(Orestias)CCTGCAGG

(Danio)CCTGCAGG

(Finches)CCTGCAGG+CCGG

(Barnacles)CCTGCAGG

(Ohomopterus)CTGCAG

(Pedicularis)CTGCAG

N lo

ci s

hare

d0

1000

020

000

3000

0

N-size subtree2 4 6 8 10

N-size subtree2 4 6 8 10

010

000

2000

030

000

N-size subtree2 4 6 8 10

010

000

2000

030

000

N-size subtree2 4 6 8 10

010

000

2000

030

000

N-size subtree2 4 6 8 10

010

000

2000

030

000

N-size subtree2 4 6 8 10

N lo

ci s

hare

d0

1000

020

000

3000

0

N-size subtree2 4 6 8 10

010

000

2000

030

000

N-size subtree2 4 6 8 10

010

000

2000

030

000

N-size subtree2 4 6 8 10

010

000

2000

030

000

N-size subtree2 4 6 8 10

010

000

2000

030

000

FIGURE 3. Number of loci (mean±SD) shared across subsampled trees of different size in 10 empirical RAD-seq data sets. Data sets areordered by the total number of loci recovered in their min4 assemblies (Table 1). The restriction recognition sequence (cutter) used to prepareeach library is shown. The mean number of loci shared across sets of four taxa (quartets) is marked with a dashed line. All axes are plotted onthe same scale.

predictor when sequencing coverage was closer tosaturation, either because the number of reads wasvery high, or the number of sequenced fragments waslow (e.g., Heliconius, and Finches, respectively; Table 1).Sequencing coverage was uneven in most data sets,with some individuals receiving much higher inputthan others, and some loci receiving high coveragedespite many other loci appearing as singletons. Thiswas observed even in double digest data sets that selectmany fewer fragments, and thus are predicted to recoverhigher coverage data given equal sequencing effort(Fig. S5 available on Dryad). Differences across datasets in the proportion of low coverage clusters does notseem to be associated with single versus double digestpreparations (Table 1), and may result from other factorssuch as differences in genome size, biases in fragmentamplification, or sequencing off-target DNA fragments.

The 6-bp cutter data sets are predicted to containa larger number of fragments than 8-bp cutter datasets, and correspondingly, the number of loci sharedamong sets of taxa in the 6-bp cutter data sets wasgenerally better predicted by the number of input readsthan by phylogenetic distance, suggesting these datasets are consistently under-sequenced. An exception wasthe Heliconius data set, which had a substantially largernumber of input reads per sample. The 6-bp cutter datasets also generally contained the highest numbers ofquartet informative loci (Table 1). Heliconius and Orestias,the two data sets with the greatest number of input reads,also had the most data shared across the largest numberof individuals (Fig. 3). Interestingly, these two data set

also show similar rates of data loss across increasingnumbers of sampled taxa (Fig. 3b and g). These rates aresimilar despite the fact that Orestias is by far the youngestclade examined, representing essentially a population-level data set, whereas the Heliconius data includesmany divergent species, with approximately 5X greatersequence divergence on average. This shows that in thenear absence of missing data from under sequencing,phylogenetic scale RAD-seq data sets can recover similaramounts of shared data among individuals as has beenobserved in population-level studies.

Viburnum PhylogenySequencing coverage.—The addition of a second lane ofsequence data led to a substantial increase in the numberof loci shared among multiple taxa. The increase wasparticularly prevalent for larger sets of taxa. For example,although the total number of loci shared across atleast four samples approximately doubled (from 24,191to 40,036) upon doubling our sequencing effort, thenumber of loci shared among 40 samples increasednearly 10X (from 184 to 1724). Because low sequencingcoverage tends to cause random missing data, it makessense that under-sequencing would rapidly reduce thenumber of loci shared among larger sets of taxa.

The total number of loci in the min4 data setincreased from 142K to 199K with the additionallane of sequencing, and the average number of lociper sample from 28,605±8,731 to 44,869±11,037. The

Page 8: Misconceptions on Missing Data in RAD-seq Phylogenetics ...€¦ · Misconceptions on Missing Data in RAD-seq Phylogenetics with a ... led to a common conception that RAD-seq data

406 SYSTEMATIC BIOLOGY VOL. 66

FIGURE 4. The number of loci that are quartet informative at each edge of three empirical RAD-seq data sets. The relative number of quartetinformative loci is indicated by the size of the node below (to the right of) each edge. Node sizes are not on the same scale between trees. Boxplotsshow the actual number of quartet informative loci with respect to the level of hierarchical redundancy at each edge, measured as the minimumnumber of descendant leaves from any of the four tips of an edge. The Viburnum topology is the most balanced across deeper splits and alsoshows an increase in quartet informativeness across deeper (more internally nested) edges.

TABLE 1. Ten empirical RAD-seq data sets reassembled and reanalyzed for this study

Study Cut Ntips phydist input lowcov min2 (%miss) min4 (%miss) �phy �input

1. Viburnum 6 65 0.061 2.7 0.25 496,509 (88.7) 199,094 (77.6) – 0.10 0.312. Heliconius 6 56 0.050 5.3 0.36 212,818 (81.0) 119,819 (70.0) – 0.13 0.043. Quercus 6 36 0.018 2.0 0.43 145,210 (69.2) 87,969 (53.7) – 0.12 0.154. Ohomopterus 6 32 0.040 3.9 0.18 119,441 (68.4) 76,778 (54.6) – 0.10 0.085. Danio 8 31 0.112 2.6 0.52 259,982 (88.9) 72,237 (74.3) – 0.21 – 0.016. Pedicularis 6 13 0.027 1.3 0.73 80,639 (56.9) 44,875 (36.3) – 0.08 0.247. Orestias 8 64 0.010 4.0 0.05 45,146 (37.0) 43,000 (33.8) – 0.03 0.268. Phrynosomatidae 8, 4 74 0.063 1.5 0.72 87,325 (91.8) 41,011 (86.2) – 0.43 0.179. Barnacles 8 13 0.052 0.7 0.24 26,325 (64.4) 15,502 (51.9) – 0.28 0.2110. Finches 8, 4 25 0.041 0.6 0.45 17,477 (60.9) 12,864 (50.5) – 0.07 0.07

Notes: Mean pairwise phylogenetic distances (GTR+�) between pairs of taxa (phydist), mean number of input sequence reads (×106; input) pertaxon, and mean proportion of excluded low coverage clusters across samples (lowcov) is shown, in addition to the total number of RAD-seq locirecovered for pairs of taxa (“min2” data sets), and quartets (“min4” data sets) of taxa, along with the proportions of missing data in the assembledsupermatrices (%miss). The number of sampled taxa (Ntips) and the length of restriction recognition sites for enzymes used to prepare libraries(Cut; detailed in Fig. 3) vary across data sets. We fit a phylogenetic least squares model to predict the number of loci shared among quartets oftaxa in each data set. Regression coefficients for phylogenetic distance (�phy) and log median number of input reads (�input) are reported the meanvalue across 100 replicate subsamples of 200 quartets. Regression coefficients are bolded if the mean P-value across replicates was significant at�=0.01.

Page 9: Misconceptions on Missing Data in RAD-seq Phylogenetics ...€¦ · Misconceptions on Missing Data in RAD-seq Phylogenetics with a ... led to a common conception that RAD-seq data

2017 EATON ET AL.—MISCONCEPTIONS ABOUT MISSING RAD-SEQ DATA 407

average number of samples with data for each locusincreased from 12.9±9.1 to 14.6±12.7, and the numberof parsimony informative SNPs from 793,827 to 1,227,424,corresponding to an increase from 5.5±4.4 parsimonyinformative SNPs per locus to 6.2±5.3. It is likely thatsequencing coverage was still short of saturation, sincethe proportion of missing data in the half versus fullmin4 data sets remained similar (80.4% and 77.6%),meaning that many new loci were added that continuedto be shared among a small proportion of samples.

Consistent with our simulation results, the number ofquartet informative loci across splits of the Viburnum treewas greatest at edges with more hierarchical redundancy(Fig. 4). This relationship is very similar to theone predicted when data were simulated on theViburnum topology with low sequencing coverage(Fig. S2f-g available on Dryad), and consistent with ourPGLS estimates predicting that most missing data in thisdata set is due to low sequencing coverage rather thanmutation-disruption.

Despite differences in size and completeness of thehalf and full data sets, both recovered highly similarML topologies (Fig. 5a; Fig. S6 available on Dryad),with only a small difference being the placement of V.erosum. Nearly all splits in the tree have 100% bootstrapsupport (Fig. 5a). The quartet-based species tree inferredfrom SNPs recovered a similar topology as the MLanalyses (Fig. 5b; Fig. S6 available on Dryad), withonly slight differences that were also associated withlower bootstrap support. An important difference is insupport for the placement of Pseudotinus, a clade that hashistorically proven difficult to place along the backboneof Viburnum (Clement et al. 2014; Spriggs et al. 2015).The placement of Pseudotinus in relation to the Urceolataclade is of special interest because, although thesegroups have not been united in cpDNA analyses, theyshare several distinctive morphological characterstics,including naked buds (lacking modified scales) anda unique sympodial branching architecture. Previousresults from whole chloroplast genome data (Clementet al. 2014) placed Urceolata sister to V. amplificatum +Crenotinus (Fig. 5d). However, this clade was puzzling tothe extent that it was named Perplexitinus to highlight thelack of morphological synapomorphies (Clement et al.2014).

Our concatenated ML and quartet-based species treeanalyses support a direct link between Pseudotinusand Urceolata (Fig. 5a and b), which makes the mostsense from a morphological standpoint. However, whilethe concatenated analysis found perfect support forthis relationship, the quartet-based analysis showedsignificant uncertainty (67% bootstrap support). The bestsupported alternative relationship in the quartet analysisincludes the two clades in sequence (paraphyletically)along the backbone (Fig. 5c; Fig. S6 available on Dryad).As the placement of Pseudotinus and Urceolata in ouranalyses has mixed support, and conflicts with previouscpDNA analyses, we focused our Bayesian concordancefactor analysis on exploring the distribution of RAD locisupporting each alternative hypothesis.

Bayesian concordance analysis.—Concordance factors arenot estimates of support, but rather represent theproportion of sampled loci for which a clade isrecovered (Baum 2007). We tested several values for theprior on the number of true underlying genealogies(a parameter which affects the correction of genetree estimation errors) but present only �=0.1, sinceall results were qualitatively similar. The primaryconcordance tree for our deep-scale analysis, whichrepresents the most frequently occurring nonconflictingclades across sampled loci, matches the topology fromour concatenation and quartet-based analyses. This wastrue whether the concordance tree was inferred fromthe full data set, which contained 1,203 loci with atleast one parsimony informative SNP and data for all8 selected taxa in this test, or if we used the half data set,in which only 120 loci met the same criteria. The largerdata set consistently yielded higher CFs and narrowerconfidence intervals.

On the primary concordance tree two clades(amplificatum + Crenotinus, and Tinus + Imbricotinus) wererecovered with a 95% credibility interval on their CFthat does not conflict with any other clade, while theremaining three clades show greater conflict (Fig. 5b).The placement of Pseudotinus in the alternative positionthat received moderate support in the quartet-basedanalysis (Fig. 5c) has a CF that is approximately half (95%CI = 0.08–0.17) that of the conflicting clade found in theprimary concordance topology (95% CI = 0.15–0.23). Theplacement of Urceolata in the cpDNA topology, whereit forms a clade with amplificatum+Crenotinus, receivedsimilarly low support (95% CI = 0.08–0.19); while theplacement of Urceolata within a clade that includesTinus + Imbricotinus to the exclusion of Pseudotinus (assupported by cpDNA) was very poorly supported (95%CI = 0.03–0.09) (Fig. 5d).

The shallow-scale relationships shown in Figure5a and 5e for the Oreinodentinus clade are stronglysupported here, in contrast to cpDNA results wherethe data were insufficient for confident resolution(Clement et al. 2014; Spriggs et al. 2015). Importantly,our RAD analyses are congruent with geographyover morphological characters (leaf size, shape, andpubescence) in strongly supporting sister relationshipsbetween V. sulcatum and V. acutifolium from Mexico, andbetween V. jamesonii and V. triphyllum from Ecuador. Itis noteworthy that we recovered 3,300 loci in the fulldata set that met our criteria for inclusion in the analysis,and only 364 loci in the half data set. Again, both datasets yielded the same topology, and the larger dataset recovered higher concordance factors and narrowerconfidence intervals (Fig. 5e).

DISCUSSION

Sources of Missing DataA common intuition about RAD-seq is that it

contains little phylogenetic information for resolvingdeep-scale relationships. Here, we examined both

Page 10: Misconceptions on Missing Data in RAD-seq Phylogenetics ...€¦ · Misconceptions on Missing Data in RAD-seq Phylogenetics with a ... led to a common conception that RAD-seq data

408 SYSTEMATIC BIOLOGY VOL. 66

(b)(a)

(c)

(d)

(e)

FIGURE 5. RAD-seq phylogeny for 65 species of Viburnum. a) Rooted maximum likelihood tree inferred from the largest phylogeneticallyinformative (min4) concatenated data set. Bootstrap support is 100 except where indicated. Open and closed grey circles indicate deep andshallow splits, respectively, that were further analyzed with Bayesian concordance analysis b)–e). Closed black circles at the tips highlight taxaused as representatives of major clades in the deep-scale analysis (b-d). b) The primary concordance tree for deep splits among major clades ofViburnum matches the ML concatenated topology. The 95% CI on concordance factors (CFs) inferred from the “half” and “full” data sets areshown above and below each split, respectively. The two splits with CFs that do not overlap the 95% CI of any alternative split are shown inbold inside brackets, while the CFs for three splits that do conflict with at least one other split are shown in plain text inside parentheses. c) Analternative topology in which Urceolata and Pseudotinus do not form a clade, as supported by quartet species tree analysis, has lower support. d)The Clement et al. (2014) cpDNA topology has low CFs on splits that differ from the primary concordance topology shown in b. e) Splits withinthe recent and rapidly radiated Oreinodentinus clade are all strongly supported.

simulated and empirical data to show that althoughmutation-disruption causes more distant relatives torecover fewer shared RAD loci, this pattern doesnot necessarily lead to the corollary expectation oflittle (or even less) phylogenetic information acrossdeeper splits in a tree. We identify several factorsthat influence how missing data among sampled tipsof a tree translates into information that can beused to test support for splits within the tree using

concatenation, quartet-based, and concordance phylo-genetic approaches.

Our simulation study shows clearly an effect of treeshape and size (number of taxa) that together influencethe information at edges of varying depths. Edgesthat are multiple nodes removed from a tip sufferless information loss from missing data among tips ofthe tree whether the data are missing from mutation-disruption or low sequencing coverage. In either case,

Page 11: Misconceptions on Missing Data in RAD-seq Phylogenetics ...€¦ · Misconceptions on Missing Data in RAD-seq Phylogenetics with a ... led to a common conception that RAD-seq data

2017 EATON ET AL.—MISCONCEPTIONS ABOUT MISSING RAD-SEQ DATA 409

the hierarchical redundancy associated with these morenested edges makes it possible for more loci to havedata for at least two individuals on each side of a split,such that the minimal quartet phylogenetic informationis available. Sampling more individuals will necessarilyshorten some branches of a tree and in doing so willincrease the number of independent lineages alongwhich data may be recovered. An interesting aspect ofthis property of RAD-seq data is that as larger datasets are assembled that include more sampled taxa,many loci that are initially found as singletons, or thatare shared among few taxa, can become increasinglyphylogenetically informative.

Although we did not include mutation-generationof new fragments in our simulations (as this wouldrequire assumptions about the size and compositionof the genome), we expect that this type of missingdata also varies greatly across data sets of differentsize and age, or prepared using different restrictiondigest methods. Loci that arise within a subclade frommutation-generation are expected not to be sharedamong all individuals in a larger clade. These locican provide phylogenetic information for taxa withinthe subclade in which the locus is present, but arenot informative about the placement of that subcladewithin the larger tree. Interestingly, of the 10 datasets examined, the double digest libraries, which tendto select fewer total fragments and should experienceless mutation-generation of new fragments since theyrequire two restriction enzymes to be present, did nottend to have less missing data than single digest libraries(Table 1). This suggests that mutation-generation ofnew fragments, at least over the time scales examinedand considering the restriction enzymes used, is not aprimary source of missing data in RAD-seq data sets.

How much Phylogenetic Information is Necessary?In sequencing projects there is a persistent trade-

off between cost, quality, and quantity, with decisionsto be made about which protocols to use and howmuch sequencing effort to expend. One of the majorbenefits of “reduced complexity” methods like RAD-seq is that they simplify many cost-benefit decisionsthat must be made when preparing genomic libraries,because the restriction digestion more or less equalizesthe number of expected markers that will be presentacross a group of closely related organisms. However,there remains the question of how many markers tosample (how common a cutter to use), and how manyindividuals to multiplex (how much sequencing effortper sample). For evolutionary questions at shallowscales, coverage as low as 1X is commonly used, andthe benefit gained by multiplexing many individuals hasbeen argued to outweigh the potential errors introducedfrom low coverage (Buerkle and Gompert 2013); but see(Harvey et al. 2013). A concern for using low coveragedata is that sequencing errors, if not corrected, cansignificantly influence branch length estimates (Kuhner

and McGill 2014), and perhaps even the topology.For phylogenetic-scale questions, individuals typicallyrepresent divergent populations or species, and thus itmay be less appropriate to account for sequencing errors,or impute missing data (Li et al. 2009), using populationgenetic assumptions about the expected distribution ofvariation.

Across the 10 empirical data sets examined, those withgreater sequencing effort per sample generally containedmore shared loci among samples (Table 1; Fig. 3). In thecase of Viburnum, which is a relatively old clade with afairly large genome (∼4 Gb; (Bennett and Leitch 2012)),a minimal increase in sequencing effort (2X), increasedthe number of markers with shared data across differentsubsets of the tree by 2–10X, and nearly doubled thenumber of parsimony informative sites. Consequently,the amount of information available for quartet-basedinference using SNPs was approximately doubled, andthe amount of information for our concordance analyseswas increased by 10X, since this method requires fullysampled gene trees. Even in the case of species treemethods based on the multi-species coalescent that donot require fully sampled gene trees, such as MP-EST(Liu et al. 2010) and ASTRAL (Mirarab et al. 2014b),RAD loci that contain more SNPs shared across agreater number of taxa will provide more informativeinputs, which has been shown to consistently improveperformance of these methods (Mirarab et al. 2014a; Liuet al. 2015; Xi et al. 2015).

One can imagine two competing strategies forattaining large phylogenetic RAD-seq data sets givenlimited resources. The first is to sequence many locito lower coverage since, as we have shown, randomlymissing data have little effect on the amount ofinformation at deeper edges of a tree (at least if itcan be sampled in a balanced way). Examples of thisare data sets 1–7 (Fig. 3), which recover relativelysparse matrices, but with enormous numbers of locithat are quartet informative, making them highlyuseful for concatenation and quartet-based phylogeneticapproaches. A second strategy is to sequence fewer locito higher coverage, which would be more useful whenseeking to attain informative gene trees. Data sets 2,4, 5, and 10 (Table 1) represent highly divergent datasets in which missing data are poorly explained bythe amount of input data, suggesting that sequencingcoverage was nearly saturated. However, it should benoted that these data sets are still composed of largeamounts of missing data in their min4 assemblies (70, 54,74, and 50%, respectively). Although sampling fewer locito higher coverage can reduce the amount of data that israndomly missing, the use of additional cutters to reducethe number of loci comes with a trade-off of highermutation disruption. Our simulation and empiricalresults both show that single digest preparations tendto recover much more phylogenetic information thandouble digest methods, and that when coupled withhigh sequencing input can yield orders of magnitudemore data.

Page 12: Misconceptions on Missing Data in RAD-seq Phylogenetics ...€¦ · Misconceptions on Missing Data in RAD-seq Phylogenetics with a ... led to a common conception that RAD-seq data

410 SYSTEMATIC BIOLOGY VOL. 66

RAD-seq and related methodsA number of studies have used in silico approaches

to investigate the expected distribution of missing datain RAD-seq assemblies, and these provided the earliestevidence that RAD could be informative for phylogeneticanalyses (Rubin et al. 2012). However, many empiricalRAD-seq data sets have proven less informative thanthese studies would predict, likely due to missing datafrom sources besides mutation-disruption, which insilico studies generally do not consider. As we haveshown (Table 1), details of the RAD-seq preparationprotocol substantially impact the amount of data thatwill be recovered.

An important point of comparison between RAD-seqand related reduced-representation genomic sequencingmethods, such as anchored hybrid enrichment (Lemmonet al. 2012) and ultra conserved elements (UCEs;McCormack et al. 2012), is in the total amount ofphylogenetic data that each can produce. AlthoughRAD-seq can sample many more genomic regionsin total than the other two methods, it is oftenthought that over deep time-scales the introductionof missing data will reduce its information contentto a level below that of the other methods whichcontain very little missing data. (Collins and Hrbek2015) recently overturned this expectation by comparingthe in silico predicted phylogenetic informativeness ofRAD, UCE, and anchored hybrid enrichment data thatcould be bioinformatically extracted from 33 primategenomes. In their comparison, even at the deepest time-scales examined (60–80 Ma), RAD provides far morephylogenetic information than alternative methods,despite the presence of substantial missing data in theRAD-seq data set.

How large are these differences? Consider the difficultphylogenetic problem of resolving the branching orderof early diverging lineages of neoavian birds, whichwas recently investigated in two large-scale analysesusing reduced-representation genomic methods. (Jarviset al. 2014) combined full genome re-sequencing datawith UCE loci, of which the latter provided a matrixof ∼370 Kb for 48 taxa, whereas (Prum et al. 2015)sampled ∼390 Kb from 198 taxa using anchored hybridenrichment loci. Both data sets had negligible amountsof missing data. Because our Viburnum data set is of asimilar, albeit slightly younger crown age (50–60 Myr) itprovides an interesting comparison. The concatenatedsupermatrix of RAD loci in our min4 data set for65 species of Viburnum was substantially larger thaneither reduced-representation bird data set, with 17.1 Mb.Despite its 78% missing data, this included 1.2 millionparsimony informative SNPs (3.2M SNPs total), meaningthat the Viburnum data set contains three times moreinformative sites than the two reduced-representationbird data sets contain characters.

As we have shown, the Viburnum RAD-seq data arenot concentrated only among close relatives (Fig. 4),but rather, there is substantial information across alledges of the tree, and it actually increases across deeperedges (Fig. S2g available on Dryad). In fact, although

there were only 1,203 loci (∼100Kb) shared acrossthe 8 taxa representing the most divergent splits inViburnum (those selected in our BUCKy analyses), thetotal number of loci spanning these splits is far greaterwhen considering the many additional combinations of8 taxa that could be sampled from these 8 clades, sincemany contain multiple taxa. The counter-intuitive resultthat the sparcity of RAD-seq supermatrices does nottranslate into a sparcity of phylogenetic information isan important consideration when comparing reduced-representation genomic methods.

Missing Data and Phylogenetic InferenceAlthough it has been a topic of interest for many

years, it remains unclear how missing data affectsphylogenetic inference under a range of scenarios,including different inference methods, data types, andproportions of missing data. Many of the studies of theseproblems predate the availability of genomic sequencedata. (Wiens 2003) showed that reduced accuracy frommissing data is typically a function of having too fewsampled characters shared between taxa rather thantoo many missing data cells. Similarly, the problem ofterraces in phylogenetic tree space (Sanderson et al.2011; 2015) was initially described for cases where manysamples share little or no informative characters whenpartitioning data. This is not the problem we face inRAD-seq phylogenetics. Typically, any sample will sharehundreds or thousands of loci with every other sample.As the problem for RAD-seq data is one of thousands ofpartially overlapping taxon sets, we focused our analyseson the distribution of the smallest informative sets,namely quartets (Berry and Gascuel 2000).

An important distinction of the quartetinformativeness metric is that it is measured conditionalon a topology. Thus, despite the fact that there maybe a substantial amount of information to resolve asplit on the correct topology, one could imagine that anuneven distribution of phylogenetic information couldbias the ability to traverse tree space, and thus to findthis topology. (Whidden and Matsen 2015) recentlydescribed a method for visualizing and measuringthe efficiency with which MCMC chains traverse treespace which will likely be a useful direction for furtherresearch on how missing data affects the ability to reachoptimal regions of tree space. Similarly, (Huang andKnowles 2016) showed that mutation-disruption canlead to a biased mutational spectrum, such that sometaxa will share loci that evolve faster or slower thanothers. These topics deserve further consideration forRAD-seq as well as other data types for understandinghow different distributions of missing data affectphylogenetic inference.

CONCLUSIONS

Our counter-intuitive results show that althoughRAD-seq data sets have high proportions of missingdata, including more missing data between more

Page 13: Misconceptions on Missing Data in RAD-seq Phylogenetics ...€¦ · Misconceptions on Missing Data in RAD-seq Phylogenetics with a ... led to a common conception that RAD-seq data

2017 EATON ET AL.—MISCONCEPTIONS ABOUT MISSING RAD-SEQ DATA 411

divergent taxa, there is potential to gather morephylogenetic information (measured in terms of quartetinformativeness) about splits deeper in a tree thantoward the tips. RAD-seq data can therefore be usefulin resolving shallow (population-level) relationships, asgenerally assumed, but also in resolving much deeperphylogenetic relationships on the scale of the 50-60million-year-old Viburnum phylogeny presented here.However, as we have also shown, high bootstrap valuesin concatenated analyses need to be thoroughly explored(e.g., using multi-locus coalescent and concordanceapproaches), and may reveal wide variation in the natureand strength of support (Salichos and Rokas 2013). Byassembling a large RAD-seq data set containing millionsof SNPs, as well as hundreds of well-sampled gene trees,we were able to compare multiple phylogenetic methodsto test alternative phylogenetic hypotheses.

One of our primary findings is that increasedsequencing coverage can greatly increase thephylogenetic utility of RAD-seq data sets. Thisis encouraging considering that sequencing costsare likely to continue to decrease for some time.Furthermore, we show that hierarchical redundancyreduces the impact that missing data among tips of atree has on phylogenetic information at edges withinit. Larger trees necessarily have more edges on whichhierarchical redundancy can be increased. Samplingmore taxa, and targeting more balanced tree shapeswill further minimize the effects of missing data. Bothof these findings suggest that early RAD-seq studies,which often sampled few taxa and were generated onsequencing platforms that yielded fewer reads, maynot be representative of the potential that RAD-seqoffers for phylogenetic resolution of clades from thescale of thousands of years to tens of millions of years.Our analysis of the angiosperm clade Viburnum makesclear that the amount of information that can beattained rapidly and efficiently using RAD-seq makesit a powerful approach for performing large-scalecomparative genomic analyses.

REPRODUCIBILITY

Code to download the 10 empirical data sets,assemble them, and reproduce our results areorganized into Jupyter/IPython notebooks availableat https://github.com/dereneaton/RADmissing. Thedemultiplexed Viburnum fastq data are archived in theNCBI SRA: SRP065788.

SUPPLEMENTARY DATA

Supplementary tables and figures can be found inDryad Repository: (http://dx.doi.org/10.5061/dryad.g549v).

FUNDING

This work was supported by National ScienceFoundation (DEB-1145606, IOS-1256706)

ACKNOWLEDGEMENTS

We thank Cecile Ané and Michael Landis for helpfuldiscussions on implementing the regression analyses,and we also thank Laura Kubatko and two anonymousreviewers for suggestions that improved the article. Wealso thank the many researchers who made their datasets publicly available.

REFERENCES

Avni E., Cohen R., Snir S. 2015. Weighted quartets phylogenetics. Syst.Biol. 64:233–242.

Baird N.A., Etter P.D., Atwood T.S., Currey M.C., Shiver A.L., LewisZ.A., Selker E.U., Cresko W.A., Johnson E.A. 2008. Rapid SNPdiscovery and genetic mapping using sequenced RAD markers.PLoS ONE 3:e3376.

Baum D.A. 2007. Concordance trees, concordance factors, and theexploration of reticulate genealogy. Taxon 56:417–426.

Bayzid M.S., Mirarab S., Boussau B., Warnow T. 2015. Weightedstatistical binning: enabling statistically consistent genome-scalephylogenetic analyses. PLoS ONE 10:e0129183.

Bennett M., Leitch I. 2012. Plant DNA C-values database (release 6.0,Dec. 2012). http://www.kew.org/cvalues/. (Accessed: 2015-09-30).

Berry V., Gascuel O. 2000. Inferring evolutionary trees with strongcombinatorial evidence. Theoret. Computer Sci. 240:271–298.

Buerkle A.C., Gompert Z. 2013. Population genomics based on lowcoverage sequencing: how low should we go? Mol. Ecol. 22:3028–3035.

Cariou M., Duret L., Charlat S. 2013. Is RAD-seq suitable forphylogenetic inference? An in silico assessment and optimization.Ecol. Evol. 3:846–852.

Chifman J., Kubatko L. 2014. Quartet inference from SNP data underthe coalescent model. Bioinformatics 30:3317–3324.

Clement W.L., Arakaki M., Sweeney P.W., Edwards E.J., DnoghueM.J. 2014. A chloroplast tree for Viburnum (Adoxaceae) and itsimplications for phylogenetic classification and character evolution.Am. J. Botany 101:1029–1049.

Collins R.A., Hrbek T. 2015. An in silico comparison of reduced-representation and sequence-capture protocols for phylogenomics.bioRxiv, p. 032565.

DaCosta J.M., Sorenson M.D. 2016. ddRAD-seq phylogenetics based onnucleotide, indel, and presence–absence polymorphisms: Analysesof two avian genera with contrasting histories. Mol. Phylogenet.Evol. 94:122–135.

Eaton D.A.R., Hipp A.L., González-Rodríguez A., Cavender-Bares J.2015. Historical introgression among the American live oaks andthe comparative nature of tests for introgression. Evolution 69:2587–2601.

Eaton D.A.R., Ree R.H. 2013. Inferring phylogeny and introgressionusing RADseq data: An example from flowering plants (Pedicularis:Orobanchaceae). Syst. Biol. 62:689–706.

Eaton D.A.R. 2014. PyRAD: assembly of de novo RADseq loci forphylogenetic analyses. Bioinformatics 30:1844–1849.

Escudero M., Eaton D.A., Hahn M., Hipp A.L. 2014. Genotyping-by-sequencing as a tool to infer phylogeny and ancestral hybridization:A case study in carex (cyperaceae). Mol. Phylogenet. Evol. 79:359–367.

Harvey M.G., Smith B.T., Glenn T.C., Faircloth B.C., Brumfield R.T.2013. Sequence capture versus restriction site associated DNAsequencing for phylogeography. arXiv:1312.6439 [q-bio]. ArXiv:1312.6439.

Herrera S., Watanabe H., Shank T.M. 2015. Evolutionary andbiogeographical patterns of barnacles from deep-sea hydrothermalvents. Mol. Ecol. 24:673–689.

Hipp A.L., Eaton D.A.R., Cavender-Bares J., Fitzek E., Nipper R., ManosP.S. 2014. A framework phylogeny of the American oak clade basedon sequenced RAD data. PLoS ONE, 9:e93975.

Huang H., Knowles L.L. 2016. Unforeseen Consequences of ExcludingMissing Data from Next-Generation Sequences: Simulation studyof RAD Sequences. Syst. Biol. 65:357–365.

Page 14: Misconceptions on Missing Data in RAD-seq Phylogenetics ...€¦ · Misconceptions on Missing Data in RAD-seq Phylogenetics with a ... led to a common conception that RAD-seq data

412 SYSTEMATIC BIOLOGY VOL. 66

Jarvis E.D., Mirarab S., Aberer A.J., Li B., Houde P., Li C., HoS.Y.W., Faircloth B.C., Nabholz B., Howard J.T., Suh A., WeberC.C., Fonseca R.R.d., Li J., Zhang F., Li H., Zhou L., Narula N.,Liu L., Ganapathy G., Boussau B., Bayzid M.S., Zavidovych V.,Subramanian S., Gabaldón T., Capella-Gutiérrez S., Huerta-Cepas J.,Rekepalli B., Munch K., Schierup M., Lindow B., Warren W.C., RayD., Green R.E., Bruford M.W., Zhan X., Dixon A., Li S., Li N., HuangY., Derryberry E.P., Bertelsen M.F., Sheldon F.H., Brumfield R.T.,Mello C.V., Lovell P.V., Wirthlin M., Schneider M.P.C., Prosdocimi F.,Samaniego J.A., Velazquez A.M.V., Alfaro-Núñez A., Campos P.F.,Petersen B., Sicheritz-Ponten T., Pas A., Bailey T., Scofield P., BunceM., Lambert D.M., Zhou Q., Perelman P., Driskell A.C., Shapiro B.,Xiong Z., Zeng Y., Liu S., Li Z., Liu B., Wu K., Xiao J., Yinqi X., ZhengQ., Zhang Y., Yang H., Wang J., Smeds L., Rheindt F.E., Braun M.,Fjeldsa J., Orlando L., Barker F.K., Jønsson K.A., Johnson W., KoepfliK.P., O’Brien S., Haussler D., Ryder O.A., Rahbek C., WillerslevE., Graves G.R., Glenn T.C., McCormack J., Burt D., Ellegren H.,Alström P., Edwards S.V., Stamatakis A., Mindell D.P., Cracraft J.,Braun E.L., Warnow T., Jun W., Gilbert M.T.P., Zhang G. 2014. Whole-genome analyses resolve early branches in the tree of life of modernbirds. Science 346:1320–1331.

Kuhner M.K., McGill J. 2014. Correcting for sequencingerror in maximum likelihood phylogeny inference. G3:Genes|Genomes|Genetics 4:2545–2552.

Larget B.R., Kotha S.K., Dewey C.N., Ané C. 2010. BUCKy: Genetree/species tree reconciliation with Bayesian concordance analysis.Bioinformatics 26:2910–2911.

Leaché A.D., Chavez A.S., Jones L.N., Grummer J.A., GottschoA.D., Linkem C.W. 2015. Phylogenomics of phrynosomatidlizards: conflicting signals from sequence capture versusrestriction site associated DNA sequencing. Genome Biol. Evol. 7:706–719.

Leaché A.D., Banbury B.L., Felsenstein J., nieto-Montes de Oca A.,Stamatakis, A. 2015. Syst. Biol. 64:1032–1047.

Lemmon A.R., Emme S.A., Lemmon E.M. (2012). Anchored hybridenrichment for massively high-throughput phylogenomics. Syst.Biol. 61:727–744.

Li Y., Willer C., Sanna S., Abecasis G. (2009). Genotype Imputation.Ann. Rev. Genomics Human Genet 10:387–406.

Liu L., Xi Z., Wu S., Davis C.C., Edwards S.V. 2015. Estimatingphylogenetic trees from genome-scale data. Ann. N.Y. Acad. Sci.1360:36–53.

Liu L., Yu L., Edwards S.V. 2010. A maximum pseudo-likelihoodapproach for estimating species trees under the coalescent model.BMC Evol. Biol. 10:1–18.

McCluskey B.M., Postlethwait J.H. (2015). Phylogeny of Zebrafish, a“Model Species,” within Danio, a “Model Genus”. Mol. Biol. Evol.32:635–652.

McCormack J.E., Faircloth B.C., Crawford N.G., Gowaty P.A.,Brumfield R.T., Glenn T.C. 2012. Ultraconserved elements are novelphylogenomic markers that resolve placental mammal phylogenywhen combined with species-tree analysis. Genome Res. 22:746–754.

Miller M.R., Dunham J.P., Amores A., Cresko W.A., Johnson E.A.2007. Rapid and cost-effective polymorphism identification andgenotyping using restriction site associated DNA (RAD) markers.Genome Res. 17:240–248.

Mirarab S., Bayzid M.S., Boussau B., Warnow T. 2014a. Statisticalbinning enables an accurate coalescent-based estimation of the aviantree. Science 346(6215).

Mirarab S., Reaz R., Bayzid M.S., Zimmermann T., Swenson M.S.,Warnow T. 2014b. Astral: genome-scale coalescent-based speciestree estimation. Bioinformatics 30:i541–i548.

Mita S.D., Siol M. 2012. EggLib: processing, analysis and simulationtools for population genetics and genomics. BMC Genetics 13:27.

Nadeau N.J., Martin S.H., Kozak K.M., Salazar C., DasmahapatraK.K., Davey J.W., Baxter S.W., Blaxter M.L., Mallet J., Jiggins C.D.2013. Genome-wide patterns of divergence and gene flow across abutterfly radiation. Mol. Ecol. 22:814–826.

Peterson B.K., Weber J.N., Kay E.H., Fisher H.S., Hoekstra H.E. 2012.Double digest RADseq: an inexpensive method for de novo SNPdiscovery and genotyping in model and non-model species. PloSOne 7:e37135.

Pinheiro J., Bates D., DebRoy S., Sarkar D., R Core Team 2016. nlme:Linear and Nonlinear Mixed Effects Models. R package version 3.1-128.

Prum R.O., Berv J.S., Dornburg A., Field D.J., Townsend J.P., LemmonE.M., Lemmon A.R. 2015. A comprehensive phylogeny of birds(Aves) using targeted next-generation DNA sequencing. Nature526:569–573.

de Queiroz A., Gatesy J. 2007. The supermatrix approach to systematics.Trend Ecol. Evol. 22:34–41.

Ree R.H., Hipp A.L. 2015. Inferring phylogenetic history fromrestriction site associated DNA (RADseq). In: (Hörandl E.,Appelhans M., editors. Next-generation sequencing in plant systematics,International Association for Plant Taxonomy (IAPT), chap. 6.

Ronquist F., Huelsenbeck J.P. 2003. MrBayes 3: Bayesian phylogeneticinference under mixed models. Bioinformatics 19:1572–1574.

Rubin B.E.R., Ree R.H., Moreau C.S. 2012. Inferring phylogenies fromRAD sequence data. PLoS ONE 7:e33394.

Salichos L., Rokas A. 2013. Inferring ancient divergences requires geneswith strong phylogenetic signals. Nature 497:327–331.

Sanderson M.J., McMahon M.M., Stamatakis A., Zwickl D.J., SteelM. 2015. Impacts of terraces on phylogenetic inference. Syst. Biol.64:709–726.

Sanderson M.J., McMahon M.M., Steel M. 2011. Terraces inphylogenetic tree space. Science 333:448–450.

Sanderson M.J., Purvis A., Henze C. 1998. Phylogenetic supertrees:Assembling the trees of life. Trend Ecol. Evol. 13:105–109.

Spriggs E.L., Clement W.L., Sweeney P.W., Madriñán S., Edwards E.J.,Donoghue M.J. 2015. Temperate radiations and dying embers ofa tropical past: the diversification of Viburnum. New Phytologist.207:340–354.

Stamatakis A. 2014. RAxML version 8: a tool for phylogenetic analysisand post-analysis of large phylogenies. Bioinformatics 30:1312–1313.

Symonds M.R.E., Blomberg S.P. 2014. A primer on phylogeneticgeneralised least squares. In: Garamszegi L.Z., editor. ModernPhylogenetic Comparative Methods and Their Application in EvolutionaryBiology. Berlin Heidelberg: Springer, chap. 5, pp. 105–130.

Takahashi T., Moreno E. 2015. A RAD-based phylogenetics for Orestiasfishes from Lake Titicaca. Mol. Phylogenet. Evol. 93:307–317.

Takahashi T., Nagata N., Sota T. 2014. Application of RAD-basedphylogenetics to complex relationships among variously relatedtaxa in a species flock. Mol. Phylogenet. Evol. 80:137–144.

Whidden C., Matsen F.A. 2015. Quantifying MCMC exploration ofphylogenetic tree space. Syst. Biol. 64:472–491.

Wiens J.J. 2003. Missing data, incomplete taxa, and phylogeneticaccuracy. Syst. Biol. 52:528–538.

Xi Z., Liu L., Davis C.C. 2015. Genes with minimal phylogeneticinformation are problematic for coalescent analyses when gene treeestimation is biased. Mol. Phylogenet. Evol. 92:63–71.