Sequencing pools of individuals — mining genome-wide...

15
About a decade ago, a fully sequenced genome was big news. But now, owing to rapid advances in next-generation sequencing (NGS) technology and computer algorithms for assembling short reads, we are enjoying the avail- ability of an ever increasing number of genomes from a broad spectrum of non-model organisms 1 . In parallel with the growing catalogue of reference genomes, a vari- ety of approaches have emerged that seek to character- ize the genome-wide polymorphism patterns. Arguably, the most comprehensive polymorphism data so far have been generated by single-nucleotide polymorphism (SNP) microarrays in humans 2,3 . More recently, the field has begun moving towards the characterization of full genome sequences, with 1000 genome projects completed for humans 4 , Arabidopsis thaliana 5 and cattle 6 . In Drosophila melanogaster too, hundreds of genomes have already been sequenced, and other species, such as pigs and dogs, are catching up. Does this imply that we have now captured all of the relevant variation and that, despite some minor bits and pieces of data remaining to be filled, in essence we are close to what we need to understand variation in these species? Probably the best demonstration that this is not the case comes again from human genetics. The analysis of human diseases and other complex traits indicated that even the analysis of several thousand individuals frequently turned out to be insufficient to determine the underlying genetic architecture 7,8 . Given this scale, it is clear that many research questions cannot be addressed by whole-genome sequencing of individuals, even though the sequencing costs of a human genome have now decreased below the ‘magic line’ of US$1,000 (REF. 9). In this Review, we discuss whole-genome sequencing of pools of individuals (Pool-seq) — an approach that provides genome-wide polymorphism data at consid- erably lower costs than sequencing of individuals. We explain why Pool-seq is more cost-effective, compare it to other approaches, review dedicated software tools, and discuss limitations and further directions. On the basis of various intraspecific whole-genome Pool-seq studies, we demonstrate its versatility and efficacy in facilitating a broad range of genome-wide analyses. However, we do not cover the metagenomic analysis of pools consisting of multiple species, as this has been reviewed elsewhere 10 . The cost-effectiveness of Pool-seq Key to population genetic surveys is information about polymorphic positions in the genome and the frequen- cies of variant alleles in various populations. The power of many genetic analyses increases with the accuracy to which allele frequencies can be determined from population samples. Pool-seq provides more accu- rate allele frequency estimation at a lower cost than sequencing of individuals 11,12 . To understand the basis of this difference, it is important to remember that allele frequencies are typically estimated from samples drawn from a larger population. Smaller sample sizes 1 Institut für Populationsgenetik, Vetmeduni Vienna, Veterinärplatz 1, 1210 Vienna, Austria. 2 Vienna Graduate School of Population Genetics Correspondence to C.S. e-mail: christian. [email protected] doi:10.1038/nrg3803 Published online 23 September 2014 Next-generation sequencing (NGS; also known as second-generation sequencing). An umbrella term for different sequencing platforms delivering millions of short DNA sequence reads. Reads DNA sequences that are generated by next-generation sequencing. Sequencing pools of individuals — mining genome-wide polymorphism data without big funding Christian Schlötterer 1 , Raymond Tobler 1,2 , Robert Kofler 1 and Viola Nolte 1 Abstract | The analysis of polymorphism data is becoming increasingly important as a complementary tool to classical genetic analyses. Nevertheless, despite plunging sequencing costs, genomic sequencing of individuals at the population scale is still restricted to a few model species. Whole-genome sequencing of pools of individuals (Pool-seq) provides a cost-effective alternative to sequencing individuals separately. With the availability of custom-tailored software tools, Pool-seq is being increasingly used for population genomic research on both model and non-model organisms. In this Review, we not only demonstrate the breadth of questions that are being addressed by Pool-seq but also discuss its limitations and provide guidelines for users. APPLICATIONS OF NEXT-GENERATION SEQUENCING REVIEWS NATURE REVIEWS | GENETICS VOLUME 15 | NOVEMBER 2014 | 749 © 2014 Macmillan Publishers Limited. All rights reserved

Transcript of Sequencing pools of individuals — mining genome-wide...

  • About a decade ago, a fully sequenced genome was big news. But now, owing to rapid advances in next-generation sequencing (NGS) technology and computer algorithms for assembling short reads, we are enjoying the avail-ability of an ever increasing number of genomes from a broad spectrum of non-model organisms1. In parallel with the growing catalogue of reference genomes, a vari-ety of approaches have emerged that seek to character-ize the genome-wide polymorphism patterns. Arguably, the most comprehensive polymorphism data so far have been generated by single-nucleotide polymorphism (SNP) microarrays in humans2,3. More recently, the field has begun moving towards the characterization of full genome sequences, with 1000 genome projects completed for humans4, Arabidopsis thaliana5 and cattle6. In Drosophila melanogaster too, hundreds of genomes have already been sequenced, and other species, such as pigs and dogs, are catching up. Does this imply that we have now captured all of the relevant variation and that, despite some minor bits and pieces of data remaining to be filled, in essence we are close to what we need to understand variation in these species?

    Probably the best demonstration that this is not the case comes again from human genetics. The analysis of human diseases and other complex traits indicated that even the analysis of several thousand individuals frequently turned out to be insufficient to determine the underlying genetic architecture7,8. Given this scale, it is clear that many research questions cannot be addressed

    by whole-genome sequencing of individuals, even though the sequencing costs of a human genome have now decreased below the ‘magic line’ of US$1,000 (REF. 9).

    In this Review, we discuss whole-genome sequencing of pools of individuals (Pool-seq) — an approach that provides genome-wide polymorphism data at consid-erably lower costs than sequencing of individuals. We explain why Pool-seq is more cost-effective, compare it to other approaches, review dedicated software tools, and discuss limitations and further directions. On the basis of various intraspecific whole-genome Pool-seq studies, we demonstrate its versatility and efficacy in facilitating a broad range of genome-wide analyses. However, we do not cover the metagenomic analysis of pools consisting of multiple species, as this has been reviewed elsewhere10.

    The cost-effectiveness of Pool-seqKey to population genetic surveys is information about polymorphic positions in the genome and the frequen-cies of variant alleles in various populations. The power of many genetic analyses increases with the accuracy to which allele frequencies can be determined from population samples. Pool-seq provides more accu-rate allele frequency estimation at a lower cost than sequencing of individuals11,12. To understand the basis of this difference, it is important to remember that allele frequencies are typically estimated from samples drawn from a larger population. Smaller sample sizes

    1Institut für Populationsgenetik, Vetmeduni Vienna, Veterinärplatz 1, 1210 Vienna, Austria.2Vienna Graduate School of Population GeneticsCorrespondence to C.S. e-mail: [email protected]:10.1038/nrg3803Published online 23 September 2014

    Next-generation sequencing(NGS; also known as second-generation sequencing). An umbrella term for different sequencing platforms delivering millions of short DNA sequence reads.

    ReadsDNA sequences that are generated by next-generation sequencing.

    Sequencing pools of individuals — mining genome-wide polymorphism data without big fundingChristian Schlötterer1, Raymond Tobler1,2, Robert Kofler1 and Viola Nolte1

    Abstract | The analysis of polymorphism data is becoming increasingly important as a complementary tool to classical genetic analyses. Nevertheless, despite plunging sequencing costs, genomic sequencing of individuals at the population scale is still restricted to a few model species. Whole-genome sequencing of pools of individuals (Pool-seq) provides a cost-effective alternative to sequencing individuals separately. With the availability of custom-tailored software tools, Pool-seq is being increasingly used for population genomic research on both model and non-model organisms. In this Review, we not only demonstrate the breadth of questions that are being addressed by Pool-seq but also discuss its limitations and provide guidelines for users.

    A P P L I C AT I O N S O F N E X T- G E N E R AT I O N S E Q U E N C I N G

    R E V I E W S

    NATURE REVIEWS | GENETICS VOLUME 15 | NOVEMBER 2014 | 749

    © 2014 Macmillan Publishers Limited. All rights reserved

    mailto:[email protected]:[email protected]

  • Pool-seqA sequencing technique in which sequencing libraries are not prepared from DNA of a single individual or cell but from a mixture of DNA fragments originating from different individuals or cells. In the context of this Review, Pool-seq is used to describe the unbiased sequencing of the entire genome.

    CoverageThe number of reads that span a given genomic position.

    Sequencing librariesSets of fragmented DNA extracted from one or more individuals that serve as the template for subsequent sequencing.

    Exome sequencingA sequencing approach in which the complexity of the genome is reduced through hybridization to exonic sequences, which results in a higher sequence coverage of protein-coding regions.

    Restriction-site-associated DNA markersSequence polymorphisms in close proximity to a restriction enzyme recognition site.

    are subject to larger sampling variance, whereby they can result in considerable errors even when the allele frequencies in the sample have been determined at high accuracy. In other words, accurately sequencing a small population sample will still result in ‘noisy’ allele frequency estimates. By contrast, Pool-seq makes use of large population samples, but not all chromosomes in the samples are analysed. The higher accuracy to cost ratio of Pool-seq arises from the fact that very few chromosomes are sequenced more than once, whereas for sequencing of individuals each chromosome is typically sequenced multiple times (5–40 times). This advantage is clearly demonstrated in FIG. 1, in which the accuracy of Pool-seq is compared with sequencing of individuals at a fixed sequencing cost (that is, assum-ing that the same number of sequence reads is used in each case). Although Pool-seq mostly performs bet-ter when 50 individuals are pooled, its performance is clearly superior when pooling 100 or more individuals (FIG. 1a). Additionally, the accuracy of Pool-seq relative to sequencing of individuals increases with the coverage of individual genomes (FIG. 1b).

    The cost-effectiveness of Pool-seq becomes even more evident when the costs for the preparation of the sequencing libraries are considered: Pool-seq uses a single library for the entire sample, whereas sequencing of

    individuals requires a separate library to be prepared for each genome. As library construction constitutes ~20% of the total sequencing costs for species with moderate genome sizes, this is an important cost factor.

    Comparison to reduced-representation sequencingSequencing individuals at a high coverage is undoubt-edly the ‘gold standard’ for obtaining high-quality data, but budget constraints frequently require alternatives for studying large populations. In addition to Pool-seq, other strategies have been developed for sequenc-ing large samples (FIG. 2). Below, we compare different sequencing approaches (TABLE 1) and weigh their par-ticular strengths and weaknesses against those of Pool-seq (TABLE 2).

    In contrast to the whole-genome approach of Pool-seq, the cost savings of these alternative approaches are achieved by reducing the representation of the genome in the sequence data. Different strategies for targeting the sequencing to specific regions of the genome can be categorized into exome sequencing13,14, high-throughput RNA sequencing (RNA-seq)15 and methods using restriction-site-associated DNA markers16. Each of these methods have been combined with pooling to fur-ther reduce costs, but each approach has its particular strengths and weaknesses (see below).

    Figure 1 | Cost-effectiveness of Pool-seq. The accuracy of allele frequency estimates is compared for whole-genome sequencing of pools of individuals (Pool-seq) and whole-genome sequencing of individuals using the ratio of the standard deviation (SD) of the estimated allele frequency with both methods. The same number of reads is used for both sequencing strategies. A value smaller than one indicates that Pool-seq is more accurate than sequencing of individuals. a | The influence of the pool size is shown. A larger pool size results in higher accuracy of Pool-seq, but Pool-seq still produces more accurate allele frequency estimates even for pool sizes of 50 individuals in most comparisons. Only when the number of sequenced individuals approaches the pool size does sequencing of individuals become the superior strategy. b | Influence of coverage and variation in representation of individuals in a pool is shown. With a lower coverage per individual, the advantage of Pool-seq decreases. It should be noted that with a decreasing coverage per individual, the two approaches produce very similar types of data; that is, sequencing of individuals tends to show the same limitations as Pool-seq, such as for estimating linkage disequilibrium and for distinguishing sequencing errors from low-frequency polymorphisms. Variation in the representation of individuals in the DNA pool reduces the accuracy of Pool-seq only slightly (0% (that is, all individuals are uniformly represented; orange line) and 30% (light blue line)). The graphs were generated with the PIFs software12, ignoring sequencing errors.

    Nature Reviews | Genetics

    0.410 20 30

    Number of individuals sequenced seperately

    SD p

    ool/

    SD in

    divi

    dual

    s

    SD p

    ool/

    SD in

    divi

    dual

    s

    Number of individuals sequenced seperately40 50

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0

    1.1a b

    0.410 20 30 40 50

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0

    1.1

    Pool size

    Coverage per sequenced individual

    Deviation in DNA content fromeach individual in the pool

    100

    20×

    0%

    100

    20×

    30%

    100

    30%

    100

    30%

    Pool size

    Coverage per sequenced individual

    Deviation in DNA content fromeach individual in the pool

    500

    30%

    100

    30%

    50

    30%

    R E V I E W S

    750 | NOVEMBER 2014 | VOLUME 15 www.nature.com/reviews/genetics

    © 2014 Macmillan Publishers Limited. All rights reserved

  • Linkage disequilibrium(LD). Nonrandom association between alleles at two loci. In outcrossing diploid individuals, the genotypes need to be sorted into haplotypes in a statistical procedure called phasing.

    Exome sequencing is carried out by sequencing only the fraction of the genome that hybridizes to probes cov-ering exons of genes (that is, exome capture). The ration-ale for this approach is to focus sequencing and analysis efforts on genomic regions that are most likely to be functionally relevant. The biggest drawback of exome capture is the high cost of exon enrichment, which will make this approach less appealing as DNA sequencing costs continue to decrease. Additionally, hybridiza-tion probes must be custom-designed for each species. Hence, the development of exome capture probes for a new species is typically too costly unless directed to a small fraction of the genome, such as genes comprising a given pathway or from a small genomic region17.

    RNA-seq typically involves sequencing the mRNA fraction of the transcriptome, which is isolated based on the presence of poly(A) tails on the transcripts.

    Although the genomic locations interrogated overlap to a large extent with those generated by exome sequenc-ing, key differences are that only expressed transcripts are detected, but no species-specific reagents need to be developed (which makes RNA-seq more cost-effective for many species). Crucially, as the relative frequencies of alleles in the transcriptome pool are determined both by the underlying allele frequencies in the DNA and by the expression levels of these alleles, RNA-seq-based analysis of DNA polymorphisms from pooled samples is confounded by unequal levels of allele-specific gene expression. Indeed, between 18% and 66% of genes have alleles differing in their relative expression level owing to cis-regulatory effects18,19. Nevertheless, the impact of this effect on allele frequency estimates depends on the extent of linkage disequilibrium (LD) between the regula-tory variants and the called SNPs, which may explain

    Nature Reviews | Genetics

    AA

    A

    AA

    A

    AAAA

    CCC

    AACA

    CCCC

    AAAC

    AAC

    CCC

    a Whole-genome sequencingUnpooled Pooled

    b Exome sequencing

    c RAD-seq

    Magnetic-basedexome capture

    Digestion using arestriction enzyme

    Figure 2 | Comparison of sequencing strategies. Three different sequencing approaches — whole-genome sequencing (part a), exome sequencing (part b) and restriction-site-associated DNA sequencing (RAD-seq; part c) — are compared, and sequencing of individuals (left panel) is contrasted with sequencing of pools of individuals (right panel). Reads are coloured to reflect the individual from which they originate. In exome sequencing, sequencing libraries are

    enriched for exonic sequences (part b). RAD-seq only determines the sequence next to restriction sites, which results in stacked sequence reads (part c). Both exome sequencing and RAD-seq direct the sequencing efforts to targeted regions. This reduction in genome coverage allows a higher read count at a given genomic position and thus a more accurate allele frequency estimate at the covered genomic regions than whole-genome sequencing.

    R E V I E W S

    NATURE REVIEWS | GENETICS VOLUME 15 | NOVEMBER 2014 | 751

    © 2014 Macmillan Publishers Limited. All rights reserved

  • Genetic markersPolymorphic loci that could be scored with a genotyping technique.

    F2 analysisAnalysis of mapping populations generated by the F2 design. The F1 progeny from crossing two phenotypically different parental strains are themselves crossed to produce an F2 population that is segregating for the phenotype of interest. The F2 mapping population may carry up to three genotypes at every marker and therefore allows the detection of additive and dominance effects, as well as interactions between loci.

    Phased genomic sequencesGenome sequences for which the haplotype phase (that is, the combination of alleles or genetic markers that coexist on a single chromosome) has been determined.

    ImputationIn statistics, it refers to the replacement of missing data with values. In genomics, it describes the use of haplotype sequences to fill in missing sequence information.

    HaplotypesThe combination of alleles or genetic markers that coexist on a single chromosome. Chromosomal regions carrying a haplotype are inherited as intact physical units until they are broken up by recombination.

    why several studies have found that pooled RNA-seq yields reliable allele frequencies estimates20. Pooled RNA-seq is a particularly appealing approach for spe-cies with large genomes and those that lack reference genomes21–23.

    Finally, a reduction in genomic complexity can be achieved by a class of sequencing protocols that in essence rely on sequencing regions flanking a restric-tion site16 (for simplicity, we refer to them as restriction- site-associated DNA sequencing (RAD-seq)). RAD-seq-based allele frequency estimates may be biased if the restriction site contains polymorphisms in LD with nearby SNPs24,25 because a polymorphic restriction site will only be cut in a fraction of the individuals. This problem becomes particularly pro-nounced when RAD-seq analyses are carried out on pooled DNA samples because missing reads are much harder to spot and inevitably lead to biased allele frequency estimates.

    The extent to which the reduction in complex-ity affects the downstream interpretation of the data strongly depends on the specific research question that is posed. For mapping the genetic basis of phenotypic traits, for example, each of these techniques has its advantages and limitations. Many studies in humans have successfully used exome sequencing to identify disease-causing protein variants13, but exome sequenc-ing is only informative if the focal variant is located within protein-coding sequences. Sequencing the entire genome using Pool-seq, rather than pre-defined regions, is an unbiased approach that allows the discovery of functional differences in the absence of a well-defined genomic target. The importance of such unbiased approaches has been amply demonstrated by mount-ing evidence that causal genetic variation underlying important phenotypes often lies within regulatory and uncharacterized genomic regions26–28.

    Whereas Pool-seq, exome sequencing and RNA-seq can directly identify functionally relevant polymor-phisms across the regions of the genome they interro-gate, RAD-seq is a modern method to identify genetic

    markers that are usually not functionally relevant them-selves. Therefore, RAD-seq relies on a high level of association (that is, high LD) between the causal poly-morphism and the RAD-seq genetic marker. In experi-ments for which high levels of LD can be expected (for example, F2 analysis, in which limited recombination will have occurred), it may be sufficient to analyse RAD-seq markers, but it is not clear whether RAD-seq or Pool-seq is the preferred technique.

    A completely different approach to reduce sequencing costs builds on the availability of high-quality phased genomic sequences. Combining these high-quality genomes with cost-effective genotyp-ing data29 of additional individuals, using either SNP microarrays30 or ultra-shallow sequencing (0.1–0.5× coverage)31, permits the imputation of haplotypes even in the absence of full genome sequences. The effective-ness of this approach is limited by the accuracy of the imputation algorithms and how well the population is represented by the genomes that are used for impu-tation. Given the need for a representative reference panel of sequenced individuals, we anticipate that this method will be restricted to a few well-studied species with large genome sizes, such as humans, cattle and some crops. By contrast, Pool-seq avoids such limita-tions and is consequently applicable to a much broader range of species.

    Limitations and problems of Pool-seqDespite the appealing conceptual properties of Pool-seq, it is important to consider the weaknesses and limitations of this approach.

    Pool size. Pool-seq is designed to obtain allele fre-quency estimates from a large number of individuals. Importantly, applying this method to small pool sizes (

  • (for example, cheetahs or coelacanths), a much better long-term investment is to determine the genomic sequence of these individuals rather than to attempt an under-powered Pool-seq study.

    Linkage disequilibrium. The current use of Pool-seq with short sequence reads means that it is not well suited for inferring haplotypes and LD. As reads are independ-ent draws, with replacement, from a large number of chromosomes, it is very challenging to identify reads that originate from the same haplotype. Although longer sequencing reads and the incorporation of haplotype information will improve this, at present we do not rec-ommend Pool-seq for research questions that depend on the availability of linkage information.

    Sequencing errors. The cost advantage of Pool-seq results from the fact that every read represents an independent draw from a large pool of chromosomes. Owing to the high error rates of NGS (for example, 0.1–1% for Illumina sequencing32), distinguishing between sequencing errors and low-frequency alleles is a chal-lenge. Unlike sequencing of individuals, this cannot be solved by analysing multiple reads from the same region of a single chromosome. However, various features have been recently recognized that distinguish sequencing

    errors from genuine sequence polymorphisms33–35, and strategies have been developed to incorporate these features into improved SNP-calling software (TABLE 3). We anticipate that replication of pools will be instru-mental to further reduce the error rate in SNP call-ing36. Moreover, for many applications, low-frequency sequencing errors have little impact on the analysis and interpretation of the data. Such applications include the identification of functionally diverged alleles in map-ping experiments, characterization of alleles involved in the adaptation of populations to their local environment and the analysis of temporal allele frequency trajectories (see below).

    Differential representation of individuals in the pool. An optimal pool contains equal amounts of DNA from each individual. However, technical errors arising from pipet-ting or DNA quantification, for example, will result in an imbalance in the pool composition. This problem had been recognized as an issue for genotyping using pooled DNA samples37 and has now been revisited for Pool-seq analysis. The consensus is that the impact of differential representation of individuals on the accuracy of allele frequency estimates is not large unless sample sizes are very small11,12,38 (FIG. 1b).

    Alignment problems. A generally underappreci-ated problem with NGS is the misalignment of short reads to a reference genome. Biased allele frequency estimates occur when reads are either mapped to the wrong genomic position or not mapped at all if they are too divergent. Such problems are easily recognized when sequencing individuals, as these regions can be identified by variation in coverage. In Pool-seq data they are more difficult to identify, particularly for low-frequency alleles, because the resulting coverage varia-tion is too small to be recognized. Hence, special care must be taken when selecting the appropriate mapping parameters in Pool-seq experiments. A careful evalua-tion of mapping parameters39 has resulted in guidelines that will considerably reduce errors due to alignment problems (BOX 1). Structural changes in sequenced indi-viduals relative to the reference genome may also cause problems. Copy number variation (CNV) describes the phenomenon by which individuals in a population dif-fer in the number of copies of a given genomic region. If these copies are divergent in their sequence, then CNVs could inflate polymorphism estimates in Pool-seq anal-yses. Similarly, chromosomal inversions and segregating transposable elements (TEs) will affect allele frequency estimates if they are not accounted for. Importantly, specific bioinformatic tools have been developed for CNVs40, inversions41 and TEs42 that not only identify but also quantify such structural changes. For standard polymorphism analyses, such problematic regions are best eliminated.

    Pool-seq applicationsPool-seq is particularly well suited for applications that rely on large sample sizes and the analysis of multi-ple samples, but it can also be applied on a broader

    Table 2 | To pool or not to pool?

    Scenario Pool-seq recommended?

    Small sample size (

  • scale. We introduce below case studies that would not have been feasible using other methods — that is, when independently sequencing a large number

    of individuals is not practical or even achievable — in addition to studies that used Pool-seq and obtained particularly interesting results.

    Table 3 | Software overview

    Method Comments Ref

    SNP and/or indel calling (applicable to Pool-seq data)

    GATK Unified Genotyper

    Calls indels and SNPs; owing to a generalized polyploid model it may also be used with pooled data 118

    MAQ Calls SNPs; may also be used to align reads 35

    VarScan Identifies SNPs and indels; can be used with Roche/454 and Illumina reads 119

    snape Bayesian SNP calling algorithm; requires a prior probability on the nucleotide diversity 120

    CRISP Identifies SNPs; requires multiple pools 121

    vipR Identifies SNPs and indels; requires multiple pools 122

    EBM Identifies SNPs using an empirical Bayes mixture model; implemented as R function 123

    EM-SNP Uses an expectation maximization algorithm for SNP discovery; slow and therefore cannot be applied to whole genomes 124

    SNPSeeker Identifies SNPs; requires a control sample to be inserted in each run 125

    SPLINTER Successor of SNPSeeker; identifies SNPs and indels; requires a synthetic library consisting of a negative control and a positive control to be inserted in each run

    126

    SNVer Identifies SNPs; may be sensitive to high error rates 127

    Dindel Realigns reads and calls indels with a Bayesian method; slow (~1 variant per second) 117

    FreeBayes Identifies SNPs and indels; haplotype-based detection of variants using a Bayesian framework 128

    Syzygy Detects SNPs and indels 129

    Identification of TEs

    PoPoolation TE Identifies TE insertions and estimates their population frequencies 42

    T-lex2 Identifies TE insertions and estimates their population frequencies 130

    TEMP Detects the presence and absence of TE insertions; also estimates population frequencies of TE insertions 131

    Population genetics

    PoPoolation Estimates variation within populations 39

    PoPoolation2 Estimates differentiation between multiple populations 132

    Pool-HMM Detects selective sweeps from the allele frequency spectrum using a hidden Markov model 133

    npstat Computes a wide range of population genetic estimators; may be used in conjunction with an external SNP caller; every contig needs to be analysed separately

    134

    Stacks Developed for population genomics with RAD-seq; may also be used with pooled RAD-seq data 135

    Bayenv2 Estimates differentiation between populations 79

    SelEstim Detects and measures selection 136

    KimTree Infers population histories 137

    Haplotype information

    harp Estimates frequencies of known haplotypes using read counts; supports a sliding window approach 107

    PoolHap Estimates frequencies of known haplotypes using a regression on allele frequencies 106

    eALPS Estimates the abundance of individuals in pools given the genotypes of at least some individuals 109

    LDx Estimates linkage disequilibrium between pairs of SNPs spanned by single-end or paired-end reads 138

    Forward genetic screens

    SHOREmap Identifies causative recessive variants from a large pool of recombinants that have the recessive genotype 44

    CloudMap A cloud-based pipeline for localizing mutations 139

    MULTIPOOL Identifies candidate loci from bulk segregant analysis in which progeny are grouped by phenotype and sequenced as pools

    140

    Fishyskeleton Detects mutations in zebrafish from a pool of mutant F2 fish 141

    NGM A web-based tool for localizing mutations from a small pool of F2 population 142

    SNPtrack A web-based tool for localizing mutations using a hidden Markov model 143

    R E V I E W S

    754 | NOVEMBER 2014 | VOLUME 15 www.nature.com/reviews/genetics

    © 2014 Macmillan Publishers Limited. All rights reserved

    http://www.broadinstitute.org/gatk/http://www.broadinstitute.org/gatk/http://maq.sourceforge.net/http://varscan.sourceforge.net/http://code.google.com/p/snape-pooled/http://sites.google.com/site/vibansal/software/crisphttp://sourceforge.net/projects/htsvipr/http://sites.google.com/site/zhouby98/ebmhttp://www-rcf.usc.edu/~fsun/Programs/EM-SNP/EM-SNP.htmlhttp://genetics.wustl.edu/rmlab/software/http://www.ibridgenetwork.org/wustl/splinterhttp://snver.sourceforge.net/http://www.sanger.ac.uk/resources/software/dindel/http://github.com/ekg/freebayeshttp://www.broadinstitute.org/software/syzygy/http://code.google.com/p/popoolationte/http://petrov.stanford.edu/cgi-bin/Tlex.htmlhttp://github.com/JialiUMassWengLab/TEMP.githttp://code.google.com/p/popoolation/http://code.google.com/p/popoolation2/http://qgsp.jouy.inra.fr/http://code.google.com/p/npstat/http://creskolab.uoregon.edu/stacks/http://gcbias.org/bayenv/http://www1.montpellier.inra.fr/CBGP/software/selestim/index.htmlhttp://www1.montpellier.inra.fr/CBGP/software/kimtree/index.htmlhttp://bitbucket.org/dkessner/harphttp://sites.google.com/site/quanlongresearch/toolshttp://sourceforge.net/projects/ealps/http://sourceforge.net/projects/ldx/http://1001genomes.org/software/shoremap.htmlhttp://hobertlab.org/cloudmap/http://github.com/matted/multipoolhttp://fishbonelab.org/http://bar.utoronto.ca/NGM/http://genetics.bwh.harvard.edu/snptrack/

  • Pool genome-wide association studies(Pool-GWASs). Genotype–phenotype mapping studies in which phenotypically extreme individuals are grouped and sequenced as pools. Causative variants are identified by contrasting the allele frequencies between the pools.

    Evolve and resequence studiesStudies that combine experimental evolution with next-generation sequencing. They make use of controlled environmental, demographic and selective variables to facilitate genotype–phenotype mapping.

    Forward geneticsAn approach in which mutations induced by random mutagenesis that lead to the disruption of gene function are identified based on their phenotypes. The causative mutation is traditionally identified by positional cloning or by a candidate-gene approach.

    Genotype–phenotype mappingThe fundamental goal of genetics is to elucidate how the heritable component of any given phenotype is encoded by the DNA. Although the theoretical and methodologi-cal basis of genotype–phenotype mapping was already well developed nearly 100 years ago43, recent technologi-cal innovations have substantially advanced the field. In particular, NGS provides high-throughput genotyping of whole genomes at a low cost. Here, we review dif-ferent applications of Pool-seq that aim to uncover the genotype–phenotype link. All of these share a common principle that had already been introduced several years prior to the development of NGS37: namely, dividing sets of individuals into groups with different phenotypes and contrasting the allele frequencies between them. As long as the pools of individuals are large enough and the pop-ulation under study is randomly mating, genetic variants that do not contribute to the trait are expected to have the same frequency in both pools. However, causal vari-ants or linked polymorphisms will differ in frequency between pools. Pool-seq-based mapping studies fall broadly into different classes: identification of induced mutations, identification of naturally occurring variants, pool genome-wide association studies (Pool-GWASs), and evolve and resequence studies.

    Mapping-by-sequencing of induced mutations using bulk segregant analysis. Mutagenesis screens are the major approach of forward genetics. Although this approach has excelled in uncovering interesting phenotypes, the subsequent genome mapping and identification of the causal mutations were slow and labour-intensive prior to the adoption of NGS. By sequencing a pool of mutant F2 plants (in an approach termed bulk segregant analysis (BSA)) followed by a clever bioinformatic analysis, one study44 was able to greatly expedite this process and

    identified the causal mutation in less than eight working days of hands-on time44,45. This method has subsequently been modified to work for pooled RNA-seq46,47 and species without high-quality reference genomes48.

    Mapping of naturally occurring functionally diverged alleles using bulk segregant analysis. This approach is conceptually similar to that above, except that the mapped variants are naturally occurring alleles that affect the phenotype of interest rather than induced mutations. Furthermore, as the two wild-type parental strains often differ at several causal loci, mapping the natural causal variation is a considerably more chal-lenging task than mapping one or two induced causal mutations.

    A recent study used computer simulations of bulk segregant experiments to show that by using very large F2 populations (>10

    5 individuals) this approach pro-vides sufficient power to detect even several small-effect loci49. Taking advantage of the accurate allele frequency estimates from large pools of Saccharomyces cerevisiae, the authors identified 14 loci associated with resist-ance against the DNA damaging agent 4-nitroquinoline 1-oxide (4-NQO), which accounted for 59% of the phe-notypic variance49. Additional BSAs in yeast have identi-fied and functionally validated genes that are responsible for xylose metabolism50 and ethanol tolerance51, even with moderate F2 sample sizes.

    It is well understood that many traits are modulated by epistatic interactions with the genetic background. As this complicates the identification of causal loci in mapping populations with randomized genetic back-grounds52, an alternative BSA approach is to introgress the loci of interest into a homogeneous genetic back-ground before trait mapping. This strategy was pursued to investigate the genetic basis of Morinda citrifolia

    Table 3 (cont.) | Software overview

    Method Comments Ref

    Microbiology

    QuRe Reconstructs viral quasispecies 144

    ShoRAH Estimates genetic heterogeneity of samples; suitable for analyses of viruses, bacteria and tumours 145

    Mixed infection estimator

    Identifies mixed infections of bacteria using whole-genome data 146

    ViSpA Infers virus quasispecies spectra from 454 sequencing data 147

    V-Phaser-2 Infers intrahost diversity in viral populations 148

    HaploClique Reconstructs viral quasispecies 149

    QuasiRecomb Reconstructs viral quasispecies using a jumping hidden Markov model 150

    PredictHaplo Reconstructs virus haplotypes 151

    Other applications

    PoPoolation DB A user-friendly database for investigating natural polymorphism and associated functional annotations 152

    PIFs A tool set for evaluating different pooling designs 12

    Psafe A workflow for obtaining unbiased allele frequency estimates 153

    PyClone Identifies and quantifies the prevalence of clonal populations in cancer 154

    Indel, insertion and deletion; Pool-seq, whole-genome sequencing of pools of individuals; RAD-seq, restriction-site-associated DNA sequencing; RNA-seq, high-throughput RNA sequencing; SNP, single-nucleotide polymorphism; TE, transposable element.

    R E V I E W S

    NATURE REVIEWS | GENETICS VOLUME 15 | NOVEMBER 2014 | 755

    © 2014 Macmillan Publishers Limited. All rights reserved

    http://sourceforge.net/projects/qure/http://www.bsse.ethz.ch/cbg/software/shorahhttp://dx.doi.org/10.1371/journal.pcbi.1003059.s007http://dx.doi.org/10.1371/journal.pcbi.1003059.s007http://alan.cs.gsu.edu/NGS/?q=content/vispahttp://www.broadinstitute.org/scientific-community/science/projects/viral-genomics/v-phaser-2http://github.com/armintoepfer/haplocliquehttp://github.com/armintoepfer/QuasiRecomb/releaseshttp://bmda.cs.unibas.ch/HivHaploTyper/http://www.popoolation.at/pgt/http://www1.montpellier.inra.fr/CBGP/software/PoolSeqUtils/http://bioinformatics.med.yale.edu/group/http://compbio.bccrc.ca/software/pyclone/

  • Bulk segregant analysis(BSA). Analysis in which offspring from diverged parents are phenotyped and the DNA of individuals from opposing tails of the phenotypic distribution is combined (pooled). Causative variants are identified by contrasting allele frequency differences among the pools.

    Epistatic interactionsNon-additive interactions between genes in which the effect of an allele at one locus is modified by the genotypes at other loci in the genome. The resulting phenotype is different from that expected by summing the independent effects of the individual loci.

    IntrogressIntroducing a genomic region from one strain or species into that of another by repeated backcrossing. By selecting for the phenotype of interest, the genomes become isogenic except for the chromosomal regions causing the selected phenotype.

    Paired-end readsDNA fragments that were sequenced from both ends, yielding pairs of reads that are separated by a defined distance that is dependent on the library preparation protocol.

    Soft clippingSubstrings at either end of reads that were not aligned with a local alignment algorithm and are thereby excluded in the subsequent analysis.

    Proper pairsPaired-end reads where both pairs can be mapped to the same chromosomes within a distance pre-specified by the insert size chosen during library preparation.

    Broken pairsPaired-end reads that do not map as proper pairs.

    Mapping qualityLog (base 10) transformed measure of the probability that a read is incorrectly mapped multiplied by 10.

    Box 1 | Pool-seq: best practice

    The analysis of data obtained by whole-genome sequencing of pools of individuals (Pool-seq) is a rapidly growing field, and new tools are continuously being developed. Therefore, we caution that recommendations listed here are also a moving target that needs to be continuously challenged, preferentially by validation studies. Furthermore, the optimal experimental design will depend on the biological systems being investigated and the purpose of the study.

    Number of individuals included in a pool: >40The accuracy of Pool-seq increases with the number of individuals included in the pool because the sampling error and the influence of unequal representation of individuals in the pool are reduced. At least 40 diploid individuals should be used11,12,38.

    Depth of coverage: >50×Reliable allele frequency estimates require a sufficiently high sequencing coverage to reduce the sampling error, which in turn depends on the allele frequency. Furthermore, a higher coverage not only facilitates the identification of sequencing errors but also provides more power to detect allele frequency differences. Therefore, we recommend a minimum coverage of at least 50-fold for single-nucleotide polymorphism (SNP)-based tests and caution that some applications may require a 200-fold coverage110. A lower coverage is sufficient if windows containing multiple SNPs39 or large inversions111 are analysed.

    Sequencing technology: using a read length of >75 nucleotides and paired-end readsAs mapping accuracy is improved by longer paired-end reads, we recommend using paired-end reads of at least 75 nucleotides. Furthermore, PCR duplicates are more reliably identified if paired-end reads are used.

    Preprocessing of reads: trimmingThe increased error rate towards the 3ʹ end of Illumina reads could impair downstream analyses such as variant calling112. Therefore, we suggest trimming reads with one of the available software tools39,113.

    Mapping: using conspecific reference genome and global alignment; allowing for gaps and disabling seedingWhenever possible, heterologous reference genomes should not be used, as even closely related species often harbour diverged genomic regions that may cause alignment artefacts83,114. For non-model organisms with large genome sizes, RNA-sequencing-based de novo assemblies may be a viable strategy72. Soft clipping (the exclusion of terminal bases with mismatches) should be avoided, as this leads to biased allele frequency estimates39,115. Thus, semi-global alignment algorithms should be used (as implemented in BWA ALN35 and Bowtie2 (REF. 116)). In addition, allowing for gaps increases the mapping accuracy39. Realignment of unmapped reads could improve the coverage of diverged regions, but soft clipping will be introduced for these reads (an example of a realignment tool that uses soft clipping is BWA SAMPE35). The ‘seeding’ step, which was introduced as a heuristic to accelerate mapping, should be avoided because it discriminates against diverged reads and could possibly introduce bias into allele frequency estimates.

    Filtering: using proper pairs and a mapping quality of >20The mapping precision is higher when both reads of a read pair can be mapped (that is, when they are proper pairs); therefore, broken pairs should be filtered. Rather than relying on uniquely mapped reads, it is preferable to filter reads by mapping quality, as this takes the base quality of mismatches into account35. We recommend a minimum mapping quality of 20.

    Indels: realigning reads spanning indels or ignoring regions around indelsReads mapped to insertions and deletions (indels) are frequently misaligned, especially if the ends of reads span an indel33. To avoid false SNPs, we recommend either realigning reads covering an indel117,118 or excluding bases flanking the indel39.

    Duplicates: removing duplicatesIt is frequently recommended to remove PCR duplicates12, but only preferential amplification of one allele will result in a biased allele frequency estimate.

    CNV: filtering for CNVs or using a maximum coverageCopy number variations (CNVs) may lead to false-positive SNPs when multiple slightly diverged copies of a genomic region are collapsed during mapping. CNVs may be detected either with specialized software40 or by excess sequence coverage and should be removed from the analyses.

    Coverage heterogeneityHeterogeneous sequence coverage results in unequal power to detect allele frequency differences if they are not accounted for. Thus, it is recommended either to use more complex models that account for this79 or to subsample to a homogeneous coverage over the entire genome.

    Variant detection: using a variant-calling algorithm that accounts for strand biasIn addition to ad hoc strategies (that is, strategies in which a minimum sequence quality is combined with a minimum fraction of reads supporting a SNP), it is also possible to use one of the several available tools for variant detection (TABLE 3). We note that it is also important to take other features that are frequently associated with false SNPs into account: only SNPs that are occurring at similar frequencies on both strands33 (that is, those not displaying strand bias) and that are also located in the central region of a read should be considered reliable. Examples of suitable variant callers include the GATK Unified Genotyper118 and VarScan119.

    R E V I E W S

    756 | NOVEMBER 2014 | VOLUME 15 www.nature.com/reviews/genetics

    © 2014 Macmillan Publishers Limited. All rights reserved

  • Base qualityLog (base 10) transformed measure of the probability that a given base call is incorrect multiplied by 10.

    Insertions and deletions(Indels). DNA sequences that have been inserted or deleted from a genomic region. As only phylogenetic analysis allows the distinction between insertions and deletions, indel has been used as an indifferent term.

    Strand biasA variant that is significantly more likely to occur within reads that originate from one of the two strands of DNA.

    GWASsTrait mapping studies that rely on a statistical test to determine associations between sequence variants and a given phenotype in natural populations.

    ClineThe gradual change in phenotypes or allele frequencies along a geographical or environmental gradient.

    aversion behaviour in Drosophila simulans53. After 15 generations of backcrossing DNA from this species into Drosophila sechellia, which does not avoid the Morinda fruit, and selecting for the aversion phenotype, the intro-gressed individuals were subjected to Pool-seq. Six can-didate loci, which accounted for 75% of the total genetic variation, were subsequently detected by contrasting the introgressed D. simulans alleles against the isogenic D. sechellia background.

    Pool genome-wide association studies. The BSA-based mapping strategies discussed above rely on genetic variation of two divergent parental genotypes, whereby the mapping precision is determined by recombination events that occur during the experi-ment. By contrast, GWASs offer the advantage of sur-veying all the genetic variation present in a population that has been randomized by past mating events over multiple generations. These advantages were com-bined with the cost-effectiveness of Pool-seq in a recent study on female abdominal pigmentation in D. melanogaster54. By sequencing pools of females with very light and very dark abdominal pigmentation, two small genomic regions near the pigmentation genes tan and bric-à-brac 1 were identified as being associ-ated with the pigmentation phenotype (FIG. 3a). The remarkable mapping accuracy of this approach is best illustrated by the tan locus. The three SNPs that are most significantly associated with the trait are located within a short haplotype (

  • 0 5 10 15

    Position on chromosome 3R (Mb)

    Nuc

    leot

    ide

    dive

    rsit

    y (π

    )

    20 25

    d Chromosome 3R

    e

    D. mauritianaD. melanogaster

    Centromere

    Sign

    ifica

    nce

    (ZH

    p)Si

    gnifi

    canc

    e (–

    log 1

    0P)

    BCDO2**SEMA3A/SEMA3D*ECR

    ENS:22710**VSTM2AENS:22912

    TSNARE1 *ECR ANK2*ECR LIPAENS:06384SLC16A12

    *EST

    PLOD2

    BRCA1NBR1etc.

    EML4 *ECR3HNF4GENS:25519ENS:22847AGTR1

    ALS2MPP4

    0

    –2

    –4

    –6

    –8

    –10TSHR

    AHK

    BHK

    BHK

    AHB

    AHB

    AHA

    BHG BHG BHV

    AHN AHN

    AHS

    AHS

    BHV

    AHA

    Atlanticherring

    Atlanticherring

    Balticherring

    Balticherring

    b

    0.020.0020

    AHK

    c

    a

    Nature Reviews | Genetics

    Time

    f

    Chemotherapy

    60

    50

    40

    30

    20

    10

    0X

    Chromosome3L

    tan

    FDR = 0.05bric-à-brac 1

    0

    0.002

    0.004

    0.006

    0.008

    0.010

    0.012

    0.014

    Ecotype niche breadth

    Ancestor

    M2M3 M4 M5

    M6

    M15

    M16 M18

    M7

    M1

    SR

    W

    Mut

    atio

    n fr

    eque

    ncy

    RS

    W

    1,0500 525315 735

    Time (generations)

    M10

    M14M13

    M8 M9

    M11 M12

    M17

    ENS:12584NAT5BAZ2B

    OSGIN1ENS:21325

    R E V I E W S

    758 | NOVEMBER 2014 | VOLUME 15 www.nature.com/reviews/genetics

    © 2014 Macmillan Publishers Limited. All rights reserved

  • DomesticationIn the process of domestication, humans have selected animals and plants for improved trait performance (for example, milk yield). The genomic analysis of domesti-cated populations therefore provides a unique opportu-nity to identify the genomic basis of selected traits and to determine whether similar genes are affected in dif-ferent species. So far, only a moderate number of studies have used Pool-seq to identify genes selected during the domestication process by contrasting allele frequencies in domestic breeds and wild populations59,80–82. In this way, strong selection signatures were detected for loci associated with morphological changes in the domestic pig80 and for genes associated with starch digestion, fat metabolism and brain function in dogs81. One particu-larly striking example of a domestication-specific adap-tation was found for the thyroid-stimulating hormone receptor (TSHR) gene in domestic chickens (FIG. 3c), which has a central role in regulating metabolism and photoperiod control59. The selective advantage of this variant in domestic chicken is probably connected to seasonal reproduction, which is absent in domestic chicken populations but present in wild populations.

    Genome evolutionRecombination landscape and polymorphism. Recombination is a central factor in determining the distribution of genomic variation in natural populations.

    Regions of low recombination tend to harbour few poly-morphisms owing to LD between selected and neutral variants. A comparison of genome-wide polymor-phism between Pool-seq samples of D. melanogaster and Drosophila mauritiana revealed a striking differ-ence between these two species. Whereas polymor-phism in D. melanogaster steadily decreases towards the centromere, variability in D. mauritiana remains high along the entire chromosome and only drops sharply very close to the centromere83 (FIG. 3d). This pattern is consistent with differences in the recombination land-scape between the two species and provides a more detailed confirmation of the scenario first suggested by a previous study that used fewer markers84.

    Transposable elements. TEs are a major evolution-ary agent. They contribute to the origin of new genes, facilitate chromosomal inversions and contribute to adaptation in natural populations85,86. By combining Pool-seq with a specialized bioinformatic analysis, one study42 characterized the genomic distribution of TE insertions and their population frequencies in a natural D. melanogaster population. On the basis of this pattern, the authors concluded that TE activity varies over evo-lutionary timescales in D. melanogaster, and new TEs tend to be more active shortly after their acquisition by horizontal gene transfer.

    Selective sweeps. New beneficial mutations are expected to increase in frequency in a population before ulti-mately becoming fixed. This event, called a selective sweep, leaves a characteristic signature in the surround-ing polymorphism landscape: a ‘valley’ of reduced variability. The width of the valley depends on the recombination rate and selective advantage of the new allele. As Pool-seq provides an excellent tool to measure variability along the chromosome, it is well suited for detecting selective sweeps87. Indeed, several Pool-seq studies59,80,83,88 have detected such valleys of reduced variability around putative targets of selection.

    Trajectories of selected allelesTime-series analyses are a powerful tool for determining the forces operating on alleles in evolving populations. Rather than evaluating only the end point of an experi-ment, information about allele frequency dynamics also becomes available. This information can be used to obtain improved selection coefficient estimates89 and to test whether the trajectories of selected alleles follow theoretical expectations. Given that such studies necessarily comprise genomic data from multiple time points and often include several independent replicates, these analyses are currently only possible with Pool-seq.

    Clonal interference. Competition of beneficial alleles in asexual organisms has been termed clonal interfer-ence. Recent time-series analyses of large experimental populations of Escherichia coli and S. cerevisiae have provided an unprecedented high-resolution depiction of clonal interference90–92. Importantly, these stud-ies demonstrated that, along with the selected ‘driver’

    Figure 3 | Pool-seq applications. Whole-genome sequencing of pools of individuals (Pool-seq) is a versatile technology that may be used for a wide range of applications. a | Pool genome-wide association study (Pool-GWAS) was used to examine female abdominal pigmentation in Drosophila melanogaster. Contrasting the allele frequencies in pools of light and dark females identified candidate single-nucleotide polymorphisms (SNPs) with an exceptionally high mapping resolution (

  • HitchhikingThe population genetic mechanism by which a neutral, or in some cases slightly deleterious, mutation increases in population frequency solely as a result of physical linkage with a positively selected mutation.

    mutations, a substantial number of neutral hitchhiking ‘passenger’ mutations also arose in these clonal cohorts90. In addition to clonal interference, one study of yeast also found an evolutionary trade-off for variants that were selected in a constant nutrient environment92. About half of the beneficial alleles were caused by loss-of-function mutations in pathways controlling organismal growth. However, when subjected to a variable nutrient supply, these mutations were no longer beneficial but became deleterious instead92.

    Plateauing of selected alleles. Although in most experi-ments on E. coli and yeast novel beneficial mutations are either lost or fixed, more complex dynamics have been discovered in studies that rely on standing genetic variation that is already present at the beginning of the experiment. Several selected genomic regions were identified93 by following allele frequency trajectories in a polymorphic yeast population. Interestingly, for the majority of the selected genomic regions, the selec-tion coefficient changed during the experiment, which resulted in the selected alleles plateauing at intermedi-ate frequencies93,94. Although it is not clear which evolu-tionary forces prevented selected alleles from reaching fixation, it is noteworthy that the same phenomenon has also been observed in two different D. melanogaster laboratory experiments60,66. Some potential insights into this phenomenon are provided by a study that compared allele frequencies in a natural D. melanogaster popula-tion that was sampled in the spring and autumn over three consecutive years. The authors identified several SNPs that showed cyclical seasonal frequencies, which remained at intermediate frequencies. As these SNPs are associated with chill-coma recovery and starva-tion resistance, both of which are seasonally variable traits95, temporally variable selection may explain why some favoured alleles do not become fixed in natural populations.

    Dynamics of ecologically diverged clones. Several recent studies have used Pool-seq to elucidate the dynamics of diverged clones in evolving experimental populations by tracking allele frequency changes throughout the experiment. In this way, the dynamics of three ecologi-cally diverged morphs of Burkholderia cenocepacia — a bacterial pathogen of the cystic fibrosis lung — were studied in a biofilm community96 (FIG. 3e). Although the relative frequency of the three morphs remained fairly stable throughout the experiment, the authors found several incidences of new mutations in the common morph, which resulted in a phenotype switch to one of the rarer morphs. Interestingly, these new variants often had higher fitness than the resident morphs, indicating a complex pattern of intramorph and intermorph clonal interference.

    A striking example of how an environmental change can affect the dynamics of diverged clonal lineages comes from Wolbachia pipientis, which is a bacterial endosym-biont of D. melanogaster97. Three Wolbachia clades, all infecting the same Portuguese D. melanogaster popula-tion, showed habitat-specific dynamics in the laboratory.

    In a hot environment the frequency of the three clades did not change, whereas one haplotype increased mark-edly in frequency from ~25% to ~80% in less than 15 generations in a cold environment97.

    Cancer genomicsAn emerging trend in cancer research is to place cancer within an evolutionary framework98,99. In this context, cancer is recast as the product of natural selection on new somatic mutations arising from a common clonal ancestor, which takes place within the broader cellular ecosystem of the organism. To this end, determining the frequency spectra of somatic mutations in cancer tissue, preferably across multiple time points, is key to obtain-ing a deep understanding of cancer aetiology. As extrac-tion of individual cells from tumours is difficult, pooling of cancer tissue is typically the default experimental design. Pool-seq data have been able to capture the rich and complex evolutionary patterns in the different stages of cancer development100 (FIG. 3f), including that tumours often comprise heterogeneous mixtures of subclones that carry different causal (that is, driver) mutations, which must compete for shared pool of resources and show differential abilities to survive treatment100–102 (see REFS 103,104 for recent reviews). Moreover, by contrast-ing cancer tissues at different developmental stages, the driver mutations responsible for crucial events such as tumorigenesis, metastasis and relapse may be identi-fied. Such information will be essential in the design of new therapeutic methods and the prescription of personalized treatment plans.

    Future directionsFrom the broad range of published Pool-seq applications that rely on sequencing large pools of individuals from multiple populations or generations, it is apparent that it will not be feasible to shift all experiments to the analysis of sequences from separate individuals even with fur-ther reductions in sequencing costs. Thus, cost-effective sequencing methods, such as Pool-seq, will remain an important research tool for species with sufficiently large population sizes that permit the acquisition of adequate sample sizes. We anticipate that the following four applications in particular will continue to benefit from Pool-seq. The first two are time-series analyses on replicated evolving laboratory populations and on natu-ral populations in a changing environment (for example, global warming or seasonal variation). Next are appli-cations that link environmental variation to the genetic composition of locally adapted populations, which will require more and larger population samples to better infer how the genome is being shaped by ecological fac-tors. The final application concerns studies for which the separation of individual samples is technically chal-lenging. This includes cancer genomics and the analysis of intraspecific microorganismal diversity in the natural environment.

    Given the anticipated continued interest in Pool-seq, we expect that several ongoing developments will make Pool-seq an even more attractive tool in the future. First, the availability of new dedicated software tools will

    R E V I E W S

    760 | NOVEMBER 2014 | VOLUME 15 www.nature.com/reviews/genetics

    © 2014 Macmillan Publishers Limited. All rights reserved

  • 1. Ellegren, H. Genome sequencing and population genomics in non-model organisms. Trends Ecol. Evol. 29, 51–63 (2014).

    2. Sherry, S. T. et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29, 308–311 (2001).

    3. International HapMap, C. et al. Integrating common and rare genetic variation in diverse human populations. Nature 467, 52–58 (2010).

    4. 1000 Genomes Project Consortium et al. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012).

    5. Weigel, D. & Mott, R. The 1001 genomes project for Arabidopsis thaliana. Genome Biol. 10, 107 (2009).

    6. Daetwyler, H. D. et al. Whole-genome sequencing of 234 bulls facilitates mapping of monogenic and complex traits in cattle. Nature Genet. 46, 858–865 (2014).

    7. Manolio, T. A. et al. Finding the missing heritability of complex diseases. Nature 461, 747–753 (2009).

    8. The Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447, 661–678 (2007).

    9. Sheridan, C. Illumina claims $1,000 genome win. Nature Biotech. 32, 115 (2014).

    10. Weinstock, G. M. Genomic approaches to studying the human microbiota. Nature 489, 250–256 (2012).

    11. Futschik, A. & Schlötterer, C. The next generation of molecular markers from massively parallel sequencing of pooled DNA samples. Genetics 186, 207–218 (2010).This study is the first to provide a statistical framework for the analysis of Pool-seq data in population genetics.

    12. Gautier, M. et al. Estimation of population allele frequencies from next-generation sequencing data: pool-versus individual-based genotyping. Mol. Ecol. 22, 3766–3779 (2013).

    13. Bamshad, M. J. et al. Exome sequencing as a tool for Mendelian disease gene discovery. Nature Rev. Genet. 12, 745–755 (2011).

    14. Gilissen, C., Hoischen, A., Brunner, H. G. & Veltman, J. A. Disease gene identification strategies for exome sequencing. Eur. J. Hum. Genet. 20, 490–497 (2012).

    15. Wang, Z., Gerstein, M. & Snyder, M. RNA-seq: a revolutionary tool for transcriptomics. Nature Rev. Genet. 10, 57–63 (2009).

    16. Davey, J. W. et al. Genome-wide genetic marker discovery and genotyping using next-generation sequencing. Nature Rev. Genet. 12, 499–510 (2011).

    17. Pihlstrom, L., Rengmark, A., Bjornara, K. A. & Toft, M. Effective variant detection by targeted deep sequencing of DNA pools: an example from Parkinson’s disease. Ann. Hum. Genet. 78, 243–252 (2014).

    18. Suvorov, A. et al. Intra-specific regulatory variation in Drosophila pseudoobscura. PLoS ONE 8, e83547 (2013).

    19. Wittkopp, P. J., Haerum, B. K. & Clark, A. G. Regulatory changes underlying expression differences within and between Drosophila species. Nature Genet. 40, 346–350 (2008).

    20. Konczal, M., Koteja, P., Stuglik, M. T., Radwan, J. & Babik, W. Accuracy of allele frequency estimation using pooled RNA-seq. Mol. Ecol. Resour. 14, 381–392 (2014).

    21. Gross, J. B., Furterer, A., Carlson, B. M. & Stahl, B. A. An integrated transcriptome-wide analysis of cave and surface dwelling Astyanax mexicanus. PLoS ONE 8, e55659 (2013).

    22. Kozak, G. M., Brennan, R. S., Berdan, E. L., Fuller, R. C. & Whitehead, A. Functional and population genomic divergence within and between two species of killifish adapted to different osmotic niches. Evolution 68, 63–80 (2014).

    23. Sloan, D. B. et al. De novo transcriptome assembly and polymorphism detection in the flowering plant Silene vulgaris (Caryophyllaceae). Mol. Ecol. Resour. 12, 333–343 (2012).

    24. Gautier, M. et al. The effect of RAD allele dropout on the estimation of genetic variation within and between populations. Mol. Ecol. 22, 3165–3178 (2013).

    25. Arnold, B., Corbett-Detig, R. B., Hartl, D. & Bomblies, K. RADseq underestimates diversity and introduces genealogical biases due to nonrandom haplotype sampling. Mol. Ecol. 22, 3179–3190 (2013).

    26. Karczewski, K. J. et al. Systematic functional regulatory assessment of disease-associated variants. Proc. Natl Acad. Sci. USA 110, 9607–9612 (2013).

    27. Khurana, E. et al. Integrative annotation of variants from 1092 humans: application to cancer genomics. Science 342, 1235587 (2013).

    28. Schaub, M. A., Boyle, A. P., Kundaje, A., Batzoglou, S. & Snyder, M. Linking disease associations with regulatory information in the human genome. Genome Res. 22, 1748–1759 (2012).

    29. Marchini, J. & Howie, B. Genotype imputation for genome-wide association studies. Nature Rev. Genet. 11, 499–511 (2010).

    30. Qanbari, S. et al. Classic selective sweeps revealed by massive sequencing in cattle. PLoS Genet. 10, e1004148 (2014).

    31. Pasaniuc, B. et al. Extremely low-coverage sequencing and imputation increases power for genome-wide association studies. Nature Genet. 44, 631–635 (2012).

    32. Lou, D. I. et al. High-throughput DNA sequencing errors are reduced by orders of magnitude using circle sequencing. Proc. Natl Acad. Sci. USA 110, 19872–19877 (2013).

    33. DePristo, M. A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genet. 43, 491–498 (2011).

    34. Minoche, A. E., Dohm, J. C. & Himmelbauer, H. Evaluation of genomic high-throughput sequencing data generated on Illumina HiSeq and genome analyzer systems. Genome Biol. 12, R112 (2011).

    35. Li, H., Ruan, J. & Durbin, R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18, 1851–1858 (2008).

    36. Robasky, K., Lewis, N. E. & Church, G. M. The role of replicates for error mitigation in next-generation sequencing. Nature Rev. Genet. 15, 56–62 (2014).

    37. Sham, P., Bader, J. S., Craig, I., O’Donovan, M. & Owen, M. DNA pooling: a tool for large-scale association studies. Nature Rev. Genet. 3, 862–871 (2002).This is a comprehensive review of pooling strategies.

    38. Zhu, Y., Bergland, A. O., Gonzalez, J. & Petrov, D. A. Empirical validation of pooled whole genome population re-sequencing in Drosophila melanogaster. PLoS ONE 7, e41901 (2012).

    39. Kofler, R. et al. PoPoolation: a toolbox for population genetic analysis of next generation sequencing data from pooled individuals. PLoS ONE 6, e15925 (2011).

    40. Schrider, D. R., Begun, D. J. & Hahn, M. W. Detecting highly differentiated copy-number variants from pooled population sequencing. Pac. Symp. Biocomput 1, 344–344 (2013).

    41. Kapun, M., van Schalkwyk, H., McAllister, B., Flatt, T. & Schlötterer, C. Inference of chromosomal inversion dynamics from Pool-seq data in natural and laboratory populations of Drosophila melanogaster. Mol. Ecol. 23, 1813–1827 (2014).

    42. Kofler, R., Betancourt, A. J. & Schlötterer, C. Sequencing of pooled DNA samples (Pool-seq) uncovers complex dynamics of transposable element insertions in Drosophila melanogaster. PLoS Genet. 8, e1002487 (2012).This study is the first to infer TE insertion sites and the population frequency of TE insertions from Pool-seq data.

    43. Sax, K. The association of size differences with seed-coat pattern and pigmentation in Phaseolus vulgaris. Genetics 8, 552–560 (1923).

    44. Schneeberger, K. et al. SHOREmap: simultaneous mapping and mutation identification by deep sequencing. Nature Methods 6, 550–551 (2009).This paper is the first to show that Pool-seq can be used to map induced mutations.

    45. Schneeberger, K. Using next-generation sequencing to isolate mutant genes from forward genetic screens. Nature Rev. Genet. 15, 662–676 (2014).

    46. Hill, J. T. et al. MMAPPR: mutation mapping analysis pipeline for pooled RNA-seq. Genome Res. 23, 687–697 (2013).

    47. Miller, A. C., Obholzer, N. D., Shah, A. N., Megason, S. G. & Moens, C. B. RNA-seq-based mapping and candidate identification of mutations from forward genetic screens. Genome Res. 23, 679–686 (2013).

    48. Galvao, V. C. et al. Synteny-based mapping-by-sequencing enabled by targeted enrichment. Plant J. 71, 517–526 (2012).

    49. Ehrenreich, I. M. et al. Dissection of genetically complex traits with extremely large pools of yeast segregants. Nature 464, 1039–1042 (2010).This study provides proof that Pool-seq provides enough power to map complex traits.

    50. Wenger, J. W., Schwartz, K. & Sherlock, G. Bulk segregant analysis by high-throughput sequencing reveals a novel xylose utilization gene from Saccharomyces cerevisiae. PLoS Genet. 6, e1000942 (2010).

    51. Swinnen, S. et al. Identification of novel causative genes determining the complex trait of high ethanol tolerance in yeast using pooled-segregant whole-genome sequence analysis. Genome Res. 22, 975–984 (2012).

    52. Wade, M. J. Epistasis, complex traits, and mapping genes. Genetica 112–113, 59–69 (2001).

    53. Earley, E. J. & Jones, C. D. Next-generation mapping of complex traits with phenotype-based selection and introgression. Genetics 189, 1203–1209 (2011).

    54. Bastide, H. et al. A genome-wide, fine-scale map of natural pigmentation variation in Drosophila melanogaster. PLoS Genet. 9, e1003534 (2013).This papershows that Pool-seq allows highly accurate fine mapping using natural population samples.

    55. Jeong, S. et al. The evolution of gene regulation underlies a morphological difference between two Drosophila sister species. Cell 132, 783–793 (2008).

    56. Kelly, J. K., Koseva, B. & Mojica, J. P. The genomic signal of partial sweeps in Mimulus guttatus. Genome Biol. Evol. 5, 1457–1469 (2013).

    57. Beissinger, T. M. et al. A genome-wide scan for evidence of selection in a maize population under long-term artificial selection for ear number. Genetics 196, 829–840 (2014).

    facilitate the analysis of Pool-seq data. Second, analyses of low-frequency variants will become routine through the use of novel techniques32,105. The third development concerns the haplotype phasing of Pool-seq data — although current approaches rely on sequence infor-mation of founder haplotypes106–109, an extension that relaxes this requirement to only a subset of the haplo-types in the pool will make this approach more general and enable more accurate estimates. Finally, the avail-ability of longer sequencing reads will further facili-tate the reconstruction of haplotype information from

    Pool-seq data. This could be driven either by technologi-cal advances, such as Nanopore and PacBio sequencing, or by new library preparation protocols (for example, Illumina’s Synthetic Long-Read technology), which allow haplotype sequencing for DNA fragments of up to 10 kb with the current sequencing technology.

    These technological improvements, together with the broad range of biological research questions that require large sample sizes, mean that Pool-seq will continue to complement the sequencing of individual genomes for the foreseeable future.

    R E V I E W S

    NATURE REVIEWS | GENETICS VOLUME 15 | NOVEMBER 2014 | 761

    © 2014 Macmillan Publishers Limited. All rights reserved

  • 58. Johansson, A. M., Pettersson, M. E., Siegel, P. B. & Carlborg, O. Genome-wide effects of long-term divergent selection. PLoS Genet. 6, e1001188 (2010).

    59. Rubin, C. J. et al. Whole-genome resequencing reveals loci under selection during chicken domestication. Nature 464, 587–591 (2010).This is a particularly nice demonstration of the power of Pool-seq to detect selected loci in population samples.

    60. Burke, M. K. et al. Genome-wide analysis of a long-term evolution experiment with Drosophila. Nature 467, 587–590 (2010).The is the first experimental evolution study measuring allele frequency changes using Pool-seq.

    61. Remolina, S. C., Chang, P. L., Leips, J., Nuzhdin, S. V. & Hughes, K. A. Genomic basis of aging and life-history evolution in Drosophila melanogaster. Evolution 66, 3390–3403 (2012).

    62. Turner, T. L., Stewart, A. D., Fields, A. T., Rice, W. R. & Tarone, A. M. Population-based resequencing of experimentally evolved populations reveals the genetic basis of body size variation in Drosophila melanogaster. PLoS Genet. 7, e1001336 (2011).

    63. Zhou, D. et al. Experimental selection of hypoxia-tolerant Drosophila melanogaster. Proc. Natl Acad. Sci. USA 108, 2349–2354 (2011).

    64. Turner, T. L. & Miller, P. M. Investigating natural variation in Drosophila courtship song by the evolve and resequence approach. Genetics 191, 633–642 (2012).

    65. Tobler, R. et al. Massive habitat-specific genomic response in D. melanogaster populations during experimental evolution in hot and cold environments. Mol. Biol. Evol. 31, 364–375 (2013).

    66. Orozco-terWengel, P. et al. Adaptation of Drosophila to a novel laboratory environment reveals temporally heterogeneous trajectories of selected alleles. Mol. Ecol. 21, 4931–4941 (2012).

    67. Reed, L. K. et al. Systems genomics of metabolic phenotypes in wild-type Drosophila melanogaster. Genetics 197, 781–793 (2014).

    68. Martins, N. et al. Host adaptation to viruses relies on few genes with different cross-resistance properties. Proc. Natl Acad. Sci. USA 111, 5938–5943 (2014).

    69. Jalvingh, K. M., Chang, P. L., Nuzhdin, S. V. & Wertheim, B. Genomic changes under rapid evolution: selection for parasitoid resistance. Proc. Biol. Sci. 281, 20132303 (2014).

    70. Magwire, M. M. et al. Genome-wide association studies reveal a simple genetic basis of resistance to naturally coevolving viruses in Drosophila melanogaster. PLoS Genet. 8, e1003057 (2012).

    71. Turner, T. L., Bourne, E. C., Von Wettberg, E. J., Hu, T. T. & Nuzhdin, S. V. Population resequencing reveals local adaptation of Arabidopsis lyrata to serpentine soils. Nature Genet. 42, 260–263 (2010).The study is the first to show that ecologically important traits can be mapped with Pool-seq by comparing two functionally diverged populations.

    72. Lamichhaney, S. et al. Population-scale sequencing reveals genetic differentiation due to local adaptation in Atlantic herring. Proc. Natl Acad. Sci. USA 109, 19345–19350 (2012).

    73. Fabian, D. K. et al. Genome-wide patterns of latitudinal differentiation among populations of Drosophila melanogaster from North America. Mol. Ecol. 21, 4748–4769 (2012).

    74. Kolaczkowski, B., Kern, A. D., Holloway, A. K. & Begun, D. J. Genomic differentiation between temperate and tropical Australian populations of Drosophila melanogaster. Genetics 187, 245–260 (2011).

    75. Cheng, C. et al. Ecological genomics of Anopheles gambiae along a latitudinal cline: a population-resequencing approach. Genetics 190, 1417–1432 (2012).

    76. Hancock, A. M. et al. Adaptations to climate in candidate genes for common metabolic disorders. PLoS Genet. 4, e32 (2008).

    77. Hancock, A. M. et al. Adaptation to climate across the Arabidopsis thaliana genome. Science 334, 83–86 (2011).

    78. Fischer, M. C. et al. Population genomic footprints of selection and associations with climate in natural populations of Arabidopsis halleri from the Alps. Mol. Ecol. 22, 5594–5607 (2013).This is a nice application of Pool-seq to find selected loci in a non-model organism.

    79. Günther, T. & Coop, G. Robust identification of local adaptation from allele frequencies. Genetics 195, 205–220 (2013).This paper presents the first statistical framework to identify significant associations of a given locus with one or more environmental variables using Pool-seq data.

    80. Rubin, C. J. et al. Strong signatures of selection in the domestic pig genome. Proc. Natl Acad. Sci. USA 109, 19529–19536 (2012).

    81. Axelsson, E. et al. The genomic signature of dog domestication reveals adaptation to a starch-rich diet. Nature 495, 360–364 (2013).

    82. He, Z. et al. Two evolutionary histories in the genome of rice: the roles of domestication genes. PLoS Genet. 7, e1002100 (2011).

    83. Nolte, V., Pandey, R. V., Kofler, R. & Schlötterer, C. Genome-wide patterns of natural variation reveal strong selective sweeps and ongoing genomic conflict in Drosophila mauritiana. Genome Res. 23, 99–110 (2013).

    84. True, J. R., Mercer, J. M. & Laurie, C. C. Differences in crossover frequency and distribution among three sibling species of Drosophila. Genetics 142, 507–523 (1996).

    85. Casacuberta, E. & Gonzalez, J. The impact of transposable elements in environmental adaptation. Mol. Ecol. 22, 1503–1517 (2013).

    86. Kazazian, H. H. Jr Mobile elements: drivers of genome evolution. Science 303, 1626–1632 (2004).

    87. Boitard, S., Schlötterer, C., Nolte, V., Pandey, R. V. & Futschik, A. Detecting selective sweeps from pooled next-generation sequencing samples. Mol. Biol. Evol. 29, 2177–2186 (2012).

    88. Clément, J. A. et al. Private selective sweeps identified from next-generation pool-sequencing reveal convergent pathways under selection in two inbred Schistosoma mansoni strains. PLoS Negl Trop. Dis. 7, e2591 (2013).

    89. Foll, M. et al. Influenza virus drug resistance: a time-sampled population genetics perspective. PLoS Genet. 10, e1004185 (2014).

    90. Lang, G. I. et al. Pervasive genetic hitchhiking and clonal interference in forty evolving yeast populations. Nature 500, 571–574 (2013).

    91. Barrick, J. E. & Lenski, R. E. Genome-wide mutational diversity in an evolving population of Escherichia coli. Cold Spring Harb. Symp. Quant. Biol. 74, 119–129 (2009).

    92. Kvitek, D. J. & Sherlock, G. Whole genome, whole population sequencing reveals that loss of signaling networks is the major adaptive strategy in a constant environment. PLoS Genet. 9, e1003972 (2013).

    93. Parts, L. et al. Revealing the genetic structure of a trait by sequencing a population under selection. Genome Res. 21, 1131–1138 (2011).

    94. Illingworth, C. J., Parts, L., Schiffels, S., Liti, G. & Mustonen, V. Quantifying selection acting on a complex trait using allele frequency time series data. Mol. Biol. Evol. 29, 1187–1197 (2012).

    95. Bergland, A. O., Behrman, E. L., O’Brien, K. R., Schmidt, P. S. & Petrov, D. A. Genomic evidence of rapid and stable adaptive oscillations over seasonal time scales in Drosophila. arXiv 1303.5044 (2014).

    96. Traverse, C. C., Mayo-Smith, L. M., Poltak, S. R. & Cooper, V. S. Tangled bank of experimentally evolved Burkholderia biofilms reflects selection during chronic infections. Proc. Natl Acad. Sci. USA 110, E250–E259 (2013).

    97. Versace, E., Nolte, V., Pandey, R. V., Tobler, R. & Schlötterer, C. Experimental evolution reveals habitat-specific fitness dynamics among Wolbachia clades in Drosophila melanogaster. Mol. Ecol. 23, 802–814 (2014).

    98. Barcellos-Hoff, M. H., Lyden, D. & Wang, T. C. The evolution of the cancer niche during multistage carcinogenesis. Nature Rev. Cancer 13, 511–518 (2013).

    99. Merlo, L. M. F., Pepper, J. W., Reid, B. J. & Maley, C. C. Cancer as an evolutionary and ecological process. Nature Rev. Cancer 6, 924–935 (2006).

    100. Ding, L. et al. Clonal evolution in relapsed acute myeloid leukaemia revealed by whole-genome sequencing. Nature 481, 506–510 (2012).

    101. Newburger, D. E. et al. Genome evolution during progression to breast cancer. Genome Res. 23, 1097–1108 (2013).

    102. Nik-Zainal, S. et al. The life history of 21 breast cancers. Cell 149, 994–1007 (2012).

    103. Aparicio, S. & Caldas, C. The implications of clonal genome evolution for cancer medicine. New Engl. J. Med. 368, 842–851 (2013).

    104. Greaves, M. & Maley, C. C. Clonal evolution in cancer. Nature 481, 306–313 (2012).

    105. Kinde, I., Wu, J., Papadopoulos, N., Kinzler, K. W. & Vogelstein, B. Detection and quantification of rare mutations with massively parallel sequencing. Proc. Natl Acad. Sci. USA 108, 9530–9535 (2011).

    106. Long, Q. et al. PoolHap: inferring haplotype frequencies from pooled samples by next generation sequencing. PLoS ONE 6, e15292 (2011).

    107. Kessner, D., Turner, T. L. & Novembre, J. Maximum likelihood estimation of frequencies of known haplotypes from pooled sequence data. Mol. Biol. Evol. 30, 1145–1158 (2013).

    108. Burke, M. K., King, E. G., Shahrestani, P., Rose, M. R. & Long, A. D. Genome-wide association study of extreme longevity in Drosophila melanogaster. Genome Biol. Evol. 6, 1–11 (2014).

    109. Eskin, I. et al. eALPS: estimating abundance levels in pooled sequencing using available genotyping data. J. Computat. Biol. 20, 861–877 (2013).

    110. Kofler, R. & Schlötterer, C. A guide for the design of evolve and resequencing studies. Mol. Biol. Evol. 31, 474–483 (2014).

    111. Imsland, F. et al. The Rose-comb mutation in chickens constitutes a structural rearrangement causing both altered comb morphology and defective sperm motility. Plos Genetics 8, e1002775 (2012).

    112. Del Fabbro, C., Scalabrin, S., Morgante, M. & Giorgi, F. M. An extensive evaluation of read trimming effects on Illumina NGS data analysis. PLoS ONE 8, e85024 (2013).

    113. Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet. J. 17, 10–12 (2011).

    114. Nevado, B., Ramos-Onsins, S. E. & Perez-Enciso, M. Resequencing studies of nonmodel organisms using closely related reference genomes: optimal experimental designs and bioinformatics approaches for population genomics. Mol. Ecol. 23, 1764–1779 (2014).

    115. Degner, J. F. et al. Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data. Bioinformatics 25, 3207–3212 (2009).

    116. Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nature Methods 9, 357–359 (2012).

    117. Albers, C. A. et al. Dindel: accurate indel calls from short-read data. Genome Res. 21, 961–973 (2011).

    118. McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).

    119. Koboldt, D. C. et al. VarScan: variant detection in massively parallel sequencing of individual and pooled samples. Bioinformatics 25, 2283–2285 (2009).

    120. Raineri, E. et al. SNP calling by sequencing pooled samples. BMC Bioinformatics 13, 239 (2012).

    121. Bansal, V. A statistical method for the detection of variants from next-generation resequencing of DNA pools. Bioinformatics 26, i318–i324 (2010).

    122. Altmann, A. et al. vipR: variant identification in pooled DNA using R. Bioinformatics 27, I77–I84 (2011).

    123. Zhou, B. Y. An empirical Bayes mixture model for SNP detection in pooled sequencing data. Bioinformatics 28, 2569–2575 (2012).

    124. Chen, Q. & Sun, F. A unified approach for allele frequency estimation, SNP detection and association studies based on pooled sequencing data using EM algorithms. BMC Genomics 14 (Suppl. 1), S1 (2013).

    125. Druley, T. E. et al. Quantification of rare allelic variants from pooled genomic DNA. Nature Methods 6, 263–265 (2009).

    126. Vallania, F. L. et al. High-throughput discovery of rare insertions and deletions in large cohorts. Genome Res. 20, 1711–1718 (2010).

    127. Wei, Z., Wang, W., Hu, P., Lyon, G. J. & Hakonarson, H. SNVer: a statistical tool for variant calling in analysis of pooled or individual next-generation sequencing data. Nucleic Acids Res. 39, e132 (2011).

    128. Garrison, E. & Marth, G. Haplotype-based variant detection from short-read sequencing. arXiv 1207.3907 (2012).

    129. Calvo, S. E. et al. High-throughput, pooled sequencing identifies mutations in NUBPL and FOXRED1 in human complex I deficiency. Nature Genet. 42, 851–858 (2010).

    R E V I E W S

    762 | NOVEMBER 2014 | VOLUME 15 www.nature.com/reviews/genetics

    © 2014 Macmillan Publishers Limited. All rights reserved

  • 130. Fiston-Lavier, A.-S., Barron, M. G., Petrov, D. A. & González, J. T-lex2: genotyping, frequency estimation and re-annotation of transposable elements using single or pooled next-generation sequencing data. bioRxiv http://dx.doi.org/10.1101/002964 (2014).

    131. Zhuang, J., Wang, J., Theurkauf, W. & Weng, Z. TEMP: a computational method for analyzing transposable element polymorphism in populations. Nucleic Acids Res. 42, 6826–6838 (2014).

    132. Kofler, R., Pandey, R. V. & Schlötterer, C. PoPoolation2: identifying differentiation between populations using sequencing of pooled DNA samples (Pool-seq). Bioinformatics 27, 3435–3436 (2011).

    133. Boitard, S. et al. Pool-HMM: a Python program for estimating the allele frequency spectrum and detecting selective sweeps from next generation sequencing of pooled samples. Mol. Ecol. Resour. 13, 337–340 (2013).

    134. Ferretti, L., Ramos-Onsins, S. E. & Perez-Enciso, M. Population genomics from pool sequencing. Mol. Ecol. 22, 5561–5576 (2013).

    135. Catchen, J., Hohenlohe, P. A., Bassham, S., Amores, A. & Cresko, W. A. Stacks: an analysis tool set for population genomics. Mol. Ecol. 22, 3124–3140 (2013).

    136. Vitalis, R., Gautier, M., Dawson, K. J. & Beaumont, M. A. Detecting and measuring selection from gene frequency data. Genetics 196, 799–817 (2014).

    137. Gautier, M. & Vitalis, R. Inferring population histories using genome-wide allele frequency data. Mol. Biol. Evol. 30, 654–668 (2013).

    138. Feder, A. F., Petrov, D. A. & Bergland, A. O. LDx: estimation of linkage disequilibrium from high-throughput pooled resequencing data. PLoS ONE 7, e48588 (2012).

    139. Minevich, G., Park, D. S., Blankenberg, D., Poole, R. J. & Hobert, O. CloudMap: a cloud-based pipeline for analysis of mutant genome sequences. Genetics 192, 1249–1269 (2012).

    140. Edwards, M. D. & Gifford, D. K. High-resolution genetic mapping with pooled sequencing. BMC Bioinformatics 13 (Suppl. 6), S8 (2012).

    141. Bowen, M. E., Henke, K., Siegfried, K. R., Warman, M. L. & Harris, M. P. Efficient mapping and cloning of mutations in zebrafish by low-coverage whole-genome sequencing. Genetics 190, 1017–1024 (2012).

    142. Austin, R. S. et al. Next-generation mapping of Arabidopsis genes. Plant J. 67, 715–725 (2011).

    143. Leshchiner, I. et al. Mutation mapping and identification by whole-genome sequencing. Genome Res. 22, 1541–1548 (2012).

    144. Prosperi, M. C. & Salemi, M. QuRe: software for viral quasispecies reconstruction from next-generation sequencing data. Bioinformatics 28, 132–133 (2012).

    145. Zagordi, O., Bhattacharya, A., Eriksson, N. & Beerenwinkel, N. ShoRAH: estimating the genetic diversity of a mixed sample from next-generation sequencing data. BMC Bioinformatics 12, 119 (2011).

    146. Eyre, D. W. et al. Detection of mixed infection from bacterial whole genome sequence data allows assessment of its role in Clostridium difficile transmission. PLoS Comput. Biol. 9, e1003059 (2013).

    147. Astrovskaya, I. et al. Inferring viral quasispecies spectra from 454 pyrosequencing reads. BMC Bioinformatics 12 (Suppl. 6), S1 (2011).

    148. Yang, X., Charlebois, P., Macalalad, A., Henn, M. R. & Zody, M. C. V-Phaser 2: variant inference for viral populations. BMC Genomics 14, 674 (2013).

    149. Töpfer, A. et al. Viral quasispecies assembly via maximal clique enumeration. PLoS Comput. Biol. 10, e1003515 (2014).

    150. Töpfer, A. et al. Probabilistic inference of viral quasispecies subject to recombination. J. Comput. Biol. 20, 113–123 (2013).

    151. Prabhakaran, S., Rey, M., Zagordi, O., Beerenwinkel, N. & Roth, V. HIV haplotype inference using a constraint-based Dirichlet process mixture model. Machine Learning in Computational Biology NIPS Workshop (2010).

    152. Pandey, R. V., Kofler, R., Orozco-terWengel, P., Nolte, V. & Schlötterer, C. PoPoolation DB: a user-friendly web-based database for the retrieval of natural polymorphisms in Drosophila. BMC Genet. 12, 27 (2011).

    153. Chen, X., Listman, J. B., Slack, F. J., Gelernter, J. & Zhao, H. Biases and errors on allele frequency estimation and disease association tests of next-generation sequencing of pooled samples. Genet. Epidemiol. 36, 549–560 (2012).

    154. Roth, A. et al. PyClone: statistical inference of clonal population structure in cancer. Nature Methods 11, 396–398 (2014).

    AcknowledgementsThe authors apologize to all colleagues who were not cited owing to space limitations. They are grateful to all colleagues who shared unpublished manuscripts, especially D. Kessner, Q. Long, M. Pérez Enciso, A. S. Fiston-Lavier and K. Schneeberger for comments and discussions. They thank members of the Institut für Populationsgenetik, in particular A. Betancourt, M. Dolezal, A. Futschik and A. Kalinka for discussion and comments on earlier versions of the manuscript. This work has been supported by the ERC (ArchAdapt) and the Austrian Science Funds (FWF, W1225).

    Competing interests statementThe authors declare no competing interests.

    FURTHER INFORMATIONBayenv2: gcbias.org/bayenv/CloudMap: hobertlab.org/cloudmap/CRISP: sites.google.com/site/vibansal/software/crispDindel: www.sanger.ac.uk/resources/software/dindel/eALPS: sourceforge.net/projects/ealps/EBM: sites.google.com/site/zhouby98/ebmEM‑SNP: www-rcf.usc.edu/~fsun/Programs/EM-SNP/ EM-SNP.htmlFishyskeleton: fishbonelab.org/FreeBayes: github.com/ekg/freebayesGATK Unified Genotyper: www.broadinstitute.org/gatk/HaploClique: github.com/armintoepfer/haplocliqueharp: bitbucket.org/dkessner/harpKimTree: www1.montpellier.inra.fr/CBGP/software/kimtree/index.htmlLDx: sourceforge.net/projects/ldx