Genotype Calling from Population-Genomic Sequencing Data...inferred (called) genotype of the...

INVESTIGATION

Genotype Calling from Population-GenomicSequencing DataTakahiro Maruki1 and Michael LynchDepartment of Biology Indiana University Bloomington Indiana 47405

ABSTRACT Genotype calling plays important roles in population-genomic studies which have beengreatly accelerated by sequencing technologies To take full advantage of the resultant information wehave developed maximum-likelihood (ML) methods for calling genotypes from high-throughput sequencingdata As the statistical uncertainties associated with sequencing data depend on depths of coverage wehave developed two types of genotype callers One approach is appropriate for low-coverage sequencingdata and incorporates population-level information on genotype frequencies and error rates pre-estimatedby an ML method Performance evaluation using computer simulations and human data shows that theproposed framework yields less biased estimates of allele frequencies and more accurate genotype callsthan current widely used methods Another type of genotype caller applies to high-coverage sequencingdata requires no prior genotype-frequency estimates and makes no assumption on the number of alleles ata polymorphic site Using computer simulations we determine the depth of coverage necessary toaccurately characterize polymorphisms using this second method We applied the proposed method tohigh-coverage (mean 18middot) sequencing data of 83 clones from a population of Daphnia pulex The resultsshow that the proposed method enables conservative and reasonably powerful detection of polymorphismswith arbitrary numbers of alleles We have extended the proposed method to the analysis of genomic datafor polyploid organisms showing that calling accurate polyploid genotypes requires much higher coveragethan diploid genotypes

KEYWORDS

genotype callpolymorphismpopulationgenomics

Whenwe carry out population-genomic analyses identifying individualgenotypes is often necessary For example in order to identify putativeloci associated with a phenotype in genome-wide association studiescalling genotypes is necessary tofindwhich allele each individual carriesat each locus Moreover calling individual genotypes is a first step insome population-genetic analyses For example many statistical meth-ods for haplotype phasing (eg Scheet and Stephens 2006 Browningand Browning 2007 Li et al 2010) start from genotype calls at eachSNP site In addition when accurate genotypes are called at each SNP

site traditional statistical methods including the four-gamete test(Hudson and Kaplan 1985) and composite disequilibrium measures(Cockerham and Weir 1977 Weir 1996) can be used to examine thepattern of linkage disequilibrium

Despite the advantages some difficulties are associated with high-throughput sequencing technologies One of the main difficulties is thehigh sequencing error rateswhich typically range from0001 to 001 perread per site with commonly used sequencing platforms (Glenn 2011Quail et al 2012) Second because sequencing occurs randomly amongsites individuals and chromosomes in diploid organisms depths ofcoverage are variable at all levels As a result when depths of coverageare low there are often missing data which introduces biases in sub-sequent population-genetic analyses unless they are statistically accountedfor (Pool et al 2010)

To call genotypes from high-throughput sequencing data manystatistical methods have been recently developed (eg Li et al 20082009b Hohenlohe et al 2010 Martin et al 2010 McKenna et al2010 Catchen et al 2011 2013 DePristo et al 2011 Li 2011Nielsen et al 2012 Vieira et al 2013) The performance of thewidely used genotype callers in population-genomic analyses isnot well understood especially when the population deviates from

Copyright copy 2017 Maruki and Lynchdoi httpsdoiorg101534g3117039008Manuscript received September 2 2016 accepted for publication January 6 2017published Early Online January 19 2017This is an open-access article distributed under the terms of the Creative CommonsAttribution 40 International License (httpcreativecommonsorglicensesby40)which permits unrestricted use distribution and reproduction in any medium pro-vided the original work is properly citedSupplemental material is available online at wwwg3journalorglookupsuppldoi101534g3117039008-DC11Corresponding author Department of Biology Indiana University 1001 E 3rd StBloomington IN 47405 E-mail tmarukiindianaedu

Volume 7 | May 2017 | 1393

the HardyndashWeinberg equilibrium (HWE) Recent studies (Kimet al 2011 Han et al 2014) found that allele frequencies estimateddirectly from the sequence reads are unbiased whereas those es-timated via genotype calling are biased when depths of coverageare low These studies assumed a population in HWE Vieira et al(2013) recently showed that the performance of genotype callingcan be improved by first estimating inbreeding coefficients fromthe sequencing data and then calling genotypes incorporating theinformation on estimated inbreeding coefficients Unfortunatelytheir method is applicable only when the inbreeding coefficientsare non-negative (Maruki and Lynch 2015) and does not alwaystake full advantage of the population-level information Negativeinbreeding coefficients are common in some organisms includingasexual aphids (Delmotte et al 2002) Daphnia in permanentpondslakes (Hebert 1978) fruit bats (Storz et al 2001) partiallyinbreeding plant species (Brown 1979) Laysan finches (Tarr et al1998) prairie dogs (Foltz and Hoogland 1983) rhesus monkeys(Melnick 1987) and water voles (Aars et al 2006) Furthermoresome regions under balancing selection may contain excess het-erozygotes and therefore show negative inbreeding coefficients(Black and Salzano 1981 Markow et al 1993 Hedrick 1998Black et al 2001 Ferreira and Amos 2006 Tollenaere et al 2008)

In this study we develop a maximum-likelihood (ML) method forcalling genotypes from high-throughput sequencing data that incorpo-rates the prior information from a genotype-frequency estimator (GFE)(Maruki and Lynch 2015) We examine the performance of the pro-posed method using computer simulations under different geneticconditions including those where HWE is violated and compare theperformance with that of other widely used methods The results showthat our method yields more accurate genotype calls with low or mod-erately high depths of coverage than the current widely used methodswhich is supported by analysis of human data In addition we developanotherMLmethod for calling genotypes from high-coverage sequenc-ing data which relaxes the assumption of biallelic polymorphismsmade in many existing methods

We also examine the necessary depth of coverage for identifyingtriallelic sites and accurate genotype calling with the proposed methodusing computer simulations Taking the results of the performanceevaluation into account we apply the proposed method to high-throughput sequencing data of 83 clones from a population of themicrocrustacean Daphnia pulex which have reasonably high depths ofcoverage (mean 18middot per site per individual) (Lynch et al 2016) Theresults show that the proposed method enables accurate and rapididentification of polymorphic sites with arbitrary numbers of allelesFurthermore we show that the proposed method can be applied toanalyses of polyploid data

METHODSWe develop ML methods for calling individual genotypes fromhigh-throughput sequencing data We develop two types of meth-ods for calling genotypes one for low-coverage sequencing data andthe other for high-coverage sequencing data because the degree ofuncertainty associated with sequencing data depends on depths ofcoverage In both types of methods we statistically test the signif-icance of polymorphisms Also we do not call the genotype of anindividual when more than one genotype has equivalent likelihoodvalues for the observed data

Genotype calling from low-coverage sequencing dataWhen depths of coverage are low only one of the two parentalchromosomes might be sampled and sequence errors can resemble

true variants In such cases the statistical uncertainties of indi-vidual sequence data can be high thus calling accurate genotypesis not easy However when sequence data for multiple individualsfrom a population are available the accuracy of the called geno-types can be improved by incorporating the population-levelinformation on genotype frequencies and error rates into thegenotype-calling process using Bayesrsquo theorem [see Martinet al (2010) Nielsen et al (2012) Vieira et al (2013) for similarmethods using the expectation-maximization algorithm for esti-mating priors]

Here the genotype of an individual at a single site is called from thenucleotide read quartet (counts of A C G and T) of high-throughputsequencing data at the site by an ML method This is achieved bymaximizing the likelihood of the observed data as a function of thegenotype of the individual g The two most abundant nucleotide readsin the population sample are considered to be candidates for alleles atthe site

Given the genotype of the individual g = 1 (major homozygoteMM) 2 (heterozygote Mm) or 3 (minor homozygote mm) and se-quencing error rate per read per site e the log-likelihood of the observedsite-specific read quartet consisting of the observed counts of the mostabundant (major) nucleotide readM (eg C) (nM) second most abun-dant (minor) nucleotide read m (eg T) (nm) and other nucleotidereads (eg in this case A and G) (ne1 and ne2) in the population samplelnLethnM nm ne1 ne2jg eTHORN is given by the following multinomial distri-bution formula

ln LethnM nm ne1 ne2jg eTHORN frac14 lnethnTHORN=ethnM nmne1ne2THORNPgethMTHORNnM

middot PgethmTHORNnmPgethe1THORNne1Pgethe2THORNne2

(1)

where n frac14 nM thorn nm thorn ne1 thorn ne2 (the depth of coverage) PgethMTHORN is aprobability of observed nucleotide read M with genotype g It isa function of e and is given by summing conditional probabil-ities of the observed nucleotide read given the true nucleotideon the sequenced chromosome chosen from the pair (Table 1)The other P terms are similarly defined For example the prob-ability of nucleotide read M with genotype 2 (Mm) P2ethMTHORN iseth1=2THORNeth12 eTHORN thorn eth1=2THORNethe=3THORN assuming the error occurs at an equalrate from the true nucleotide to one of the other three nucleo-tides Because the multinomial coefficient in Equation 1 is con-stant regardless of the parameter values for computationalefficiency it is ignored as follows

ln LethnM nm ne1 ne2jg eTHORN frac14 lnPgethMTHORNnMPgethmTHORNnmPgethe1THORNne1Pgethe2THORNne2

(2)

When the genotype frequencies and error rate at the site areestimated from the population sample of nucleotide reads theycan be incorporated into Equation 2 as Bayesrsquo priorsWe previouslydeveloped an ML method to estimate the site-specific genotypefrequencies and error rate from the nucleotide read quartets(Maruki and Lynch 2015) This method yields essentially unbiasedgenotype-frequency estimates even with moderate depths of cov-erage which we use as the Bayesrsquo priors necessary for improvingthe accuracy of genotype calls Specifically given the estimates ofthe genotype frequencies g1 (frequency of major homozygotes) g2

(frequency of heterozygotes) and g3 (frequency of minor homo-zygotes) and error rate (e) predetermined by the ML method ofMaruki and Lynch (2015) the log-likelihood of the observed dataare given as follows

1394 | T Maruki and M Lynch

ln LethnM nm ne1 ne2jg g1 g2 g3 eTHORN

frac14 ln

8lthggPgethMTHORNnMPgethmTHORNnmPgethe1THORNne1Pgethe2THORNne2

i

X3jfrac141

hgjPjethMTHORNnMPjethmTHORNnmPjethe1THORNne1Pjethe2THORNne2

i9= (3)

where theML error-rate estimate e is substituted into the P terms Theinferred (called) genotype of the individual is the genotype that max-imizes Equation 3 To avoid calling genotypes at false polymorphicsites we only call genotypes at significantly polymorphic sites whichare identified beforehand using the likelihood-ratio test by thepopulation-level genotype-frequency estimator (GFE) (Marukiand Lynch 2015)

Genotype calling from high-coverage sequencing dataWhen depths of coverage are high both parental chromosomes aresequenced with high probability and nucleotide reads derived from truevariants are reliably abundant comparedwith thosedue to sequence errorsIn such cases if the confounding effects of binomial sampling of parentalchromosomes and sequence errors are statistically accounted for accurateand rapid genotype calls may be made from sequence data on eachindividual separately relaxing the assumption of biallelic polymorphismsmade in themethod for low-coverage sequencing data and eliminating theneed for prior population-level estimates of genotype frequencies

Given thenucleotide read quartet of an individual at a site we call thegenotype of the individual by finding the genotype that maximizes thelikelihood of the observed data For example if the nucleotide readquartet of an individual contains only A C and G we examine thelikelihoods of six candidate genotypes (AA AC AG CC CG and GG)Suppose for example that the nucleotide read quartet of an individualcontains nonzero counts for all four nucleotides LettingnA nC nG andnT denote the counts of A C G and T respectively the log-likelihoodfor genotype AA is

ln LethnA nC nG nT jAA eTHORN frac14 lnfrac12eth12eTHORNnAethe=3THORNn2nA (4)

where n is the sum of the nucleotide read counts (depth of coverage)and e is the sequence error rate per read per site This equation is amultinomial distribution formula where the constant multinomialcoefficient is ignored for computational efficiency as in Equation 2The subsequent likelihood functions are derived and shown in asimilar way By taking the derivative of Equation 4 with respect to eand equating it to zero e is analytically estimated as

e frac14 ethn2 nATHORN=n (5)

and this estimate is substituted into Equation 4 to find the likelihood forgenotypeAAAs another example the log-likelihood for genotypeAC is

ln LethnA nC nG nT jAC eTHORN frac14 lneth1=22e=3THORNnAthornnC ethe=3THORNnGthornnT

(6)

where e is similarly estimated as

e frac14 eth3=2THORN frac12ethnG thorn nTTHORN=n (7)

and again this estimate is substituted into Equation 6 to find thelikelihood for genotype AC The likelihoods for the other eightcandidate genotypes are similarly calculated

Statistical tests of called genotypes in the high-coverage genotype callerBecause of high sequencing error rates genotypes called by the high-coverage genotype caller (HGC) might sometimes falsely suggestpolymorphisms To minimize analyzing false polymorphisms andmisidentifying sites containing three or four alleles we examine statis-tical significance of called genotypes by likelihood-ratio tests (Kendalland Stuart 1979) Specifically we examine the statistical significance ofcalled genotypes with respect to the genotype homozygous for the mostabundant nucleotide in the population sample M (MM)

Under the null hypothesis ofmonomorphism the site is fixed forMUnder the alternative hypothesis of polymorphism at least one indi-vidual has a genotype different fromMM Therefore we reject the nullhypothesis of population monomorphism if the likelihood of at leastone non-MM called genotype is significantly greater than that of theMM genotype Specifically lettingLL0 andLL1 denote the log-likelihoodsof the observed data of an individual for theMM genotype and that for anon-MM called genotype the likelihood-ratio test statistic (LRT) for theindividual is

LRT frac14 2ethLL1 2 LL0THORN (8)

This test statistic is expected to be asymptotically x2-distributed withone degree of freedom We reject the null hypothesis of populationmonomorphism when LRT is significant for at least one of the non-MM called genotypes at a user-specified level

We estimate the number of alleles by examining the nucleotidescontained in significant genotypes To minimize misidentification ofsites containing three or four alleles when the significant genotype isheterozygouswecompare the likelihoodof the calledgenotype to thatofthe genotype homozygous for the more abundant nucleotide in theindividual and consider the genotype heterozygous only if the former issignificantlygreater than the latter at the specified levelby the likelihood-ratio test with one degree of freedom Otherwise the genotype isconsidered homozygous for estimating the number of alleles

Genotype calling from triploid sequencing dataThe HGC explained above can be extended to triploid data Specifi-cally assuming that sequencing occurs randomly among the three

n Table 1 Probability of an observed nucleotide read as a function of the individual genotype g and error rate e

GenotypeNucleotide Read

M m e1 e2

1 (MM) 12 e e=3 e=3 e=32 (Mm) frac12eth1=2THORN eth12 eTHORN thorn frac12eth1=2THORN ethe=3THORN frac12eth1=2THORN ethe=3THORN thorn frac12eth1=2THORN eth12 eTHORN e=3 e=33 (mm) e=3 12 e e=3 e=3

M and m denote candidate alleles (the two most abundant nucleotide reads in the population sample eg C and T) and e1 and e2 denote other nucleotide reads(eg in this case A and G)

Volume 7 May 2017 | Genotype Calling | 1395

chromosomes and equal error rates occur from the true nucleotide toone of the other three the probability of anobservednucleotide read as afunction of the genotype of an individual is found in a way analogous tothat for diploid data (Supplemental Material Table S1) Then thelikelihood of the observed nucleotide read quartet of an individualcan be formulated As with diploid data we choose the genotype thatmaximizes the likelihood to call the genotype of the individual Theproposed method can be applied to low-coverage sequencing dataalthough high coverage is needed to call accurate genotypes

As an example suppose that the nucleotide read quartet of anindividual contains nonzero counts for all four nucleotides LettingnA nC nG and nT denote the counts of A C G and T respectively thelog-likelihood for genotype AAA is

ln LethnA nC nG nT jAAA eTHORN frac14 lnfrac12eth12eTHORNnAethe=3THORNn2nA (9)

where n is the sum of the nucleotide read counts (depth of coverage)and e is the sequence error rate per read per site By taking the de-rivative of Equation 9 with respect to e and equating it to zero e isanalytically estimated as


and this estimate is substituted intoEquation9 tofind the likelihood forgenotype AAA As another example the log-likelihood for genotypeACC is

ln LethnA nC nG nT jACC eTHORNfrac14 ln

frac12eth1=3THORN2ethe=9THORNnA frac12eth2=3THORN2eth5=9THORN eTHORNnC ethe=3THORNnGthornnT

(11)

where e is estimated as

e frac146nthorn 9nC thorn 15nG thorn 15nT

2ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffieth6nthorn 9nC thorn 15nG thorn 15nTTHORN2 2 360nethnG thorn nTTHORN

q 10n

(12)

and again this estimate is substituted into Equation 11 to find thelikelihood for genotypeACCAs a final example the log-likelihood forgenotype ACG is

ln LethnA nC nG nT jACG eTHORN frac14 lneth1=32e=9THORNnAthornnCthornnGethe=3THORNnT

(13)


e frac14 3 frac12ethnTTHORN=n (14)

and again this estimate is substituted into Equation 13 to find thelikelihood for genotypeACGTo avoid analyzing false polymorphismsdue to sequencing errors the statistical significance of called genotypesis examined by likelihood-ratio tests analogous to those with diploiddata and the number of alleles is based on significant genotypes

Genotype calling from tetraploid sequencing dataBecause tetraploid species are common in plants and some animals wealso formulate likelihood functions for a tetraploid genotype callerAssuming that sequencing occurs randomly among the four chromo-somes and equal error rates occur from the true nucleotide to one of theother three the probability of an observed nucleotide read as a function

of the genotype of an individual is similarly found (Table S2) As withtriploid data we find the genotype that maximizes the likelihood of theobserved nucleotide read quartet of an individual to call the genotype ofthe individual

As an example suppose that the nucleotide read quartet of anindividual contains nonzero counts for all four nucleotides LettingnA nC nG and nT denote the counts of A C G andT respectively thelog-likelihood for genotype AAAA is

ln LethnA nC nG nT jAAAA eTHORN frac14 lnfrac12eth12eTHORNnAethe=3THORNn2nA (15)


e frac14 frac12ethn2 nATHORN=n (16)

and this estimate is substituted into Equation 15 to find the likelihoodfor genotype AAAA As another example the log-likelihood for ge-notype ACCC is

ln LethnA nC nG nT jACCC eTHORNfrac14 ln

eth1=4THORNnA frac12eth3=4THORN2eth2=3THORN eTHORNnC ethe=3THORNnGthornnT (17)


e frac14 eth9=8THORN frac12ethnG thorn nTTHORN=ethnC thorn nG thorn nTTHORN (18)

and again this estimate is substituted into Equation 17 to find thelikelihood for genotypeACCCAs another example the log-likelihoodfor genotype AACC is

ln LethnA nC nG nT jAACC eTHORNfrac14 ln

frac12eth1=2THORN2ethe=3THORNnAthornnC ethe=3THORNnGthornnT (19)



and again this estimate is substituted into Equation 19 to find thelikelihood for genotype AACC As a final example the log-likelihoodfor genotype AACG is

ln LethnA nC nG nT jAACG eTHORNfrac14 ln

frac12eth1=2THORN2ethe=3THORNnAeth1=4THORNnCthornnGethe=3THORNnT (21)


e frac14 eth3=2THORN frac12nT=ethnA thorn nTTHORN (22)

and again this estimate is substituted into Equation 21 to find thelikelihood for genotype AACG Again we report the number of allelesat a site based on significant genotypes to avoid analyzing falsepolymorphisms due to sequencing errors

Generation of diploid nucleotide read data at biallelicsites using computer simulationsTo examine the performance of the genotype callers at biallelic sites wegenerated nucleotide read data for N diploid individuals by computersimulations and called individual genotypes from the simulated data Inthe simulations the probability of sampling an individual with a par-ticular genotype was equal to its relative frequency in the population


The population frequencies of major and minor homozygotes werespecified by g1 and g3 respectively To compare the called genotypeswith the true genotypes we recorded the true genotype of each indi-vidual The depths of coverage were assumed to be Poisson-distributedwith mean m among the individuals and were specified as

cethXmTHORN frac14hethmTHORNXe2m

iX (23)

where X is a particular value of the coverage for an individual and c isa probability mass function of X The nucleotide reads from eachindividual were randomly chosen from its genotype allowing for er-rors Sequence errors were randomly introduced at rate e per read persite from the true nucleotide to one of the other three nucleotides

Comparison of the Bayesian genotype caller with otherwidely used methodsTo compare the performance of our Bayesian genotype caller (BGC)with that of other widely used methods for calling genotypes we calledindividual genotypes usingGATK (version 34-0) (McKenna et al 2010DePristo et al 2011 Van der Auwera et al 2013) and Samtools (version0119) (Li 2011) from the same simulated nucleotide read data gener-ated by the method described above and compared the accuracy of thegenotypes called by different methods Although both GATK and Sam-tools use Bayesian genotype-frequency priors GATK uses the samepriors for all sites and Samtools assumes HWE and therefore theirapproaches differ from ours In addition to these genotype-callingmethods we also compared our performance with that of the corre-sponding genotype-calling function in ANGSD (version 0911)(Korneliussen et al 2014) which is also designed especially for low-coverage sequencing data To make the generated data applicable to allmethods including ours we made BAM files of sequence read datamapped to a simulated reference sequence

First we generated a simulated reference sequence consisting ofrandom nucleotides Each of a total of 10000 biallelic sites was sur-rounded by 50 fixed nucleotides on both sides so that sequence reads of101bpareuniquelymapped to the referencesequenceWeoutputted the101-bp sequence reads in the FASTA format with nucleotide reads atbiallelic sites characterized in themannerdescribedaboveThe sequencereads were mapped to the reference sequence using Novoalign (version30002) (wwwnovocraftcom) We converted SAM files of mappedreads to BAM files using Samtools (Li et al 2009a)

To call genotypes with GATK we made a dictionary file of thereference sequence and added read groups to the BAMfiles using Picard(version 1134) (httpbroadinstitutegithubiopicard) Thenwe sortedand indexed the BAM files using Samtools We generated individualGVCF files which contain information of called genotypes from theBAM files using HaplotypeCaller which individually calls genotypesassigning an arbitrarily chosen uniform base quality score (30) to everybase Changing the base quality score however from 30 to 20 did notnoticeably influence the results (results not shown) Then we calledgenotypes from the GVCF files using Genotype_GVCFs which refinesthe called genotypes using the population-level information We calledgenotypes with Samtools from the BAM files using mpileup andbcftools to use the population-level information To call genotypes withANGSD we first converted the BAM files to SAM files using SamtoolsThen we assigned the same uniform base quality score of 30 to everybase using a custom Perl script The resulting SAM files with basequality scores were converted back to BAM files as input for ANGSDusing Samtools We used a Bayesian genotype-calling method corre-sponding to ours in ANGSD setting the value of the -doPost option at

one using Samtools genotype likelihoods and setting the statisticalsignificance for calling SNPs at the 5 level Unlike other methodsour method does not rely just on read quality and estimates error ratesfrom sequence data themselves This unique feature of our method isimportant because errors can result from other factors including thoseintroduced during sample preparation

TheML estimates of genotype frequencies and error rates necessaryfor calling genotypes with BGC were obtained using GFE (availablefrom httpsgithubcomTakahiro-MarukiPackage-GFE an updatedversion with a function to prepare the input file of the BGC and itsdocumentation are available as File S1 and File S2) (Maruki and Lynch2015) The individual pro files of nucleotide read quartets were gener-ated using sam2pro (version 06) (httpguanineevolbiompgdemlRhosam2pro_06tgz) from mpileup files generated from the BAMfiles by Samtools

In addition to examining the performance of genotype calling wecompared the allele-frequency estimates by GFE to those by othermethods The allele-frequency estimates via genotype calling by GATKand Samtools were calculated from the VCF files using VCFtools(version 0111) (Danecek et al 2011) The allele-frequency estimatesby ANGSDwere estimated directly from sequence reads by the methodof Kim et al (2011)

To examine how different genotype-calling methods perform whenapplied to real sequencing data we applied them to low-coverage (mean76middot per site per individual) sequencing data of the phase 3 1000 Ge-nomes project (1000 Genomes Project Consortium 2015) on chromo-some 11 in the CEU population To assess the accuracy of genotypecalls by different methods we compared the called genotypes with thecorresponding genotypes in phase II and III data of the InternationalHapMap project (International HapMap Consortium 2007 Interna-tional HapMap 3 Consortium 2010) Specifically we downloaded theBAM files of the phase 3 1000 Genomes sequencing data in 99 CEUindividuals on chromosome 11 from ftp1000genomesebiacukvol1ftpphase3data We also downloaded the FASTA file of the referencesequence (human_g1k_v37fasta) from ftp1000genomesebiacukvol1ftptechnicalreferenceWe downloaded the corresponding geno-types in the HapMap phase II and III data from ftpncbinlmnihgovhapmapgenotypes2010-08_phaseII+IIIforward We found that 94out of the 99 CEU individuals had both 1000 Genomes and HapMapdata and therefore used the 94 individuals with the necessary data inthe subsequent analyses

To prepare high-quality input data for calling genotypes wemarkedduplicate reads and locally realigned sequences around indels in theBAM files using GATK (version 34-0) (McKenna et al 2010 DePristoet al 2011) and Picard (version 1107) (httpbroadinstitutegithubiopicard) followingGATKbest practices (Van der Auwera et al 2013) Inaddition we clipped overlapping read pairs in the BAM files usingBamUtil (version 1013) (httpgenomesphumicheduwikiBamUtil)We made mpileup files of the 94 individuals from the processed BAMfiles using Samtools The pro files of nucleotide read quartets were madefrom thempileupfiles using sam2pro (version 08) (httpguanineevolbiompgdemlRhosam2pro_08tgz) The file of nucleotide read quartets of94 individuals necessary for the proposed method was made from the profiles using GFE (Maruki and Lynch 2015)

To call genotypes with GATK we first sorted and indexed the BAMfiles using Samtools Next we generated individual GVCF files from theBAM files using HaplotypeCaller Then we called genotypes from theGVCFfilesusingGenotype_GVCFsWecalledgenotypeswithSamtoolsfrom the BAM files using mpileup and bcftools using the population-level information To call genotypes with ANGSD from the BAM fileswe used a Bayesian genotype-calling method corresponding to ours in


ANGSD setting the value of the -doPost option at one using Samtoolsgenotype likelihoods and setting the statistical significance for callingSNPs at the 5 level In addition to these genotype-calling methods weexamined how the popular genotype-imputation method performs byimputing missing genotypes in GATK calls using Beagle (version 41)(Browning and Browning 2016) Because preexisting high-quality ref-erence panels (genotypeshaplotypes) do not exist in the vast majorityof the organisms except a few model organisms such as humans andflies we imputed the genotypes without a reference panel Here weimputed genotypes using genotype likelihoods instead of called geno-types to take the uncertainty of genotype calls into account

We calculated the correct-call rate among individuals as a fraction ofindividuals with genotype calls identical to HapMap genotypes assum-ing that HapMap genotypes are correct To examine the accuracy ofactually called genotypes we also calculated the correct-call rate amongcalled genotypes where the fraction is calculated among called geno-typesWe converted theNCBI build 36 coordinates in theHapMapdatato the GRCh37 coordinates in the 1000 Genomes data using the UCSCliftOver (Speir et al 2016) to make the coordinates in the two data setsconsistent with each other To minimize the confounding effect ofmismapping we excluded sites involved in putative repetitive regionsidentified by RepeatMasker (httpwwwrepeatmaskerorg) and thosewith population coverage (sum of the coverage over the individuals)less than half the mean or greater than one and a half means from theanalyses We downloaded the repeat-masked reference sequence fromthe Ensemblwebsite (ftpensemblorgpubrelease-75fastahomo_sapiensdna) to identify sites involved in putative repetitive regions

Performance evaluation of the HGC as a functionof coverageBecause of the random sampling of parental chromosomes only one ofthe two chromosomesmight be sampledwhen the depth of coverage foran individual is lowTherefore theabilityof agenotypecaller tocorrectlyinfer individual genotypes is mainly limited by its ability to call hetero-zygoteswhen thedepthof coverage is low Inadditionwhen thedepthofcoverage is low homozygotes can appear heterozygous due to sequenc-ing errorsTherefore examining the effect of coverage on the accuracyofcalledgenotypes is important forfinding theoptimal sequencingstrategyfor genotype ascertainment

To examine how much coverage is needed to accurately call geno-types using the HGC we examined its rates for correctly callinghomozygotes and heterozygotes as functions of the depth of coverageSpecificallywegeneratednucleotidereaddata fromanindividualhavinga homozygous or heterozygous genotype with a fixed depth of coverageand error rate e and called the genotype of the individual with HGCWe repeated this process for 10000 simulation replications and calcu-lated the correct-call rate as a fraction of simulation replications wherethe genotype was correctly called by HGC

Performance evaluation of genotype callers at triallelicsites using computer simulationsTo examine the performance of theHGC for identifying sites withmorethan two alleles and calling genotypes in population samples wegenerated BAM files of sequence read data in a way similar to that atbiallelic sites where nucleotide read data at each of a total of 10000triallelic sites for N diploid individuals were generated according totheir genotype frequencies introducing errors at rate e Here for sim-plicity we specified the population genotype frequencies by assigningpopulation allele frequencies p q and r to the most abundant secondmost abundant and the rarest allele respectively and assuming HWE

although our method makes no assumption on the mating system AsGATK and Samtools are both capable of calling genotypes at triallelicsites we applied them to the BAM files using HaplotypeCaller andGenotypeGVCF and mpileup and bcftools with the -m option respec-tively and compared their correct-call rates to that ofHGC In additionwe applied our BGC to the BAM files and examined the correspondingcorrect-call rate to find the effect of incorrectly assuming at most twoalleles on the accuracy of genotype calls at triallelic sites

Application of the HGC to empirical dataTo examine the performance of the HGCwith real data we applied it tohigh-throughput Illumina sequencing data of 83 D pulex clones fromKickapond (Lynch et al 2016) We mapped the sequencing data to thePA42 reference genome (version 30) (Z Ye S Xu K SpitzeJ Asselman X Jiang M E Pfrender and M Lynch unpublishedresults) using Novoalign (version 30211) (wwwnovocraftcom) withthe ldquo-r Nonerdquo option to prevent it from mapping a read if it matchedmore than one location We converted the SAM files of mapped se-quencing data to BAM files using Samtools (version 0118) (Li et al2009a) Then we marked duplicate reads and locally realigned se-quences around indels using GATK (version 34-0) (McKenna et al2010 DePristo et al 2011) and Picard (version 1107) (httpbroad-institutegithubiopicard) following GATK best practices (Van derAuwera et al 2013) In addition we clipped overlapping read pairsusing BamUtil (version 1013) (httpgenomesphumicheduwikiBamUtil) We made mpileup files of the 83 clones from the processedBAM files using Samtools The pro files of nucleotide read quartets weremade from the mpileup files using sam2pro (version 08) (httpgua-nineevolbiompgdemlRhosam2pro_08tgz) The input file of nucle-otide read quartets of 83 clones was made from the pro files using GFE(Maruki and Lynch 2015) To avoid analyzing misassembled regionswe excluded regions considered to bemisassembled by REAPR (version1134) (Hunt et al 2013) from our analyses We set the minimumcoverage required to call a genotype of an individual at six Tominimizeanalyzing sites with data on a small number of individuals or those withmismapped reads we only analyzed sites with the population coverage(sum of the coverage over the individuals) at least half the mean and atmost 15middot of the mean Furthermore we excluded sites involved inputative repetitive regions identified by RepeatMasker (version 405)(httpwwwrepeatmaskerorg) with the RepeatMasker library (Jurkaet al 2005) made on August 7 2015 In addition to further reduceanalyzing sites with mismapped reads we excluded sites with meanerror-rate estimates among clones of 001

Performance evaluation of the triploid genotype calleras a function of coverageTo examine the effect of coverage on the accuracy of genotypes called bythe triploid genotype caller (TRI) we examined the correct-call rate as afunctionof the depth of coverage in away similar to that for diploid dataHere we examined the correct-call rate for three different types ofgenotypes (homozygotes heterozygotes containing two different nu-cleotides and heterozygotes containing three different nucleotides)among a total of 10000 simulation replications

Comparison of the TRI with GATKTo compare the performance of the TRI to that of other widely usedmethodsweappliedourTRIandGATK(version34-0) (McKenna etal2010 DePristo et al 2011) to BAM files of simulated sequence read datagenerated in a way similar to that for diploid data and compared thecorrect-call rates Here we generated fixed-coverage read data from an


individual with a particular genotype 10000 times To call triploidgenotypes with GATK we set the ploidy of HaplotypeCaller at three

Data availabilitySource codes written in C++ implementing the proposedmethods andtheir documentation are available as supporting information (File S1File S2 File S3 File S4 File S5 File S6 File S7 File S8 File S9 File S10File S11 and File S12)

RESULTSThe performance of the BGCwhen applied to low-coverage sequencingdata at biallelic sites was evaluated with simulated data and comparedwith the corresponding performance of GATK Samtools andANGSDIn addition the allele-frequency estimates by the GFE were comparedwith those via genotype calling by GATK and Samtools and those by amethod of Kim et al (2011) as implemented in ANGSD To examinethe performance under the worst situation the error rate was set at 001which is typically the upper bound with commonly used sequencing plat-forms We evaluated the performance under three different genetic con-ditions where the inbreeding coefficient fwas zero (HWE) minimized ormaximized given minor-allele frequency q When f is minimized thefrequencies of major homozygotes and minor homozygotes g1 and g3are 12 2q and 0 respectively When f is maximized g1 and g3 are 12 qand q respectively

When the depth of coverage is low (mean 3middot) the allele-frequencyestimates by GFE are unbiased whereas those via genotype calling arebiased under all examined conditions (Table 2) The allele-frequency

estimates by ANGSD are similar to ours although they are slightlybiased when f is minimized or maximized The allele-frequency esti-mates via genotype calling by GATK are more biased than those bySamtools when f is zero or minimized On the other hand when f ismaximized the allele-frequency estimates via genotype calling by Sam-tools aremore biased than those by GATK The correct-call rate amongindividuals by the proposed method (BGC) is higher than that byGATK and lower than that by Samtools and ANGSD when f is zeroor minimized When f is maximized the correct-call rate by BGC is thehighest The highest correct-call rate among individuals by Samtoolsand ANGSD when f is zero or minimized is mainly because bothmethods always call individual genotypes regardless of the depth ofcoverage even when there is no read In fact the correct-call rateamong called genotypes by BGC is the highest under the majority ofthe examined conditions

When the depth of coverage is moderately high (mean 10middot) theallele-frequency estimates by all methods are similar to each other andnearly unbiased under all examined conditions (Table S3) Consistentwith this the correct-call rates by all methods are high under all exam-ined conditions However the correct-call rate by BGC is the highestunder all examined conditions We confirmed that we correctly gener-ated the simulated data by examining the realized parameter values inthe population samples (Table S4)

Our conclusions on the performance of the genotype-calling meth-ods based on computer simulations are further supported by qualita-tively similar results of the corresponding performance evaluation usinghuman data (Table 3) Similar to the simulation-based results of theperformance evaluation when the population is inHWE the correct-call

n Table 2 Comparison of the allele-frequency estimates and called genotypes by different methods with low depths of coverage

Method q f q (mean 6 2 SEM)Correct-Call Rate among

Individuals (mean 6 2 SEM)Correct-Call Rate among

Called Genotypes (mean 6 2 SEM)

Proposed 01 0 (HWE) 010 6 000050 084 6 000064 091 6 000049GATK 01 0 (HWE) 006 6 000042 080 6 000073 088 6 000057Samtools 01 0 (HWE) 008 6 000039 090 6 000045 090 6 000045ANGSD 01 0 (HWE) 010 6 000050 090 6 000045 090 6 000045Proposed 01 Minimized 010 6 000050 087 6 000068 095 6 000049GATK 01 Minimized 006 6 000041 082 6 000077 091 6 000061Samtools 01 Minimized 008 6 000039 094 6 000047 094 6 000047ANGSD 01 Minimized 011 6 000050 094 6 000046 094 6 000046Proposed 01 Maximized 010 6 000063 095 6 000046 100 6 000015GATK 01 Maximized 009 6 000058 093 6 000051 100 6 000006Samtools 01 Maximized 006 6 000046 091 6 000047 091 6 000047ANGSD 01 Maximized 008 6 000057 091 6 000047 091 6 000047Proposed 03 0 (HWE) 030 6 000077 077 6 000081 088 6 000080GATK 03 0 (HWE) 022 6 000077 068 6 000094 079 6 000088Samtools 03 0 (HWE) 025 6 000096 084 6 000074 084 6 000074ANGSD 03 0 (HWE) 030 6 000076 084 6 000074 084 6 000074Proposed 03 Minimized 030 6 000063 074 6 000087 087 6 000075GATK 03 Minimized 019 6 000067 058 6 000098 070 6 000101Samtools 03 Minimized 026 6 000085 083 6 000078 083 6 000078ANGSD 03 Minimized 031 6 000064 083 6 000078 083 6 000078Proposed 03 Maximized 030 6 000094 094 6 000047 100 6 000017GATK 03 Maximized 026 6 000093 090 6 000061 100 6 000011Samtools 03 Maximized 024 6 000118 087 6 000067 087 6 000067ANGSD 03 Maximized 028 6 000102 087 6 000067 087 6 000067

q q and f are the minor-allele frequency its estimate and inbreeding coefficient respectively q by the proposed method and ANGSD are directly estimated fromsequence read data by the genotype-frequency estimator (Maruki and Lynch 2015) and Kim et alrsquos (2011) method respectively Called genotypes by the proposedmethod are by the Bayesian genotype caller The correct-call rate among individuals is a fraction of individuals with correctly called genotypes among N =100 individuals where missing genotype calls are considered incorrect On the other hand the correct-call rate among called genotypes is calculated only amongindividuals with called genotypes Mean depth of coverage m = 3 error rate e = 001 Results are based on a total of 10000 simulation replications for each parameterset HWE HardyndashWeinberg equilibrium


rate among individuals of BGC was lower than that by ANGSD andSamtools and higher than that by GATK Again this is mainly becausethe former twomethods call individual genotypes even when there is noread and the correct-call rate among called genotypes by BGC is thehighest The correct-call rate among individuals of the Beagle imputa-tion of genotypes called by GATK was higher than that by GATK andlower than that of the other methods Interestingly the correct-call rateamong called genotypes of the Beagle imputation of genotypes called byGATK was lower than that by GATK indicating that genotype impu-tation does not necessarily improve the accuracy of genotype calls atleast when preexisting high-quality reference panels are not availableThe observation here is consistent with a recent finding that population-genetic parameters estimated via genotype imputation from low-coverage sequencing data are biased (Fu 2014)

To examinewhen the depth of coverage is high enough to accuratelycall individual genotypes using the HGC we examined the correct-callrate as a function of coverage when the true genotype is homozygous orheterozygous Overall the rate for correctly calling homozygotes is high(Figure 1A) It decreases with increased coverage when the coverage isfive or less and approaches one when the coverage is over 5 The overallrate for correctly calling heterozygotes increases with increased cover-age although it somewhat decreases when the coverage increases fromfive to six (Figure 1B) This decrease is consistent with the suddenincrease in the rate for correctly calling homozygotes when the coverageincreases from five to six These patterns are observed because whentwo nucleotides have nonzero read counts one of which has just oneread in the quartet both of them are considered to be from an allelewithout error when the coverage is five or less and the one with a singleread count is considered to be due to a sequence error when the cov-erage is over five by HGC as the likelihood that the single read is due toa sequence error becomes greater than that without error In addition tothe valley of the correct-call rate at coverage equal to six there isanother small valley at coverage equal to 11 This is because whentwo nucleotides have nonzero read counts one of which has just two

reads in the quartet both of them are considered to be from an allelewithout error when the coverage is11 and the one with double readsis considered to be due to sequence errors when the coverage is$11 byHGC as the likelihood that such reads are due to sequencing errorsbecomes greater than that without error

To compare the performance ofHGC in calling diploid genotypes attriallelic sites with that of other existing methods we compared thecorrect-call rate of HGC to that of GATK and Samtools (Table 4) Inaddition to examine the effect of incorrectly assuming biallelic poly-morphisms on the accuracy of genotype calls at triallelic sites we ap-plied our BGC to the same data and examined its correct-call rate Thecorrect-call rate of HGC is higher than that by Samtools and slightlylower than that by GATK Compared with genotype calls by Samtoolsgenotype calls by HGC and GATK quickly become more accurate withhigher depths of coverage The correct-call rate of BGC is the lowestThis is because BGC assumes at most two alleles as many other meth-ods including ANGSD do and fails to correctly call genotypes contain-ing the rarest allele We confirmed that we correctly generated thesimulated data by examining the realized parameter values in the pop-ulation samples (Table S5)

To evaluate the power of HGC for detecting polymorphisms weexamined the false-positive and -negative rates as functions of themeandepth of coveragem The false-positive rate of detecting polymorphismsis low when m is moderately high (10) and is essentially zero withm 10 (Figure 2A) The false-negative rate of detecting polymor-phisms is reasonably low when m is moderately high (10) and ap-proaches the theoretical minimum possible value which is theprobability that only one of the two alleles is sampled in a finitesample whenm is 30 (Figure 2B)We note that the high false-negativerates with low minor-allele frequencies are mainly due to the finitesample size (100) which limits sampling rare alleles and they remainhigh even with infinite coverage with any method The power fordetecting three alleles is essentially the same as that for detectingpolymorphisms the false-positive rate is low (Figure 2C) and false-negative

n Table 3 Comparison of the performance of the genotype-calling methods with human data

MethodCorrect-Call Rate among



BGC 0954 6 00004 0969 6 00004GATK 0923 6 00004 0943 6 00004Samtools 0966 6 00004 0966 6 00004ANGSD 0967 6 00004 0967 6 00004GATK + Beagle 0941 6 00004 0941 6 00004

The correct-call rate among individuals is calculated among individuals with HapMap genotypes where missing genotype calls are considered incorrect On the otherhand the correct-call rate among called genotypes is calculated only among individuals with both HapMap genotypes and called genotypes respectively

Figure 1 Correct-call rate of the high-coverage geno-type caller as a function of the depth of coverage (A)Correct-call rate when the true genotype of the indi-vidual is homozygous (B) Correct-call rate when thetrue genotype of the individual is heterozygous Errorrate e = 001 Results are based on a total of 10000simulation replications for each parameter set


rate is reasonably low (Figure 2D) These results indicate that HGCaccurately describes polymorphisms with arbitrary numbers of alleleswhen the depth of coverage is high

We applied the HGC to high-throughput sequencing data of 83 Dpulex clones fromKickapond to identify sites containingmore than twoalleles We set the minimum coverage required to call a genotype of anindividual at six so that the rate of falsely calling homozygotes asheterozygotes is low (Figure 1A) We identified a total of 4403303significantly polymorphic sites at the 5 level from the sequence datafiltered by multiple procedures to minimize analyzing misassembledregions and sites with mismapped reads The vast majority (9707) of

these were considered to contain two alleles However 127167 (288)and 2030 (005) sites were considered to contain three and four allelesrespectively

The performance evaluation using computer simulations shows thatcalling accurate triploid genotypes requires much higher coverage thancalling diploid genotypes (Figure 3) In particular high coverage isneeded to correctly call heterozygotes (Figure 3 B and C) This isbecause higher coverage is required to sequence all three chromosomesrather than just two chromosomes Furthermore it is difficult to dis-tinguish the two alternative heterozygotes containing two alleles (egAAC vsACC) unless the coverage is very high Overall the correct-call

n Table 4 Performance of different genotype callers for calling diploid genotypes from nucleotide data at triallelic sites

Method Mean Coverage Correct-Call Rate (mean 6 2 SEM)Correct-Call Rate among CalledGenotypes (mean 6 2 SEM)

HGC 10 097 6 000037 097 6 000037GATK 10 097 6 000032 098 6 000031Samtools 10 096 6 000040 096 6 000040BGC 10 078 6 000099 079 6 000099HGC 15 099 6 000022 099 6 000022GATK 15 100 6 000013 100 6 000013Samtools 15 096 6 000040 096 6 000040BGC 15 080 6 000088 081 6 000087HGC 20 100 6 000014 100 6 000014GATK 20 100 6 000007 100 6 000007Samtools 20 097 6 000031 097 6 000031BGC 20 080 6 000084 081 6 000083HGC 30 100 6 000005 100 6 000005GATK 30 100 6 000002 100 6 000002Samtools 30 099 6 000015 099 6 000015BGC 30 081 6 000080 081 6 000080

Allele frequencies p q and r are 07 02 and 01 and the population is in HardyndashWeinberg equilibrium Correct-call rate and that among called genotypes arecalculated among N = 100 individuals and those with called genotypes respectively Error rate e = 001 Results are based on a total of 10000 simulation replicationsfor each parameter set

Figure 2 Power analysis of the high-coverage geno-type caller (A) False-positive rate of polymorphismdetection as a function of the mean depth of coverage(B) False-negative rate of detecting polymorphisms atbiallelic sites as a function of the minor-allele frequencyThe solid curve shows the theoretical minimum value asthe probability that just one of the alleles is sampled in afinite sample of sizeN given by q2N thorn eth12qTHORN2N where q isthe minor-allele frequency (C) False-positive rate ofdetecting triallelic sites as a function of the mean depthof coverage with q = 01 (ie the site is biallelic) (D)False-negative rate of inferring triallelic sites as a functionof the frequency of the rarest allele The frequency of thesecond most abundant allele is 02 The statistical signif-icance of the likelihood-ratio tests is set at the 5 level inall panels N = 100 error rate e = 001 inbreeding co-efficient f = 0 (HardyndashWeinberg equilibrium) Results arebased on a total of 10000 simulation replications foreach parameter set


rate increases with increased coverage for all three types of genotypesalthough there are a few local exceptions The valley of the rate forcorrectly calling homozygotes at coverage equal to eight (Figure 3A) isobserved because when two nucleotides have nonzero read counts oneof which has just one read in the quartet both of them are consideredto be from an allele without error when the coverage is eight or less andthe one with a single read is considered to be due to a sequence errorwhen the coverage is over eight by the TRI as the likelihood that thesingle read is due to a sequence error becomes greater than that withouterror The sudden decrease in the rate for correctly calling heterozy-gotes containing three different nucleotides when coverage increasesfrom six to seven (Figure 3C) is observed because when three nucleo-tides have nonzero read counts one of which has just one read in thequartet all of them are considered to be from an allele without errorwhen the coverage is six or less and the one with a single read isconsidered to be due to a sequence error when the coverage is oversix by the TRI as the likelihood that the single read is due to a sequenceerror becomes greater than that without error These observationshighlight the inherent difficulty in calling polyploid genotypes not onlyfor our method

Comparison of the performance of our TRI to that of GATK showsthatTRIyieldsmoreaccurate calledgenotypeswhen the truegenotype isa homozygote or heterozygote with two different nucleotides whereasGATKyieldsmoreaccurate calledgenotypeswhen the truegenotype is aheterozygote with three different nucleotides (Table S6) Because theformer two genotypes are generally far more abundant than the latter innatural populations our method is likely to yield more accurate geno-type calls than GATK when applied to real data We also note that ourmethod is more efficient than GATK in terms of both memory usageand computation time TRI takes 2749 sec whereas GATK takes397708 sec under the same computing environment to analyze simu-lated data with fixed coverage of 30 at 1010000 sites in an individualThis 145-fold difference in computation time becomes huge whenanalyzing organisms with a large genome size

Because it takes additional time to prepare the input file of TRI fromtheBAMfilemake adictionaryfile of the reference andadd read groupsto the BAM file to use GATK we also compared the computation timebetween TRI andGATK taking these additional times into account Theadditional times taken for the above analysis are 22374 and8863 sec forTRI and GATK respectively making the total time 25123 and 406571sec for TRI and GATK respectively Therefore the total time taken forGATK is 16 times greater than that for TRI

DISCUSSIONGiven the rapid emergence of the field of population genomics there is aneed to systematically examine the performance of genotype callersRecent studies (eg Lynch 2009 Buerkle and Gompert 2013 Marukiand Lynch 2014) found that sequencing asmany individuals as possible

with low depths of coverage is the optimal strategy for estimatingpopulation-level parameters without calling genotypes with limitedbudgets However when depths of coverage are low statistical uncer-tainty of individual sequence data are high and calling genotypes isdifficult Therefore when inferring accurate genotypes is critical for astudy it becomes necessary to sequence a more limited number ofindividuals with higher depths of coverage

In this study we investigated how estimating genotype frequenciesand error rates from a population sample and incorporating thepopulation-level information into genotype-calling processes improvesthe performance of genotype calling especially when the populationdeviates from HWE We also examined the situation when individualcoverage is high enough to accurately call genotypes without priorpopulation-level estimatesTopromote theuseof ourmethodswe freelyprovide C++ programs implementing our genotype callers along withtheir documentation (File S3 File S4 File S5 File S6 File S7 File S8 FileS9 and File S10)

Figure 3 Correct-call rate of the triploid genotype caller as a functionof the depth of coverage (A) Correct-call rate when the true genotypeof the individual is homozygous (B) Correct-call rate when the truegenotype of the individual is heterozygous containing two differentnucleotides (C) Correct-call rate when the true genotype of theindividual is heterozygous containing three different nucleotides Errorrate e = 001 Results are based on a total of 10000 simulation repli-cations for each parameter set

n Table 5 Summary of the proposed methods

Method Ploidy Advantage Disadvantage

BGC Two More accurate for biallelic SNPs Assumes at most two allelesHGC Two Highly efficient Genotype-frequency information not used

Arbitrary numbers of alleles Less accurate for biallelic SNPsTRI Three Highly efficient Genotype-frequency information not used

Arbitrary numbers of allelesTET Four Highly efficient Genotype-frequency information not used

Arbitrary numbers of alleles


Consistent with previous results (Kim et al 2011 Han et al 2014)we find that allele-frequency estimates via genotype calling are biasedwhereas those directly estimated from sequence read data are unbiasedwhen depths of coverage are low Given the importance of unbiasedestimates of allele frequencies for subsequent population-genetic anal-yses this reinforces the importance of estimating allele frequencies di-rectly from sequence reads with low depths of coverage as in Marukiand Lynch (2015) Our BGC takes advantage of prior population-levelestimates of genotype frequencies and error rates to improve the accu-racy of genotype calls in a population with an arbitrary mating systemand internal population structure The higher accuracy of called geno-types by BGC than obtainedwith currently widely used genotype callersin the majority of examined cases shows that our approach improvesthe ability to call genotypes while also providing unbiased estimates ofallele frequencies

Our performance evaluation of the HGC revealed that the rate ofcorrectly calling homozygotes can be high if the minimum coveragerequired for calling genotypes is set at six or greater By raisingthe minimum coverage cutoff to eight or greater the rate for correctlycalling heterozygotes can also be reasonably high The power analyses ofHGC for detecting polymorphisms and triallelic sites also indicate thatthemethod is conservative and reasonably sensitive The comparison ofthe performance of HGC to that of other existing methods showed thatourmethod yieldsmore accurate or as accurate diploid genotype calls atsites with more than two alleles

The application of HGC to high-throughput sequencing data of83 Daphnia clones from a population indicates that a non-negligiblefraction of polymorphisms is triallelic and even tetra-allelic inD pulexThere is growing evidence that triallelic polymorphisms exist in thehuman genome (eg Hodgkinson and Eyre-Walker 2010 Nelson et al2012 Cao et al 2015) In particular a recent study sequencing manysamples with high depths of coverage by Nelson et al (2012) found thatthe fraction of triallelic sites is much higher than that previouslyexpected Furthermore a recent study (Jenkins et al 2014) found tri-allelic polymorphisms may be more informative for demographic in-ferences than biallelic polymorphisms Therefore it is important torelax the assumption of biallelic polymorphisms Our method providesan efficient flexible and statistically rigorous framework for identifyingpolymorphic sites containing arbitrary numbers of alleles

Because of its efficiency and flexibility our method can be extendedto analyses of population-genomic data from polyploid organisms Ourperformanceevaluationof theTRI revealed thatmuchhigher coverage isneeded for correctly calling triploid genotypes than in the case ofdiploids This conclusion is extended to genotype calling of organismswith higher ploidy as the difficulties we found become greater Theseresults will help researchers to design sequencing strategies for pop-ulation-genomic analyses of polyploid organisms

As guidance on proper usage of the proposed four different geno-type-callingmethods we provided a summary of the proposedmethods(Table 5) BGC and HGC are both intended for calling diploid geno-types Users should first apply HGC to population-genomic data ofdiploid organisms to identify sites with more than two alleles Thenusers should apply BGC to the data at biallelic sites as BGC yields moreaccurate genotype calls than HGC at biallelic sites unless the depth ofcoverage is extremely high To facilitate this procedure we provide a C++program and documentation for setting the coverage of all individuals atzero at sites with more than two alleles as identified by HGC (File S11and File S12) TRI and TET are extensions of HGC to triploid andtetraploid data respectively They enable highly efficient genotype callingat sites with arbitrary numbers of alleles If future studies enable thepopulation-level estimation of the genotype frequencies and error rates

in diploid data at sites with arbitrary numbers of alleles and triploid andtetraploid data we can in principle improve the accuracy of genotypecalls from these data although it will be very computationally expensiveto estimate frequencies of many genotypes in these cases

ACKNOWLEDGMENTSWe thank two anonymous reviewers whose comments improved ourmanuscript This work was supported by National Science Foundationgrant DEB-1257806 and National Institutes of Health grant NIH-NIGMS R01-GM101672 It was also supported by the National Centerfor Genome Analysis Support funded by National Science Founda-tion grant DBI-1458641 to Indiana University and Indiana UniversityResearch Technologyrsquos computational resources

Note added in proof See Ye et al 2017 (pp 1405) in this issue andAckerman et al 2017 (pp 105) and Lynch et al 2017 (pp 315) in theGENETICS May issue for related work

LITERATURE CITEDAars J J F Dallas S B Piertney F Marshall J L Gow et al

2006 Widespread gene flow and high genetic variability in populationsof water voles Arvicola terrestris in patchy habitats Mol Ecol 15 1455ndash1466

Black F L and F M Salzano 1981 Evidence for heterosis in the HLAsystem Am J Hum Genet 33 894ndash899

Black W C IV C F Baer M F Antolin and N M DuTeau2001 Population genomics genome-wide sampling of insect popula-tions Annu Rev Entomol 46 441ndash469

Brown A H D 1979 Enzyme polymorphism in plant-populations TheorPopul Biol 15 1ndash42

Browning B L and S R Browning 2016 Genotype imputation withmillions of reference samples Am J Hum Genet 98 116ndash126

Browning S R and B L Browning 2007 Rapid and accurate haplotypephasing and missing-data inference for whole-genome association studiesby use of localized haplotype clustering Am J Hum Genet 81 1084ndash1097

Buerkle C A and Z Gompert 2013 Population genomics basedon low coverage sequencing how low should we go Mol Ecol 223028ndash3035

Cao M J Shi J Wang J Hong B Cui et al 2015 Analysis of humantriallelic SNPs by next-generation sequencing Ann Hum Genet 79275ndash281

Catchen J P A Hohenlohe S Bassham A Amores and W A Cresko2013 Stacks an analysis tool set for population genomics Mol Ecol 223124ndash3140

Catchen J M A Amores P Hohenlohe W Cresko and J H Postlethwait2011 Stacks building and genotyping loci de novo from short-readsequences G3 1 171ndash182

Cockerham C C and B S Weir 1977 Digenic descent measures for finitepopulations Genet Res 30 121ndash147

Danecek P A Auton G Abecasis C A Albers E Banks et al 2011 Thevariant call format and VCFtools Bioinformatics 27 2156ndash2158

Delmotte F N Leterme J P Gauthier C Rispe and J C Simon 2002 Geneticarchitecture of sexual and asexual populations of the aphid Rhopalosiphumpadi based on allozyme and microsatellite markers Mol Ecol 11 711ndash723

DePristo M A E Banks R Poplin K V Garimella J R Maguire et al2011 A framework for variation discovery and genotyping using next-generation DNA sequencing data Nat Genet 43 491ndash498

Ferreira A G and W Amos 2006 Inbreeding depression and multipleregions showing heterozygote advantage in Drosophila melanogasterexposed to stress Mol Ecol 15 3885ndash3893

Foltz D W and J L Hoogland 1983 Genetic-evidence of outbreeding in theblack-tailed prairie dog (Cynomys-Ludovicianus) Evolution 37 273ndash281

Fu Y B 2014 Genetic diversity analysis of highly incomplete SNP geno-type data with imputations an empirical assessment G3 4 891ndash900


1000 Genomes Project ConsortiumAuton A L D Brooks R M DurbinE P Garrison et al 2015 A global reference for human genetic vari-ation Nature 526 68ndash74

Glenn T C 2011 Field guide to next-generation DNA sequencers MolEcol Resour 11 759ndash769

Han E J S Sinsheimer and J Novembre 2014 Characterizing bias inpopulation genetic inferences from low-coverage sequencing data MolBiol Evol 31 723ndash735

Hebert P D N 1978 Population biology of Daphnia (Crustacea Daph-nidae) Biol Rev Camb Philos Soc 53 387ndash426

Hedrick P W 1998 Balancing selection and MHC Genetica 104 207ndash214Hodgkinson A and A Eyre-Walker 2010 Human triallelic sites evidence

for a new mutational mechanism Genetics 184 233ndash241Hohenlohe P A S Bassham P D Etter N Stiffler E A Johnson et al

2010 Population genomics of parallel adaptation in threespine stickle-back using sequenced RAD tags PLoS Genet 6 e1000862

Hudson R R and N L Kaplan 1985 Statistical properties of the numberof recombination events in the history of a sample of DNA sequencesGenetics 111 147ndash164

Hunt M T Kikuchi M Sanders C Newbold M Berriman et al2013 REAPR a universal tool for genome assembly evaluation GenomeBiol 14 R47

International HapMap ConsortiumFrazer K A D G Ballinger D R CoxD A Hinds et al 2007 A second generation human haplotype map ofover 31 million SNPs Nature 449 851ndash861

International HapMap 3 ConsortiumAltshuler D M R A Gibbs L PeltonenD M Altshuler et al 2010 Integrating common and rare genetic varia-tion in diverse human populations Nature 467 52ndash58

Jenkins P A J W Mueller and Y S Song 2014 General triallelic fre-quency spectrum under demographic models with variable populationsize Genetics 196 295ndash311

Jurka J V V Kapitonov A Pavlicek P Klonowski O Kohany et al2005 Repbase update a database of eukaryotic repetitive elementsCytogenet Genome Res 110 462ndash467

Kendall M and A Stuart 1979 The Advanced Theory of Statistics Vol 2Ed 4 Charles Griffin amp Co London

Kim S Y K E Lohmueller A Albrechtsen Y Li T Korneliussen et al2011 Estimation of allele frequency and association mapping usingnext-generation sequencing data BMC Bioinformatics 12 231

Korneliussen T S A Albrechtsen and R Nielsen 2014 ANGSD analysisof next generation sequencing data BMC Bioinformatics 15 356

Li H 2011 A statistical framework for SNP calling mutation discoveryassociation mapping and population genetical parameter estimation fromsequencing data Bioinformatics 27 2987ndash2993

Li H J Ruan and R Durbin 2008 Mapping short DNA sequencingreads and calling variants using mapping quality scores Genome Res18 1851ndash1858

Li H B Handsaker A Wysoker T Fennell J Ruan et al 2009a The sequencealignmentmap format and samtools Bioinformatics 25 2078ndash2079

Li R Y Li X Fang H Yang J Wang et al 2009b SNP detection for massivelyparallel whole-genome resequencing Genome Res 19 1124ndash1132

Li Y C J Willer J Ding P Scheet and G R Abecasis 2010 MaCH usingsequence and genotype data to estimate haplotypes and unobserved ge-notypes Genet Epidemiol 34 816ndash834

Lynch M 2009 Estimation of allele frequencies from high-coverage genome-sequencing projects Genetics 182 295ndash301

Lynch M R Gutenkunst M S Ackerman K Spitze Z Ye et al2016 Population genomics of Daphnia pulex Genetics 206315ndash332

Markow T P W Hedrick K Zuerlein J Danilovs J Martin et al1993 HLA polymorphism in the Havasupai evidence for balancingselection Am J Hum Genet 53 943ndash952

Martin E R D D Kinnamon M A Schmidt E H Powell S Zuchneret al 2010 SeqEM an adaptive genotype-calling approach for next-generation sequencing studies Bioinformatics 26 2803ndash2810

Maruki T and M Lynch 2014 Genome-wide estimation of linkage dis-equilibrium from population-level high-throughput sequencing dataGenetics 197 1303ndash1313

Maruki T and M Lynch 2015 Genotype-frequency estimation from high-throughput sequencing data Genetics 201 473ndash486

McKenna A M Hanna E Banks A Sivachenko K Cibulskis et al2010 The Genome Analysis Toolkit a MapReduce framework for an-alyzing next-generation DNA sequencing data Genome Res 20 1297ndash1303

Melnick D J 1987 The genetic consequences of primate social organiza-tion a review of macaques baboons and vervet monkeys Genetica 73117ndash135

Nelson M R D Wegmann M G Ehm D Kessner P St Jean et al2012 An abundance of rare functional variants in 202 drug target genessequenced in 14002 people Science 337 100ndash104

Nielsen R T Korneliussen A Albrechtsen Y Li and J Wang 2012 SNPcalling genotype calling and sample allele frequency estimation fromnew-generation sequencing data PLoS One 7 e37558

Pool J E I Hellmann J D Jensen and R Nielsen 2010 Population ge-netic inference from genomic sequence variation Genome Res 20 291ndash300

Quail M A M Smith P Coupland T D Otto S R Harris et al 2012 Atale of three next generation sequencing platforms comparison of IonTorrent Pacific Biosciences and Illumina MiSeq sequencers BMC Ge-nomics 13 341

Scheet P and M Stephens 2006 A fast and flexible statistical model forlarge-scale population genotype data applications to inferring missinggenotypes and haplotypic phase Am J Hum Genet 78 629ndash644

Speir M L A S Zweig K R Rosenbloom B J Raney B Paten et al2016 The UCSC genome browser database 2016 update Nucleic AcidsRes 44 D717ndashD725

Storz J F H R Bhat and T H Kunz 2001 Genetic consequences ofpolygyny and social structure in an Indian fruit bat Cynopterus sphinxII Variance in male mating success and effective population size Evo-lution 55 1224ndash1232

Tarr C L S Conant and R C Fleischer 1998 Founder events and var-iation at microsatellite loci in an insular passerine bird the Laysan finch(Telespiza cantans) Mol Ecol 7 719ndash731

Tollenaere C J Bryja M Galan P Cadet J Deter et al 2008 Multipleparasites mediate balancing selection at two MHC class II genes in thefossorial water vole insights from multivariate analyses and populationgenetics J Evol Biol 21 1307ndash1320

Van der Auwera G A M O Carneiro C Hartl R Poplin G Del Angelet al 2013 From FastQ data to high confidence variant calls the Ge-nome Analysis Toolkit best practices pipeline Curr Protoc Bioinfor-matics 11 11101ndash11033

Vieira F G M Fumagalli A Albrechtsen and R Nielsen 2013 Estimatinginbreeding coefficients from NGS data impact on genotype calling andallele frequency estimation Genome Res 23 1852ndash1861

Weir B S 1996 Genetic Data Analysis II Methods for Discrete PopulationGenetic Data Sinauer Associates Sunderland MA

Communicating editor S I Wright


the HardyndashWeinberg equilibrium (HWE) Recent studies (Kimet al 2011 Han et al 2014) found that allele frequencies estimateddirectly from the sequence reads are unbiased whereas those es-timated via genotype calling are biased when depths of coverageare low These studies assumed a population in HWE Vieira et al(2013) recently showed that the performance of genotype callingcan be improved by first estimating inbreeding coefficients fromthe sequencing data and then calling genotypes incorporating theinformation on estimated inbreeding coefficients Unfortunatelytheir method is applicable only when the inbreeding coefficientsare non-negative (Maruki and Lynch 2015) and does not alwaystake full advantage of the population-level information Negativeinbreeding coefficients are common in some organisms includingasexual aphids (Delmotte et al 2002) Daphnia in permanentpondslakes (Hebert 1978) fruit bats (Storz et al 2001) partiallyinbreeding plant species (Brown 1979) Laysan finches (Tarr et al1998) prairie dogs (Foltz and Hoogland 1983) rhesus monkeys(Melnick 1987) and water voles (Aars et al 2006) Furthermoresome regions under balancing selection may contain excess het-erozygotes and therefore show negative inbreeding coefficients(Black and Salzano 1981 Markow et al 1993 Hedrick 1998Black et al 2001 Ferreira and Amos 2006 Tollenaere et al 2008)

In this study we develop a maximum-likelihood (ML) method forcalling genotypes from high-throughput sequencing data that incorpo-rates the prior information from a genotype-frequency estimator (GFE)(Maruki and Lynch 2015) We examine the performance of the pro-posed method using computer simulations under different geneticconditions including those where HWE is violated and compare theperformance with that of other widely used methods The results showthat our method yields more accurate genotype calls with low or mod-erately high depths of coverage than the current widely used methodswhich is supported by analysis of human data In addition we developanotherMLmethod for calling genotypes from high-coverage sequenc-ing data which relaxes the assumption of biallelic polymorphismsmade in many existing methods

We also examine the necessary depth of coverage for identifyingtriallelic sites and accurate genotype calling with the proposed methodusing computer simulations Taking the results of the performanceevaluation into account we apply the proposed method to high-throughput sequencing data of 83 clones from a population of themicrocrustacean Daphnia pulex which have reasonably high depths ofcoverage (mean 18middot per site per individual) (Lynch et al 2016) Theresults show that the proposed method enables accurate and rapididentification of polymorphic sites with arbitrary numbers of allelesFurthermore we show that the proposed method can be applied toanalyses of polyploid data

METHODSWe develop ML methods for calling individual genotypes fromhigh-throughput sequencing data We develop two types of meth-ods for calling genotypes one for low-coverage sequencing data andthe other for high-coverage sequencing data because the degree ofuncertainty associated with sequencing data depends on depths ofcoverage In both types of methods we statistically test the signif-icance of polymorphisms Also we do not call the genotype of anindividual when more than one genotype has equivalent likelihoodvalues for the observed data

Genotype calling from low-coverage sequencing dataWhen depths of coverage are low only one of the two parentalchromosomes might be sampled and sequence errors can resemble

true variants In such cases the statistical uncertainties of indi-vidual sequence data can be high thus calling accurate genotypesis not easy However when sequence data for multiple individualsfrom a population are available the accuracy of the called geno-types can be improved by incorporating the population-levelinformation on genotype frequencies and error rates into thegenotype-calling process using Bayesrsquo theorem [see Martinet al (2010) Nielsen et al (2012) Vieira et al (2013) for similarmethods using the expectation-maximization algorithm for esti-mating priors]

Here the genotype of an individual at a single site is called from thenucleotide read quartet (counts of A C G and T) of high-throughputsequencing data at the site by an ML method This is achieved bymaximizing the likelihood of the observed data as a function of thegenotype of the individual g The two most abundant nucleotide readsin the population sample are considered to be candidates for alleles atthe site

Given the genotype of the individual g = 1 (major homozygoteMM) 2 (heterozygote Mm) or 3 (minor homozygote mm) and se-quencing error rate per read per site e the log-likelihood of the observedsite-specific read quartet consisting of the observed counts of the mostabundant (major) nucleotide readM (eg C) (nM) second most abun-dant (minor) nucleotide read m (eg T) (nm) and other nucleotidereads (eg in this case A and G) (ne1 and ne2) in the population samplelnLethnM nm ne1 ne2jg eTHORN is given by the following multinomial distri-bution formula

ln LethnM nm ne1 ne2jg eTHORN frac14 lnethnTHORN=ethnM nmne1ne2THORNPgethMTHORNnM

middot PgethmTHORNnmPgethe1THORNne1Pgethe2THORNne2

(1)

where n frac14 nM thorn nm thorn ne1 thorn ne2 (the depth of coverage) PgethMTHORN is aprobability of observed nucleotide read M with genotype g It isa function of e and is given by summing conditional probabil-ities of the observed nucleotide read given the true nucleotideon the sequenced chromosome chosen from the pair (Table 1)The other P terms are similarly defined For example the prob-ability of nucleotide read M with genotype 2 (Mm) P2ethMTHORN iseth1=2THORNeth12 eTHORN thorn eth1=2THORNethe=3THORN assuming the error occurs at an equalrate from the true nucleotide to one of the other three nucleo-tides Because the multinomial coefficient in Equation 1 is con-stant regardless of the parameter values for computationalefficiency it is ignored as follows

ln LethnM nm ne1 ne2jg eTHORN frac14 lnPgethMTHORNnMPgethmTHORNnmPgethe1THORNne1Pgethe2THORNne2

(2)

When the genotype frequencies and error rate at the site areestimated from the population sample of nucleotide reads theycan be incorporated into Equation 2 as Bayesrsquo priorsWe previouslydeveloped an ML method to estimate the site-specific genotypefrequencies and error rate from the nucleotide read quartets(Maruki and Lynch 2015) This method yields essentially unbiasedgenotype-frequency estimates even with moderate depths of cov-erage which we use as the Bayesrsquo priors necessary for improvingthe accuracy of genotype calls Specifically given the estimates ofthe genotype frequencies g1 (frequency of major homozygotes) g2

(frequency of heterozygotes) and g3 (frequency of minor homo-zygotes) and error rate (e) predetermined by the ML method ofMaruki and Lynch (2015) the log-likelihood of the observed dataare given as follows



frac14 ln


i

X3jfrac141


i9= (3)









(6)












M m e1 e2












(11)




q 10n

(12)



(13)






























iX (23)
















































































































































frac14 ln


i

X3jfrac141


i9= (3)









(6)












M m e1 e2












(11)




q 10n

(12)



(13)






























iX (23)























































































































































(11)




q 10n

(12)



(13)






























iX (23)

















































































































































iX (23)















































































































































Genotype Calling from Population-Genomic Sequencing Data...inferred (called) genotype of the...

Documents

Transcript of Genotype Calling from Population-Genomic Sequencing Data...inferred (called) genotype of the...