Assembly errors cause false tandem duplicate regions in the chicken (Gallus gallus) genome sequence

4
RESEARCH ARTICLE Assembly errors cause false tandem duplicate regions in the chicken (Gallus gallus ) genome sequence Qu Zhang & Niclas Backström Received: 6 September 2013 /Revised: 25 October 2013 /Accepted: 28 October 2013 # Springer-Verlag Berlin Heidelberg 2013 Abstract The complexity of eukaryote genomes makes as- sembly errors inevitable in the process of constructing refer- ence genomes. Next-generation sequencing (NGS) could pro- vide an efficient way to validate previously assembled ge- nomes. Here, we exploited NGS data to interrogate the chick- en reference genome and identified 35 pairs of nearly identical regions with >99.5 % sequence similarity and a median size of 109 kb. Several lines of evidence, including read depth, the composition of junction sequences, and sequence similarity, suggest that these regions present genome assembly errors and should be excluded from forthcoming genomic studies. Keywords Next-generation sequencing . Read depth . Assembly error . Chicken genome Introduction The whole-genome shotgun strategy is commonly used to generate genome-wide DNA sequence data of eukaryotic species. Although various sophisticated algorithms have been developed to assemble such data into full genome sequences, many errors may still remain even in announced completegenomes (Kelley and Salzberg 2010). One of the most chal- lenging problems in the process is to correctly assemble duplicated regions, especially large segmental duplications. This is a serious issue for evolutionary studies since segmental duplications may play major functional roles as well as give important information about relationships between taxonomic groups (Bailey et al. 2002). One major problem with extensive duplicated regions is that the assembly algorithms have diffi- culties separating out individual regions if there is high se- quence similarity between duplicates and such regions tend to be collapsed in the final assembly (Salzberg and Yorke 2005). Conversely, but similarly problematic is that the assembly process leads to an increased rate of falseduplicated regions in genomic parts with significant sequence heterozygosity (Kelley and Salzberg 2010). Subsequently, both of these er- roneous assembly outcomes can lead to difficulties in accu- rately characterizing genomic properties and identifying DNA polymorphisms. A validated genome assembly must be consistent with statistical properties of the data-generating process (Myers 1995). Based on this theory, different methods have been developed to examine the quality of a genome assembly using for example mate-pair information, repeat information, micro- heterogeneities, and/or coverage information (Phillippy et al. 2008). The development of high-throughput next-generation sequencing (NGS) technology provides an unprecedented opportunity in genomic studies, including detection of errors in previously available reference genome assemblies. Despite the fact that NGS sequence reads are relatively short and sometimes unpaired, deep sequencing can generate high ge- nome coverage, which is highly suitable for coverage-based validation. In this study, we analyzed NGS data from the red jungle fowl (Gallus gallus ) female that was used to recon- struct the chicken reference genome. By interrogating read coverage patterns across the genome of this individual and an Electronic supplementary material The online version of this article (doi:10.1007/s00412-013-0443-8) contains supplementary material, which is available to authorized users. Q. Zhang (*) Department of Human Evolutionary Biology, Harvard University, 11 Divinity Avenue, Cambridge, MA 02138, USA e-mail: [email protected] N. Backström Department of Evolutionary Biology, Evolutionary Biology Centre (EBC), Uppsala University, Norbyvägen 18D, 752 36 Uppsala, Sweden e-mail: [email protected] Q. Zhang Pioneer Hi-Bred International, A DuPont Business, Johnston, IA 50131, USA Chromosoma DOI 10.1007/s00412-013-0443-8

Transcript of Assembly errors cause false tandem duplicate regions in the chicken (Gallus gallus) genome sequence

Page 1: Assembly errors cause false tandem duplicate regions in the chicken (Gallus gallus) genome sequence

RESEARCH ARTICLE

Assembly errors cause false tandem duplicate regionsin the chicken (Gallus gallus ) genome sequence

Qu Zhang & Niclas Backström

Received: 6 September 2013 /Revised: 25 October 2013 /Accepted: 28 October 2013# Springer-Verlag Berlin Heidelberg 2013

Abstract The complexity of eukaryote genomes makes as-sembly errors inevitable in the process of constructing refer-ence genomes. Next-generation sequencing (NGS) could pro-vide an efficient way to validate previously assembled ge-nomes. Here, we exploited NGS data to interrogate the chick-en reference genome and identified 35 pairs of nearly identicalregions with >99.5 % sequence similarity and a median size of109 kb. Several lines of evidence, including read depth, thecomposition of junction sequences, and sequence similarity,suggest that these regions present genome assembly errors andshould be excluded from forthcoming genomic studies.

Keywords Next-generation sequencing . Read depth .

Assembly error . Chicken genome

Introduction

The whole-genome shotgun strategy is commonly used togenerate genome-wide DNA sequence data of eukaryoticspecies. Although various sophisticated algorithms have been

developed to assemble such data into full genome sequences,many errors may still remain even in announced “complete”genomes (Kelley and Salzberg 2010). One of the most chal-lenging problems in the process is to correctly assembleduplicated regions, especially large segmental duplications.This is a serious issue for evolutionary studies since segmentalduplications may play major functional roles as well as giveimportant information about relationships between taxonomicgroups (Bailey et al. 2002). Onemajor problemwith extensiveduplicated regions is that the assembly algorithms have diffi-culties separating out individual regions if there is high se-quence similarity between duplicates and such regions tend tobe collapsed in the final assembly (Salzberg and Yorke 2005).Conversely, but similarly problematic is that the assemblyprocess leads to an increased rate of “false” duplicated regionsin genomic parts with significant sequence heterozygosity(Kelley and Salzberg 2010). Subsequently, both of these er-roneous assembly outcomes can lead to difficulties in accu-rately characterizing genomic properties and identifying DNApolymorphisms.

A validated genome assembly must be consistent withstatistical properties of the data-generating process (Myers1995). Based on this theory, different methods have beendeveloped to examine the quality of a genome assembly usingfor example mate-pair information, repeat information, micro-heterogeneities, and/or coverage information (Phillippy et al.2008). The development of high-throughput next-generationsequencing (NGS) technology provides an unprecedentedopportunity in genomic studies, including detection of errorsin previously available reference genome assemblies. Despitethe fact that NGS sequence reads are relatively short andsometimes unpaired, deep sequencing can generate high ge-nome coverage, which is highly suitable for coverage-basedvalidation. In this study, we analyzed NGS data from the redjungle fowl (Gallus gallus) female that was used to recon-struct the chicken reference genome. By interrogating readcoverage patterns across the genome of this individual and an

Electronic supplementary material The online version of this article(doi:10.1007/s00412-013-0443-8) contains supplementary material,which is available to authorized users.

Q. Zhang (*)Department of Human Evolutionary Biology, Harvard University,11 Divinity Avenue, Cambridge, MA 02138, USAe-mail: [email protected]

N. BackströmDepartment of Evolutionary Biology, Evolutionary Biology Centre(EBC), Uppsala University, Norbyvägen 18D,752 36 Uppsala, Swedene-mail: [email protected]

Q. ZhangPioneer Hi-Bred International, A DuPont Business,Johnston, IA 50131, USA

ChromosomaDOI 10.1007/s00412-013-0443-8

Page 2: Assembly errors cause false tandem duplicate regions in the chicken (Gallus gallus) genome sequence

additional independent sample of eight pooled male domesticchickens, we find large nearly identical regions (NIRs) whichare very likely the result of errors in the initial genomeassembly. We report the details of these regions and recom-mend that they should be excluded from future evolutionarygenomics analyses using the chicken reference genomesequence.

Materials and methods

Chickenwhole genome re-sequencing data (Rubin et al. 2010)generated by the ABI SOLiD platform technology was re-trieved from the NCBI short read archive (http://www.ncbi.nlm.nih.gov/sra). We downloaded data from two samples, ared jungle fowl female from the partly inbred UCD 001 lineused to generate the chicken genome sequence (accessionSRX016464) and a sample of eight pooled red jungle fowlmales from two captive populations in Sweden (accessionSRX016463).

Short reads were preprocessed to filter low quality readswith mean base quality <10. Since SOLiD sequencing uses“color space” for dinucleotides and errors in the first bases of aread may mess up the whole sequence, we also filtered thosewith poor quality (<15) in any of the first five bases to avoiderroneous alignments. Filtered reads were then aligned to thechicken reference genome [Ensembl release 71 (Flicek et al.2013)] using the BWA aligner (Li and Durbin 2009), with thefollowing parameter settings: -l 25 -k 2 -n 5. The wholegenome alignment was binned into 1,000 bp windows with200 bp overlap, and the number of aligned reads startingwithin each window was tallied. The repeat-masked chickenreference genome was downloaded from the Ensembl data-base, and the percentage of bases pertaining to repetitiveelements for each window was recorded.

To identify regions with potential copy number changes,we first removed windows with more than 5 % ambiguousbases or more than 10 % repetitive bases. Since variations inGC content in different regions introduce bias in read depth,we corrected for that using a previously developed method(Yoon et al. 2009). Finally, we estimated the mean and stan-dard deviation of read counts for all windows on the auto-somes and defined copy number variable regions (CNVRs) asthose with more than five consecutive windows showingcoverage values at least two standard deviations away fromthe mean.

Results and discussion

In total, we identified 346 CNVRs in the UCD sample, in-cluding 59 regions with copy number gains (CNGs) and 287regions with copy number losses (CNLs) (Supplementary

Table 1). The size of the CNVRs ranged from 4 to 20 kb, witha median size of 5 kb. The observation of CNVRs indicatesthe presence of mis-assemblies in the chicken reference ge-nome. Since CNGs are probably due to erroneously collapsedregions and methods based on read depth may be inadequateto characterize them in detail, we only focused on CNLs inthis study. By closer examination of our initially detectedCNLs, we found that they were considerably clustered intolarger regions spanning tens to hundreds of kb. As brieflymentioned before, a possible way leading to these types oferrors is the incorporation of artificial duplications into thegenome assembly. To test the possibility that this has been thesituation underlying the detected CNLs in this study, weidentified all paralogous regions for detected CNL clustersabove using BLAT (Kent 2002) and manually determined thestart and end positions for each pair of duplications. Intrigu-ingly, we found that each cluster was composed of two ormore tandem duplications with extremely high sequence sim-ilarity, and they could be separated into 35 tandem duplicatepairs nearly identical in sequence (nearly identical regions,NIRs, Fig. 1 and Supplementary Table 2), ranging from 7 to309 kb in size with a median of 109 kb. In total, they summedup to 8 Mb, corresponding to 0.4 % of the chicken referencegenome sequence. Several additional lines of evidence indi-cate that these large segmental duplications are a result oferroneously adding of an extra stretch of sequence to thereference genome assembly: (1) the comparison between readdepth on autosomes and the Z chromosome verified that thepower to distinguish one-copy from two-copy regions usingour pipeline is considerable (Fig. 2a); (2) detailed examinationof the sequence similarity between tandem duplicates foundthat all of these CNL regions are composed of highly similar,or even identical, tandem duplicates (99.95 % similarity onaverage, Supplementary Table 2), which is extremely unlikelygiven their large sizes; (3) the average read depths for eachrespective copy of a duplicate pair are very similar (Supple-mentary Table 2); (4) for a majority of the NIRs (30/35), thestretch of sequence separating two tandem duplications iscomposed of ambiguous bases (N) only (Fig. 2b); (5) all ofthe NIRs display a similar coverage pattern in the pooledsample of eight red jungle fowl males (Fig. 2b). Whenintersecting our detected CNL clusters with gene annotationinformation, we found 136 protein coding genes (∼0.9 % oftotal coding genes) and 10 miRNAs (∼1 % of total miRNAs)located within these regions.

It has been well acknowledged that mis-assemblies arepresent in published “complete” genomes (Kelley andSalzberg 2010). As demonstrated in previous studies, a sub-stantial proportion of highly similar intrachromosomal falseduplications can be due to genomemis-assembly (Bailey et al.2002; Cheung et al. 2003; Kelley and Salzberg 2010), prob-ably as a result of high regional heterozygosity in the corre-sponding chromosome region. Although the use of highly

Chromosoma

Page 3: Assembly errors cause false tandem duplicate regions in the chicken (Gallus gallus) genome sequence

Fig. 1 An illustrative example of a region of false duplications(chr18:7017519-7637909). Four chicken RefSeq genes (NOL11 ,PSMD12 , CACNG1 , and CACNG4) are embedded within this regionwith two identical copies. The boundary between two duplicated genomic

regions is composed of ambiguous bases (Ns) and is indicated by atriangle . The orthologous regions in turkey (Meleagris gallopavo) andzebra finch (Taeniopygia guttata) are also presented with only one copyfor each gene, respectively

Fig. 2 Read depth analyses across the chicken reference genome whereeach point represents the total read count within a 1-kb window, bluedashed lines denote the median read depth for all autosomes, and reddashed lines denote the median read depth for the Z chromosome. a Readdepth is highly consistent with true copy number; coverage on autosomesis twice as high as coverage on the Z chromosome in the single UCDfemale but similar to the coverage on the Z chromosome in the pooledRJFsample. Note that only chromosome 4 was selected to represent auto-somes. b NIRs were defined as regions deviating more than two standard

deviations from the global mean coverage of the chromosome class.Illustrated is a case on chromosome 18 with two duplicate tandem regionswith half of the autosomal global average read depth in both the singleUCD female and the pooled RJF sample. This most likely represents anerror in the assembly where a single genomic region has been interpretedas two tandem duplicates. The majority of NIRs are separated by ambig-uous bases (Ns) further supporting mis-assembly rather than true dupli-cates. UCD the single female individual used to generate the chickengenome assembly, RJF pooled sample of eight red jungle fowl males

Chromosoma

Page 4: Assembly errors cause false tandem duplicate regions in the chicken (Gallus gallus) genome sequence

inbred samples can alleviate the problem to some extent,poorly assembled regions still exist in these cases (Kelleyand Salzberg 2010). For example, Kelley and Salzberg devel-oped a contig-centric algorithm and identified 8,001 mis-assembled contigs in the chicken genome (Kelley andSalzberg 2010). However, when intersecting the chromosom-al location of these contigs to the position of the NIRs iden-tified in our study, none of the NIR pairs was discovered. Wetherefore conclude that the NIRs reported here supplementcurrent knowledge of mis-assembled regions in the chickengenome. Although it remains unclear what the underlyingreason has been causing this type of mis-assemblies, extracare must be taken to interpret observed copy number varia-tions in chicken, and we propose that these regions should beexcluded from forthcoming genomic studies involving thechicken reference assembly.

Acknowledgments QZ was supported by the Department of HumanEvolutionary Biology, Harvard University. NB acknowledges postdoc-toral research funding from the Swedish Research Council (VR grant2009-693). We thank the anonymous reviewers for the helpful commentson an earlier version of this manuscript.

References

Bailey JA, Gu Z, Clark RA, Reinert K, Samonte RV, Schwartz S, AdamsMD, Myers EW, Li PW, Eichler EE (2002) Recent segmentalduplications in the human genome. Science 297(5583):1003–1007

Cheung J, Estivill X, Khaja R, MacDonald JR, Lau K, Tsui LC, SchererSW (2003) Genome-wide detection of segmental duplications andpotential assembly errors in the human genome sequence. GenomeBiol 4(4):R25

Flicek P, Ahmed I, Amode MR, Barrell D, Beal K, Brent S, Carvalho-Silva D, Clapham P, Coates G, Fairley S, Fitzgerald S, Gil L, Garcia-Giron C, Gordon L, Hourlier T, Hunt S, Juettemann T, Kahari AK,Keenan S, Komorowska M, Kulesha E, Longden I, Maurel T,McLaren WM, Muffato M, Nag R, Overduin B, Pignatelli M,Pritchard B, Pritchard E, Riat HS, Ritchie GR, Ruffier M, SchusterM, Sheppard D, Sobral D, Taylor K, Thormann A, Trevanion S,White S, Wilder SP, Aken BL, Birney E, Cunningham F, Dunham I,Harrow J, Herrero J, Hubbard TJ, Johnson N, Kinsella R, Parker A,Spudich G, Yates A, Zadissa A, Searle SM (2013) Ensembl 2013.Nucleic Acids Res 41:D48–55, Database issue

Kelley DR, Salzberg SL (2010) Detection and correction of false seg-mental duplications caused by genome mis-assembly. Genome Biol11(3):R28

Kent WJ (2002) BLAT—the BLAST-like alignment tool. Genome Res12(4):656–664

Li H, Durbin R (2009) Fast and accurate short read alignment withBurrows-Wheeler transform. Bioinformatics 25(14):1754–1760

Myers EW (1995) Toward simplifying and accurately formulating frag-ment assembly. J Comput Biol 2(2):275–290

Phillippy AM, Schatz MC, Pop M (2008) Genome assembly forensics:finding the elusive mis-assembly. Genome Biol 9(3):R55

Rubin CJ, Zody MC, Eriksson J, Meadows JR, Sherwood E, WebsterMT, Jiang L, Ingman M, Sharpe T, Ka S, Hallbook F, Besnier F,Carlborg O, Bed’hom B, Tixier-Boichard M, Jensen P, Siegel P,Lindblad-Toh K, Andersson L (2010) Whole-genome resequencingreveals loci under selection during chicken domestication. Nature464(7288):587–591

Salzberg SL, Yorke JA (2005) Beware of mis-assembled genomes.Bioinformatics 21(24):4320–4321

Yoon S, Xuan Z, Makarov V, Ye K, Sebat J (2009) Sensitive and accuratedetection of copy number variants using read depth of coverage.Genome Res 19(9):1586–1592

Chromosoma