Abundance and Distribution of Transposable Elements in Two ...

17
Article Abundance and Distribution of Transposable Elements in Two Drosophila QTL Mapping Resources Julie M. Cridland,* ,1 Stuart J. Macdonald, 2 Anthony D. Long, 1 and Kevin R. Thornton* ,1 1 Department of Ecology, Evolution and Physiology, University of California, Irvine 2 Department of Molecular Biosciences, University of Kansas *Corresponding author: E-mail: [email protected]; [email protected]. Associate editor: Naoko Takezaki All software and TE calls resulting from our pipeline are available at http://www.molpopgen.org/Data/. Abstract Here we present computational machinery to efficiently and accurately identify transposable element (TE) insertions in 146 next-generation sequenced inbred strains of Drosophila melanogaster. The panel of lines we use in our study is composed of strains from a pair of genetic mapping resources: the Drosophila Genetic Reference Panel (DGRP) and the Drosophila Synthetic Population Resource (DSPR). We identified 23,087 TE insertions in these lines, of which 83.3% are found in only one line. There are marked differences in the distribution of elements over the genome, with TEs found at higher densities on the X chromosome, and in regions of low recombination. We also identified many more TEs per base pair of intronic sequence and fewer TEs per base pair of exonic sequence than expected if TEs are located at random locations in the euchromatic genome. There was substantial variation in TE load across genes. For example, the paralogs derailed and derailed-2 show a significant difference in the number of TE insertions, potentially reflecting differences in the selection acting on these loci. When considering TE families, we find a very weak effect of gene family size on TE insertions per gene, indicating that as gene family size increases the number of TE insertions in a given gene within that family also increases. TEs are known to be associated with certain phenotypes, and our data will allow investigators using the DGRP and DSPR to assess the functional role of TE insertions in complex trait variation more generally. Notably, because most TEs are very rare and often private to a single line, causative TEs resulting in phenotypic differences among individuals may typically fail to replicate across mapping panels since individual elements are unlikely to segregate in both panels. Our data suggest that “burden tests” that test for the effect of TEs as a class may be more fruitful. Key words: transposable element, DGRP, DSPR, genomics, population genetics. Introduction Transposable elements are common, naturally occurring sources of genetic variation known to play diverse roles in genome evolution (Bennetzen 2000; reviewed in Kazazian 2004), influencing chromosomal rearrangements (Lonnig and Saedler 2002; Biemont and Vieira 2006), genome size (Kidwell 2002) and gene duplication (Schmidt et al. 2010). They also contribute both to functional variation between individuals (Daborn et al. 2001; Aminetzach et al. 2005) and to tissue-specific gene expression (Sackton et al. 2009). TEs can also contribute to variation in quantitative traits such as bris- tle number variation (Mackay 1984; Shrimpton et al. 1990; Mackay et al. 1992) and fitness (Mackay 1989). However, this variation is not limited to the production of null alleles, but instead TEs can produce a wide variety of changes in gene expression. TEs can act as enhancers (Chung et al. 2007), repressors (Zachar and Bingham 1982), or regulators of more complex expression patterns acting either in cis or in trans (Smith and Corces 1991). Additionally, different TE in- sertions into the same gene do not necessarily produce the same effect (Zachar and Bingham 1982; Birchler et al. 1989; Birchler and Hiebert 1989). In Drosophila, a large portion of phenotypic variation is likely due to rare alleles maintained through mutation-selection bal- ance (Mackay 2010). TEs are potentially good candidates to be rare, causative mutations contributing to a wide variety of phenotypic variation. The population frequency of most TE insertions in Drosophila is low (Charlesworth and Langley 1989), resulting in rare variants that could potentially have phenotypic consequences. TE insertions are often deleterious and host genomes have evolved a variety of methods to reg- ulate TE replication in their genomes (reviewed in Slotkin and Martienssen 2007). Still, many transposable elements are known to be active in Drosophila melanogaster (Deloger et al. 2009) , with insertion rates ranging between 10 3 and 10 5 elements per generation (Nuzhdin and Mackay 1994), suggesting high rates of ongoing activity in many different TE families continuing to produce rare variants. Individually rare transposable element insertions as a class of mutations have also been associated with quantitative traits in Drosophila (Mackay and Langley 1990; Long et al. 2000). TEs as a class have been found to be causative muta- tions in association studies examining both the Enhancer-of- split gene complex and the achaete-scute complex in ß The Author 2013. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http:// creativecommons.org/licenses/by-nc/3.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected] Open Access Mol. Biol. Evol. 30(10):2311–2327 doi:10.1093/molbev/mst129 Advance Access publication July 24, 2013 2311

Transcript of Abundance and Distribution of Transposable Elements in Two ...

Page 1: Abundance and Distribution of Transposable Elements in Two ...

Article

Abundance and Distribution of Transposable Elements in TwoDrosophila QTL Mapping ResourcesJulie M Cridland1 Stuart J Macdonald2 Anthony D Long1 and Kevin R Thornton1

1Department of Ecology Evolution and Physiology University of California Irvine2Department of Molecular Biosciences University of Kansas

Corresponding author E-mail jcridlanuciedu krthorntuciedu

Associate editor Naoko Takezaki

All software and TE calls resulting from our pipeline are available at httpwwwmolpopgenorgData

Abstract

Here we present computational machinery to efficiently and accurately identify transposable element (TE) insertions in146 next-generation sequenced inbred strains of Drosophila melanogaster The panel of lines we use in our study iscomposed of strains from a pair of genetic mapping resources the Drosophila Genetic Reference Panel (DGRP) and theDrosophila Synthetic Population Resource (DSPR) We identified 23087 TE insertions in these lines of which 833 arefound in only one line There are marked differences in the distribution of elements over the genome with TEs found athigher densities on the X chromosome and in regions of low recombination We also identified many more TEs per basepair of intronic sequence and fewer TEs per base pair of exonic sequence than expected if TEs are located at randomlocations in the euchromatic genome There was substantial variation in TE load across genes For example the paralogsderailed and derailed-2 show a significant difference in the number of TE insertions potentially reflecting differences inthe selection acting on these loci When considering TE families we find a very weak effect of gene family size on TEinsertions per gene indicating that as gene family size increases the number of TE insertions in a given gene within thatfamily also increases TEs are known to be associated with certain phenotypes and our data will allow investigators usingthe DGRP and DSPR to assess the functional role of TE insertions in complex trait variation more generally Notablybecause most TEs are very rare and often private to a single line causative TEs resulting in phenotypic differences amongindividuals may typically fail to replicate across mapping panels since individual elements are unlikely to segregate inboth panels Our data suggest that ldquoburden testsrdquo that test for the effect of TEs as a class may be more fruitful

Key words transposable element DGRP DSPR genomics population genetics

IntroductionTransposable elements are common naturally occurringsources of genetic variation known to play diverse roles ingenome evolution (Bennetzen 2000 reviewed in Kazazian2004) influencing chromosomal rearrangements (Lonnigand Saedler 2002 Biemont and Vieira 2006) genome size(Kidwell 2002) and gene duplication (Schmidt et al 2010)They also contribute both to functional variation betweenindividuals (Daborn et al 2001 Aminetzach et al 2005) and totissue-specific gene expression (Sackton et al 2009) TEs canalso contribute to variation in quantitative traits such as bris-tle number variation (Mackay 1984 Shrimpton et al 1990Mackay et al 1992) and fitness (Mackay 1989) However thisvariation is not limited to the production of null alleles butinstead TEs can produce a wide variety of changes in geneexpression TEs can act as enhancers (Chung et al 2007)repressors (Zachar and Bingham 1982) or regulators ofmore complex expression patterns acting either in cis or intrans (Smith and Corces 1991) Additionally different TE in-sertions into the same gene do not necessarily produce thesame effect (Zachar and Bingham 1982 Birchler et al 1989Birchler and Hiebert 1989)

In Drosophila a large portion of phenotypic variation is likelydue to rare alleles maintained through mutation-selection bal-ance (Mackay 2010) TEs are potentially good candidates to berare causative mutations contributing to a wide variety ofphenotypic variation The population frequency of most TEinsertions in Drosophila is low (Charlesworth and Langley1989) resulting in rare variants that could potentially havephenotypic consequences TE insertions are often deleteriousand host genomes have evolved a variety of methods to reg-ulate TE replication in their genomes (reviewed in Slotkin andMartienssen 2007) Still many transposable elements areknown to be active in Drosophila melanogaster (Delogeret al 2009) with insertion rates ranging between 103 and105 elements per generation (Nuzhdin and Mackay 1994)suggesting high rates of ongoing activity in many different TEfamilies continuing to produce rare variants

Individually rare transposable element insertions as a classof mutations have also been associated with quantitativetraits in Drosophila (Mackay and Langley 1990 Long et al2000) TEs as a class have been found to be causative muta-tions in association studies examining both the Enhancer-of-split gene complex and the achaete-scute complex in

The Author 2013 Published by Oxford University Press on behalf of the Society for Molecular Biology and EvolutionThis is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (httpcreativecommonsorglicensesby-nc30) which permits non-commercial re-use distribution and reproduction in any mediumprovided the original work is properly cited For commercial re-use please contact journalspermissionsoupcom Open AccessMol Biol Evol 30(10)2311ndash2327 doi101093molbevmst129 Advance Access publication July 24 2013 2311

Drosophila (Mackay and Langley 1990 Long et al 2000Macdonald et al 2005 Gruber et al 2007) Future studiesmay reveal many more examples of TEs as a class as causativemutations

Recently two reference panels for the mapping of quanti-tative trait loci (QTL) have been unveiled the DrosophilaGenetic Reference Panel (DGRP) (Mackay et al 2012) andthe Drosophila Synthetic Population Resource (DSPR) (Kinget al 2012a 2012b) The DGRP is a set of 168 inbred isofemalelines derived from individuals collected in Raleigh NC in 2003The DSPR is a collection of ~1700 Recombinant Inbred Linesderived from 15 highly inbred founder lines The founderswere collected from many different geographic locationsand were all been in laboratory conditions for 40 + years(Macdonald and Long 2007) These resources are intendedfor use as platforms to map QTL in D melanogaster and havebeen characterized with respect to single-nucleotide poly-morphism (SNP) genotypes (King et al 2012a Mackay et al2012) In addition an initial pass at calling transposable ele-ment (TE) insertions in the DGRP was presented in Mackayet al (2012) and in Linheiro and Bergman (2012) This studyrevealed hundreds of rare TE insertions per line and in caseswhere TE insertions are found in the same gene in multiplelines the insertions are usually at different positions withinthe gene (Mackay et al 2012) These findings suggest thatmarker-based associations where the causative mutation is aTE insertion may fail to replicate and that tests for the effectsof TE insertions as a class of mutations such as in Mackay andLangley (1990) Long et al (2000) and Macdonald et al(2005) may be more fruitful

Replicability of genotypendashphenotype associations is a crit-ical step for establishing causative variation and genome-wideassociation studies in humans use replication in multiple datasets as a standard of quality (NCI-NHGRI Working Group onReplication in Association Studies 2007) Future studies usingthe DGRP and DSPR must also consider differences in thepattern of TE insertions between panels when designing ex-periments utilizing these resources because a causative TEsegregating in one panel but not in the other would affectthe replicability of a study It is therefore important to knowthe TE genotype of the individual fly lines used in QTL studiesin Drosophila and an accurate characterization of the TE con-tent of both panels is desirable for future QTL mapping stud-ies in Drosophila Currently in D melanogaster there are onlya few replicated associations One set involves TE insertions atthe achaete-scute complex and their effect on bristle number(Mackay and Langley 1990 Long et al 2000) There is oneexample of a replicated association involving a SNP and thegene Egfr in Drosophila (Palsson and Gibson 2004 Dworkinet al 2005) although the nature of the replication is complexbecause the specific phenotype associated with the replicatedSNP differs between studies

Here we describe the TE content of two QTL mappingresources identifying patterns of TE abundance and distribu-tion We find that TEs are generally rare existing in only oneline and that there is an excess of rare TEs compared with thestandard neutral model and SNPs We find substantial varia-tion between genes in term of TE load as well as a weak effect

of gene family size on average TE load on genes TEs are also athigher densities in regions of low recombination as well as onthe X chromosome

New ApproachesHere we present a whole-genome TE calling method which isan improvement to the method used to call TEs in Mackayet al (2012) and apply it to two QTL mapping resources(supplementary fig S1 Supplementary Material online)Briefly TEs are identified by read-pairs generated via next-generation paired-end sequencing Informative read-pairs arecharacterized by a pattern of one read in the pair aligninguniquely to the genome and the other read in the pair align-ing to multiple genomic regions Individual insertion eventsare identified by two sets of read pairs one set anchored byuniquely aligning reads upstream of the insertion and one setanchored by uniquely aligning reads downstream of the in-sertion Following identification of informative read pairs weuse Phrap (Ewing and Green 1998) to reconstruct the localarea in genomic regions suggestive of the presence of a TEReconstructed sequences are then aligned back to the refer-ence genome using BlastN (Altschul et al 1990) to validatethe presence of a TE breakpoint and to classify the type of TEidentified

There are four major improvements to this method overthe previous method used in Mackay et al (2012) First wehave switched to using the aligner BWA (Li and Durbin 2009)which is more accurate at distinguishing unique from non-unique alignments to a reference genome than the Mosaik 10(httpcodegooglecompmosaik-alignerdownloadsdetailname=Mosaik201020Documentationpdf last accessedJune 30 2012) aligner used in Mackay et al (2012) for TEdetection Second we have incorporated bedtools (Quinianand Hall 2010) into our pipeline which has improved thespeed at which we can detect events Third we have alsoincorporated definitive absence calls to the pipeline wherewe attempt to positively call the presence or absence of a TEat a given position in a given genotype (cf Mackay et al 2012)Finally we have performed both computational and poly-merase chain reaction (PCR)-based validation of our methoddemonstrating the methodrsquos precision and sensitivity

Results

Identity by Descent in the DGRP Resource

Because association studies rely on the assumption that indi-viduals are unrelated (Voight and Pritchard 2005) we used theSNP calls from Mackay et al (2012) to scan for regions ofextensive identity by descent (IBD) in the DGRP samplegt95 similarity in 1 Mb windows with 100 kb steps Webegan with a total of 148 DGRP lines for which we have se-quence data We identified many large regions of IBD betweenpairs of lines (fig 1A) and removed from the analysis a total of13 DGRP lines which were identified as 95 IBD with an-other line over 50 of their genomes When two linesshowed such genome-wide IBD the line with the highest av-erage coverage was retained for further analyses Similarly re-gions in remaining DGRP lines which were 95 IBD to

2312

Cridland et al doi101093molbevmst129 MBE

another line were masked in the line with the lowest averagegenome coverage In addition to the regions removed due toIBD we removed four lines from the DGRP set which had anaverage genome coverage of lt10 (supplementary table S1Supplementary Material online) This filtering resulted in asample size of 131 DGRP lines some of which were partiallymasked to be used in all further analyses This removal of somelines and masking of regions in other lines resulted in variationin sample sizes across the genome For 98 of sites in thegenome the sample size was 68 For the remainder of sitescoverage was low in a substantial number of lines indicative ofgenomic regions where alignment of short reads is difficult

Coalescent simulations of an equilibrium WrightndashFishermodel showed that the probability of more than four pairsof lines with a single 1 Mb region of IBD matching our IBDcriteria is lt103 Therefore our observation of hundreds of1 Mb regions of IBD between pairs of lines (fig 1A) indicatedthat there was substantially more IBD in the DGRP thanexpected in a panel of randomly chosen individuals sampledfrom an idealized population However the extent of IBD wassimilar between the 15 DSPR founder lines and the 15 DGRPlines with the highest average sequence coverage (fig 1Band C) and was equal to a few dozen megabytes of IBDincluding a large proportion of centromeric regions It istherefore unclear whether the extensive IBD seen in 148lines of the DGRP (fig 1A) is a general property of largesamples of cosmopolitan D melanogaster or a specificproperty of the DGRP line collection

Data Filtering

Following the IBD analysis we identified transposable ele-ment insertions in a total of 146 lines 131 DGRP and 15DSPR founder lines TEs identified at a specific genomic loca-tion were either present in the D melanogaster referencesequence (version 513 downloaded from wwwflybaseorglast accessed January 30 2009) hereafter referred to as refer-ence TEs or not present in the reference sequence hereafterreferred to as novel TEs Centromeric and telomeric regionsdefined in the Materials and Methods section were excludedfrom the analysis as identification of TEs in these regions canbe problematic since their very high TE densities result in fewinformative read-pairs In addition we wished to focus on theeuchromatic regions of the genome since most genes arelocated in these regions We also removed from the datasets reference TEs which were lt75 of a full length copyof the element as these elements are unlikely to be activecopies These filtering steps reduced the number of referenceTEs considered in later analysis to 607 from a starting total of6003

Validation of Transposable Element PresenceAbsence CallsSimulationDetecting a set of inserted elements in silico showed that atan average genome coverage of 50 we were able to detect913 of elements that were at least 75 of the length of the

0 20 40 60 80 100 120

05

1015

0 20 40 60 80 100 120

05

1015

0 20 40 60 80 100 120

050

100

150

Descending Order of Coverage DGRP

X 2L 2R 3L 3R

A B

C

DGRP Top 15 by Coverage

DSPR

Mb Mb

Mb

Line

s (R

anke

d by

Mea

n C

over

age)

X

X

2L 2R 3L 3R

2L 2R 3L 3R

FIG 1 (A) Identity by descent in 148 DGRP lines (B) IBD in the top 15 DGRP lines by average sequence coverage (C) IBD in the 15 DSPR lines Maskedregions indicate regions of IDB 95 When two lines were considered IBD in a region the line with lower mean coverage was masked

2313

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

canonical element These are the set of events directly com-parable to the set of reference TEs we kept for our analysissince these are the set of TEs that are likely to be active copiesin the genome A drop in average genome coverage did notaffect our rate of identification in simulated data sets Wedetected 956 of elements at 25 coverage and 913 ofelements at 15 coverage

When we looked at the detection of all elements we foundthat we were able to identify the insertion locations for 70 ofall elements at 50 coverage 724 of elements at 25 cov-erage and 64 of elements at 15 coverage

PCR ValidationWe compared our pipelinersquos TE calls with PCR data from1687 PCR calls of TE insertions in 9 of the DGRP lines(Blumenstiel JP Chen X He M Bergman CM unpublisheddata httparxivorgabs12093456 last accessed June 262013) The comparison of our pipeline to PCR-validated in-sertions showed a 924 overall agreement between the twomethods (supplementary table S2 Supplementary Materialonline) when comparing presence and absence calls We didnot include the few examples of heterozygous insertions inthis analysis Our pipeline was unable to make a presence orabsence call for the set of PCR-validated insertions 6 of thetime and disagreed with the PCR data 16 of the time Failureto validate via PCR was equally likely for presence or absencecalls suggesting that detection of TEs was unbiased with re-spect to presence versus absence We estimate a pipelinesensitivity of 982 and a specificity of 94 based on thissubset of the data Our estimation of the validation rate isconservative requiring us to reconstruct a contig that eithercontains the TE breakpoint or spans over the insertion site inorder to call an event as either present or absent respectivelyFor some cases where our pipeline makes no call we did seeread pairs that are suggestive of the state of the presence orabsence of the TE but are not sufficient in number andoruniqueness for reconstruction the event de novo

Computational ValidationThe initial phase of our pipeline detected transposable ele-ment insertions in individual lines Following this initial de-tection phase our pipeline surveys every line at each genomiclocation where a TE insertion was identified in any other lineincluding lines in different data sets In this manner weattempted to computationally verify the presence or absenceof TE insertions at a total of 2596370 genomic locationsOur pipeline made presence or absence calls between9717 and 9998 of the time (supplementary table S3Supplementary Material online) Positive absence callsmade by our pipeline can also be considered a type of vali-dation since to make an absence call our pipeline must detectreads that span the junction where the insertion would belocated (supplementary table S3 Supplementary Materialonline)

The presence of regions of IBD amongst DGRP lines (fig 1)provided an opportunity to cross-validate TE calls in regionsmasked as IBD between different line pairs because we ex-pected TE genotypes to be identical within regions identified

as IBD We find a mean of 98 agreement for TE calls (bothpresence and absence) between pairs of regions identified asIBD with a standard deviation of 0015

Distribution and Abundance of TransposableElements

A total of 7104 transposable element insertions were foundin the DSPR and 17639 in the DGRP A comparison of the TEinsertion content of the DGRP and DSPR reveals that eachpanel contained a large number of TE insertions specific tothat panel only 1656 insertions were shared between theDSPR and DGRP data sets out of a total of 23087 insertionsdetected (ie 72 shared insertions) All transposable ele-ment insertion locations are listed in supplementarytables S4 and S5 Supplementary Material online A subsetof the DGRP data lines with an average sequence coverage25 mean sequence coverage 335 standard deviation51 hereafter DGRP25 (table 1) contained 5855 insertionsThe DGRP25 set of lines was used in some analyses to bemore directly comparable to the DSPR lines which all had amean genome coverage of 60 (fig 2)

In both resources TE insertions were generally at low fre-quency Overall we found a larger total number of insertionsper line in DSPR than in DGRP (fig 2) This was presumablydue to the much higher sequencing coverage of the formerleading to a higher power to make calls of complex insertionsMean sequence coverage for the DGRP lines was 195with astandard deviation of 64 mean sequence coverage for theDSPR was 671 with a standard deviation of 42 Howeverthe increased amount of time the DSPR lines have remained inlaboratory conditions may also have contributed to the ele-vated number of TEs seen in these lines (Nuzhdin et al 1997)

TEs may be located at random throughout the genome orthey may be found preferentially in certain genomic regionssuch as in intergenic sequence We used the binomial distri-bution as a null model for the distribution of TEs in thegenome assuming TEs insert and remain at random withrespect to these genomic features (table 2) We found thatin general the distribution of TEs observed differed from whatwe would expect if TEs were distributed at random through-out the genome We found significantly fewer TEs in exons50 UTRs and 30 UTRs than expected and significantly moreTEs than expected in intergenic and intronic regions (table 2)which was consistent with previous observations as TEs ingenes are much more likely to be deleterious (Kaminkeret al 2002 Lipatov et al 2005) This pattern held true forthe DGRP the DGRP25 and the DSPR suggesting that selec-tion may be acting universally to remove TEs that disruptexonic sequences

TE Identification

Information about the class of TE that has been inserted isuseful for two reasons First this information will allow forrapid assays for the presence of this TE in other strainsSecond insertions of different families of TEs can producedifferent phenotypic consequences (Smith and Corces1991) We were able to identify the TE class of the majority

2314

Cridland et al doi101093molbevmst129 MBE

of insertions (table 1) RNA elements were much morecommon than DNA elements and within RNA elementsover half were long terminal repeat (LTR) retrotransposonelements We were unable to identify element class for347 of the DGRP insertions and 320 of the DSPR inser-tions In cases where we were unable to identify the TE familyfor an insertion this was due to either conflict between themost likely TE as identified by the two reconstructed contigsfor each event or because we had limited data on the TEsequence for that event In cases where we have conflictinginformation this was due to 1) the degradation of the TEsequence so that there were many poor matches to multiplefamilies 2) conserved sequence between families and there-fore good matches to multiple families 3) because we iden-tified a nest of elements where multiple elements insert invery close proximity or within one another or 4) the amountof TE sequence we were able to reconstruct was less than

80 bp and therefore below our annotation threshold lengthBecause of the frequently short length of TE sequence that weattempted to identify we did not attempt to identify thestrand of the TE insertion Although TE orientation is impor-tant and influences selection upon the insertion (Cutter et al2005) our data were frequently insufficient to resolve strand

Factors Influencing TE Density

If TE insertions are deleterious and recessive with respect tofitness then insertions are also predicted to be morecommon on the autosomes since deleterious insertions onthe X will be eliminated via selection against hemizygousmales (Charlesworth and Langley 1989) Contrary to this pre-diction an analysis of variance (ANOVA) indicated a signifi-cant effect of chromosome on mean TE density per line whenboth resources were included in the same analysis (table 3)This difference was due to an increased mean density of TEson the X versus the autosomes in regions of high recombi-nation This pattern was the same both for autosomes takenas a group and for each autosome individually (table 4)

TEs are predicted to accumulate in regions of low recom-bination where they are less likely to produce deleteriousrearrangements via ectopic exchange (Charlesworth andLangley 1989) An ANOVA indicates a highly significanteffect of recombination rate on mean TE density per line(table 3) TE density was also found to be greater in regionsof low recombination than in high recombination through-out the genome which was consistent with the predictions ofCharlesworth and Langley (1989)

Because the factors that influence TE density may differgreatly from family to family we repeated this analysis in 14families of TEs that appear in moderate to high abundance inthe two resources (table 5) Many of these families were pre-viously examined by Montgomery et al (1987) and Bartolomeet al (2002)

We found that in 10 families (412 roo 176 F Bari1 copiahopper INE-1 mdg1 springer) there was a highly significantdifference in mean TE density between the X and the auto-somes (P value 0026 in all cases) although the direction ofthe difference varied among families Within these familiesthere were substantial differences In two of the familiesINE-1 and roo we saw an increased density in the X overthe autosomes in regions of high recombination in boththe DSPR and DGRP25 resource (table 5) However for 412F mdg1 springer Bari1 and copia we saw the opposite pat-tern This suggests that there may be substantial variation inTE insertion site preferences among families

Sample Counts of Transposable Elements

The power to detect an association is dependent both on thepenetrance and the frequency of the causal allele (Hirschhornand Daly 2005) We found that the majority of transposableelements in the DGRP25 and the DSPR are found in only oneline (fig 3) Comparisons between the observed TE data andSNPs from the same population found in introns 86 bp(Haddrill et al 2005) indicated an excess of TEs (DGRP252 test Pamp 0 df = 6 DSPR 2 test Pamp 0 df = 6)

0 20 40 60 80 100

020

040

060

080

010

00

Coverage

TE

s

DSPRDGRP gt 25x coverageDGRP lt 25x coverage

FIG 2 Total number of TEs identified versus coverage for the DGRPDGRP25 and DSPR lines

Table 1 Summary of TE Insertions

DGRP DGRP25 DSPR

Total 17639 5855 7104

Not present in reference 17346 5615 6812

Present in reference 293 240 292

X recombination 2 cMMb 418 149 203

X recombinationgt 2 cMMb 2741 948 1107

Autosomes recombination 2 cMMb 6205 2131 2694

Autosomes recombinationgt 2 cMMb 8177 2544 3019

4 all 98 83 81

Exon 1158 378 633

Intron 8595 2870 3310

30 UTR 477 159 234

50 UTR 190 63 81

Intergenic 7219 2680 3269

DNA elements 1388 514 748

RNA elements 10133 3004 4085

Indeterminate 6118 2336 2271

2315

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

Comparisons between the observed data and the expectednumber of events at the same count under an infinite sitesmodel of an equilibrium Wright-Fisher population (Wakeley2009 p 54ndash56) also indicated an excess of rare alleles(DGRP25 2 test Pamp 0 df = 6 DSPR 2 test Pamp 0df = 6) In addition to overall site count spectra we also gen-erated spectra for the X autosomes regions of high and lowrecombination DNA elements and RNA elements In eachcase we observed the same pattern as seen in the overallspectrum (data not shown) We also compared the distribu-tion of counts in the DGRP25 with the rescaled distribution ofcounts in the DSPR rescaled to the same sample size as theDGRP25 A KolmogorovndashSmirnov test comparing these twodistributions showed that they did not differ (P = 021) Wedid see more overall TEs in every count category in therescaled DSPR data over the DGRP25 data set

Patterns of Transposable Element Insertions betweenGenes

For association studies genotyping TE insertions withincoding regions is of interest due to the potentially large

effect of an insertion on gene function A total of 2931genes (213 of annotated genes) had insertions in theDGRP resource These are insertions which exist anywherewithin the span of the gene from 50 UTR to 30 UTR Onethousand nine hundred fifteen genes 139 had insertions inthe DSPR with 1317 genes having insertions in bothresources The number of TE insertions within genes variedsubstantially between individual genes up to a maximum of78 insertions though the majority of genes with insertionshad only one insertion 631 in the DSPR and 527 in theDGRP The number and locations of individual insertions in agene varied tremendously though the majority of insertionsthat fell within gene regions were in introns 777 DSPR and824 DGRP TE number was strongly correlated with intronsize (Pearsonrsquos correlation coefficient = 079 for DSPR 087for DGRP Pamp 0 for both tests) (supplementary fig S2Supplementary Material online)

Examining the pattern of insertions in individual genes inthe DSPR and DGRP data sets we found that RNA-bindingprotein 6 had the highest number of insertions 62 in theDGRP lines and 21 in the DSPR lines These insertions were

Table 3 ANOVAs for DSRP and DGRP25 Coverage and Comparison between the Two Data Sets

df Sum Sq Mean Sq F value Pr(gtF)

DSPR vs DGRP25

Set 1 54125 54125 6486975 lt2e16

Line 36 2848 791 94817 lt2e16

Chromosome 4 11858 2965 355313 lt2e16

Recombination rate (high vs low) 1 25575 25575 3065207 lt2e16

Chromosomerecombination rate 4 10335 2584 309663 lt2e16

Residuals 333 27784 083

DSPR

Line 14 26599 19 17894 004699

Chromosome 4 39632 9908 93316 123e06

Recombination rate (high vs low) 1 141337 141337 1331143 lt2e16

Chromosomerecombination rate 4 39712 9928 93504 119e06

Residuals 126 133783 1062

DGRP25

Line 22 258202 11736 1723 lt2e16

Chromosome 4 80126 20031 29409 lt2e16

Recombination rate (high vs low) 1 120011 120011 176191 lt2e16

Chromosomerecombination rate 4 66056 16514 24244 243e16

Residuals 198 134866 0681

Table 2 Differences between Observed and Expected TE Counts

DGRP DGRP25 DSPRObserved vs Expected P Value Observed vs Expected P Value Observed vs Expected P Value

X recombination 2 cMMb Decrease 329E11 Decrease 187E03 Decrease 522E02

X recombinationgt 2 cMMb Decrease 174E01 Increase 701E02 Increase 416E01

Autosomes recombination 2 cMMb Decrease 370E01 Increase 100E + 00 Increase 687E04

Autosomes recombinationgt 2 cMMb Increase 392E02 Decrease 100E + 00 Decrease 100E + 00

Exon Decrease 000E + 00 Decrease 247E323 Decrease 103E283

Intron Increase 354E103 Increase 000E + 00 Increase 000E + 00

30 UTR Decrease 699E16 Decrease 313E06 Decrease 126E02

50 UTR Decrease 143E51 Decrease 282E18 Decrease 498E20

Intergenic Increase 112E32 Increase 0 Increase 000E + 00

2316

Cridland et al doi101093molbevmst129 MBE

typically unique to a single line though there were 3 insertionsin more than one line in the DSPR and 14 insertions in morethan one line in the DGRP

Some genes accumulated many TE insertions such asklarsicht (fig 4A gene images from the UCSC genome brow-ser Meyer et al 2013) which had 29 total insertions This genealso had a hotspot of insertions with seven independent in-sertions each at low frequency located within 36 kb of each

other Only one of these insertions was present in any givenline and these could be functionally equivalent though inde-pendently arising mutations Other genes had only a few in-sertions such as Notch (fig 4B) and Delta (fig 4C) Theinsertions in these genes were also at low frequency andthe few present were located in intronic regions

While most insertions exist in only one panel some werefound in both Cyp6a20 (fig 4D) contains a high frequencynon-reference TE insertions present in both data sets in 4lines out of 121 lines where we were able to make a presenceor absence call in the DGRP (hereafter shown as 4121) and314 lines in the DSPR Both Cyp6a20 and klarsicht are exam-ples of genes with TE insertions in exons which was

00

02

04

06

08

10

1 2 3 4 5 6 7+

Observed TEsObserved SNPsExpected

DSPR

00

02

04

06

08

10

1 2 3 4 5 6 7+

Observed TEsObserved SNPsExpected

DGRP

FIG 3 Derived allele count spectra for the DSRP lines and the DGRP25lines where a positive presence or absence call was made for eachinsertion in each line 6613 insertions in the DSPR and 3274 in theDGRP25 Count spectra for SNPs is from SNPs in introns 86 bp 2

tests between observed and expected distributions result in Pamp 0 forcomparisons between TEs and the neutral model as well as between TEsand SNPs for both data sets

Table 5 TE Density for 15 Individual Families of Elements

Mean Density (TEMb)

Element Resource X High X Low Auto High Auto Low

roo DGRP25 039 037 032 033DSPR 080 088 065 068

297 DGRP25 005 000 002 001DSPR 005 000 004 000

412 DGRP25 003 001 003 007DSPR 006 008 009 012

F DGRP25 002 006 004 006DSPR 007 008 012 013

176 DGRP25 001 000 001 002DSPR 002 000 002 006

Bari1 DGRP25 001 000 004 001DSPR 002 002 006 002

copia DGRP25 001 000 002 003DSPR 009 002 014 019

H DGRP25 001 000 001 000DSPR 000 000 001 000

hopper DGRP25 010 001 000 000DSPR 010 004 002 002

INE-1 DGRP25 042 188 003 048DSPR 045 203 004 055

jockey DGRP25 015 012 019 013DSPR 035 034 040 033

mdg1 DGRP25 003 002 003 010DSPR 008 004 012 018

pogo DGRP25 006 004 007 005DSPR 013 021 015 013

springer DGRP25 000 001 001 003DSPR 002 000 001 005

Table 4 TE Density in the X and Autosomes

TEMb

DGRP25 DSPR Both

X all 382 638 483

Autosomes all 351 593 447

X high recombination 371 600 461

X low recombination 393 676 505

2L high recombination 227 473 324

2L low recombination 323 589 428

2R high recombination 282 489 364

2R low recombination 617 863 714

3L high recombination 296 523 386

3L low recombination 474 726 573

3R high recombination 281 468 355

3R low recombination 372 670 490

2317

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

uncommon both in this data set and in previous studies(Kaminker et al 2002 Lipatov et al 2005)

Differences in the strength of selection against closelyrelated genes can also be illustrated by patterns of TE inser-tions The paralogs derailed (fig 4F) and derailed-2 (fig 4E)showed very different patterns of TE insertions though thesame pattern was seen in both resources derailed-2 locatedon chromosome 2R is in a region of moderate recombination199 cMMb whereas derailed is located towards the distalend of 2L in a region of low recombination 044 cMMb

Given the context of the recombination rates the expectationwould be that derailed would have a higher TE load sincedeleterious alleles are removed more efficiently in regions ofhigh recombination (Hudson and Kaplan 1995) but theopposite pattern was observed

While both genes play a role in Wnt5 signaling mutationsin derailed can cause major phenotypic changes in Drosophilanervous system development resulting in the loss of normalfunction (Yoshikawa et al 2003) derailed-2 when mutatedproduces only minor differences in neuron positioning

115 115115 215 115115

114 1122

11181120 11181124

2117

21182119

1122

1119 1118

2118

1121 3114

1115

1114

1119

1113

1119

1120

1120

1119

1120

3119

1119

A klarsichtDGRP

DSPR

Insertion in an Exon

314 1195

4121D Cyp6a20

315

1123

1415

56125

115115

415

115615

10124

1124 1124 1123

2124E derailed-2

B Notch

115 11211122

C Delta1114 1120

191

F derailed

Same insertion in both resources

FIG 4 Transposable element insertions in genes The frequency above each insertion is the number of lines in which the element is present over thenumber of lines in which the element is validated as either present or absent Gene images are from the UCSC genome browser (httpgenomeucscedu last accessed June 31 2012)

2318

Cridland et al doi101093molbevmst129 MBE

(Sakurai et al 2009) This suggests that either the strength ofselection may be much higher against insertions in derailed orthe insertion rates are unequal between the paralogs

Gene Family Size and TEs

We used available information on gene families in DrosophilaHahn et al (2007) grouped into families based on sequencesimilarity and found a very weak but significant effect of genefamily size on the average number of TE insertions per kbin both resources (P value = 00003 R2 = 0001 DSPRP value = 0002 R2 = 0007 DGRP) (fig 5A and B) Althoughthere was a large variance in the number of insertions per kb

in both resources (fig 5A and B) the average number ofTEskb increased as family size increased up to a moderatelylarge family size

Patterns of Variation between Resources

To replicate an association with a single causative allele thatallele must both exist in both panels and also be at highenough frequency to detect the association Given thatmost TEs were singletons and only 45 of insertion eventswere identified in both panels and were within genes thereplicability of associations with these alleles is likely to below if alleles are considered individually We examined sharedinsertions segregating at different frequencies between panelsand found only two TE insertions in genes that were segre-gating at different frequencies in the DGRP versus DSPRresources following Bonferroni correction (Fisherrsquos exacttests P 000007) One of these was in an exon of bothCG13175 and CG33964 a pair of overlapping genes Herethe TE insertion was at much higher frequency in theDGRP than in the DSPR (89122 vs 215) The other insertionwas in an intron of Caliban and the frequency of this insertionwas higher in the DSPR than the DGRP (1115 vs 7116)

If insertions are considered as a class of mutation thenthere may be better power to detect an association assumingthat different insertions contribute in the same way to thephenotype We also examined genes with TE insertions(139 of genes) though not necessarily the same insertionin both resources Treating all insertion alleles within a gene asequivalent a comparison between resources showed thatthere were 21 genes which contain TEs at different frequen-cies between the DGRP versus DSPR following Bonferronicorrection (Fisherrsquos exact tests P 0000038197) includingderailed-2 (supplementary table S6 Supplementary Materialonline)

Discussion

Distribution and Abundance of TransposableElements

Previous work on transposable elements has generally beenfocused on a few families of TEs (Montgomery et al 1987Charlesworth and Lapid 1989 Charlesworth et al 1992) orfocused only on a single genome (Bartolome et al 2002Kaminker et al 2002 Rizzon et al 2002) We have presenteda genomic analysis of TEs across a large number of lines andtwo data sets We found that transposable elements in bothresources are primarily found as singletons with a few mod-erate to high frequency insertions Transposable elements alsosegregated at lower allele frequencies than do presumed non-functional SNPs drawn from the same data set suggestingthat negative selection is acting upon TEs However theremay be substantial variation in the age of transposable ele-ments in different families many LTR elements are thought tobe young insertions (Bowen and McDonald 2001 Bergmanand Bensasson 2007) while other families like INE-1 arethought to be quite old (Kapitonov and Jurka 2003) Thismay create a situation where the young age of insertions isresponsible for the excess of rare alleles We found a

0

1

2

3

00 05 10 15 20log10(Gene family size)

TE

den

sity

0

1

2

3

00 05 10 15 20log10(Gene family size)

TE

den

sity

DSPR

DGRPA

B

FIG 5 log10(TE density) versus log10(Gene family size)

2319

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

non-random distribution of TE insertions across the genomewith increases of TEs in both the intergenic and intronicregions of the genome and decreases in exonic and UTRregions (Kaminker et al 2002 Lipatov et al 2005) Similar toprevious studies we found variation in TE density betweenchromosomes and higher densities of TEs in regions of lowrecombination (Charlesworth et al 1994 Bartolome et al2002) We also found increased TE densities on the X relativeto the autosomes both in overall insertions and in someindividual TE families (table 5)

Detection method and the size of the data set are likely tobe contributing factors to differences between previous workand this study Many previous studies have used in situ meth-ods to ascertain TE number in samples and have looked atelement accumulation at the base versus the midsections ofchromosomes (Montgomery et al 1987 Charlesworth et al1992) or have looked only at one genome (Bartolome et al2002 Rizzon et al 2002) A previous PCR-based detectionstudy found no difference in TE frequency between the Xand the autosomes in natural populations (Petrov et al2011) but this study was biased in that it only looked atTEs shared with the reference sequence and thus did notcapture the full picture of TEs in these populations

While our analysis does disagree with previous work thesedisagreements may be mainly due to the way in which regionsof the genome are divided for examination Rizzon et al(2002) found an increase in copies of LTR retrotransposonson the X when they excluded pericentromeric telomeric andchromosome 4 from their analysis though they found a def-icit of TEs on the X when these regions are included Thisfinding is most directly comparable to our own finding for theentire TE data set since they exclude a very similar set ofgenomic regions We do agree with previous work in ourfinding of higher densities of TEs in regions of low recombi-nation than in high recombination (Charlesworth et al 1994Bartolome et al 2002)

Examining individual element families Carr et al (2002)found an increase of mdg-3 elements and 297 elements on theX than would be expected under a random insertion modelMontgomery et al (1987) found no evidence for a reductionin TEs on the X for both roo and 297 elements but foundfewer TEs on the X when looking at the 412 family HoweverCharlesworth et al (1992) observed the opposite pattern forroo and 297 Montgomery et al (1991) also found that moreroo elements in lab kept lines than they had previously ob-served in natural populations We found significantly in-creased densities on the X for roo and significantlyincreased densities on the autosomes for 412

Differences between previous studies of individual elementfamilies and our results may be due to a variety of reasonsFirst we were able to annotate only about two-thirds of ourTE insertions This was primarily due to having insufficientsequence to reliably annotate the element or because inde-pendent annotation using sequence reconstructed fromeither end of the insertion did not yield the same elementfamily annotation This means that our density calculationsfor individual elements may be skewed by missing annotationinformation Second many previous studies (Montgomery

et al 1987 Charlesworth et al 1992 Carr et al 2002) examinedonly LTR elements These elements are young (Bowen andMcDonald 2001 Bergman and Bensasson 2007) and thus in-dividual copies may not yet have been removed from thegenome thereby displaying different patterns of insertionthan older active elements

What is clear is that there is great variability in TE densityboth within and between element families as well as withinand between chromosomes It may also be that the overallpattern is driven by a small group of families In our data setINE-1 shows the largest pattern of difference between the Xand autosomes with an order of magnitude increase in meanTE density per line on the X over the autosomes possibly dueto an increased likelihood of fixation of insertions (Omettoet al 2005) specifically on the X chromosome (Presgraves2006) These INE-1 elements are a subset of the largest TEfamily in D melanogaster though only several dozen of themare in the euchromatin INE-1 elements are also considered tobe inactive in D melanogaster and likely have not been mobilefor gt3 million years (Kapitonov and Jurka 2003) Consistentwith this most of the INE-1 elements we detected were fixedor nearly fixed in our data set We did see some examples oflow frequency insertions which may be due to more recenttranspositional events or incorrect TE annotation due toshort sequences The old age of INE-1 insertions also suggeststhat the extant elements are unlikely to be selected againstAn increased density of TEs on the X compared with theautosomes suggests that it is not the fitness effects of inser-tions which govern TE copy number Fitness effects wouldresult in the more rapid removal of insertions on the X thaninsertions on the autosomes due to the increased selectivepressures experienced on the X (Montgomery et al 1987)Background selection also predicts that the X should be lessburdened by deleterious recessives since it is hemizygous onethird of the time (Charlesworth 1994)

However it is unclear whether our observation of in-creased densities of TEs on the X relative to the autosomesfits with the ectopic exchange model This model predictsthat TEs will accumulate in regions of low ectopic exchange(Langley et al 1988) The general assumption in the literatureis that ectopic exchange is positively correlated with recom-bination rate (Langley et al 1988 Montgomery et al 1991)The X chromosome experiences higher levels of recombina-tion than the autosomes with an average of 36 cMMb forthe X in regions of high recombination (cMMbgt 2) whereasthe autosomes have a mean of 336 cMMb If recombinationrate and ectopic exchange are positively correlated through-out the genome then the expectation would be fewer TEs inthe middle regions of the X where recombination rates areknown to be high This is the opposite of what we observe Ifrecombination rate and the rate of ectopic exchange are notpositively correlated then the ectopic exchange model maybe correct but data are lacking to properly address this issueCurrently there are no direct measurements of the rate ofectopic exchange on a genome-wide scale in Drosophiladuring meiosis Montgomery et al (1991) observed littleectopic exchange between homozygous chromosomes inDrosophila melanogaster on the X chromosome in the

2320

Cridland et al doi101093molbevmst129 MBE

region around the white locus They also found that whilerecombination rate was reduced in both centromeric andtelomeric regions only the centromeric region displayed anincreased density of TEs This information is suggestive ofdifferences between these regions in the rate of ectopicexchange

The DGRP and DSPR are also composed of inbred lineswhich have been kept in laboratory conditions for some time~9 years in the case of the DGRP lines and 40+ years in thecase of the DSPR Given that TE insertion mutations accumu-late rapidly and in an increasing manner over time (Nuzhdinet al 1997) the increased density of TEs on the X versus theautosomes may not be surprising if there is a highertransposition rate to the X The only direct study to datefocused on roo elements in D melanogaster and did observehigher rates of insertion in the X than the autosomes(Vazquez et al 2007)

Transposable Elements in Genes

The majority of genes 813 do not have any TE insertions inany line studied and intron length is a good predictor of thenumber of TEs within a gene However a few genes accumu-late many insertions even after correcting for gene size andintron length and some genes have very few TEs even thoughthey have large introns The reasons why different genes mayaccumulate TE insertions may be quite different for individualgenes In some cases there may be hot spots for insertionswhich if insertions are neutral or only slightly deleterious mayresult in large number of insertions in a small region In othergenes the lack of TE accumulation may suggest selectivepressures against TE accumulation

Gene family size may also play a role in TE accumulationWe see an increase in TE density correlated with an increase ingene family size This may indicate differences in selectivepressures for genes in different sized families Single-copygenes may experience more constraint whereas in moder-ately sized gene families these constraints may be more re-laxed perhaps due to redundancy

In the cytochrome p450 functional family of genes we findthat 3789 of these genes have TE insertions 2589 in theDGRP and 2089 in the DSPR Cytochrome p450 is also foundto be in an enriched term in the DSPR for the set of genescontaining TEs The cytochrome p450 gene Cyp6g1 is knownto confer resistance to DDT (Daborn et al 2001) and TE-mediated copy number variation is associated with increasingresistance to DDT (Schmidt et al 2010) This pattern of TEinsertions in the DSPR may be reflective of the locationswhere DSPR lines were originally collected The DSPR lineswere gathered from various locations worldwide at a timewhen different pesticides were in use The DGRP lines weregathered from a single geographic area in North America in2003 Presumably these sets of lines have experienced differ-ent selective pressures with respect to pesticide resistance InCyp6a20 (fig 4D) where TEs are segregating at different fre-quencies in the two resources some similar effect may beoccurring

TE Detection Pipelines

The TE detection strategy used here is an improvement uponthe strategy used in Mackay et al (2012) The most notabledifferences between the data reported in Mackay et al (2012)and the DGRP data reported here is substantial amount ofnew information First our pipeline improvements agree wellwith the previous implementation with 812 of previous TEpresence calls identified by the new pipeline However thesecalls represent on 182 of the presence calls made by theimproved pipeline demonstrating that we have identifiedsignificantly more calls than before The incorporation of TEabsence calls also allows us to provide a much more completepicture of TEs in euchromatic sequence in the DGRP

The TE detection pipeline used here is similar to otherpipelines which have been employed but it is useful to de-scribe briefly here the differences in approaches and why wefollowed the approach presented here A useful discussion ofcommon approaches to TE detection in many species can befound in Xing et al (2013) One common method for TEdetection is the split-read mapping approach which hasbeen employed in Drosophila by Linheiro and Bergman(2012) This approach identifies TE insertions by locating in-dividual sequence reads which span TE insertion breakpointsThis approach is the strategy of choice for next-generationsequencing experiments which produce only single-endreads However in a paired-end read situation this approachmay not take full advantage of the available data Specificallythe size of window around the TE insertion breakpoint whereinformative reads can be found is governed by read length in asingle-end experiment and fragment length in a paired-endexperiment Linheiro and Bergman (2012) discuss the impor-tance of read length on TE detection and suggest that in-creased read length improves TE detection In situationswhere read length is long and coverage is deep enough thesplit-read method may be preferable to the paired-end de-tection strategy However for the data sets analyzed herewhere read length is much shorter than fragment lengththe paired-end detection strategy will capture moreinformation

The pipeline used here also has similarities to the pipelineused in Kofler et al (2012) However there are two majordifferences in these studies First Kofler et al (2012) do notattempt to reconstruct their insertion breakpoints andsecond the population studied by Kofler et al (2012) wassequenced as a pool rather than as individual lines Becausetheir pooled sequence data reflect on average less than 1sequence coverage per line Kofler et al (2012) are primarilyable to detect intermediate to high frequency TE insertionstheir own analysis concludes that they cannot reliably detectinsertions at frequency lt7 (see supplementary fig S2Supplementary Material online in their supplementary dis-cussion) However the literature to date (Charlesworth andLangley 1989 Charlesworth and Lapid 1989 Charlesworthet al 1992) as well as the results of this study indicate thatthe majority of TEs in a population are found at very lowfrequency often in only one individual in a population andtherefore beyond the scope of detection of Kofler et al (2012)

2321

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

Caveats

The D melanogaster reference genome to which we alignedour data is a P-element-free genome Therefore the use of thisreference will be biased against detecting any P elementsexisting in the data sets In the case of the DSPR data setthis is not an issue since these lines are also all P-element-free(Macdonald and Long 2007) However it is likely that theDGRP lines do harbor P elements For the purpose of theseanalyses we decided to ignore P elements and focus on theset of element families common to the two resources

This situation does highlight the importance of having anappropriate reference sequence To detect P elements in thegenome of D melanogaster an artificially constructed refer-ence sequence or a different assembled sequence containingthe P element would need to be utilized This also brings upthe question about as yet undetected elements that exist atlow frequency If these elements do not have sufficient sim-ilarity to existing elements to be aligned to the referencegenome these elements will go undetected

Conclusions

We have presented a description of transposable elements intwo recently released QTL resources Transposable elementscan play important roles in gene function and regulation andincorporation of TE insertion information into any analysisperformed with either of these two resources will present amore complete picture of genomic variation and its contri-bution to complex traits In addition it will be important forassociation studies attempting to replicate genotypendashpheno-type associations utilizing these resources to be aware of thepotential contributions of TEs

We have also demonstrated a next-generation high-throughput sequencing analysis pipeline that is capable ofdetecting this type variation with a high rate of precisionand specificity We suggest that future studies using next-generation sequencing data utilize this technique

Materials and Methods

Data Sets

Two data sets were used in this analysis The first set consistsof 131 inbred lines from the Drosophila Genetic ReferencePanel (DGRP) (Mackay et al 2012) SRA accessionSRP000694 The second set consists of 15 lines used to es-tablish the Drosophila Synthetic Population Resource (DSPR)(King et al 2012a) initially described in Macdonald and Long(2007) SRA accession SRA051316 For the DGRP resourceboth 454 and Illumina data were available whereas onlyIllumina data are available for the DSPR We utilized onlythe Illumina sequence data to ensure consistency in datacollection across data sets We calculated a mean coverageof 21 for the DGRP lines and 50 for the DSPR linesVariation in coverage for the DGRP lines was substantial be-tween lines ranging from ~4 average coverage to nearly50 average coverage We therefore dropped from the anal-ysis four lines with an average coverage of 10 or less sincelow coverage can cause difficulties with assembly and TEidentification Sequence data for both data sets consisted of

paired-end data with 54 bp reads for the DSPR and paired-end 100 bp reads trimmed to 75 bp for DGRP

IBD in DGRP Data Set

We downloaded the SNP tables generated by the DGRPproject from httpwwwhgscbcmtmceduprojectsdgrpfreeze1_July_2010snp_callsIllumina (last accessed July 302010) to examine identity by descent in this data set Weperformed an all by all comparison between lines examiningsliding windows of 1 Mb across the genome with 100 kb stepsbetween windows We take a simple definition of IBD hereWhenever we identified two lines that weregt95 identity atSNP positions in a 1 Mb region we labeled the pair as IBD inthe identified region and masked that region of the genomein the line with lower coverage We also dropped entirelyfrom our analysis any lines where gt50 of the genome wasdetermined to be IBD with other lines in the population(supplementary table S1 Supplementary Material online) re-moving from the analysis the line with lower average se-quence coverage

We modeled 1 Mb regions of 200 chromosomes in a coa-lescent simulation using the parameters theta = 001site and4Nr = 10theta to determine the number of regions of95IBD under a Wright-Fisher population model using the MACSsoftware (Chen et al 2009) We simulated 1000 replicates ofthis simulation and tabulated the number of 1 Mb regionsshowing 95 IBD in each replicate

Alignment

We aligned reads to the D melanogaster reference genome(version 513 downloaded from FlyBase flybaseorg) using thealigner BWA (version 059 Li and Durbin 2009) We used thefollowing parameters (aln -t 8 -l 13 -m 50000000 -R 5000)followed by the command ldquosamperdquo to resolve paired endmappings (using parameters -a 5000 -N 5000 -n 500) It isimportant when detecting TEs that the -R during the initialalignment and the -N parameter during the ldquosamperdquo phase beset high otherwise highly repetitive sequences that would beinformative of the presence of a common element will beexcluded from the data by the aligner We also used the -Ioption for the DGRP lines but not the DSPR due to the dif-ferences in Illumina quality score output format as the twodata sets were sequenced at different times The -m parameterfrom the initial alignment step and the -n parameter from thepaired-end resolution step deal with how BWA treats multiplyaligning reads An additional caveat is that some TE familiesmost notably the P-element family is absent from the Dmelanogaster reference sequence Such families in the se-quenced genomes will not align to the reference genome weused even if they are present in the sampled lines

Transposable Element Detection

Transposable elements were detected by first identifying allreads that were aligned to an annotated TE in the referencegenome (supplementary fig S1 Supplementary Material on-line) We then identified the mates of these reads and selectedonly those mates which did not align to an annotated TE

2322

Cridland et al doi101093molbevmst129 MBE

These reads identify the set of read-pairs which span a TEinsertion The uniquely mapping non-TE reads were thenclustered based on their start position and chromosomalstrand to identify putative TEs Clusters on the plus strandwere matched to clusters on the minus strand to producepairs of clusters indicating either end of the putative TE Onceclusters were identified we filtered out all events which con-tained fewer than three read-pairs consistent with a giveninsertion for each cluster This means that putative eventsat this stage have a minimum of six read-pairs indicating theinsertion three on the plus strand and three on the minusstrand

We then extracted all reads and their mates that aligned tothe identified regions of the genome and attempted toreconstruct the local area including the breakpoint of theTE Because many TEs have repetitive regions at the ends ofthe element we reconstructed the two breakpoints of the TEseparately (see also supplementary fig 23 from Mackay et al2012) Local reconstruction was done by running Phrap (ver-sion 1090518 Ewing and Green 1998) with the followingparameters (-forcelevel 10 -minscore 10 -minmatch 10)This set of parameters causes Phrap to do a Smith-Waterman comparison when it finds matches of at least 10characters but then relaxes the minimum alignment scoreand the final assembly parameters This is necessary to recon-struct TEs because repetitive sequence within the TEs canresult in a failure to reconstruct a contig if stricter assemblyparameters are used All code we developed for the detectionof TEs will be freely available at wwwmolpopgenorgData(last accessed August 1 2013)

Following reconstruction we used BlastN (verson 2222) tocompare reconstructed contigs to the D melanogaster refer-ence to confirm the presence of a TE In most cases we wereable to fully reconstruct both breakpoints of the TE Fullyreconstructed indicates that we could reconstruct a singlecontig which contained the breakpoint of the TE In mostcases the two breakpoints of the TE did not resolve to thesame nucleotide but instead identify the location of thetarget site duplication for the insertion (Linheiro andBergman 2012 similar to fig 1) These are cases where theupstream estimate of the breakpoint is slightly higher thanthe downstream estimate of the breakpoint and the differ-ence in position can be seen as the likely span of the target siteduplication However sometimes we were not able to fullyreconstruct a contig but instead generated two contigs onecontaining uniquely aligning sequence and the other contain-ing TE sequence These contigs are known to belong togetherbecause reads forming one contig have their mates in theother contig In these cases we have an approximate but notprecise estimate of the TE breakpoint and are unable toidentify the target site duplication

After the first round of detection we performed a secondround of detection where we examined every location wherea novel TE was detected in any line in all other lines Thisincluded examining positions in one data set that were iden-tified in the other data set In addition we examined thelocations of all known TEs in the D melanogaster genomeFor each location queried in each line we extracted reads

aligning to that location and their mates and used Phrap toreconstruct a contig This allowed us to produce presenceabsence or missing data calls for all positions where TE wasdetected in all lines The absence of a TE is determined byidentifying reconstructed contigs that spanned the insertionposition of the TE by at least 15 bp on either side Theseabsence calls were not included in Mackay et al (2012)This also allows us to identify TEs that may be in regions ofthe genome with lower average sequence coverage comparedwith the rest of the genome provided that insertion is sharedwith another line where it was possible to detect the insertion

Transposable Element Annotation

Annotation of novel TEs was performed by aligning recon-structed contigs with BlastN to the set of TEs found in theD melanogaster reference We chose the best match basedon overall length and BLAST reported e-value and called thenovel TE a TE of that family We also required a minimum of80 bp of contig matching a TE to make an annotation callThis is an additional annotation analysis to the one we per-formed in Mackay et al (2012) For each of the two contigsproduced for each insertion we determined the best matchand then compared these to each other TEs are classified asper the classification scheme put forth in Wicker et al (2007)We annotated each TE insertion to the highest level of agree-ment between the two contigs if the two contigs disagreedwith each other at every level of classification we called thatelement ldquoundeterminedrdquo This could result from nested in-sertions of elements where the 50 end of the insertion is truly adifferent element from the 30 end or because the TE sequenceis degraded and differs substantially from the referencesequence or because there was not enough TE sequence tomeet our annotation guidelines Because we can only recon-struct a maximum of few hundred bases at either end of theinsertion it is difficult to distinguish between these issues Wealso did not attempt to determine the strand of the TE

Validation

A subset of 190 TE insertion sites shared with the D melano-gaster reference were validated via PCR in a subset of nineof the DGRP lines by Blumenstiel JP Chen X He M BergmanCM (unpublished data httparxivorgabs12093456 lastaccessed June 26 2013) To independently check IBD callsusing SNP data we also determined if pairs of lines wereIBD for TEs in the same regions In this comparison we onlyincluded TE insertions where both lines have a positive pres-ence or absence call This can also serve as a validation of ourTE calling pipeline since if two segments are IBD they shouldshare the same set of TEs

Simulation

We selected a set of elements of different sizes and of TEclasses from the set of TEs in the reference sequence andinserted them semi-randomly into the D melanogaster refer-ence genome in silico Insertions were semi-random since weavoided areas already occupied by reference TEs and the cen-tromeric and telomeric regions Read pairs were then

2323

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

generated to construct fastq files We then ran our TE detec-tion pipeline as above through the initial annotation stepThis process was repeated three times to generate a simula-tion of 15 25 and 50 coverage

Recombination Rates

We wrote an R routine to estimate local recombination ratesas a function of physical position Briefly we used ldquocyto-ge-netic-seqtsvrdquo from wwwflybaseorg (last accessed January 302009) to assign a set of physical landmarks to genetic posi-tions For each chromosome arm we then fit a local polyno-mial relating cumulative cM to cumulative basepair (usinglocpoly) Unlike many prior local polynomial fits we madethe added assumption that this polynomial must be a mono-tonically increasing function (using monoproc) an assump-tion that is true for real data We then fit a smooth spline tothe above curve (using smoothspline) The advantage ofsmoothspline is that the resulting curves can be differentiated(and hence estimates of rates obtained directly) Finally we fita smooth spline to the first derivatives obtained from theprior smooth spline largely to make the recombinationrates less noisy near centromeres and telomeres TheR code is freely available at wwwmolpopgenorgData (lastaccessed August 1 2013)

Data Set Restriction

Analyses performed on these data sets were restricted in twoways First we restricted our analyses to regions of thegenome with low TE density and moderate recombinationrates We restricted our data set in this way because regions ofhigh TE density make it more difficult to make accurate callsBecause our pipeline requires unique reads to positionallylocate the TE insertion regions with high TE density canlack enough unique sequence to generate these reads Theregions included in the analyses are X300000ndash208000002L200000ndash20100000 2R2300000ndash21000000 3L100000ndash21900000 3R600000ndash27800000 which are not centromericand telomeric regions This excludes the majority of TEs an-notated in the reference genome since these TEs are largelyclustered together in centromeric regions However this doesnot affect our analysis since we are interested in TEs in theeuchromatic portions of the genome

Second to restrict the set of reference TEs studied to full-length copies we removed from the analysis all reference TEsthat had a length that was less than 75 of the canonicalannotated length of the TE While these elements may stillproduce phenotypic effects these elements are unlikely to beactive copies This reduced the number of reference TEs in-cluded in the analysis to 607 from 1085 TEs that are present inthe reference genome in the regions included in our analysis

Additionally some analyses were performed on the subsetof 23 lines in the DGRP data set which had an average se-quencing coverage of 25 or higher hereafter referred to asDGRP25 This restriction was to mitigate the effect of differ-ences in coverage between the two data sets Simulations oftransposable element detection via our pipeline data not

shown indicate that TE detection at 25 is comparable todetection at 50

Site Count Spectra

We generated derived allele count spectra for TEs in the DSPRdata set and the DGRP25 set of lines We included only siteswhere we were able to make a positive presence or absencecall for each line in the data set so that the number of lineswas normalized for each resource for this analysis Observedvalues were compared with expected values at counts 1 2 34 5 6 and 7 under an infinite sites model via a 2 testCategories 7 and higher were grouped due to low counts insome of these categories We also calculated count spectra forSNPs present in small introns (introns86 bp [Haddrill et al2005]) a group unlikely to have functional effects We alsoused SoFoS a site frequency rescaler (Hufford et al 2012) torescale the DSPR data set to the sample size of the DGRP25from 15 to 23 samples to see if the rescaled data show adifferent distribution of element counts from the observedDGRP25 data We then did a KolmogorovndashSmirnov test tocompare the two distributions as a test of different selectiveeffects acting on the populations

Genomic Context of TEs

Transposable elements were identified in regions of low(lt2 cMMb) or high recombination (2 cMMb) FASTAfiles containing the annotated coordinates of 50 UTRs30 UTRs intronic intergenic and full gene span regions forversion 513 of the D melanogaster reference were down-loaded from flybase (wwwflybaseorg last accessed January30 2009) Version 513 of the reference annotation was usedfor comparative purposes with the DGRP analysis (Mackayet al 2012) Because alternative transcripts can result in thesame genomic position being classified in multiple categorieswe labeled each TE as every appropriate category This oc-curred with 52 of TEs in the DGRP and 60 of TEs in theDSPR When calculating the percentage of the referencegenome in our restricted areas of analysis that were classifiedas each genomic type we followed the same strategy

We compared our observations of TE insertions to a bino-mial distribution to determine whether TEs were distributedrandomly throughout the genome We used the pbinomfunction in R (R Development Core Team 2008) setting thenumber of trials to the total number of TE insertion locationsin the data set and the probability of success to the propor-tion of the genome defined as each context (intron exon30 UTR 50 UTR or intergenic) We did one tailed tests in eachcase For exon 30 UTR and 50 UTR tests we calculated theprobability of observing the observed value or fewer TEs Forintron and intergenic tests we calculated the probability ofobserving the observed value or more TEs

Variables Affecting TE Density

For the DGRP25 and DSPR data sets we calculatedTE density in terms of TEsMb For this analysis we alsoexcluded chromosome 4 We examined the effects of recom-bination rate chromosome and TE element family on TE

2324

Cridland et al doi101093molbevmst129 MBE

density using the standard linear model in R We usedthe lm() function to define our model as followsTEsMb ~ Line + Recombination Rate + Chromosome +Recombination Rate Chromosome We performed thisanalysis both for the set of all TEs and also for 15 TE familieswith moderate to high number of copies in our data set(supplementary table S4 Supplementary Material online)We selected individual TE families for this additional analysisbased on their level of abundance in the data sets and theirhaving been previously studied by Montgomery et al (1987)and Bartolome et al (2002)

Gene Families

We downloaded the Dfam database from (httpwwwindianaedu~hahnlabflyDfamDBdrosophila_frbhtml last acc-essed September 1 2012) This database describes all mem-bers of all gene families clustered using a fuzzy reciprocalBLAST method as described in Hahn et al (2007) for 12Drosophila species We used R to perform a linear regressionusing a simple model where the square root (TE density) in agene is the random variable and log10 (gene family size) is theindependent variable

Variation between Resources

For each TE present in both data sets we compared thefrequency of the TE in each population using Fisherrsquos exacttest We excluded from this analysis all insertions where wewere unable to make a presenceabsence call in at least 100 ofthe DGRP lines or 13 of the DSPR lines For each gene with TEinsertions in both data sets we compared the frequency atwhich the gene contains TEs between data sets using aFisherrsquos exact test

Supplementary MaterialSupplementary tables S1ndashS6 and figures S1 and S2 are avail-able at Molecular Biology and Evolution online (httpwwwmbeoxfordjournalsorg)

Acknowledgments

The authors thank Casey Bergman and Miaomiao He for earlyaccess to their data as well as helpful comments from CBThey also thank Rebekah Rogers for help with editing themanuscript This work was supported by the NationalInstitute of Health grant NIH R01-GM085183 to KRT theNational Library of Medicine - National Institute of Healthtraining grant LM007443 to JMC and the National Instituteof Health grant NIH R01-OD010974 to ADL and SJM

ReferencesAltschul SF Gish W Miller W Myers EW Lipman DJ 1990 Basic local

alignment search tool J Mol Biol 215403ndash410Aminetzach YT Macpherson JM Petrov DA 2005 Pesticide resistance

via transposition-mediated adaptive gene truncation in DrosophilaScience 309764ndash767

Bartolome C Maside X Charlesworth B 2002 On the abundance anddistribution of transposable elements in the genome of Drosophilamelanogaster Mol Biol Evol 19926ndash937

Bennetzen JL 2000 Transposable element contributions to plant geneand genome evolution Plant Mol Biol 42251ndash269

Bergman CM Bensasson D 2007 Recent LTR retrotransposon insertioncontrasts with waves of non-LTR insertion since speciation inDrosophila melanogaster Proc Natl Acad Sci U S A 10411340ndash11345

Biemont C Vieira C 2006 Junk DNA as an evolutionary force Nature443521ndash524

Birchler JA Hiebert JC 1989 Interaction of the Enhancer of white-apricotwith transposable element alleles at the white locus in Drosophilamelanogaster Genetics 122129ndash138

Birchler JA Hiebert JC Rabinow L 1989 Interaction of the mottler ofwhite with transposable element alleles at the white locus inDrosophila melanogaster Genes Dev 373ndash84

Bowen NJ McDonald JF 2001 Drosophila euchromatic LTR retrotran-sposons are much younger than the host species in which theyreside Genome Res 111527ndash1540

Carr M Soloway JR Robinson TE Brookfield JF 2002 Mechanisms reg-ulating copy numbers of six LTR retrotransposons in the genome ofDrosophila melanogaster Chromosoma 110511ndash518

Charlesworth B 1994 The effect of background selection against dele-terious mutations on weakly selected linked variants Genet Res 63213ndash227

Charlesworth B Jarne P Assimacopoulos S 1994 The distribution oftransposable elements within and between chromosomes in a pop-ulation of Drosophila melanogaster III Element abundances in het-erochromatin Genet Res 64183ndash197

Charlesworth B Langley CH 1989 The population genetics ofDrosophila transposable elements Annu Rev Genet 23251ndash287

Charlesworth B Lapid A 1989 A study of ten families of transposableelements on X chromosomes from a population of Drosophila mel-anogaster Genet Res 54113ndash125

Charlesworth B Lapid A Canada D 1992 The distribution of transpos-able elements within and between chromosomes in a population ofDrosophila melanogaster I Element frequencies and distributionGenet Res 60103ndash114

Chen GK Marjoram P Wall JD 2009 Fast and flexible simulation ofDNA sequence data Genome Res 19136ndash142

Chung H Bogwitz MR McCart C Andrianopoulos A Ffrench-ConstantRH Batterham P Daborn PJ 2007 Cis-regulatory elements in theaccord retrotransposon result in tissue-specific expression of theDrosophila melanogaster insecticide resistance gene Cyp6g1Genetics 1751071ndash1077

Cutter AD Good JM Pappas CT Saunders MA Starrett DM Wheeler TJ2005 Transposable element orientation bias in the Drosophila mel-anogaster genome J Mol Evol 6733ndash741

Daborn P Boundy S Yen J Pittendrigh B Ffrench-Constant R 2001DDT resistance in Drosophila correlates with Cyp6g1 over-expressionand confers cross-resistance to the neonicotinoid imidacloprid MolGenet Genomics 266556ndash563

Deloger M Cavalli FMG Lerat E Biemont C Sagot M-F Vieira C 2009Identification of expressed transposable element insertionsin the sequenced genome of Drosophila melanogaster Gene 43955ndash62

Dworkin I Palsson A Gibson G 2005 Replication of an Egfr-wing shapeassociation in a wild-caught cohort of Drosophila melanogasterGenetics 1692115ndash2125

Ewing B Green P 1998 Base-calling of automated sequencer tracesusing Phred II Error probabilities Genome Res 8186ndash194

Gruber JD Genissel A Macdonald SJ Long AD 2007 How repeatable areassociations between polymorphisms in achaete-scute and bristlenumber variation in Drosophila Genetics 1751987ndash1997

Haddrill PR Charlesworth B Halligan DL Andolfatto P 2005 Patterns ofintron sequence evolution in Drosophila are dependent upon lengthand GC content Genome Biol 6R67

Hahn MW Han MV Han S-G 2007 Gene family evolution across 12Drosophila genomes PLoS Genet 3e197

Hirschhorn JN Daly MJ 2005 Genome-wide association stud-ies for common diseases and complex traits Nat Rev Genet 695ndash108

2325

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

Hudson RR Kaplan NL 1995 Deleterious background selection withrecombination Genetics 1411605ndash1617

Hufford MB Xu X van Heerwaarden J et al (25 co-authors) 2012Comparative population genomics of maize domestication andimprovement Nat Genet 44808ndash811

Kaminker JS Bergman CM Kronmiller B et al (12 co-authors) 2002The transposable elements of the Drosophila melanogastereuchromatin a genomics perspective Genome Biol3RESEARCH0084

Kapitonov VV Jurka J 2003 Molecular paleontology of transposableelements in the Drosophila melanogaster genome Proc Natl AcadSci U S A 1006569ndash6574

Kazazian HH 2004 Mobile elements drivers of genome evolutionScience 3031626ndash1632

Kidwell MG 2002 Transposable elements and the evolution of genomesize in eukaryotes Genetica 11549ndash63

King EG Macdonald SJ Long AD 2012a Properties and power of theDrosophila synthetic population resource for the routine dissectionof complex traits Genetics 191935ndash949

King EG Merkes CM McNeil CL Hoofer SR Sen S Broman KW LongAD Macdonald SJ 2012b Genetic dissection of a model complextrait using the Drosophila synthetic population resource GenomeRes 221558ndash1566

Kofler R Betancourt AJ Schlotterer C 2012 Sequencing of pooled DNAsamples (Pool-Seq) uncovers complex dynamics of transposableelement insertions in Drosophila melanogaster PLoS Genet 8e1002487

Langley CH Montgomery E Hudson R Kaplan N Charlesworth B 1988On the role of unequal exchange in the containment of transpos-able element copy number Genet Res 52223ndash235

Li H Durbin R 2009 Fast and accurate short readalignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760

Linheiro RS Bergman CM 2012 Whole genome resequencing revealsnatural target site preferences of transposable elements inDrosophila melanogaster PLoS One 7e30008

Lipatov M Lenkov L Petrov DA Bergman CM 2005 Paucity of chimericgene-transposable elements in the Drosophila melanogastergenome BMC Biol 324

Long AD Lyman RF Morgan AH Langley CH Mackay TFC 2000 Bothnaturally occurring insertions of transposable elements and inter-mediate frequency polymorphisms at the achaete-scute complexare associated with bristle number in Drosophila melanogasterGenetics 1541255ndash1269

Lonnig W-E Saedler H 2002 Chromosome rearrangements and trans-posable elements Annu Rev Genet 36389ndash410

Macdonald SJ Long AD 2007 Joint estimates of quantitativetrait locus effect and frequency using synthetic recombinantpopulations of Drosophila melanogaster Genetics 1761261ndash1281

Macdonald SJ Pastinen T Long AD 2005 The effect of polymorphismsin the Enhancer of split gene complex on bristle number variation ina large wild-caught cohort of Drosophila melanogaster Genetics 1711741ndash1756

Mackay TFC 1984 Jumping genes meet abdominal bristles hybrid dys-genesis-induced quantitative variation in Drosophila melanogasterGenet Res 44231ndash237

Mackay TFC 1989 Transposable elements and fitness in Drosophilamelanogaster Genome 31284ndash295

Mackay TFC 2010 Mutations and quantitative genetic variationlessons from Drosophila Philos Trans R Soc Lond B Biol Sci 3651229ndash1239

Mackay TFC Langley CH 1990 Molecular and phenotypic variation inthe achaete-scute region of Drosophila melanogaster Nature 34864ndash66

Mackay TFC Lyman RF Jackson MS 1992 Effects of P element inser-tions on quantitative traits in Drosophila melanogaster Genetics 130315ndash332

Mackay TFC Richards S Stone EA et al (52 co-authors) 2012 TheDrosophila melanogaster genetic reference panel Nature 482173ndash178

Meyer LR Zweig AS Hinrichs AS et al (32 co-authors) 2013 The UCSCgenome browser database extensions and updates 2013 NucleicAcids Res 41D64ndashD69

Montgomery E Charlesworth B Langley CH 1987 A test for the role ofnatural selection in the stabilization of transposable element copynumber in a population of Drosophila melanogaster Genet Res 4931ndash41

Montgomery EA Huang S-M Langley CH Judd BH 1991 Chromosomerearrangement by ectopic recombination in Drosophilamelanogaster genome structure and evolution Genetics 1291085ndash1098

NCI-NHGRI Working Group on Replication in Association Studies2007 Replicating genotype-phenotype associations Nature 447655ndash660

Nuzhdin SV Mackay TFC 1994 The genomic rate of transposable ele-ment movement in Drosophila melanogaster Mol Biol Evol 12180ndash181

Nuzhdin SV Pasyukova EG Mackay TFC 1997 Accumulation of trans-posable elements in laboratory lines of Drosophila melanogasterGenetica 100167ndash175

Ometto L Stephan W Lorenzo DD 2005 Insertiondeletion andnucleotide polymorphism data reveal constraints in Drosophilamelanogaster introns and intergenic regions Genetics 1691521ndash1527

Palsson A Gibson G 2004 Association between nucleotide variation inEgfr and wing shape in Drosophila melanogaster Genetics 1671187ndash1198

Petrov DA Fiston-Lavier A-S Lipatov M Lenkov K Gonzalez J 2011Population genomics of transposable elements in Drosophila mela-nogaster Mol Biol Evol 281633ndash1644

Presgraves DC 2006 Intron length evolution in Drosophila Mol Biol Evol132203ndash2213

Quinian AR Hall IM 2010 BEDTools a flexible suite of utilities forcomparing genomic features Bioinformatics 26841ndash842

R Development Core Team 2008 R a language and environment forstatistical computing Vienna Austria R Foundation for StatisticalComputing [cited 2013 May 1] Available from httpwwwR-projectorg

Rizzon C Marais G Gouy M Biemont C 2002 Recombination rate andthe distribution of transposable elements in the Drosophila mela-nogaster genome Genome Res 12400ndash407

Sackton TB Kulathinal RJ Bergman CM Quinlan AR Dopman EBCarneiro M Marth GT Hartl DL Clark AG 2009 Population geno-mic inferences from sparse high-throughput sequencing of twopopulations of Drosophila melanogaster Genome Biol Evol 1449ndash465

Sakurai M Aoki T Yoshikawa S et al (11 co-authors) 2009 Differentiallyexpressed Drl and Drl-2 play opposing roles in Wnt5 signalingduring Drosophila olfactory system development J Neurosci 294972ndash4980

Schmidt JM Good RT Appleton B et al (11 co-authors) 2010Copy number variation and transposable elements feature inrecent ongoing adaptation at the Cyp6g1 locus PLoS Genet 6e1000998

Shrimpton AE Mackay TFC Brown AJL 1990 Transposable element-in-duced response to artificial selection in Drosophila melanogastermolecular analysis of selection lines Genetics 125803ndash811

Slotkin RK Martienssen R 2007 Transposable elements and the epige-netic regulation of the genome Nat Rev Genet 8272ndash285

Smith PA Corces VG 1991 Drosophila transposable elements mecha-nisms of mutagenesis and interactions with the host genome AdvGenet 29229ndash300

Thornton K 2003 Libsequence a C ++ class library for evolutionarygenetic analysis Bioinformatics 192325ndash2327

Vazquez JF Albornoz J Dominguez A 2007 Direct determination of theeffects of genotype and extreme temperature on the transposition

2326

Cridland et al doi101093molbevmst129 MBE

of roo in long-term mutation accumulation lines of Drosophilamelanogaster Mol Genet Genomics 278653ndash664

Voight BF Pritchard JK 2005 Confounding from cryptic relatedness incase-control association studies PLoS Genet 1e32

Wakeley J 2009 Coalescent theory an introduction Greenwood VillageRoberts and Company Publishers

Wicker T Sabot F Hua-Van A et al (13 co-authors) 2007 A unifiedclassification system for eukaryotic transposable elements Nat RevGenet 8973ndash982

Xing J Witherspoon DJ Jorde LB 2013 Mobile element biology newpossibilities with high-throughput sequencing Trends Genet 29280ndash289

Yoshikawa S McKinnon RD Kokel M Thomas JB 2003 Wnt-mediated axon guidance via Drosophila derailed receptor Nature422583ndash588

Zachar Z Bingham PM 1982 Regulation of white locus expression thestructure of mutant alleles at the white locus of Drosophila mela-nogaster Cell 30529ndash541

2327

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

Page 2: Abundance and Distribution of Transposable Elements in Two ...

Drosophila (Mackay and Langley 1990 Long et al 2000Macdonald et al 2005 Gruber et al 2007) Future studiesmay reveal many more examples of TEs as a class as causativemutations

Recently two reference panels for the mapping of quanti-tative trait loci (QTL) have been unveiled the DrosophilaGenetic Reference Panel (DGRP) (Mackay et al 2012) andthe Drosophila Synthetic Population Resource (DSPR) (Kinget al 2012a 2012b) The DGRP is a set of 168 inbred isofemalelines derived from individuals collected in Raleigh NC in 2003The DSPR is a collection of ~1700 Recombinant Inbred Linesderived from 15 highly inbred founder lines The founderswere collected from many different geographic locationsand were all been in laboratory conditions for 40 + years(Macdonald and Long 2007) These resources are intendedfor use as platforms to map QTL in D melanogaster and havebeen characterized with respect to single-nucleotide poly-morphism (SNP) genotypes (King et al 2012a Mackay et al2012) In addition an initial pass at calling transposable ele-ment (TE) insertions in the DGRP was presented in Mackayet al (2012) and in Linheiro and Bergman (2012) This studyrevealed hundreds of rare TE insertions per line and in caseswhere TE insertions are found in the same gene in multiplelines the insertions are usually at different positions withinthe gene (Mackay et al 2012) These findings suggest thatmarker-based associations where the causative mutation is aTE insertion may fail to replicate and that tests for the effectsof TE insertions as a class of mutations such as in Mackay andLangley (1990) Long et al (2000) and Macdonald et al(2005) may be more fruitful

Replicability of genotypendashphenotype associations is a crit-ical step for establishing causative variation and genome-wideassociation studies in humans use replication in multiple datasets as a standard of quality (NCI-NHGRI Working Group onReplication in Association Studies 2007) Future studies usingthe DGRP and DSPR must also consider differences in thepattern of TE insertions between panels when designing ex-periments utilizing these resources because a causative TEsegregating in one panel but not in the other would affectthe replicability of a study It is therefore important to knowthe TE genotype of the individual fly lines used in QTL studiesin Drosophila and an accurate characterization of the TE con-tent of both panels is desirable for future QTL mapping stud-ies in Drosophila Currently in D melanogaster there are onlya few replicated associations One set involves TE insertions atthe achaete-scute complex and their effect on bristle number(Mackay and Langley 1990 Long et al 2000) There is oneexample of a replicated association involving a SNP and thegene Egfr in Drosophila (Palsson and Gibson 2004 Dworkinet al 2005) although the nature of the replication is complexbecause the specific phenotype associated with the replicatedSNP differs between studies

Here we describe the TE content of two QTL mappingresources identifying patterns of TE abundance and distribu-tion We find that TEs are generally rare existing in only oneline and that there is an excess of rare TEs compared with thestandard neutral model and SNPs We find substantial varia-tion between genes in term of TE load as well as a weak effect

of gene family size on average TE load on genes TEs are also athigher densities in regions of low recombination as well as onthe X chromosome

New ApproachesHere we present a whole-genome TE calling method which isan improvement to the method used to call TEs in Mackayet al (2012) and apply it to two QTL mapping resources(supplementary fig S1 Supplementary Material online)Briefly TEs are identified by read-pairs generated via next-generation paired-end sequencing Informative read-pairs arecharacterized by a pattern of one read in the pair aligninguniquely to the genome and the other read in the pair align-ing to multiple genomic regions Individual insertion eventsare identified by two sets of read pairs one set anchored byuniquely aligning reads upstream of the insertion and one setanchored by uniquely aligning reads downstream of the in-sertion Following identification of informative read pairs weuse Phrap (Ewing and Green 1998) to reconstruct the localarea in genomic regions suggestive of the presence of a TEReconstructed sequences are then aligned back to the refer-ence genome using BlastN (Altschul et al 1990) to validatethe presence of a TE breakpoint and to classify the type of TEidentified

There are four major improvements to this method overthe previous method used in Mackay et al (2012) First wehave switched to using the aligner BWA (Li and Durbin 2009)which is more accurate at distinguishing unique from non-unique alignments to a reference genome than the Mosaik 10(httpcodegooglecompmosaik-alignerdownloadsdetailname=Mosaik201020Documentationpdf last accessedJune 30 2012) aligner used in Mackay et al (2012) for TEdetection Second we have incorporated bedtools (Quinianand Hall 2010) into our pipeline which has improved thespeed at which we can detect events Third we have alsoincorporated definitive absence calls to the pipeline wherewe attempt to positively call the presence or absence of a TEat a given position in a given genotype (cf Mackay et al 2012)Finally we have performed both computational and poly-merase chain reaction (PCR)-based validation of our methoddemonstrating the methodrsquos precision and sensitivity

Results

Identity by Descent in the DGRP Resource

Because association studies rely on the assumption that indi-viduals are unrelated (Voight and Pritchard 2005) we used theSNP calls from Mackay et al (2012) to scan for regions ofextensive identity by descent (IBD) in the DGRP samplegt95 similarity in 1 Mb windows with 100 kb steps Webegan with a total of 148 DGRP lines for which we have se-quence data We identified many large regions of IBD betweenpairs of lines (fig 1A) and removed from the analysis a total of13 DGRP lines which were identified as 95 IBD with an-other line over 50 of their genomes When two linesshowed such genome-wide IBD the line with the highest av-erage coverage was retained for further analyses Similarly re-gions in remaining DGRP lines which were 95 IBD to

2312

Cridland et al doi101093molbevmst129 MBE

another line were masked in the line with the lowest averagegenome coverage In addition to the regions removed due toIBD we removed four lines from the DGRP set which had anaverage genome coverage of lt10 (supplementary table S1Supplementary Material online) This filtering resulted in asample size of 131 DGRP lines some of which were partiallymasked to be used in all further analyses This removal of somelines and masking of regions in other lines resulted in variationin sample sizes across the genome For 98 of sites in thegenome the sample size was 68 For the remainder of sitescoverage was low in a substantial number of lines indicative ofgenomic regions where alignment of short reads is difficult

Coalescent simulations of an equilibrium WrightndashFishermodel showed that the probability of more than four pairsof lines with a single 1 Mb region of IBD matching our IBDcriteria is lt103 Therefore our observation of hundreds of1 Mb regions of IBD between pairs of lines (fig 1A) indicatedthat there was substantially more IBD in the DGRP thanexpected in a panel of randomly chosen individuals sampledfrom an idealized population However the extent of IBD wassimilar between the 15 DSPR founder lines and the 15 DGRPlines with the highest average sequence coverage (fig 1Band C) and was equal to a few dozen megabytes of IBDincluding a large proportion of centromeric regions It istherefore unclear whether the extensive IBD seen in 148lines of the DGRP (fig 1A) is a general property of largesamples of cosmopolitan D melanogaster or a specificproperty of the DGRP line collection

Data Filtering

Following the IBD analysis we identified transposable ele-ment insertions in a total of 146 lines 131 DGRP and 15DSPR founder lines TEs identified at a specific genomic loca-tion were either present in the D melanogaster referencesequence (version 513 downloaded from wwwflybaseorglast accessed January 30 2009) hereafter referred to as refer-ence TEs or not present in the reference sequence hereafterreferred to as novel TEs Centromeric and telomeric regionsdefined in the Materials and Methods section were excludedfrom the analysis as identification of TEs in these regions canbe problematic since their very high TE densities result in fewinformative read-pairs In addition we wished to focus on theeuchromatic regions of the genome since most genes arelocated in these regions We also removed from the datasets reference TEs which were lt75 of a full length copyof the element as these elements are unlikely to be activecopies These filtering steps reduced the number of referenceTEs considered in later analysis to 607 from a starting total of6003

Validation of Transposable Element PresenceAbsence CallsSimulationDetecting a set of inserted elements in silico showed that atan average genome coverage of 50 we were able to detect913 of elements that were at least 75 of the length of the

0 20 40 60 80 100 120

05

1015

0 20 40 60 80 100 120

05

1015

0 20 40 60 80 100 120

050

100

150

Descending Order of Coverage DGRP

X 2L 2R 3L 3R

A B

C

DGRP Top 15 by Coverage

DSPR

Mb Mb

Mb

Line

s (R

anke

d by

Mea

n C

over

age)

X

X

2L 2R 3L 3R

2L 2R 3L 3R

FIG 1 (A) Identity by descent in 148 DGRP lines (B) IBD in the top 15 DGRP lines by average sequence coverage (C) IBD in the 15 DSPR lines Maskedregions indicate regions of IDB 95 When two lines were considered IBD in a region the line with lower mean coverage was masked

2313

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

canonical element These are the set of events directly com-parable to the set of reference TEs we kept for our analysissince these are the set of TEs that are likely to be active copiesin the genome A drop in average genome coverage did notaffect our rate of identification in simulated data sets Wedetected 956 of elements at 25 coverage and 913 ofelements at 15 coverage

When we looked at the detection of all elements we foundthat we were able to identify the insertion locations for 70 ofall elements at 50 coverage 724 of elements at 25 cov-erage and 64 of elements at 15 coverage

PCR ValidationWe compared our pipelinersquos TE calls with PCR data from1687 PCR calls of TE insertions in 9 of the DGRP lines(Blumenstiel JP Chen X He M Bergman CM unpublisheddata httparxivorgabs12093456 last accessed June 262013) The comparison of our pipeline to PCR-validated in-sertions showed a 924 overall agreement between the twomethods (supplementary table S2 Supplementary Materialonline) when comparing presence and absence calls We didnot include the few examples of heterozygous insertions inthis analysis Our pipeline was unable to make a presence orabsence call for the set of PCR-validated insertions 6 of thetime and disagreed with the PCR data 16 of the time Failureto validate via PCR was equally likely for presence or absencecalls suggesting that detection of TEs was unbiased with re-spect to presence versus absence We estimate a pipelinesensitivity of 982 and a specificity of 94 based on thissubset of the data Our estimation of the validation rate isconservative requiring us to reconstruct a contig that eithercontains the TE breakpoint or spans over the insertion site inorder to call an event as either present or absent respectivelyFor some cases where our pipeline makes no call we did seeread pairs that are suggestive of the state of the presence orabsence of the TE but are not sufficient in number andoruniqueness for reconstruction the event de novo

Computational ValidationThe initial phase of our pipeline detected transposable ele-ment insertions in individual lines Following this initial de-tection phase our pipeline surveys every line at each genomiclocation where a TE insertion was identified in any other lineincluding lines in different data sets In this manner weattempted to computationally verify the presence or absenceof TE insertions at a total of 2596370 genomic locationsOur pipeline made presence or absence calls between9717 and 9998 of the time (supplementary table S3Supplementary Material online) Positive absence callsmade by our pipeline can also be considered a type of vali-dation since to make an absence call our pipeline must detectreads that span the junction where the insertion would belocated (supplementary table S3 Supplementary Materialonline)

The presence of regions of IBD amongst DGRP lines (fig 1)provided an opportunity to cross-validate TE calls in regionsmasked as IBD between different line pairs because we ex-pected TE genotypes to be identical within regions identified

as IBD We find a mean of 98 agreement for TE calls (bothpresence and absence) between pairs of regions identified asIBD with a standard deviation of 0015

Distribution and Abundance of TransposableElements

A total of 7104 transposable element insertions were foundin the DSPR and 17639 in the DGRP A comparison of the TEinsertion content of the DGRP and DSPR reveals that eachpanel contained a large number of TE insertions specific tothat panel only 1656 insertions were shared between theDSPR and DGRP data sets out of a total of 23087 insertionsdetected (ie 72 shared insertions) All transposable ele-ment insertion locations are listed in supplementarytables S4 and S5 Supplementary Material online A subsetof the DGRP data lines with an average sequence coverage25 mean sequence coverage 335 standard deviation51 hereafter DGRP25 (table 1) contained 5855 insertionsThe DGRP25 set of lines was used in some analyses to bemore directly comparable to the DSPR lines which all had amean genome coverage of 60 (fig 2)

In both resources TE insertions were generally at low fre-quency Overall we found a larger total number of insertionsper line in DSPR than in DGRP (fig 2) This was presumablydue to the much higher sequencing coverage of the formerleading to a higher power to make calls of complex insertionsMean sequence coverage for the DGRP lines was 195with astandard deviation of 64 mean sequence coverage for theDSPR was 671 with a standard deviation of 42 Howeverthe increased amount of time the DSPR lines have remained inlaboratory conditions may also have contributed to the ele-vated number of TEs seen in these lines (Nuzhdin et al 1997)

TEs may be located at random throughout the genome orthey may be found preferentially in certain genomic regionssuch as in intergenic sequence We used the binomial distri-bution as a null model for the distribution of TEs in thegenome assuming TEs insert and remain at random withrespect to these genomic features (table 2) We found thatin general the distribution of TEs observed differed from whatwe would expect if TEs were distributed at random through-out the genome We found significantly fewer TEs in exons50 UTRs and 30 UTRs than expected and significantly moreTEs than expected in intergenic and intronic regions (table 2)which was consistent with previous observations as TEs ingenes are much more likely to be deleterious (Kaminkeret al 2002 Lipatov et al 2005) This pattern held true forthe DGRP the DGRP25 and the DSPR suggesting that selec-tion may be acting universally to remove TEs that disruptexonic sequences

TE Identification

Information about the class of TE that has been inserted isuseful for two reasons First this information will allow forrapid assays for the presence of this TE in other strainsSecond insertions of different families of TEs can producedifferent phenotypic consequences (Smith and Corces1991) We were able to identify the TE class of the majority

2314

Cridland et al doi101093molbevmst129 MBE

of insertions (table 1) RNA elements were much morecommon than DNA elements and within RNA elementsover half were long terminal repeat (LTR) retrotransposonelements We were unable to identify element class for347 of the DGRP insertions and 320 of the DSPR inser-tions In cases where we were unable to identify the TE familyfor an insertion this was due to either conflict between themost likely TE as identified by the two reconstructed contigsfor each event or because we had limited data on the TEsequence for that event In cases where we have conflictinginformation this was due to 1) the degradation of the TEsequence so that there were many poor matches to multiplefamilies 2) conserved sequence between families and there-fore good matches to multiple families 3) because we iden-tified a nest of elements where multiple elements insert invery close proximity or within one another or 4) the amountof TE sequence we were able to reconstruct was less than

80 bp and therefore below our annotation threshold lengthBecause of the frequently short length of TE sequence that weattempted to identify we did not attempt to identify thestrand of the TE insertion Although TE orientation is impor-tant and influences selection upon the insertion (Cutter et al2005) our data were frequently insufficient to resolve strand

Factors Influencing TE Density

If TE insertions are deleterious and recessive with respect tofitness then insertions are also predicted to be morecommon on the autosomes since deleterious insertions onthe X will be eliminated via selection against hemizygousmales (Charlesworth and Langley 1989) Contrary to this pre-diction an analysis of variance (ANOVA) indicated a signifi-cant effect of chromosome on mean TE density per line whenboth resources were included in the same analysis (table 3)This difference was due to an increased mean density of TEson the X versus the autosomes in regions of high recombi-nation This pattern was the same both for autosomes takenas a group and for each autosome individually (table 4)

TEs are predicted to accumulate in regions of low recom-bination where they are less likely to produce deleteriousrearrangements via ectopic exchange (Charlesworth andLangley 1989) An ANOVA indicates a highly significanteffect of recombination rate on mean TE density per line(table 3) TE density was also found to be greater in regionsof low recombination than in high recombination through-out the genome which was consistent with the predictions ofCharlesworth and Langley (1989)

Because the factors that influence TE density may differgreatly from family to family we repeated this analysis in 14families of TEs that appear in moderate to high abundance inthe two resources (table 5) Many of these families were pre-viously examined by Montgomery et al (1987) and Bartolomeet al (2002)

We found that in 10 families (412 roo 176 F Bari1 copiahopper INE-1 mdg1 springer) there was a highly significantdifference in mean TE density between the X and the auto-somes (P value 0026 in all cases) although the direction ofthe difference varied among families Within these familiesthere were substantial differences In two of the familiesINE-1 and roo we saw an increased density in the X overthe autosomes in regions of high recombination in boththe DSPR and DGRP25 resource (table 5) However for 412F mdg1 springer Bari1 and copia we saw the opposite pat-tern This suggests that there may be substantial variation inTE insertion site preferences among families

Sample Counts of Transposable Elements

The power to detect an association is dependent both on thepenetrance and the frequency of the causal allele (Hirschhornand Daly 2005) We found that the majority of transposableelements in the DGRP25 and the DSPR are found in only oneline (fig 3) Comparisons between the observed TE data andSNPs from the same population found in introns 86 bp(Haddrill et al 2005) indicated an excess of TEs (DGRP252 test Pamp 0 df = 6 DSPR 2 test Pamp 0 df = 6)

0 20 40 60 80 100

020

040

060

080

010

00

Coverage

TE

s

DSPRDGRP gt 25x coverageDGRP lt 25x coverage

FIG 2 Total number of TEs identified versus coverage for the DGRPDGRP25 and DSPR lines

Table 1 Summary of TE Insertions

DGRP DGRP25 DSPR

Total 17639 5855 7104

Not present in reference 17346 5615 6812

Present in reference 293 240 292

X recombination 2 cMMb 418 149 203

X recombinationgt 2 cMMb 2741 948 1107

Autosomes recombination 2 cMMb 6205 2131 2694

Autosomes recombinationgt 2 cMMb 8177 2544 3019

4 all 98 83 81

Exon 1158 378 633

Intron 8595 2870 3310

30 UTR 477 159 234

50 UTR 190 63 81

Intergenic 7219 2680 3269

DNA elements 1388 514 748

RNA elements 10133 3004 4085

Indeterminate 6118 2336 2271

2315

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

Comparisons between the observed data and the expectednumber of events at the same count under an infinite sitesmodel of an equilibrium Wright-Fisher population (Wakeley2009 p 54ndash56) also indicated an excess of rare alleles(DGRP25 2 test Pamp 0 df = 6 DSPR 2 test Pamp 0df = 6) In addition to overall site count spectra we also gen-erated spectra for the X autosomes regions of high and lowrecombination DNA elements and RNA elements In eachcase we observed the same pattern as seen in the overallspectrum (data not shown) We also compared the distribu-tion of counts in the DGRP25 with the rescaled distribution ofcounts in the DSPR rescaled to the same sample size as theDGRP25 A KolmogorovndashSmirnov test comparing these twodistributions showed that they did not differ (P = 021) Wedid see more overall TEs in every count category in therescaled DSPR data over the DGRP25 data set

Patterns of Transposable Element Insertions betweenGenes

For association studies genotyping TE insertions withincoding regions is of interest due to the potentially large

effect of an insertion on gene function A total of 2931genes (213 of annotated genes) had insertions in theDGRP resource These are insertions which exist anywherewithin the span of the gene from 50 UTR to 30 UTR Onethousand nine hundred fifteen genes 139 had insertions inthe DSPR with 1317 genes having insertions in bothresources The number of TE insertions within genes variedsubstantially between individual genes up to a maximum of78 insertions though the majority of genes with insertionshad only one insertion 631 in the DSPR and 527 in theDGRP The number and locations of individual insertions in agene varied tremendously though the majority of insertionsthat fell within gene regions were in introns 777 DSPR and824 DGRP TE number was strongly correlated with intronsize (Pearsonrsquos correlation coefficient = 079 for DSPR 087for DGRP Pamp 0 for both tests) (supplementary fig S2Supplementary Material online)

Examining the pattern of insertions in individual genes inthe DSPR and DGRP data sets we found that RNA-bindingprotein 6 had the highest number of insertions 62 in theDGRP lines and 21 in the DSPR lines These insertions were

Table 3 ANOVAs for DSRP and DGRP25 Coverage and Comparison between the Two Data Sets

df Sum Sq Mean Sq F value Pr(gtF)

DSPR vs DGRP25

Set 1 54125 54125 6486975 lt2e16

Line 36 2848 791 94817 lt2e16

Chromosome 4 11858 2965 355313 lt2e16

Recombination rate (high vs low) 1 25575 25575 3065207 lt2e16

Chromosomerecombination rate 4 10335 2584 309663 lt2e16

Residuals 333 27784 083

DSPR

Line 14 26599 19 17894 004699

Chromosome 4 39632 9908 93316 123e06

Recombination rate (high vs low) 1 141337 141337 1331143 lt2e16

Chromosomerecombination rate 4 39712 9928 93504 119e06

Residuals 126 133783 1062

DGRP25

Line 22 258202 11736 1723 lt2e16

Chromosome 4 80126 20031 29409 lt2e16

Recombination rate (high vs low) 1 120011 120011 176191 lt2e16

Chromosomerecombination rate 4 66056 16514 24244 243e16

Residuals 198 134866 0681

Table 2 Differences between Observed and Expected TE Counts

DGRP DGRP25 DSPRObserved vs Expected P Value Observed vs Expected P Value Observed vs Expected P Value

X recombination 2 cMMb Decrease 329E11 Decrease 187E03 Decrease 522E02

X recombinationgt 2 cMMb Decrease 174E01 Increase 701E02 Increase 416E01

Autosomes recombination 2 cMMb Decrease 370E01 Increase 100E + 00 Increase 687E04

Autosomes recombinationgt 2 cMMb Increase 392E02 Decrease 100E + 00 Decrease 100E + 00

Exon Decrease 000E + 00 Decrease 247E323 Decrease 103E283

Intron Increase 354E103 Increase 000E + 00 Increase 000E + 00

30 UTR Decrease 699E16 Decrease 313E06 Decrease 126E02

50 UTR Decrease 143E51 Decrease 282E18 Decrease 498E20

Intergenic Increase 112E32 Increase 0 Increase 000E + 00

2316

Cridland et al doi101093molbevmst129 MBE

typically unique to a single line though there were 3 insertionsin more than one line in the DSPR and 14 insertions in morethan one line in the DGRP

Some genes accumulated many TE insertions such asklarsicht (fig 4A gene images from the UCSC genome brow-ser Meyer et al 2013) which had 29 total insertions This genealso had a hotspot of insertions with seven independent in-sertions each at low frequency located within 36 kb of each

other Only one of these insertions was present in any givenline and these could be functionally equivalent though inde-pendently arising mutations Other genes had only a few in-sertions such as Notch (fig 4B) and Delta (fig 4C) Theinsertions in these genes were also at low frequency andthe few present were located in intronic regions

While most insertions exist in only one panel some werefound in both Cyp6a20 (fig 4D) contains a high frequencynon-reference TE insertions present in both data sets in 4lines out of 121 lines where we were able to make a presenceor absence call in the DGRP (hereafter shown as 4121) and314 lines in the DSPR Both Cyp6a20 and klarsicht are exam-ples of genes with TE insertions in exons which was

00

02

04

06

08

10

1 2 3 4 5 6 7+

Observed TEsObserved SNPsExpected

DSPR

00

02

04

06

08

10

1 2 3 4 5 6 7+

Observed TEsObserved SNPsExpected

DGRP

FIG 3 Derived allele count spectra for the DSRP lines and the DGRP25lines where a positive presence or absence call was made for eachinsertion in each line 6613 insertions in the DSPR and 3274 in theDGRP25 Count spectra for SNPs is from SNPs in introns 86 bp 2

tests between observed and expected distributions result in Pamp 0 forcomparisons between TEs and the neutral model as well as between TEsand SNPs for both data sets

Table 5 TE Density for 15 Individual Families of Elements

Mean Density (TEMb)

Element Resource X High X Low Auto High Auto Low

roo DGRP25 039 037 032 033DSPR 080 088 065 068

297 DGRP25 005 000 002 001DSPR 005 000 004 000

412 DGRP25 003 001 003 007DSPR 006 008 009 012

F DGRP25 002 006 004 006DSPR 007 008 012 013

176 DGRP25 001 000 001 002DSPR 002 000 002 006

Bari1 DGRP25 001 000 004 001DSPR 002 002 006 002

copia DGRP25 001 000 002 003DSPR 009 002 014 019

H DGRP25 001 000 001 000DSPR 000 000 001 000

hopper DGRP25 010 001 000 000DSPR 010 004 002 002

INE-1 DGRP25 042 188 003 048DSPR 045 203 004 055

jockey DGRP25 015 012 019 013DSPR 035 034 040 033

mdg1 DGRP25 003 002 003 010DSPR 008 004 012 018

pogo DGRP25 006 004 007 005DSPR 013 021 015 013

springer DGRP25 000 001 001 003DSPR 002 000 001 005

Table 4 TE Density in the X and Autosomes

TEMb

DGRP25 DSPR Both

X all 382 638 483

Autosomes all 351 593 447

X high recombination 371 600 461

X low recombination 393 676 505

2L high recombination 227 473 324

2L low recombination 323 589 428

2R high recombination 282 489 364

2R low recombination 617 863 714

3L high recombination 296 523 386

3L low recombination 474 726 573

3R high recombination 281 468 355

3R low recombination 372 670 490

2317

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

uncommon both in this data set and in previous studies(Kaminker et al 2002 Lipatov et al 2005)

Differences in the strength of selection against closelyrelated genes can also be illustrated by patterns of TE inser-tions The paralogs derailed (fig 4F) and derailed-2 (fig 4E)showed very different patterns of TE insertions though thesame pattern was seen in both resources derailed-2 locatedon chromosome 2R is in a region of moderate recombination199 cMMb whereas derailed is located towards the distalend of 2L in a region of low recombination 044 cMMb

Given the context of the recombination rates the expectationwould be that derailed would have a higher TE load sincedeleterious alleles are removed more efficiently in regions ofhigh recombination (Hudson and Kaplan 1995) but theopposite pattern was observed

While both genes play a role in Wnt5 signaling mutationsin derailed can cause major phenotypic changes in Drosophilanervous system development resulting in the loss of normalfunction (Yoshikawa et al 2003) derailed-2 when mutatedproduces only minor differences in neuron positioning

115 115115 215 115115

114 1122

11181120 11181124

2117

21182119

1122

1119 1118

2118

1121 3114

1115

1114

1119

1113

1119

1120

1120

1119

1120

3119

1119

A klarsichtDGRP

DSPR

Insertion in an Exon

314 1195

4121D Cyp6a20

315

1123

1415

56125

115115

415

115615

10124

1124 1124 1123

2124E derailed-2

B Notch

115 11211122

C Delta1114 1120

191

F derailed

Same insertion in both resources

FIG 4 Transposable element insertions in genes The frequency above each insertion is the number of lines in which the element is present over thenumber of lines in which the element is validated as either present or absent Gene images are from the UCSC genome browser (httpgenomeucscedu last accessed June 31 2012)

2318

Cridland et al doi101093molbevmst129 MBE

(Sakurai et al 2009) This suggests that either the strength ofselection may be much higher against insertions in derailed orthe insertion rates are unequal between the paralogs

Gene Family Size and TEs

We used available information on gene families in DrosophilaHahn et al (2007) grouped into families based on sequencesimilarity and found a very weak but significant effect of genefamily size on the average number of TE insertions per kbin both resources (P value = 00003 R2 = 0001 DSPRP value = 0002 R2 = 0007 DGRP) (fig 5A and B) Althoughthere was a large variance in the number of insertions per kb

in both resources (fig 5A and B) the average number ofTEskb increased as family size increased up to a moderatelylarge family size

Patterns of Variation between Resources

To replicate an association with a single causative allele thatallele must both exist in both panels and also be at highenough frequency to detect the association Given thatmost TEs were singletons and only 45 of insertion eventswere identified in both panels and were within genes thereplicability of associations with these alleles is likely to below if alleles are considered individually We examined sharedinsertions segregating at different frequencies between panelsand found only two TE insertions in genes that were segre-gating at different frequencies in the DGRP versus DSPRresources following Bonferroni correction (Fisherrsquos exacttests P 000007) One of these was in an exon of bothCG13175 and CG33964 a pair of overlapping genes Herethe TE insertion was at much higher frequency in theDGRP than in the DSPR (89122 vs 215) The other insertionwas in an intron of Caliban and the frequency of this insertionwas higher in the DSPR than the DGRP (1115 vs 7116)

If insertions are considered as a class of mutation thenthere may be better power to detect an association assumingthat different insertions contribute in the same way to thephenotype We also examined genes with TE insertions(139 of genes) though not necessarily the same insertionin both resources Treating all insertion alleles within a gene asequivalent a comparison between resources showed thatthere were 21 genes which contain TEs at different frequen-cies between the DGRP versus DSPR following Bonferronicorrection (Fisherrsquos exact tests P 0000038197) includingderailed-2 (supplementary table S6 Supplementary Materialonline)

Discussion

Distribution and Abundance of TransposableElements

Previous work on transposable elements has generally beenfocused on a few families of TEs (Montgomery et al 1987Charlesworth and Lapid 1989 Charlesworth et al 1992) orfocused only on a single genome (Bartolome et al 2002Kaminker et al 2002 Rizzon et al 2002) We have presenteda genomic analysis of TEs across a large number of lines andtwo data sets We found that transposable elements in bothresources are primarily found as singletons with a few mod-erate to high frequency insertions Transposable elements alsosegregated at lower allele frequencies than do presumed non-functional SNPs drawn from the same data set suggestingthat negative selection is acting upon TEs However theremay be substantial variation in the age of transposable ele-ments in different families many LTR elements are thought tobe young insertions (Bowen and McDonald 2001 Bergmanand Bensasson 2007) while other families like INE-1 arethought to be quite old (Kapitonov and Jurka 2003) Thismay create a situation where the young age of insertions isresponsible for the excess of rare alleles We found a

0

1

2

3

00 05 10 15 20log10(Gene family size)

TE

den

sity

0

1

2

3

00 05 10 15 20log10(Gene family size)

TE

den

sity

DSPR

DGRPA

B

FIG 5 log10(TE density) versus log10(Gene family size)

2319

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

non-random distribution of TE insertions across the genomewith increases of TEs in both the intergenic and intronicregions of the genome and decreases in exonic and UTRregions (Kaminker et al 2002 Lipatov et al 2005) Similar toprevious studies we found variation in TE density betweenchromosomes and higher densities of TEs in regions of lowrecombination (Charlesworth et al 1994 Bartolome et al2002) We also found increased TE densities on the X relativeto the autosomes both in overall insertions and in someindividual TE families (table 5)

Detection method and the size of the data set are likely tobe contributing factors to differences between previous workand this study Many previous studies have used in situ meth-ods to ascertain TE number in samples and have looked atelement accumulation at the base versus the midsections ofchromosomes (Montgomery et al 1987 Charlesworth et al1992) or have looked only at one genome (Bartolome et al2002 Rizzon et al 2002) A previous PCR-based detectionstudy found no difference in TE frequency between the Xand the autosomes in natural populations (Petrov et al2011) but this study was biased in that it only looked atTEs shared with the reference sequence and thus did notcapture the full picture of TEs in these populations

While our analysis does disagree with previous work thesedisagreements may be mainly due to the way in which regionsof the genome are divided for examination Rizzon et al(2002) found an increase in copies of LTR retrotransposonson the X when they excluded pericentromeric telomeric andchromosome 4 from their analysis though they found a def-icit of TEs on the X when these regions are included Thisfinding is most directly comparable to our own finding for theentire TE data set since they exclude a very similar set ofgenomic regions We do agree with previous work in ourfinding of higher densities of TEs in regions of low recombi-nation than in high recombination (Charlesworth et al 1994Bartolome et al 2002)

Examining individual element families Carr et al (2002)found an increase of mdg-3 elements and 297 elements on theX than would be expected under a random insertion modelMontgomery et al (1987) found no evidence for a reductionin TEs on the X for both roo and 297 elements but foundfewer TEs on the X when looking at the 412 family HoweverCharlesworth et al (1992) observed the opposite pattern forroo and 297 Montgomery et al (1991) also found that moreroo elements in lab kept lines than they had previously ob-served in natural populations We found significantly in-creased densities on the X for roo and significantlyincreased densities on the autosomes for 412

Differences between previous studies of individual elementfamilies and our results may be due to a variety of reasonsFirst we were able to annotate only about two-thirds of ourTE insertions This was primarily due to having insufficientsequence to reliably annotate the element or because inde-pendent annotation using sequence reconstructed fromeither end of the insertion did not yield the same elementfamily annotation This means that our density calculationsfor individual elements may be skewed by missing annotationinformation Second many previous studies (Montgomery

et al 1987 Charlesworth et al 1992 Carr et al 2002) examinedonly LTR elements These elements are young (Bowen andMcDonald 2001 Bergman and Bensasson 2007) and thus in-dividual copies may not yet have been removed from thegenome thereby displaying different patterns of insertionthan older active elements

What is clear is that there is great variability in TE densityboth within and between element families as well as withinand between chromosomes It may also be that the overallpattern is driven by a small group of families In our data setINE-1 shows the largest pattern of difference between the Xand autosomes with an order of magnitude increase in meanTE density per line on the X over the autosomes possibly dueto an increased likelihood of fixation of insertions (Omettoet al 2005) specifically on the X chromosome (Presgraves2006) These INE-1 elements are a subset of the largest TEfamily in D melanogaster though only several dozen of themare in the euchromatin INE-1 elements are also considered tobe inactive in D melanogaster and likely have not been mobilefor gt3 million years (Kapitonov and Jurka 2003) Consistentwith this most of the INE-1 elements we detected were fixedor nearly fixed in our data set We did see some examples oflow frequency insertions which may be due to more recenttranspositional events or incorrect TE annotation due toshort sequences The old age of INE-1 insertions also suggeststhat the extant elements are unlikely to be selected againstAn increased density of TEs on the X compared with theautosomes suggests that it is not the fitness effects of inser-tions which govern TE copy number Fitness effects wouldresult in the more rapid removal of insertions on the X thaninsertions on the autosomes due to the increased selectivepressures experienced on the X (Montgomery et al 1987)Background selection also predicts that the X should be lessburdened by deleterious recessives since it is hemizygous onethird of the time (Charlesworth 1994)

However it is unclear whether our observation of in-creased densities of TEs on the X relative to the autosomesfits with the ectopic exchange model This model predictsthat TEs will accumulate in regions of low ectopic exchange(Langley et al 1988) The general assumption in the literatureis that ectopic exchange is positively correlated with recom-bination rate (Langley et al 1988 Montgomery et al 1991)The X chromosome experiences higher levels of recombina-tion than the autosomes with an average of 36 cMMb forthe X in regions of high recombination (cMMbgt 2) whereasthe autosomes have a mean of 336 cMMb If recombinationrate and ectopic exchange are positively correlated through-out the genome then the expectation would be fewer TEs inthe middle regions of the X where recombination rates areknown to be high This is the opposite of what we observe Ifrecombination rate and the rate of ectopic exchange are notpositively correlated then the ectopic exchange model maybe correct but data are lacking to properly address this issueCurrently there are no direct measurements of the rate ofectopic exchange on a genome-wide scale in Drosophiladuring meiosis Montgomery et al (1991) observed littleectopic exchange between homozygous chromosomes inDrosophila melanogaster on the X chromosome in the

2320

Cridland et al doi101093molbevmst129 MBE

region around the white locus They also found that whilerecombination rate was reduced in both centromeric andtelomeric regions only the centromeric region displayed anincreased density of TEs This information is suggestive ofdifferences between these regions in the rate of ectopicexchange

The DGRP and DSPR are also composed of inbred lineswhich have been kept in laboratory conditions for some time~9 years in the case of the DGRP lines and 40+ years in thecase of the DSPR Given that TE insertion mutations accumu-late rapidly and in an increasing manner over time (Nuzhdinet al 1997) the increased density of TEs on the X versus theautosomes may not be surprising if there is a highertransposition rate to the X The only direct study to datefocused on roo elements in D melanogaster and did observehigher rates of insertion in the X than the autosomes(Vazquez et al 2007)

Transposable Elements in Genes

The majority of genes 813 do not have any TE insertions inany line studied and intron length is a good predictor of thenumber of TEs within a gene However a few genes accumu-late many insertions even after correcting for gene size andintron length and some genes have very few TEs even thoughthey have large introns The reasons why different genes mayaccumulate TE insertions may be quite different for individualgenes In some cases there may be hot spots for insertionswhich if insertions are neutral or only slightly deleterious mayresult in large number of insertions in a small region In othergenes the lack of TE accumulation may suggest selectivepressures against TE accumulation

Gene family size may also play a role in TE accumulationWe see an increase in TE density correlated with an increase ingene family size This may indicate differences in selectivepressures for genes in different sized families Single-copygenes may experience more constraint whereas in moder-ately sized gene families these constraints may be more re-laxed perhaps due to redundancy

In the cytochrome p450 functional family of genes we findthat 3789 of these genes have TE insertions 2589 in theDGRP and 2089 in the DSPR Cytochrome p450 is also foundto be in an enriched term in the DSPR for the set of genescontaining TEs The cytochrome p450 gene Cyp6g1 is knownto confer resistance to DDT (Daborn et al 2001) and TE-mediated copy number variation is associated with increasingresistance to DDT (Schmidt et al 2010) This pattern of TEinsertions in the DSPR may be reflective of the locationswhere DSPR lines were originally collected The DSPR lineswere gathered from various locations worldwide at a timewhen different pesticides were in use The DGRP lines weregathered from a single geographic area in North America in2003 Presumably these sets of lines have experienced differ-ent selective pressures with respect to pesticide resistance InCyp6a20 (fig 4D) where TEs are segregating at different fre-quencies in the two resources some similar effect may beoccurring

TE Detection Pipelines

The TE detection strategy used here is an improvement uponthe strategy used in Mackay et al (2012) The most notabledifferences between the data reported in Mackay et al (2012)and the DGRP data reported here is substantial amount ofnew information First our pipeline improvements agree wellwith the previous implementation with 812 of previous TEpresence calls identified by the new pipeline However thesecalls represent on 182 of the presence calls made by theimproved pipeline demonstrating that we have identifiedsignificantly more calls than before The incorporation of TEabsence calls also allows us to provide a much more completepicture of TEs in euchromatic sequence in the DGRP

The TE detection pipeline used here is similar to otherpipelines which have been employed but it is useful to de-scribe briefly here the differences in approaches and why wefollowed the approach presented here A useful discussion ofcommon approaches to TE detection in many species can befound in Xing et al (2013) One common method for TEdetection is the split-read mapping approach which hasbeen employed in Drosophila by Linheiro and Bergman(2012) This approach identifies TE insertions by locating in-dividual sequence reads which span TE insertion breakpointsThis approach is the strategy of choice for next-generationsequencing experiments which produce only single-endreads However in a paired-end read situation this approachmay not take full advantage of the available data Specificallythe size of window around the TE insertion breakpoint whereinformative reads can be found is governed by read length in asingle-end experiment and fragment length in a paired-endexperiment Linheiro and Bergman (2012) discuss the impor-tance of read length on TE detection and suggest that in-creased read length improves TE detection In situationswhere read length is long and coverage is deep enough thesplit-read method may be preferable to the paired-end de-tection strategy However for the data sets analyzed herewhere read length is much shorter than fragment lengththe paired-end detection strategy will capture moreinformation

The pipeline used here also has similarities to the pipelineused in Kofler et al (2012) However there are two majordifferences in these studies First Kofler et al (2012) do notattempt to reconstruct their insertion breakpoints andsecond the population studied by Kofler et al (2012) wassequenced as a pool rather than as individual lines Becausetheir pooled sequence data reflect on average less than 1sequence coverage per line Kofler et al (2012) are primarilyable to detect intermediate to high frequency TE insertionstheir own analysis concludes that they cannot reliably detectinsertions at frequency lt7 (see supplementary fig S2Supplementary Material online in their supplementary dis-cussion) However the literature to date (Charlesworth andLangley 1989 Charlesworth and Lapid 1989 Charlesworthet al 1992) as well as the results of this study indicate thatthe majority of TEs in a population are found at very lowfrequency often in only one individual in a population andtherefore beyond the scope of detection of Kofler et al (2012)

2321

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

Caveats

The D melanogaster reference genome to which we alignedour data is a P-element-free genome Therefore the use of thisreference will be biased against detecting any P elementsexisting in the data sets In the case of the DSPR data setthis is not an issue since these lines are also all P-element-free(Macdonald and Long 2007) However it is likely that theDGRP lines do harbor P elements For the purpose of theseanalyses we decided to ignore P elements and focus on theset of element families common to the two resources

This situation does highlight the importance of having anappropriate reference sequence To detect P elements in thegenome of D melanogaster an artificially constructed refer-ence sequence or a different assembled sequence containingthe P element would need to be utilized This also brings upthe question about as yet undetected elements that exist atlow frequency If these elements do not have sufficient sim-ilarity to existing elements to be aligned to the referencegenome these elements will go undetected

Conclusions

We have presented a description of transposable elements intwo recently released QTL resources Transposable elementscan play important roles in gene function and regulation andincorporation of TE insertion information into any analysisperformed with either of these two resources will present amore complete picture of genomic variation and its contri-bution to complex traits In addition it will be important forassociation studies attempting to replicate genotypendashpheno-type associations utilizing these resources to be aware of thepotential contributions of TEs

We have also demonstrated a next-generation high-throughput sequencing analysis pipeline that is capable ofdetecting this type variation with a high rate of precisionand specificity We suggest that future studies using next-generation sequencing data utilize this technique

Materials and Methods

Data Sets

Two data sets were used in this analysis The first set consistsof 131 inbred lines from the Drosophila Genetic ReferencePanel (DGRP) (Mackay et al 2012) SRA accessionSRP000694 The second set consists of 15 lines used to es-tablish the Drosophila Synthetic Population Resource (DSPR)(King et al 2012a) initially described in Macdonald and Long(2007) SRA accession SRA051316 For the DGRP resourceboth 454 and Illumina data were available whereas onlyIllumina data are available for the DSPR We utilized onlythe Illumina sequence data to ensure consistency in datacollection across data sets We calculated a mean coverageof 21 for the DGRP lines and 50 for the DSPR linesVariation in coverage for the DGRP lines was substantial be-tween lines ranging from ~4 average coverage to nearly50 average coverage We therefore dropped from the anal-ysis four lines with an average coverage of 10 or less sincelow coverage can cause difficulties with assembly and TEidentification Sequence data for both data sets consisted of

paired-end data with 54 bp reads for the DSPR and paired-end 100 bp reads trimmed to 75 bp for DGRP

IBD in DGRP Data Set

We downloaded the SNP tables generated by the DGRPproject from httpwwwhgscbcmtmceduprojectsdgrpfreeze1_July_2010snp_callsIllumina (last accessed July 302010) to examine identity by descent in this data set Weperformed an all by all comparison between lines examiningsliding windows of 1 Mb across the genome with 100 kb stepsbetween windows We take a simple definition of IBD hereWhenever we identified two lines that weregt95 identity atSNP positions in a 1 Mb region we labeled the pair as IBD inthe identified region and masked that region of the genomein the line with lower coverage We also dropped entirelyfrom our analysis any lines where gt50 of the genome wasdetermined to be IBD with other lines in the population(supplementary table S1 Supplementary Material online) re-moving from the analysis the line with lower average se-quence coverage

We modeled 1 Mb regions of 200 chromosomes in a coa-lescent simulation using the parameters theta = 001site and4Nr = 10theta to determine the number of regions of95IBD under a Wright-Fisher population model using the MACSsoftware (Chen et al 2009) We simulated 1000 replicates ofthis simulation and tabulated the number of 1 Mb regionsshowing 95 IBD in each replicate

Alignment

We aligned reads to the D melanogaster reference genome(version 513 downloaded from FlyBase flybaseorg) using thealigner BWA (version 059 Li and Durbin 2009) We used thefollowing parameters (aln -t 8 -l 13 -m 50000000 -R 5000)followed by the command ldquosamperdquo to resolve paired endmappings (using parameters -a 5000 -N 5000 -n 500) It isimportant when detecting TEs that the -R during the initialalignment and the -N parameter during the ldquosamperdquo phase beset high otherwise highly repetitive sequences that would beinformative of the presence of a common element will beexcluded from the data by the aligner We also used the -Ioption for the DGRP lines but not the DSPR due to the dif-ferences in Illumina quality score output format as the twodata sets were sequenced at different times The -m parameterfrom the initial alignment step and the -n parameter from thepaired-end resolution step deal with how BWA treats multiplyaligning reads An additional caveat is that some TE familiesmost notably the P-element family is absent from the Dmelanogaster reference sequence Such families in the se-quenced genomes will not align to the reference genome weused even if they are present in the sampled lines

Transposable Element Detection

Transposable elements were detected by first identifying allreads that were aligned to an annotated TE in the referencegenome (supplementary fig S1 Supplementary Material on-line) We then identified the mates of these reads and selectedonly those mates which did not align to an annotated TE

2322

Cridland et al doi101093molbevmst129 MBE

These reads identify the set of read-pairs which span a TEinsertion The uniquely mapping non-TE reads were thenclustered based on their start position and chromosomalstrand to identify putative TEs Clusters on the plus strandwere matched to clusters on the minus strand to producepairs of clusters indicating either end of the putative TE Onceclusters were identified we filtered out all events which con-tained fewer than three read-pairs consistent with a giveninsertion for each cluster This means that putative eventsat this stage have a minimum of six read-pairs indicating theinsertion three on the plus strand and three on the minusstrand

We then extracted all reads and their mates that aligned tothe identified regions of the genome and attempted toreconstruct the local area including the breakpoint of theTE Because many TEs have repetitive regions at the ends ofthe element we reconstructed the two breakpoints of the TEseparately (see also supplementary fig 23 from Mackay et al2012) Local reconstruction was done by running Phrap (ver-sion 1090518 Ewing and Green 1998) with the followingparameters (-forcelevel 10 -minscore 10 -minmatch 10)This set of parameters causes Phrap to do a Smith-Waterman comparison when it finds matches of at least 10characters but then relaxes the minimum alignment scoreand the final assembly parameters This is necessary to recon-struct TEs because repetitive sequence within the TEs canresult in a failure to reconstruct a contig if stricter assemblyparameters are used All code we developed for the detectionof TEs will be freely available at wwwmolpopgenorgData(last accessed August 1 2013)

Following reconstruction we used BlastN (verson 2222) tocompare reconstructed contigs to the D melanogaster refer-ence to confirm the presence of a TE In most cases we wereable to fully reconstruct both breakpoints of the TE Fullyreconstructed indicates that we could reconstruct a singlecontig which contained the breakpoint of the TE In mostcases the two breakpoints of the TE did not resolve to thesame nucleotide but instead identify the location of thetarget site duplication for the insertion (Linheiro andBergman 2012 similar to fig 1) These are cases where theupstream estimate of the breakpoint is slightly higher thanthe downstream estimate of the breakpoint and the differ-ence in position can be seen as the likely span of the target siteduplication However sometimes we were not able to fullyreconstruct a contig but instead generated two contigs onecontaining uniquely aligning sequence and the other contain-ing TE sequence These contigs are known to belong togetherbecause reads forming one contig have their mates in theother contig In these cases we have an approximate but notprecise estimate of the TE breakpoint and are unable toidentify the target site duplication

After the first round of detection we performed a secondround of detection where we examined every location wherea novel TE was detected in any line in all other lines Thisincluded examining positions in one data set that were iden-tified in the other data set In addition we examined thelocations of all known TEs in the D melanogaster genomeFor each location queried in each line we extracted reads

aligning to that location and their mates and used Phrap toreconstruct a contig This allowed us to produce presenceabsence or missing data calls for all positions where TE wasdetected in all lines The absence of a TE is determined byidentifying reconstructed contigs that spanned the insertionposition of the TE by at least 15 bp on either side Theseabsence calls were not included in Mackay et al (2012)This also allows us to identify TEs that may be in regions ofthe genome with lower average sequence coverage comparedwith the rest of the genome provided that insertion is sharedwith another line where it was possible to detect the insertion

Transposable Element Annotation

Annotation of novel TEs was performed by aligning recon-structed contigs with BlastN to the set of TEs found in theD melanogaster reference We chose the best match basedon overall length and BLAST reported e-value and called thenovel TE a TE of that family We also required a minimum of80 bp of contig matching a TE to make an annotation callThis is an additional annotation analysis to the one we per-formed in Mackay et al (2012) For each of the two contigsproduced for each insertion we determined the best matchand then compared these to each other TEs are classified asper the classification scheme put forth in Wicker et al (2007)We annotated each TE insertion to the highest level of agree-ment between the two contigs if the two contigs disagreedwith each other at every level of classification we called thatelement ldquoundeterminedrdquo This could result from nested in-sertions of elements where the 50 end of the insertion is truly adifferent element from the 30 end or because the TE sequenceis degraded and differs substantially from the referencesequence or because there was not enough TE sequence tomeet our annotation guidelines Because we can only recon-struct a maximum of few hundred bases at either end of theinsertion it is difficult to distinguish between these issues Wealso did not attempt to determine the strand of the TE

Validation

A subset of 190 TE insertion sites shared with the D melano-gaster reference were validated via PCR in a subset of nineof the DGRP lines by Blumenstiel JP Chen X He M BergmanCM (unpublished data httparxivorgabs12093456 lastaccessed June 26 2013) To independently check IBD callsusing SNP data we also determined if pairs of lines wereIBD for TEs in the same regions In this comparison we onlyincluded TE insertions where both lines have a positive pres-ence or absence call This can also serve as a validation of ourTE calling pipeline since if two segments are IBD they shouldshare the same set of TEs

Simulation

We selected a set of elements of different sizes and of TEclasses from the set of TEs in the reference sequence andinserted them semi-randomly into the D melanogaster refer-ence genome in silico Insertions were semi-random since weavoided areas already occupied by reference TEs and the cen-tromeric and telomeric regions Read pairs were then

2323

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

generated to construct fastq files We then ran our TE detec-tion pipeline as above through the initial annotation stepThis process was repeated three times to generate a simula-tion of 15 25 and 50 coverage

Recombination Rates

We wrote an R routine to estimate local recombination ratesas a function of physical position Briefly we used ldquocyto-ge-netic-seqtsvrdquo from wwwflybaseorg (last accessed January 302009) to assign a set of physical landmarks to genetic posi-tions For each chromosome arm we then fit a local polyno-mial relating cumulative cM to cumulative basepair (usinglocpoly) Unlike many prior local polynomial fits we madethe added assumption that this polynomial must be a mono-tonically increasing function (using monoproc) an assump-tion that is true for real data We then fit a smooth spline tothe above curve (using smoothspline) The advantage ofsmoothspline is that the resulting curves can be differentiated(and hence estimates of rates obtained directly) Finally we fita smooth spline to the first derivatives obtained from theprior smooth spline largely to make the recombinationrates less noisy near centromeres and telomeres TheR code is freely available at wwwmolpopgenorgData (lastaccessed August 1 2013)

Data Set Restriction

Analyses performed on these data sets were restricted in twoways First we restricted our analyses to regions of thegenome with low TE density and moderate recombinationrates We restricted our data set in this way because regions ofhigh TE density make it more difficult to make accurate callsBecause our pipeline requires unique reads to positionallylocate the TE insertion regions with high TE density canlack enough unique sequence to generate these reads Theregions included in the analyses are X300000ndash208000002L200000ndash20100000 2R2300000ndash21000000 3L100000ndash21900000 3R600000ndash27800000 which are not centromericand telomeric regions This excludes the majority of TEs an-notated in the reference genome since these TEs are largelyclustered together in centromeric regions However this doesnot affect our analysis since we are interested in TEs in theeuchromatic portions of the genome

Second to restrict the set of reference TEs studied to full-length copies we removed from the analysis all reference TEsthat had a length that was less than 75 of the canonicalannotated length of the TE While these elements may stillproduce phenotypic effects these elements are unlikely to beactive copies This reduced the number of reference TEs in-cluded in the analysis to 607 from 1085 TEs that are present inthe reference genome in the regions included in our analysis

Additionally some analyses were performed on the subsetof 23 lines in the DGRP data set which had an average se-quencing coverage of 25 or higher hereafter referred to asDGRP25 This restriction was to mitigate the effect of differ-ences in coverage between the two data sets Simulations oftransposable element detection via our pipeline data not

shown indicate that TE detection at 25 is comparable todetection at 50

Site Count Spectra

We generated derived allele count spectra for TEs in the DSPRdata set and the DGRP25 set of lines We included only siteswhere we were able to make a positive presence or absencecall for each line in the data set so that the number of lineswas normalized for each resource for this analysis Observedvalues were compared with expected values at counts 1 2 34 5 6 and 7 under an infinite sites model via a 2 testCategories 7 and higher were grouped due to low counts insome of these categories We also calculated count spectra forSNPs present in small introns (introns86 bp [Haddrill et al2005]) a group unlikely to have functional effects We alsoused SoFoS a site frequency rescaler (Hufford et al 2012) torescale the DSPR data set to the sample size of the DGRP25from 15 to 23 samples to see if the rescaled data show adifferent distribution of element counts from the observedDGRP25 data We then did a KolmogorovndashSmirnov test tocompare the two distributions as a test of different selectiveeffects acting on the populations

Genomic Context of TEs

Transposable elements were identified in regions of low(lt2 cMMb) or high recombination (2 cMMb) FASTAfiles containing the annotated coordinates of 50 UTRs30 UTRs intronic intergenic and full gene span regions forversion 513 of the D melanogaster reference were down-loaded from flybase (wwwflybaseorg last accessed January30 2009) Version 513 of the reference annotation was usedfor comparative purposes with the DGRP analysis (Mackayet al 2012) Because alternative transcripts can result in thesame genomic position being classified in multiple categorieswe labeled each TE as every appropriate category This oc-curred with 52 of TEs in the DGRP and 60 of TEs in theDSPR When calculating the percentage of the referencegenome in our restricted areas of analysis that were classifiedas each genomic type we followed the same strategy

We compared our observations of TE insertions to a bino-mial distribution to determine whether TEs were distributedrandomly throughout the genome We used the pbinomfunction in R (R Development Core Team 2008) setting thenumber of trials to the total number of TE insertion locationsin the data set and the probability of success to the propor-tion of the genome defined as each context (intron exon30 UTR 50 UTR or intergenic) We did one tailed tests in eachcase For exon 30 UTR and 50 UTR tests we calculated theprobability of observing the observed value or fewer TEs Forintron and intergenic tests we calculated the probability ofobserving the observed value or more TEs

Variables Affecting TE Density

For the DGRP25 and DSPR data sets we calculatedTE density in terms of TEsMb For this analysis we alsoexcluded chromosome 4 We examined the effects of recom-bination rate chromosome and TE element family on TE

2324

Cridland et al doi101093molbevmst129 MBE

density using the standard linear model in R We usedthe lm() function to define our model as followsTEsMb ~ Line + Recombination Rate + Chromosome +Recombination Rate Chromosome We performed thisanalysis both for the set of all TEs and also for 15 TE familieswith moderate to high number of copies in our data set(supplementary table S4 Supplementary Material online)We selected individual TE families for this additional analysisbased on their level of abundance in the data sets and theirhaving been previously studied by Montgomery et al (1987)and Bartolome et al (2002)

Gene Families

We downloaded the Dfam database from (httpwwwindianaedu~hahnlabflyDfamDBdrosophila_frbhtml last acc-essed September 1 2012) This database describes all mem-bers of all gene families clustered using a fuzzy reciprocalBLAST method as described in Hahn et al (2007) for 12Drosophila species We used R to perform a linear regressionusing a simple model where the square root (TE density) in agene is the random variable and log10 (gene family size) is theindependent variable

Variation between Resources

For each TE present in both data sets we compared thefrequency of the TE in each population using Fisherrsquos exacttest We excluded from this analysis all insertions where wewere unable to make a presenceabsence call in at least 100 ofthe DGRP lines or 13 of the DSPR lines For each gene with TEinsertions in both data sets we compared the frequency atwhich the gene contains TEs between data sets using aFisherrsquos exact test

Supplementary MaterialSupplementary tables S1ndashS6 and figures S1 and S2 are avail-able at Molecular Biology and Evolution online (httpwwwmbeoxfordjournalsorg)

Acknowledgments

The authors thank Casey Bergman and Miaomiao He for earlyaccess to their data as well as helpful comments from CBThey also thank Rebekah Rogers for help with editing themanuscript This work was supported by the NationalInstitute of Health grant NIH R01-GM085183 to KRT theNational Library of Medicine - National Institute of Healthtraining grant LM007443 to JMC and the National Instituteof Health grant NIH R01-OD010974 to ADL and SJM

ReferencesAltschul SF Gish W Miller W Myers EW Lipman DJ 1990 Basic local

alignment search tool J Mol Biol 215403ndash410Aminetzach YT Macpherson JM Petrov DA 2005 Pesticide resistance

via transposition-mediated adaptive gene truncation in DrosophilaScience 309764ndash767

Bartolome C Maside X Charlesworth B 2002 On the abundance anddistribution of transposable elements in the genome of Drosophilamelanogaster Mol Biol Evol 19926ndash937

Bennetzen JL 2000 Transposable element contributions to plant geneand genome evolution Plant Mol Biol 42251ndash269

Bergman CM Bensasson D 2007 Recent LTR retrotransposon insertioncontrasts with waves of non-LTR insertion since speciation inDrosophila melanogaster Proc Natl Acad Sci U S A 10411340ndash11345

Biemont C Vieira C 2006 Junk DNA as an evolutionary force Nature443521ndash524

Birchler JA Hiebert JC 1989 Interaction of the Enhancer of white-apricotwith transposable element alleles at the white locus in Drosophilamelanogaster Genetics 122129ndash138

Birchler JA Hiebert JC Rabinow L 1989 Interaction of the mottler ofwhite with transposable element alleles at the white locus inDrosophila melanogaster Genes Dev 373ndash84

Bowen NJ McDonald JF 2001 Drosophila euchromatic LTR retrotran-sposons are much younger than the host species in which theyreside Genome Res 111527ndash1540

Carr M Soloway JR Robinson TE Brookfield JF 2002 Mechanisms reg-ulating copy numbers of six LTR retrotransposons in the genome ofDrosophila melanogaster Chromosoma 110511ndash518

Charlesworth B 1994 The effect of background selection against dele-terious mutations on weakly selected linked variants Genet Res 63213ndash227

Charlesworth B Jarne P Assimacopoulos S 1994 The distribution oftransposable elements within and between chromosomes in a pop-ulation of Drosophila melanogaster III Element abundances in het-erochromatin Genet Res 64183ndash197

Charlesworth B Langley CH 1989 The population genetics ofDrosophila transposable elements Annu Rev Genet 23251ndash287

Charlesworth B Lapid A 1989 A study of ten families of transposableelements on X chromosomes from a population of Drosophila mel-anogaster Genet Res 54113ndash125

Charlesworth B Lapid A Canada D 1992 The distribution of transpos-able elements within and between chromosomes in a population ofDrosophila melanogaster I Element frequencies and distributionGenet Res 60103ndash114

Chen GK Marjoram P Wall JD 2009 Fast and flexible simulation ofDNA sequence data Genome Res 19136ndash142

Chung H Bogwitz MR McCart C Andrianopoulos A Ffrench-ConstantRH Batterham P Daborn PJ 2007 Cis-regulatory elements in theaccord retrotransposon result in tissue-specific expression of theDrosophila melanogaster insecticide resistance gene Cyp6g1Genetics 1751071ndash1077

Cutter AD Good JM Pappas CT Saunders MA Starrett DM Wheeler TJ2005 Transposable element orientation bias in the Drosophila mel-anogaster genome J Mol Evol 6733ndash741

Daborn P Boundy S Yen J Pittendrigh B Ffrench-Constant R 2001DDT resistance in Drosophila correlates with Cyp6g1 over-expressionand confers cross-resistance to the neonicotinoid imidacloprid MolGenet Genomics 266556ndash563

Deloger M Cavalli FMG Lerat E Biemont C Sagot M-F Vieira C 2009Identification of expressed transposable element insertionsin the sequenced genome of Drosophila melanogaster Gene 43955ndash62

Dworkin I Palsson A Gibson G 2005 Replication of an Egfr-wing shapeassociation in a wild-caught cohort of Drosophila melanogasterGenetics 1692115ndash2125

Ewing B Green P 1998 Base-calling of automated sequencer tracesusing Phred II Error probabilities Genome Res 8186ndash194

Gruber JD Genissel A Macdonald SJ Long AD 2007 How repeatable areassociations between polymorphisms in achaete-scute and bristlenumber variation in Drosophila Genetics 1751987ndash1997

Haddrill PR Charlesworth B Halligan DL Andolfatto P 2005 Patterns ofintron sequence evolution in Drosophila are dependent upon lengthand GC content Genome Biol 6R67

Hahn MW Han MV Han S-G 2007 Gene family evolution across 12Drosophila genomes PLoS Genet 3e197

Hirschhorn JN Daly MJ 2005 Genome-wide association stud-ies for common diseases and complex traits Nat Rev Genet 695ndash108

2325

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

Hudson RR Kaplan NL 1995 Deleterious background selection withrecombination Genetics 1411605ndash1617

Hufford MB Xu X van Heerwaarden J et al (25 co-authors) 2012Comparative population genomics of maize domestication andimprovement Nat Genet 44808ndash811

Kaminker JS Bergman CM Kronmiller B et al (12 co-authors) 2002The transposable elements of the Drosophila melanogastereuchromatin a genomics perspective Genome Biol3RESEARCH0084

Kapitonov VV Jurka J 2003 Molecular paleontology of transposableelements in the Drosophila melanogaster genome Proc Natl AcadSci U S A 1006569ndash6574

Kazazian HH 2004 Mobile elements drivers of genome evolutionScience 3031626ndash1632

Kidwell MG 2002 Transposable elements and the evolution of genomesize in eukaryotes Genetica 11549ndash63

King EG Macdonald SJ Long AD 2012a Properties and power of theDrosophila synthetic population resource for the routine dissectionof complex traits Genetics 191935ndash949

King EG Merkes CM McNeil CL Hoofer SR Sen S Broman KW LongAD Macdonald SJ 2012b Genetic dissection of a model complextrait using the Drosophila synthetic population resource GenomeRes 221558ndash1566

Kofler R Betancourt AJ Schlotterer C 2012 Sequencing of pooled DNAsamples (Pool-Seq) uncovers complex dynamics of transposableelement insertions in Drosophila melanogaster PLoS Genet 8e1002487

Langley CH Montgomery E Hudson R Kaplan N Charlesworth B 1988On the role of unequal exchange in the containment of transpos-able element copy number Genet Res 52223ndash235

Li H Durbin R 2009 Fast and accurate short readalignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760

Linheiro RS Bergman CM 2012 Whole genome resequencing revealsnatural target site preferences of transposable elements inDrosophila melanogaster PLoS One 7e30008

Lipatov M Lenkov L Petrov DA Bergman CM 2005 Paucity of chimericgene-transposable elements in the Drosophila melanogastergenome BMC Biol 324

Long AD Lyman RF Morgan AH Langley CH Mackay TFC 2000 Bothnaturally occurring insertions of transposable elements and inter-mediate frequency polymorphisms at the achaete-scute complexare associated with bristle number in Drosophila melanogasterGenetics 1541255ndash1269

Lonnig W-E Saedler H 2002 Chromosome rearrangements and trans-posable elements Annu Rev Genet 36389ndash410

Macdonald SJ Long AD 2007 Joint estimates of quantitativetrait locus effect and frequency using synthetic recombinantpopulations of Drosophila melanogaster Genetics 1761261ndash1281

Macdonald SJ Pastinen T Long AD 2005 The effect of polymorphismsin the Enhancer of split gene complex on bristle number variation ina large wild-caught cohort of Drosophila melanogaster Genetics 1711741ndash1756

Mackay TFC 1984 Jumping genes meet abdominal bristles hybrid dys-genesis-induced quantitative variation in Drosophila melanogasterGenet Res 44231ndash237

Mackay TFC 1989 Transposable elements and fitness in Drosophilamelanogaster Genome 31284ndash295

Mackay TFC 2010 Mutations and quantitative genetic variationlessons from Drosophila Philos Trans R Soc Lond B Biol Sci 3651229ndash1239

Mackay TFC Langley CH 1990 Molecular and phenotypic variation inthe achaete-scute region of Drosophila melanogaster Nature 34864ndash66

Mackay TFC Lyman RF Jackson MS 1992 Effects of P element inser-tions on quantitative traits in Drosophila melanogaster Genetics 130315ndash332

Mackay TFC Richards S Stone EA et al (52 co-authors) 2012 TheDrosophila melanogaster genetic reference panel Nature 482173ndash178

Meyer LR Zweig AS Hinrichs AS et al (32 co-authors) 2013 The UCSCgenome browser database extensions and updates 2013 NucleicAcids Res 41D64ndashD69

Montgomery E Charlesworth B Langley CH 1987 A test for the role ofnatural selection in the stabilization of transposable element copynumber in a population of Drosophila melanogaster Genet Res 4931ndash41

Montgomery EA Huang S-M Langley CH Judd BH 1991 Chromosomerearrangement by ectopic recombination in Drosophilamelanogaster genome structure and evolution Genetics 1291085ndash1098

NCI-NHGRI Working Group on Replication in Association Studies2007 Replicating genotype-phenotype associations Nature 447655ndash660

Nuzhdin SV Mackay TFC 1994 The genomic rate of transposable ele-ment movement in Drosophila melanogaster Mol Biol Evol 12180ndash181

Nuzhdin SV Pasyukova EG Mackay TFC 1997 Accumulation of trans-posable elements in laboratory lines of Drosophila melanogasterGenetica 100167ndash175

Ometto L Stephan W Lorenzo DD 2005 Insertiondeletion andnucleotide polymorphism data reveal constraints in Drosophilamelanogaster introns and intergenic regions Genetics 1691521ndash1527

Palsson A Gibson G 2004 Association between nucleotide variation inEgfr and wing shape in Drosophila melanogaster Genetics 1671187ndash1198

Petrov DA Fiston-Lavier A-S Lipatov M Lenkov K Gonzalez J 2011Population genomics of transposable elements in Drosophila mela-nogaster Mol Biol Evol 281633ndash1644

Presgraves DC 2006 Intron length evolution in Drosophila Mol Biol Evol132203ndash2213

Quinian AR Hall IM 2010 BEDTools a flexible suite of utilities forcomparing genomic features Bioinformatics 26841ndash842

R Development Core Team 2008 R a language and environment forstatistical computing Vienna Austria R Foundation for StatisticalComputing [cited 2013 May 1] Available from httpwwwR-projectorg

Rizzon C Marais G Gouy M Biemont C 2002 Recombination rate andthe distribution of transposable elements in the Drosophila mela-nogaster genome Genome Res 12400ndash407

Sackton TB Kulathinal RJ Bergman CM Quinlan AR Dopman EBCarneiro M Marth GT Hartl DL Clark AG 2009 Population geno-mic inferences from sparse high-throughput sequencing of twopopulations of Drosophila melanogaster Genome Biol Evol 1449ndash465

Sakurai M Aoki T Yoshikawa S et al (11 co-authors) 2009 Differentiallyexpressed Drl and Drl-2 play opposing roles in Wnt5 signalingduring Drosophila olfactory system development J Neurosci 294972ndash4980

Schmidt JM Good RT Appleton B et al (11 co-authors) 2010Copy number variation and transposable elements feature inrecent ongoing adaptation at the Cyp6g1 locus PLoS Genet 6e1000998

Shrimpton AE Mackay TFC Brown AJL 1990 Transposable element-in-duced response to artificial selection in Drosophila melanogastermolecular analysis of selection lines Genetics 125803ndash811

Slotkin RK Martienssen R 2007 Transposable elements and the epige-netic regulation of the genome Nat Rev Genet 8272ndash285

Smith PA Corces VG 1991 Drosophila transposable elements mecha-nisms of mutagenesis and interactions with the host genome AdvGenet 29229ndash300

Thornton K 2003 Libsequence a C ++ class library for evolutionarygenetic analysis Bioinformatics 192325ndash2327

Vazquez JF Albornoz J Dominguez A 2007 Direct determination of theeffects of genotype and extreme temperature on the transposition

2326

Cridland et al doi101093molbevmst129 MBE

of roo in long-term mutation accumulation lines of Drosophilamelanogaster Mol Genet Genomics 278653ndash664

Voight BF Pritchard JK 2005 Confounding from cryptic relatedness incase-control association studies PLoS Genet 1e32

Wakeley J 2009 Coalescent theory an introduction Greenwood VillageRoberts and Company Publishers

Wicker T Sabot F Hua-Van A et al (13 co-authors) 2007 A unifiedclassification system for eukaryotic transposable elements Nat RevGenet 8973ndash982

Xing J Witherspoon DJ Jorde LB 2013 Mobile element biology newpossibilities with high-throughput sequencing Trends Genet 29280ndash289

Yoshikawa S McKinnon RD Kokel M Thomas JB 2003 Wnt-mediated axon guidance via Drosophila derailed receptor Nature422583ndash588

Zachar Z Bingham PM 1982 Regulation of white locus expression thestructure of mutant alleles at the white locus of Drosophila mela-nogaster Cell 30529ndash541

2327

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

Page 3: Abundance and Distribution of Transposable Elements in Two ...

another line were masked in the line with the lowest averagegenome coverage In addition to the regions removed due toIBD we removed four lines from the DGRP set which had anaverage genome coverage of lt10 (supplementary table S1Supplementary Material online) This filtering resulted in asample size of 131 DGRP lines some of which were partiallymasked to be used in all further analyses This removal of somelines and masking of regions in other lines resulted in variationin sample sizes across the genome For 98 of sites in thegenome the sample size was 68 For the remainder of sitescoverage was low in a substantial number of lines indicative ofgenomic regions where alignment of short reads is difficult

Coalescent simulations of an equilibrium WrightndashFishermodel showed that the probability of more than four pairsof lines with a single 1 Mb region of IBD matching our IBDcriteria is lt103 Therefore our observation of hundreds of1 Mb regions of IBD between pairs of lines (fig 1A) indicatedthat there was substantially more IBD in the DGRP thanexpected in a panel of randomly chosen individuals sampledfrom an idealized population However the extent of IBD wassimilar between the 15 DSPR founder lines and the 15 DGRPlines with the highest average sequence coverage (fig 1Band C) and was equal to a few dozen megabytes of IBDincluding a large proportion of centromeric regions It istherefore unclear whether the extensive IBD seen in 148lines of the DGRP (fig 1A) is a general property of largesamples of cosmopolitan D melanogaster or a specificproperty of the DGRP line collection

Data Filtering

Following the IBD analysis we identified transposable ele-ment insertions in a total of 146 lines 131 DGRP and 15DSPR founder lines TEs identified at a specific genomic loca-tion were either present in the D melanogaster referencesequence (version 513 downloaded from wwwflybaseorglast accessed January 30 2009) hereafter referred to as refer-ence TEs or not present in the reference sequence hereafterreferred to as novel TEs Centromeric and telomeric regionsdefined in the Materials and Methods section were excludedfrom the analysis as identification of TEs in these regions canbe problematic since their very high TE densities result in fewinformative read-pairs In addition we wished to focus on theeuchromatic regions of the genome since most genes arelocated in these regions We also removed from the datasets reference TEs which were lt75 of a full length copyof the element as these elements are unlikely to be activecopies These filtering steps reduced the number of referenceTEs considered in later analysis to 607 from a starting total of6003

Validation of Transposable Element PresenceAbsence CallsSimulationDetecting a set of inserted elements in silico showed that atan average genome coverage of 50 we were able to detect913 of elements that were at least 75 of the length of the

0 20 40 60 80 100 120

05

1015

0 20 40 60 80 100 120

05

1015

0 20 40 60 80 100 120

050

100

150

Descending Order of Coverage DGRP

X 2L 2R 3L 3R

A B

C

DGRP Top 15 by Coverage

DSPR

Mb Mb

Mb

Line

s (R

anke

d by

Mea

n C

over

age)

X

X

2L 2R 3L 3R

2L 2R 3L 3R

FIG 1 (A) Identity by descent in 148 DGRP lines (B) IBD in the top 15 DGRP lines by average sequence coverage (C) IBD in the 15 DSPR lines Maskedregions indicate regions of IDB 95 When two lines were considered IBD in a region the line with lower mean coverage was masked

2313

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

canonical element These are the set of events directly com-parable to the set of reference TEs we kept for our analysissince these are the set of TEs that are likely to be active copiesin the genome A drop in average genome coverage did notaffect our rate of identification in simulated data sets Wedetected 956 of elements at 25 coverage and 913 ofelements at 15 coverage

When we looked at the detection of all elements we foundthat we were able to identify the insertion locations for 70 ofall elements at 50 coverage 724 of elements at 25 cov-erage and 64 of elements at 15 coverage

PCR ValidationWe compared our pipelinersquos TE calls with PCR data from1687 PCR calls of TE insertions in 9 of the DGRP lines(Blumenstiel JP Chen X He M Bergman CM unpublisheddata httparxivorgabs12093456 last accessed June 262013) The comparison of our pipeline to PCR-validated in-sertions showed a 924 overall agreement between the twomethods (supplementary table S2 Supplementary Materialonline) when comparing presence and absence calls We didnot include the few examples of heterozygous insertions inthis analysis Our pipeline was unable to make a presence orabsence call for the set of PCR-validated insertions 6 of thetime and disagreed with the PCR data 16 of the time Failureto validate via PCR was equally likely for presence or absencecalls suggesting that detection of TEs was unbiased with re-spect to presence versus absence We estimate a pipelinesensitivity of 982 and a specificity of 94 based on thissubset of the data Our estimation of the validation rate isconservative requiring us to reconstruct a contig that eithercontains the TE breakpoint or spans over the insertion site inorder to call an event as either present or absent respectivelyFor some cases where our pipeline makes no call we did seeread pairs that are suggestive of the state of the presence orabsence of the TE but are not sufficient in number andoruniqueness for reconstruction the event de novo

Computational ValidationThe initial phase of our pipeline detected transposable ele-ment insertions in individual lines Following this initial de-tection phase our pipeline surveys every line at each genomiclocation where a TE insertion was identified in any other lineincluding lines in different data sets In this manner weattempted to computationally verify the presence or absenceof TE insertions at a total of 2596370 genomic locationsOur pipeline made presence or absence calls between9717 and 9998 of the time (supplementary table S3Supplementary Material online) Positive absence callsmade by our pipeline can also be considered a type of vali-dation since to make an absence call our pipeline must detectreads that span the junction where the insertion would belocated (supplementary table S3 Supplementary Materialonline)

The presence of regions of IBD amongst DGRP lines (fig 1)provided an opportunity to cross-validate TE calls in regionsmasked as IBD between different line pairs because we ex-pected TE genotypes to be identical within regions identified

as IBD We find a mean of 98 agreement for TE calls (bothpresence and absence) between pairs of regions identified asIBD with a standard deviation of 0015

Distribution and Abundance of TransposableElements

A total of 7104 transposable element insertions were foundin the DSPR and 17639 in the DGRP A comparison of the TEinsertion content of the DGRP and DSPR reveals that eachpanel contained a large number of TE insertions specific tothat panel only 1656 insertions were shared between theDSPR and DGRP data sets out of a total of 23087 insertionsdetected (ie 72 shared insertions) All transposable ele-ment insertion locations are listed in supplementarytables S4 and S5 Supplementary Material online A subsetof the DGRP data lines with an average sequence coverage25 mean sequence coverage 335 standard deviation51 hereafter DGRP25 (table 1) contained 5855 insertionsThe DGRP25 set of lines was used in some analyses to bemore directly comparable to the DSPR lines which all had amean genome coverage of 60 (fig 2)

In both resources TE insertions were generally at low fre-quency Overall we found a larger total number of insertionsper line in DSPR than in DGRP (fig 2) This was presumablydue to the much higher sequencing coverage of the formerleading to a higher power to make calls of complex insertionsMean sequence coverage for the DGRP lines was 195with astandard deviation of 64 mean sequence coverage for theDSPR was 671 with a standard deviation of 42 Howeverthe increased amount of time the DSPR lines have remained inlaboratory conditions may also have contributed to the ele-vated number of TEs seen in these lines (Nuzhdin et al 1997)

TEs may be located at random throughout the genome orthey may be found preferentially in certain genomic regionssuch as in intergenic sequence We used the binomial distri-bution as a null model for the distribution of TEs in thegenome assuming TEs insert and remain at random withrespect to these genomic features (table 2) We found thatin general the distribution of TEs observed differed from whatwe would expect if TEs were distributed at random through-out the genome We found significantly fewer TEs in exons50 UTRs and 30 UTRs than expected and significantly moreTEs than expected in intergenic and intronic regions (table 2)which was consistent with previous observations as TEs ingenes are much more likely to be deleterious (Kaminkeret al 2002 Lipatov et al 2005) This pattern held true forthe DGRP the DGRP25 and the DSPR suggesting that selec-tion may be acting universally to remove TEs that disruptexonic sequences

TE Identification

Information about the class of TE that has been inserted isuseful for two reasons First this information will allow forrapid assays for the presence of this TE in other strainsSecond insertions of different families of TEs can producedifferent phenotypic consequences (Smith and Corces1991) We were able to identify the TE class of the majority

2314

Cridland et al doi101093molbevmst129 MBE

of insertions (table 1) RNA elements were much morecommon than DNA elements and within RNA elementsover half were long terminal repeat (LTR) retrotransposonelements We were unable to identify element class for347 of the DGRP insertions and 320 of the DSPR inser-tions In cases where we were unable to identify the TE familyfor an insertion this was due to either conflict between themost likely TE as identified by the two reconstructed contigsfor each event or because we had limited data on the TEsequence for that event In cases where we have conflictinginformation this was due to 1) the degradation of the TEsequence so that there were many poor matches to multiplefamilies 2) conserved sequence between families and there-fore good matches to multiple families 3) because we iden-tified a nest of elements where multiple elements insert invery close proximity or within one another or 4) the amountof TE sequence we were able to reconstruct was less than

80 bp and therefore below our annotation threshold lengthBecause of the frequently short length of TE sequence that weattempted to identify we did not attempt to identify thestrand of the TE insertion Although TE orientation is impor-tant and influences selection upon the insertion (Cutter et al2005) our data were frequently insufficient to resolve strand

Factors Influencing TE Density

If TE insertions are deleterious and recessive with respect tofitness then insertions are also predicted to be morecommon on the autosomes since deleterious insertions onthe X will be eliminated via selection against hemizygousmales (Charlesworth and Langley 1989) Contrary to this pre-diction an analysis of variance (ANOVA) indicated a signifi-cant effect of chromosome on mean TE density per line whenboth resources were included in the same analysis (table 3)This difference was due to an increased mean density of TEson the X versus the autosomes in regions of high recombi-nation This pattern was the same both for autosomes takenas a group and for each autosome individually (table 4)

TEs are predicted to accumulate in regions of low recom-bination where they are less likely to produce deleteriousrearrangements via ectopic exchange (Charlesworth andLangley 1989) An ANOVA indicates a highly significanteffect of recombination rate on mean TE density per line(table 3) TE density was also found to be greater in regionsof low recombination than in high recombination through-out the genome which was consistent with the predictions ofCharlesworth and Langley (1989)

Because the factors that influence TE density may differgreatly from family to family we repeated this analysis in 14families of TEs that appear in moderate to high abundance inthe two resources (table 5) Many of these families were pre-viously examined by Montgomery et al (1987) and Bartolomeet al (2002)

We found that in 10 families (412 roo 176 F Bari1 copiahopper INE-1 mdg1 springer) there was a highly significantdifference in mean TE density between the X and the auto-somes (P value 0026 in all cases) although the direction ofthe difference varied among families Within these familiesthere were substantial differences In two of the familiesINE-1 and roo we saw an increased density in the X overthe autosomes in regions of high recombination in boththe DSPR and DGRP25 resource (table 5) However for 412F mdg1 springer Bari1 and copia we saw the opposite pat-tern This suggests that there may be substantial variation inTE insertion site preferences among families

Sample Counts of Transposable Elements

The power to detect an association is dependent both on thepenetrance and the frequency of the causal allele (Hirschhornand Daly 2005) We found that the majority of transposableelements in the DGRP25 and the DSPR are found in only oneline (fig 3) Comparisons between the observed TE data andSNPs from the same population found in introns 86 bp(Haddrill et al 2005) indicated an excess of TEs (DGRP252 test Pamp 0 df = 6 DSPR 2 test Pamp 0 df = 6)

0 20 40 60 80 100

020

040

060

080

010

00

Coverage

TE

s

DSPRDGRP gt 25x coverageDGRP lt 25x coverage

FIG 2 Total number of TEs identified versus coverage for the DGRPDGRP25 and DSPR lines

Table 1 Summary of TE Insertions

DGRP DGRP25 DSPR

Total 17639 5855 7104

Not present in reference 17346 5615 6812

Present in reference 293 240 292

X recombination 2 cMMb 418 149 203

X recombinationgt 2 cMMb 2741 948 1107

Autosomes recombination 2 cMMb 6205 2131 2694

Autosomes recombinationgt 2 cMMb 8177 2544 3019

4 all 98 83 81

Exon 1158 378 633

Intron 8595 2870 3310

30 UTR 477 159 234

50 UTR 190 63 81

Intergenic 7219 2680 3269

DNA elements 1388 514 748

RNA elements 10133 3004 4085

Indeterminate 6118 2336 2271

2315

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

Comparisons between the observed data and the expectednumber of events at the same count under an infinite sitesmodel of an equilibrium Wright-Fisher population (Wakeley2009 p 54ndash56) also indicated an excess of rare alleles(DGRP25 2 test Pamp 0 df = 6 DSPR 2 test Pamp 0df = 6) In addition to overall site count spectra we also gen-erated spectra for the X autosomes regions of high and lowrecombination DNA elements and RNA elements In eachcase we observed the same pattern as seen in the overallspectrum (data not shown) We also compared the distribu-tion of counts in the DGRP25 with the rescaled distribution ofcounts in the DSPR rescaled to the same sample size as theDGRP25 A KolmogorovndashSmirnov test comparing these twodistributions showed that they did not differ (P = 021) Wedid see more overall TEs in every count category in therescaled DSPR data over the DGRP25 data set

Patterns of Transposable Element Insertions betweenGenes

For association studies genotyping TE insertions withincoding regions is of interest due to the potentially large

effect of an insertion on gene function A total of 2931genes (213 of annotated genes) had insertions in theDGRP resource These are insertions which exist anywherewithin the span of the gene from 50 UTR to 30 UTR Onethousand nine hundred fifteen genes 139 had insertions inthe DSPR with 1317 genes having insertions in bothresources The number of TE insertions within genes variedsubstantially between individual genes up to a maximum of78 insertions though the majority of genes with insertionshad only one insertion 631 in the DSPR and 527 in theDGRP The number and locations of individual insertions in agene varied tremendously though the majority of insertionsthat fell within gene regions were in introns 777 DSPR and824 DGRP TE number was strongly correlated with intronsize (Pearsonrsquos correlation coefficient = 079 for DSPR 087for DGRP Pamp 0 for both tests) (supplementary fig S2Supplementary Material online)

Examining the pattern of insertions in individual genes inthe DSPR and DGRP data sets we found that RNA-bindingprotein 6 had the highest number of insertions 62 in theDGRP lines and 21 in the DSPR lines These insertions were

Table 3 ANOVAs for DSRP and DGRP25 Coverage and Comparison between the Two Data Sets

df Sum Sq Mean Sq F value Pr(gtF)

DSPR vs DGRP25

Set 1 54125 54125 6486975 lt2e16

Line 36 2848 791 94817 lt2e16

Chromosome 4 11858 2965 355313 lt2e16

Recombination rate (high vs low) 1 25575 25575 3065207 lt2e16

Chromosomerecombination rate 4 10335 2584 309663 lt2e16

Residuals 333 27784 083

DSPR

Line 14 26599 19 17894 004699

Chromosome 4 39632 9908 93316 123e06

Recombination rate (high vs low) 1 141337 141337 1331143 lt2e16

Chromosomerecombination rate 4 39712 9928 93504 119e06

Residuals 126 133783 1062

DGRP25

Line 22 258202 11736 1723 lt2e16

Chromosome 4 80126 20031 29409 lt2e16

Recombination rate (high vs low) 1 120011 120011 176191 lt2e16

Chromosomerecombination rate 4 66056 16514 24244 243e16

Residuals 198 134866 0681

Table 2 Differences between Observed and Expected TE Counts

DGRP DGRP25 DSPRObserved vs Expected P Value Observed vs Expected P Value Observed vs Expected P Value

X recombination 2 cMMb Decrease 329E11 Decrease 187E03 Decrease 522E02

X recombinationgt 2 cMMb Decrease 174E01 Increase 701E02 Increase 416E01

Autosomes recombination 2 cMMb Decrease 370E01 Increase 100E + 00 Increase 687E04

Autosomes recombinationgt 2 cMMb Increase 392E02 Decrease 100E + 00 Decrease 100E + 00

Exon Decrease 000E + 00 Decrease 247E323 Decrease 103E283

Intron Increase 354E103 Increase 000E + 00 Increase 000E + 00

30 UTR Decrease 699E16 Decrease 313E06 Decrease 126E02

50 UTR Decrease 143E51 Decrease 282E18 Decrease 498E20

Intergenic Increase 112E32 Increase 0 Increase 000E + 00

2316

Cridland et al doi101093molbevmst129 MBE

typically unique to a single line though there were 3 insertionsin more than one line in the DSPR and 14 insertions in morethan one line in the DGRP

Some genes accumulated many TE insertions such asklarsicht (fig 4A gene images from the UCSC genome brow-ser Meyer et al 2013) which had 29 total insertions This genealso had a hotspot of insertions with seven independent in-sertions each at low frequency located within 36 kb of each

other Only one of these insertions was present in any givenline and these could be functionally equivalent though inde-pendently arising mutations Other genes had only a few in-sertions such as Notch (fig 4B) and Delta (fig 4C) Theinsertions in these genes were also at low frequency andthe few present were located in intronic regions

While most insertions exist in only one panel some werefound in both Cyp6a20 (fig 4D) contains a high frequencynon-reference TE insertions present in both data sets in 4lines out of 121 lines where we were able to make a presenceor absence call in the DGRP (hereafter shown as 4121) and314 lines in the DSPR Both Cyp6a20 and klarsicht are exam-ples of genes with TE insertions in exons which was

00

02

04

06

08

10

1 2 3 4 5 6 7+

Observed TEsObserved SNPsExpected

DSPR

00

02

04

06

08

10

1 2 3 4 5 6 7+

Observed TEsObserved SNPsExpected

DGRP

FIG 3 Derived allele count spectra for the DSRP lines and the DGRP25lines where a positive presence or absence call was made for eachinsertion in each line 6613 insertions in the DSPR and 3274 in theDGRP25 Count spectra for SNPs is from SNPs in introns 86 bp 2

tests between observed and expected distributions result in Pamp 0 forcomparisons between TEs and the neutral model as well as between TEsand SNPs for both data sets

Table 5 TE Density for 15 Individual Families of Elements

Mean Density (TEMb)

Element Resource X High X Low Auto High Auto Low

roo DGRP25 039 037 032 033DSPR 080 088 065 068

297 DGRP25 005 000 002 001DSPR 005 000 004 000

412 DGRP25 003 001 003 007DSPR 006 008 009 012

F DGRP25 002 006 004 006DSPR 007 008 012 013

176 DGRP25 001 000 001 002DSPR 002 000 002 006

Bari1 DGRP25 001 000 004 001DSPR 002 002 006 002

copia DGRP25 001 000 002 003DSPR 009 002 014 019

H DGRP25 001 000 001 000DSPR 000 000 001 000

hopper DGRP25 010 001 000 000DSPR 010 004 002 002

INE-1 DGRP25 042 188 003 048DSPR 045 203 004 055

jockey DGRP25 015 012 019 013DSPR 035 034 040 033

mdg1 DGRP25 003 002 003 010DSPR 008 004 012 018

pogo DGRP25 006 004 007 005DSPR 013 021 015 013

springer DGRP25 000 001 001 003DSPR 002 000 001 005

Table 4 TE Density in the X and Autosomes

TEMb

DGRP25 DSPR Both

X all 382 638 483

Autosomes all 351 593 447

X high recombination 371 600 461

X low recombination 393 676 505

2L high recombination 227 473 324

2L low recombination 323 589 428

2R high recombination 282 489 364

2R low recombination 617 863 714

3L high recombination 296 523 386

3L low recombination 474 726 573

3R high recombination 281 468 355

3R low recombination 372 670 490

2317

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

uncommon both in this data set and in previous studies(Kaminker et al 2002 Lipatov et al 2005)

Differences in the strength of selection against closelyrelated genes can also be illustrated by patterns of TE inser-tions The paralogs derailed (fig 4F) and derailed-2 (fig 4E)showed very different patterns of TE insertions though thesame pattern was seen in both resources derailed-2 locatedon chromosome 2R is in a region of moderate recombination199 cMMb whereas derailed is located towards the distalend of 2L in a region of low recombination 044 cMMb

Given the context of the recombination rates the expectationwould be that derailed would have a higher TE load sincedeleterious alleles are removed more efficiently in regions ofhigh recombination (Hudson and Kaplan 1995) but theopposite pattern was observed

While both genes play a role in Wnt5 signaling mutationsin derailed can cause major phenotypic changes in Drosophilanervous system development resulting in the loss of normalfunction (Yoshikawa et al 2003) derailed-2 when mutatedproduces only minor differences in neuron positioning

115 115115 215 115115

114 1122

11181120 11181124

2117

21182119

1122

1119 1118

2118

1121 3114

1115

1114

1119

1113

1119

1120

1120

1119

1120

3119

1119

A klarsichtDGRP

DSPR

Insertion in an Exon

314 1195

4121D Cyp6a20

315

1123

1415

56125

115115

415

115615

10124

1124 1124 1123

2124E derailed-2

B Notch

115 11211122

C Delta1114 1120

191

F derailed

Same insertion in both resources

FIG 4 Transposable element insertions in genes The frequency above each insertion is the number of lines in which the element is present over thenumber of lines in which the element is validated as either present or absent Gene images are from the UCSC genome browser (httpgenomeucscedu last accessed June 31 2012)

2318

Cridland et al doi101093molbevmst129 MBE

(Sakurai et al 2009) This suggests that either the strength ofselection may be much higher against insertions in derailed orthe insertion rates are unequal between the paralogs

Gene Family Size and TEs

We used available information on gene families in DrosophilaHahn et al (2007) grouped into families based on sequencesimilarity and found a very weak but significant effect of genefamily size on the average number of TE insertions per kbin both resources (P value = 00003 R2 = 0001 DSPRP value = 0002 R2 = 0007 DGRP) (fig 5A and B) Althoughthere was a large variance in the number of insertions per kb

in both resources (fig 5A and B) the average number ofTEskb increased as family size increased up to a moderatelylarge family size

Patterns of Variation between Resources

To replicate an association with a single causative allele thatallele must both exist in both panels and also be at highenough frequency to detect the association Given thatmost TEs were singletons and only 45 of insertion eventswere identified in both panels and were within genes thereplicability of associations with these alleles is likely to below if alleles are considered individually We examined sharedinsertions segregating at different frequencies between panelsand found only two TE insertions in genes that were segre-gating at different frequencies in the DGRP versus DSPRresources following Bonferroni correction (Fisherrsquos exacttests P 000007) One of these was in an exon of bothCG13175 and CG33964 a pair of overlapping genes Herethe TE insertion was at much higher frequency in theDGRP than in the DSPR (89122 vs 215) The other insertionwas in an intron of Caliban and the frequency of this insertionwas higher in the DSPR than the DGRP (1115 vs 7116)

If insertions are considered as a class of mutation thenthere may be better power to detect an association assumingthat different insertions contribute in the same way to thephenotype We also examined genes with TE insertions(139 of genes) though not necessarily the same insertionin both resources Treating all insertion alleles within a gene asequivalent a comparison between resources showed thatthere were 21 genes which contain TEs at different frequen-cies between the DGRP versus DSPR following Bonferronicorrection (Fisherrsquos exact tests P 0000038197) includingderailed-2 (supplementary table S6 Supplementary Materialonline)

Discussion

Distribution and Abundance of TransposableElements

Previous work on transposable elements has generally beenfocused on a few families of TEs (Montgomery et al 1987Charlesworth and Lapid 1989 Charlesworth et al 1992) orfocused only on a single genome (Bartolome et al 2002Kaminker et al 2002 Rizzon et al 2002) We have presenteda genomic analysis of TEs across a large number of lines andtwo data sets We found that transposable elements in bothresources are primarily found as singletons with a few mod-erate to high frequency insertions Transposable elements alsosegregated at lower allele frequencies than do presumed non-functional SNPs drawn from the same data set suggestingthat negative selection is acting upon TEs However theremay be substantial variation in the age of transposable ele-ments in different families many LTR elements are thought tobe young insertions (Bowen and McDonald 2001 Bergmanand Bensasson 2007) while other families like INE-1 arethought to be quite old (Kapitonov and Jurka 2003) Thismay create a situation where the young age of insertions isresponsible for the excess of rare alleles We found a

0

1

2

3

00 05 10 15 20log10(Gene family size)

TE

den

sity

0

1

2

3

00 05 10 15 20log10(Gene family size)

TE

den

sity

DSPR

DGRPA

B

FIG 5 log10(TE density) versus log10(Gene family size)

2319

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

non-random distribution of TE insertions across the genomewith increases of TEs in both the intergenic and intronicregions of the genome and decreases in exonic and UTRregions (Kaminker et al 2002 Lipatov et al 2005) Similar toprevious studies we found variation in TE density betweenchromosomes and higher densities of TEs in regions of lowrecombination (Charlesworth et al 1994 Bartolome et al2002) We also found increased TE densities on the X relativeto the autosomes both in overall insertions and in someindividual TE families (table 5)

Detection method and the size of the data set are likely tobe contributing factors to differences between previous workand this study Many previous studies have used in situ meth-ods to ascertain TE number in samples and have looked atelement accumulation at the base versus the midsections ofchromosomes (Montgomery et al 1987 Charlesworth et al1992) or have looked only at one genome (Bartolome et al2002 Rizzon et al 2002) A previous PCR-based detectionstudy found no difference in TE frequency between the Xand the autosomes in natural populations (Petrov et al2011) but this study was biased in that it only looked atTEs shared with the reference sequence and thus did notcapture the full picture of TEs in these populations

While our analysis does disagree with previous work thesedisagreements may be mainly due to the way in which regionsof the genome are divided for examination Rizzon et al(2002) found an increase in copies of LTR retrotransposonson the X when they excluded pericentromeric telomeric andchromosome 4 from their analysis though they found a def-icit of TEs on the X when these regions are included Thisfinding is most directly comparable to our own finding for theentire TE data set since they exclude a very similar set ofgenomic regions We do agree with previous work in ourfinding of higher densities of TEs in regions of low recombi-nation than in high recombination (Charlesworth et al 1994Bartolome et al 2002)

Examining individual element families Carr et al (2002)found an increase of mdg-3 elements and 297 elements on theX than would be expected under a random insertion modelMontgomery et al (1987) found no evidence for a reductionin TEs on the X for both roo and 297 elements but foundfewer TEs on the X when looking at the 412 family HoweverCharlesworth et al (1992) observed the opposite pattern forroo and 297 Montgomery et al (1991) also found that moreroo elements in lab kept lines than they had previously ob-served in natural populations We found significantly in-creased densities on the X for roo and significantlyincreased densities on the autosomes for 412

Differences between previous studies of individual elementfamilies and our results may be due to a variety of reasonsFirst we were able to annotate only about two-thirds of ourTE insertions This was primarily due to having insufficientsequence to reliably annotate the element or because inde-pendent annotation using sequence reconstructed fromeither end of the insertion did not yield the same elementfamily annotation This means that our density calculationsfor individual elements may be skewed by missing annotationinformation Second many previous studies (Montgomery

et al 1987 Charlesworth et al 1992 Carr et al 2002) examinedonly LTR elements These elements are young (Bowen andMcDonald 2001 Bergman and Bensasson 2007) and thus in-dividual copies may not yet have been removed from thegenome thereby displaying different patterns of insertionthan older active elements

What is clear is that there is great variability in TE densityboth within and between element families as well as withinand between chromosomes It may also be that the overallpattern is driven by a small group of families In our data setINE-1 shows the largest pattern of difference between the Xand autosomes with an order of magnitude increase in meanTE density per line on the X over the autosomes possibly dueto an increased likelihood of fixation of insertions (Omettoet al 2005) specifically on the X chromosome (Presgraves2006) These INE-1 elements are a subset of the largest TEfamily in D melanogaster though only several dozen of themare in the euchromatin INE-1 elements are also considered tobe inactive in D melanogaster and likely have not been mobilefor gt3 million years (Kapitonov and Jurka 2003) Consistentwith this most of the INE-1 elements we detected were fixedor nearly fixed in our data set We did see some examples oflow frequency insertions which may be due to more recenttranspositional events or incorrect TE annotation due toshort sequences The old age of INE-1 insertions also suggeststhat the extant elements are unlikely to be selected againstAn increased density of TEs on the X compared with theautosomes suggests that it is not the fitness effects of inser-tions which govern TE copy number Fitness effects wouldresult in the more rapid removal of insertions on the X thaninsertions on the autosomes due to the increased selectivepressures experienced on the X (Montgomery et al 1987)Background selection also predicts that the X should be lessburdened by deleterious recessives since it is hemizygous onethird of the time (Charlesworth 1994)

However it is unclear whether our observation of in-creased densities of TEs on the X relative to the autosomesfits with the ectopic exchange model This model predictsthat TEs will accumulate in regions of low ectopic exchange(Langley et al 1988) The general assumption in the literatureis that ectopic exchange is positively correlated with recom-bination rate (Langley et al 1988 Montgomery et al 1991)The X chromosome experiences higher levels of recombina-tion than the autosomes with an average of 36 cMMb forthe X in regions of high recombination (cMMbgt 2) whereasthe autosomes have a mean of 336 cMMb If recombinationrate and ectopic exchange are positively correlated through-out the genome then the expectation would be fewer TEs inthe middle regions of the X where recombination rates areknown to be high This is the opposite of what we observe Ifrecombination rate and the rate of ectopic exchange are notpositively correlated then the ectopic exchange model maybe correct but data are lacking to properly address this issueCurrently there are no direct measurements of the rate ofectopic exchange on a genome-wide scale in Drosophiladuring meiosis Montgomery et al (1991) observed littleectopic exchange between homozygous chromosomes inDrosophila melanogaster on the X chromosome in the

2320

Cridland et al doi101093molbevmst129 MBE

region around the white locus They also found that whilerecombination rate was reduced in both centromeric andtelomeric regions only the centromeric region displayed anincreased density of TEs This information is suggestive ofdifferences between these regions in the rate of ectopicexchange

The DGRP and DSPR are also composed of inbred lineswhich have been kept in laboratory conditions for some time~9 years in the case of the DGRP lines and 40+ years in thecase of the DSPR Given that TE insertion mutations accumu-late rapidly and in an increasing manner over time (Nuzhdinet al 1997) the increased density of TEs on the X versus theautosomes may not be surprising if there is a highertransposition rate to the X The only direct study to datefocused on roo elements in D melanogaster and did observehigher rates of insertion in the X than the autosomes(Vazquez et al 2007)

Transposable Elements in Genes

The majority of genes 813 do not have any TE insertions inany line studied and intron length is a good predictor of thenumber of TEs within a gene However a few genes accumu-late many insertions even after correcting for gene size andintron length and some genes have very few TEs even thoughthey have large introns The reasons why different genes mayaccumulate TE insertions may be quite different for individualgenes In some cases there may be hot spots for insertionswhich if insertions are neutral or only slightly deleterious mayresult in large number of insertions in a small region In othergenes the lack of TE accumulation may suggest selectivepressures against TE accumulation

Gene family size may also play a role in TE accumulationWe see an increase in TE density correlated with an increase ingene family size This may indicate differences in selectivepressures for genes in different sized families Single-copygenes may experience more constraint whereas in moder-ately sized gene families these constraints may be more re-laxed perhaps due to redundancy

In the cytochrome p450 functional family of genes we findthat 3789 of these genes have TE insertions 2589 in theDGRP and 2089 in the DSPR Cytochrome p450 is also foundto be in an enriched term in the DSPR for the set of genescontaining TEs The cytochrome p450 gene Cyp6g1 is knownto confer resistance to DDT (Daborn et al 2001) and TE-mediated copy number variation is associated with increasingresistance to DDT (Schmidt et al 2010) This pattern of TEinsertions in the DSPR may be reflective of the locationswhere DSPR lines were originally collected The DSPR lineswere gathered from various locations worldwide at a timewhen different pesticides were in use The DGRP lines weregathered from a single geographic area in North America in2003 Presumably these sets of lines have experienced differ-ent selective pressures with respect to pesticide resistance InCyp6a20 (fig 4D) where TEs are segregating at different fre-quencies in the two resources some similar effect may beoccurring

TE Detection Pipelines

The TE detection strategy used here is an improvement uponthe strategy used in Mackay et al (2012) The most notabledifferences between the data reported in Mackay et al (2012)and the DGRP data reported here is substantial amount ofnew information First our pipeline improvements agree wellwith the previous implementation with 812 of previous TEpresence calls identified by the new pipeline However thesecalls represent on 182 of the presence calls made by theimproved pipeline demonstrating that we have identifiedsignificantly more calls than before The incorporation of TEabsence calls also allows us to provide a much more completepicture of TEs in euchromatic sequence in the DGRP

The TE detection pipeline used here is similar to otherpipelines which have been employed but it is useful to de-scribe briefly here the differences in approaches and why wefollowed the approach presented here A useful discussion ofcommon approaches to TE detection in many species can befound in Xing et al (2013) One common method for TEdetection is the split-read mapping approach which hasbeen employed in Drosophila by Linheiro and Bergman(2012) This approach identifies TE insertions by locating in-dividual sequence reads which span TE insertion breakpointsThis approach is the strategy of choice for next-generationsequencing experiments which produce only single-endreads However in a paired-end read situation this approachmay not take full advantage of the available data Specificallythe size of window around the TE insertion breakpoint whereinformative reads can be found is governed by read length in asingle-end experiment and fragment length in a paired-endexperiment Linheiro and Bergman (2012) discuss the impor-tance of read length on TE detection and suggest that in-creased read length improves TE detection In situationswhere read length is long and coverage is deep enough thesplit-read method may be preferable to the paired-end de-tection strategy However for the data sets analyzed herewhere read length is much shorter than fragment lengththe paired-end detection strategy will capture moreinformation

The pipeline used here also has similarities to the pipelineused in Kofler et al (2012) However there are two majordifferences in these studies First Kofler et al (2012) do notattempt to reconstruct their insertion breakpoints andsecond the population studied by Kofler et al (2012) wassequenced as a pool rather than as individual lines Becausetheir pooled sequence data reflect on average less than 1sequence coverage per line Kofler et al (2012) are primarilyable to detect intermediate to high frequency TE insertionstheir own analysis concludes that they cannot reliably detectinsertions at frequency lt7 (see supplementary fig S2Supplementary Material online in their supplementary dis-cussion) However the literature to date (Charlesworth andLangley 1989 Charlesworth and Lapid 1989 Charlesworthet al 1992) as well as the results of this study indicate thatthe majority of TEs in a population are found at very lowfrequency often in only one individual in a population andtherefore beyond the scope of detection of Kofler et al (2012)

2321

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

Caveats

The D melanogaster reference genome to which we alignedour data is a P-element-free genome Therefore the use of thisreference will be biased against detecting any P elementsexisting in the data sets In the case of the DSPR data setthis is not an issue since these lines are also all P-element-free(Macdonald and Long 2007) However it is likely that theDGRP lines do harbor P elements For the purpose of theseanalyses we decided to ignore P elements and focus on theset of element families common to the two resources

This situation does highlight the importance of having anappropriate reference sequence To detect P elements in thegenome of D melanogaster an artificially constructed refer-ence sequence or a different assembled sequence containingthe P element would need to be utilized This also brings upthe question about as yet undetected elements that exist atlow frequency If these elements do not have sufficient sim-ilarity to existing elements to be aligned to the referencegenome these elements will go undetected

Conclusions

We have presented a description of transposable elements intwo recently released QTL resources Transposable elementscan play important roles in gene function and regulation andincorporation of TE insertion information into any analysisperformed with either of these two resources will present amore complete picture of genomic variation and its contri-bution to complex traits In addition it will be important forassociation studies attempting to replicate genotypendashpheno-type associations utilizing these resources to be aware of thepotential contributions of TEs

We have also demonstrated a next-generation high-throughput sequencing analysis pipeline that is capable ofdetecting this type variation with a high rate of precisionand specificity We suggest that future studies using next-generation sequencing data utilize this technique

Materials and Methods

Data Sets

Two data sets were used in this analysis The first set consistsof 131 inbred lines from the Drosophila Genetic ReferencePanel (DGRP) (Mackay et al 2012) SRA accessionSRP000694 The second set consists of 15 lines used to es-tablish the Drosophila Synthetic Population Resource (DSPR)(King et al 2012a) initially described in Macdonald and Long(2007) SRA accession SRA051316 For the DGRP resourceboth 454 and Illumina data were available whereas onlyIllumina data are available for the DSPR We utilized onlythe Illumina sequence data to ensure consistency in datacollection across data sets We calculated a mean coverageof 21 for the DGRP lines and 50 for the DSPR linesVariation in coverage for the DGRP lines was substantial be-tween lines ranging from ~4 average coverage to nearly50 average coverage We therefore dropped from the anal-ysis four lines with an average coverage of 10 or less sincelow coverage can cause difficulties with assembly and TEidentification Sequence data for both data sets consisted of

paired-end data with 54 bp reads for the DSPR and paired-end 100 bp reads trimmed to 75 bp for DGRP

IBD in DGRP Data Set

We downloaded the SNP tables generated by the DGRPproject from httpwwwhgscbcmtmceduprojectsdgrpfreeze1_July_2010snp_callsIllumina (last accessed July 302010) to examine identity by descent in this data set Weperformed an all by all comparison between lines examiningsliding windows of 1 Mb across the genome with 100 kb stepsbetween windows We take a simple definition of IBD hereWhenever we identified two lines that weregt95 identity atSNP positions in a 1 Mb region we labeled the pair as IBD inthe identified region and masked that region of the genomein the line with lower coverage We also dropped entirelyfrom our analysis any lines where gt50 of the genome wasdetermined to be IBD with other lines in the population(supplementary table S1 Supplementary Material online) re-moving from the analysis the line with lower average se-quence coverage

We modeled 1 Mb regions of 200 chromosomes in a coa-lescent simulation using the parameters theta = 001site and4Nr = 10theta to determine the number of regions of95IBD under a Wright-Fisher population model using the MACSsoftware (Chen et al 2009) We simulated 1000 replicates ofthis simulation and tabulated the number of 1 Mb regionsshowing 95 IBD in each replicate

Alignment

We aligned reads to the D melanogaster reference genome(version 513 downloaded from FlyBase flybaseorg) using thealigner BWA (version 059 Li and Durbin 2009) We used thefollowing parameters (aln -t 8 -l 13 -m 50000000 -R 5000)followed by the command ldquosamperdquo to resolve paired endmappings (using parameters -a 5000 -N 5000 -n 500) It isimportant when detecting TEs that the -R during the initialalignment and the -N parameter during the ldquosamperdquo phase beset high otherwise highly repetitive sequences that would beinformative of the presence of a common element will beexcluded from the data by the aligner We also used the -Ioption for the DGRP lines but not the DSPR due to the dif-ferences in Illumina quality score output format as the twodata sets were sequenced at different times The -m parameterfrom the initial alignment step and the -n parameter from thepaired-end resolution step deal with how BWA treats multiplyaligning reads An additional caveat is that some TE familiesmost notably the P-element family is absent from the Dmelanogaster reference sequence Such families in the se-quenced genomes will not align to the reference genome weused even if they are present in the sampled lines

Transposable Element Detection

Transposable elements were detected by first identifying allreads that were aligned to an annotated TE in the referencegenome (supplementary fig S1 Supplementary Material on-line) We then identified the mates of these reads and selectedonly those mates which did not align to an annotated TE

2322

Cridland et al doi101093molbevmst129 MBE

These reads identify the set of read-pairs which span a TEinsertion The uniquely mapping non-TE reads were thenclustered based on their start position and chromosomalstrand to identify putative TEs Clusters on the plus strandwere matched to clusters on the minus strand to producepairs of clusters indicating either end of the putative TE Onceclusters were identified we filtered out all events which con-tained fewer than three read-pairs consistent with a giveninsertion for each cluster This means that putative eventsat this stage have a minimum of six read-pairs indicating theinsertion three on the plus strand and three on the minusstrand

We then extracted all reads and their mates that aligned tothe identified regions of the genome and attempted toreconstruct the local area including the breakpoint of theTE Because many TEs have repetitive regions at the ends ofthe element we reconstructed the two breakpoints of the TEseparately (see also supplementary fig 23 from Mackay et al2012) Local reconstruction was done by running Phrap (ver-sion 1090518 Ewing and Green 1998) with the followingparameters (-forcelevel 10 -minscore 10 -minmatch 10)This set of parameters causes Phrap to do a Smith-Waterman comparison when it finds matches of at least 10characters but then relaxes the minimum alignment scoreand the final assembly parameters This is necessary to recon-struct TEs because repetitive sequence within the TEs canresult in a failure to reconstruct a contig if stricter assemblyparameters are used All code we developed for the detectionof TEs will be freely available at wwwmolpopgenorgData(last accessed August 1 2013)

Following reconstruction we used BlastN (verson 2222) tocompare reconstructed contigs to the D melanogaster refer-ence to confirm the presence of a TE In most cases we wereable to fully reconstruct both breakpoints of the TE Fullyreconstructed indicates that we could reconstruct a singlecontig which contained the breakpoint of the TE In mostcases the two breakpoints of the TE did not resolve to thesame nucleotide but instead identify the location of thetarget site duplication for the insertion (Linheiro andBergman 2012 similar to fig 1) These are cases where theupstream estimate of the breakpoint is slightly higher thanthe downstream estimate of the breakpoint and the differ-ence in position can be seen as the likely span of the target siteduplication However sometimes we were not able to fullyreconstruct a contig but instead generated two contigs onecontaining uniquely aligning sequence and the other contain-ing TE sequence These contigs are known to belong togetherbecause reads forming one contig have their mates in theother contig In these cases we have an approximate but notprecise estimate of the TE breakpoint and are unable toidentify the target site duplication

After the first round of detection we performed a secondround of detection where we examined every location wherea novel TE was detected in any line in all other lines Thisincluded examining positions in one data set that were iden-tified in the other data set In addition we examined thelocations of all known TEs in the D melanogaster genomeFor each location queried in each line we extracted reads

aligning to that location and their mates and used Phrap toreconstruct a contig This allowed us to produce presenceabsence or missing data calls for all positions where TE wasdetected in all lines The absence of a TE is determined byidentifying reconstructed contigs that spanned the insertionposition of the TE by at least 15 bp on either side Theseabsence calls were not included in Mackay et al (2012)This also allows us to identify TEs that may be in regions ofthe genome with lower average sequence coverage comparedwith the rest of the genome provided that insertion is sharedwith another line where it was possible to detect the insertion

Transposable Element Annotation

Annotation of novel TEs was performed by aligning recon-structed contigs with BlastN to the set of TEs found in theD melanogaster reference We chose the best match basedon overall length and BLAST reported e-value and called thenovel TE a TE of that family We also required a minimum of80 bp of contig matching a TE to make an annotation callThis is an additional annotation analysis to the one we per-formed in Mackay et al (2012) For each of the two contigsproduced for each insertion we determined the best matchand then compared these to each other TEs are classified asper the classification scheme put forth in Wicker et al (2007)We annotated each TE insertion to the highest level of agree-ment between the two contigs if the two contigs disagreedwith each other at every level of classification we called thatelement ldquoundeterminedrdquo This could result from nested in-sertions of elements where the 50 end of the insertion is truly adifferent element from the 30 end or because the TE sequenceis degraded and differs substantially from the referencesequence or because there was not enough TE sequence tomeet our annotation guidelines Because we can only recon-struct a maximum of few hundred bases at either end of theinsertion it is difficult to distinguish between these issues Wealso did not attempt to determine the strand of the TE

Validation

A subset of 190 TE insertion sites shared with the D melano-gaster reference were validated via PCR in a subset of nineof the DGRP lines by Blumenstiel JP Chen X He M BergmanCM (unpublished data httparxivorgabs12093456 lastaccessed June 26 2013) To independently check IBD callsusing SNP data we also determined if pairs of lines wereIBD for TEs in the same regions In this comparison we onlyincluded TE insertions where both lines have a positive pres-ence or absence call This can also serve as a validation of ourTE calling pipeline since if two segments are IBD they shouldshare the same set of TEs

Simulation

We selected a set of elements of different sizes and of TEclasses from the set of TEs in the reference sequence andinserted them semi-randomly into the D melanogaster refer-ence genome in silico Insertions were semi-random since weavoided areas already occupied by reference TEs and the cen-tromeric and telomeric regions Read pairs were then

2323

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

generated to construct fastq files We then ran our TE detec-tion pipeline as above through the initial annotation stepThis process was repeated three times to generate a simula-tion of 15 25 and 50 coverage

Recombination Rates

We wrote an R routine to estimate local recombination ratesas a function of physical position Briefly we used ldquocyto-ge-netic-seqtsvrdquo from wwwflybaseorg (last accessed January 302009) to assign a set of physical landmarks to genetic posi-tions For each chromosome arm we then fit a local polyno-mial relating cumulative cM to cumulative basepair (usinglocpoly) Unlike many prior local polynomial fits we madethe added assumption that this polynomial must be a mono-tonically increasing function (using monoproc) an assump-tion that is true for real data We then fit a smooth spline tothe above curve (using smoothspline) The advantage ofsmoothspline is that the resulting curves can be differentiated(and hence estimates of rates obtained directly) Finally we fita smooth spline to the first derivatives obtained from theprior smooth spline largely to make the recombinationrates less noisy near centromeres and telomeres TheR code is freely available at wwwmolpopgenorgData (lastaccessed August 1 2013)

Data Set Restriction

Analyses performed on these data sets were restricted in twoways First we restricted our analyses to regions of thegenome with low TE density and moderate recombinationrates We restricted our data set in this way because regions ofhigh TE density make it more difficult to make accurate callsBecause our pipeline requires unique reads to positionallylocate the TE insertion regions with high TE density canlack enough unique sequence to generate these reads Theregions included in the analyses are X300000ndash208000002L200000ndash20100000 2R2300000ndash21000000 3L100000ndash21900000 3R600000ndash27800000 which are not centromericand telomeric regions This excludes the majority of TEs an-notated in the reference genome since these TEs are largelyclustered together in centromeric regions However this doesnot affect our analysis since we are interested in TEs in theeuchromatic portions of the genome

Second to restrict the set of reference TEs studied to full-length copies we removed from the analysis all reference TEsthat had a length that was less than 75 of the canonicalannotated length of the TE While these elements may stillproduce phenotypic effects these elements are unlikely to beactive copies This reduced the number of reference TEs in-cluded in the analysis to 607 from 1085 TEs that are present inthe reference genome in the regions included in our analysis

Additionally some analyses were performed on the subsetof 23 lines in the DGRP data set which had an average se-quencing coverage of 25 or higher hereafter referred to asDGRP25 This restriction was to mitigate the effect of differ-ences in coverage between the two data sets Simulations oftransposable element detection via our pipeline data not

shown indicate that TE detection at 25 is comparable todetection at 50

Site Count Spectra

We generated derived allele count spectra for TEs in the DSPRdata set and the DGRP25 set of lines We included only siteswhere we were able to make a positive presence or absencecall for each line in the data set so that the number of lineswas normalized for each resource for this analysis Observedvalues were compared with expected values at counts 1 2 34 5 6 and 7 under an infinite sites model via a 2 testCategories 7 and higher were grouped due to low counts insome of these categories We also calculated count spectra forSNPs present in small introns (introns86 bp [Haddrill et al2005]) a group unlikely to have functional effects We alsoused SoFoS a site frequency rescaler (Hufford et al 2012) torescale the DSPR data set to the sample size of the DGRP25from 15 to 23 samples to see if the rescaled data show adifferent distribution of element counts from the observedDGRP25 data We then did a KolmogorovndashSmirnov test tocompare the two distributions as a test of different selectiveeffects acting on the populations

Genomic Context of TEs

Transposable elements were identified in regions of low(lt2 cMMb) or high recombination (2 cMMb) FASTAfiles containing the annotated coordinates of 50 UTRs30 UTRs intronic intergenic and full gene span regions forversion 513 of the D melanogaster reference were down-loaded from flybase (wwwflybaseorg last accessed January30 2009) Version 513 of the reference annotation was usedfor comparative purposes with the DGRP analysis (Mackayet al 2012) Because alternative transcripts can result in thesame genomic position being classified in multiple categorieswe labeled each TE as every appropriate category This oc-curred with 52 of TEs in the DGRP and 60 of TEs in theDSPR When calculating the percentage of the referencegenome in our restricted areas of analysis that were classifiedas each genomic type we followed the same strategy

We compared our observations of TE insertions to a bino-mial distribution to determine whether TEs were distributedrandomly throughout the genome We used the pbinomfunction in R (R Development Core Team 2008) setting thenumber of trials to the total number of TE insertion locationsin the data set and the probability of success to the propor-tion of the genome defined as each context (intron exon30 UTR 50 UTR or intergenic) We did one tailed tests in eachcase For exon 30 UTR and 50 UTR tests we calculated theprobability of observing the observed value or fewer TEs Forintron and intergenic tests we calculated the probability ofobserving the observed value or more TEs

Variables Affecting TE Density

For the DGRP25 and DSPR data sets we calculatedTE density in terms of TEsMb For this analysis we alsoexcluded chromosome 4 We examined the effects of recom-bination rate chromosome and TE element family on TE

2324

Cridland et al doi101093molbevmst129 MBE

density using the standard linear model in R We usedthe lm() function to define our model as followsTEsMb ~ Line + Recombination Rate + Chromosome +Recombination Rate Chromosome We performed thisanalysis both for the set of all TEs and also for 15 TE familieswith moderate to high number of copies in our data set(supplementary table S4 Supplementary Material online)We selected individual TE families for this additional analysisbased on their level of abundance in the data sets and theirhaving been previously studied by Montgomery et al (1987)and Bartolome et al (2002)

Gene Families

We downloaded the Dfam database from (httpwwwindianaedu~hahnlabflyDfamDBdrosophila_frbhtml last acc-essed September 1 2012) This database describes all mem-bers of all gene families clustered using a fuzzy reciprocalBLAST method as described in Hahn et al (2007) for 12Drosophila species We used R to perform a linear regressionusing a simple model where the square root (TE density) in agene is the random variable and log10 (gene family size) is theindependent variable

Variation between Resources

For each TE present in both data sets we compared thefrequency of the TE in each population using Fisherrsquos exacttest We excluded from this analysis all insertions where wewere unable to make a presenceabsence call in at least 100 ofthe DGRP lines or 13 of the DSPR lines For each gene with TEinsertions in both data sets we compared the frequency atwhich the gene contains TEs between data sets using aFisherrsquos exact test

Supplementary MaterialSupplementary tables S1ndashS6 and figures S1 and S2 are avail-able at Molecular Biology and Evolution online (httpwwwmbeoxfordjournalsorg)

Acknowledgments

The authors thank Casey Bergman and Miaomiao He for earlyaccess to their data as well as helpful comments from CBThey also thank Rebekah Rogers for help with editing themanuscript This work was supported by the NationalInstitute of Health grant NIH R01-GM085183 to KRT theNational Library of Medicine - National Institute of Healthtraining grant LM007443 to JMC and the National Instituteof Health grant NIH R01-OD010974 to ADL and SJM

ReferencesAltschul SF Gish W Miller W Myers EW Lipman DJ 1990 Basic local

alignment search tool J Mol Biol 215403ndash410Aminetzach YT Macpherson JM Petrov DA 2005 Pesticide resistance

via transposition-mediated adaptive gene truncation in DrosophilaScience 309764ndash767

Bartolome C Maside X Charlesworth B 2002 On the abundance anddistribution of transposable elements in the genome of Drosophilamelanogaster Mol Biol Evol 19926ndash937

Bennetzen JL 2000 Transposable element contributions to plant geneand genome evolution Plant Mol Biol 42251ndash269

Bergman CM Bensasson D 2007 Recent LTR retrotransposon insertioncontrasts with waves of non-LTR insertion since speciation inDrosophila melanogaster Proc Natl Acad Sci U S A 10411340ndash11345

Biemont C Vieira C 2006 Junk DNA as an evolutionary force Nature443521ndash524

Birchler JA Hiebert JC 1989 Interaction of the Enhancer of white-apricotwith transposable element alleles at the white locus in Drosophilamelanogaster Genetics 122129ndash138

Birchler JA Hiebert JC Rabinow L 1989 Interaction of the mottler ofwhite with transposable element alleles at the white locus inDrosophila melanogaster Genes Dev 373ndash84

Bowen NJ McDonald JF 2001 Drosophila euchromatic LTR retrotran-sposons are much younger than the host species in which theyreside Genome Res 111527ndash1540

Carr M Soloway JR Robinson TE Brookfield JF 2002 Mechanisms reg-ulating copy numbers of six LTR retrotransposons in the genome ofDrosophila melanogaster Chromosoma 110511ndash518

Charlesworth B 1994 The effect of background selection against dele-terious mutations on weakly selected linked variants Genet Res 63213ndash227

Charlesworth B Jarne P Assimacopoulos S 1994 The distribution oftransposable elements within and between chromosomes in a pop-ulation of Drosophila melanogaster III Element abundances in het-erochromatin Genet Res 64183ndash197

Charlesworth B Langley CH 1989 The population genetics ofDrosophila transposable elements Annu Rev Genet 23251ndash287

Charlesworth B Lapid A 1989 A study of ten families of transposableelements on X chromosomes from a population of Drosophila mel-anogaster Genet Res 54113ndash125

Charlesworth B Lapid A Canada D 1992 The distribution of transpos-able elements within and between chromosomes in a population ofDrosophila melanogaster I Element frequencies and distributionGenet Res 60103ndash114

Chen GK Marjoram P Wall JD 2009 Fast and flexible simulation ofDNA sequence data Genome Res 19136ndash142

Chung H Bogwitz MR McCart C Andrianopoulos A Ffrench-ConstantRH Batterham P Daborn PJ 2007 Cis-regulatory elements in theaccord retrotransposon result in tissue-specific expression of theDrosophila melanogaster insecticide resistance gene Cyp6g1Genetics 1751071ndash1077

Cutter AD Good JM Pappas CT Saunders MA Starrett DM Wheeler TJ2005 Transposable element orientation bias in the Drosophila mel-anogaster genome J Mol Evol 6733ndash741

Daborn P Boundy S Yen J Pittendrigh B Ffrench-Constant R 2001DDT resistance in Drosophila correlates with Cyp6g1 over-expressionand confers cross-resistance to the neonicotinoid imidacloprid MolGenet Genomics 266556ndash563

Deloger M Cavalli FMG Lerat E Biemont C Sagot M-F Vieira C 2009Identification of expressed transposable element insertionsin the sequenced genome of Drosophila melanogaster Gene 43955ndash62

Dworkin I Palsson A Gibson G 2005 Replication of an Egfr-wing shapeassociation in a wild-caught cohort of Drosophila melanogasterGenetics 1692115ndash2125

Ewing B Green P 1998 Base-calling of automated sequencer tracesusing Phred II Error probabilities Genome Res 8186ndash194

Gruber JD Genissel A Macdonald SJ Long AD 2007 How repeatable areassociations between polymorphisms in achaete-scute and bristlenumber variation in Drosophila Genetics 1751987ndash1997

Haddrill PR Charlesworth B Halligan DL Andolfatto P 2005 Patterns ofintron sequence evolution in Drosophila are dependent upon lengthand GC content Genome Biol 6R67

Hahn MW Han MV Han S-G 2007 Gene family evolution across 12Drosophila genomes PLoS Genet 3e197

Hirschhorn JN Daly MJ 2005 Genome-wide association stud-ies for common diseases and complex traits Nat Rev Genet 695ndash108

2325

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

Hudson RR Kaplan NL 1995 Deleterious background selection withrecombination Genetics 1411605ndash1617

Hufford MB Xu X van Heerwaarden J et al (25 co-authors) 2012Comparative population genomics of maize domestication andimprovement Nat Genet 44808ndash811

Kaminker JS Bergman CM Kronmiller B et al (12 co-authors) 2002The transposable elements of the Drosophila melanogastereuchromatin a genomics perspective Genome Biol3RESEARCH0084

Kapitonov VV Jurka J 2003 Molecular paleontology of transposableelements in the Drosophila melanogaster genome Proc Natl AcadSci U S A 1006569ndash6574

Kazazian HH 2004 Mobile elements drivers of genome evolutionScience 3031626ndash1632

Kidwell MG 2002 Transposable elements and the evolution of genomesize in eukaryotes Genetica 11549ndash63

King EG Macdonald SJ Long AD 2012a Properties and power of theDrosophila synthetic population resource for the routine dissectionof complex traits Genetics 191935ndash949

King EG Merkes CM McNeil CL Hoofer SR Sen S Broman KW LongAD Macdonald SJ 2012b Genetic dissection of a model complextrait using the Drosophila synthetic population resource GenomeRes 221558ndash1566

Kofler R Betancourt AJ Schlotterer C 2012 Sequencing of pooled DNAsamples (Pool-Seq) uncovers complex dynamics of transposableelement insertions in Drosophila melanogaster PLoS Genet 8e1002487

Langley CH Montgomery E Hudson R Kaplan N Charlesworth B 1988On the role of unequal exchange in the containment of transpos-able element copy number Genet Res 52223ndash235

Li H Durbin R 2009 Fast and accurate short readalignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760

Linheiro RS Bergman CM 2012 Whole genome resequencing revealsnatural target site preferences of transposable elements inDrosophila melanogaster PLoS One 7e30008

Lipatov M Lenkov L Petrov DA Bergman CM 2005 Paucity of chimericgene-transposable elements in the Drosophila melanogastergenome BMC Biol 324

Long AD Lyman RF Morgan AH Langley CH Mackay TFC 2000 Bothnaturally occurring insertions of transposable elements and inter-mediate frequency polymorphisms at the achaete-scute complexare associated with bristle number in Drosophila melanogasterGenetics 1541255ndash1269

Lonnig W-E Saedler H 2002 Chromosome rearrangements and trans-posable elements Annu Rev Genet 36389ndash410

Macdonald SJ Long AD 2007 Joint estimates of quantitativetrait locus effect and frequency using synthetic recombinantpopulations of Drosophila melanogaster Genetics 1761261ndash1281

Macdonald SJ Pastinen T Long AD 2005 The effect of polymorphismsin the Enhancer of split gene complex on bristle number variation ina large wild-caught cohort of Drosophila melanogaster Genetics 1711741ndash1756

Mackay TFC 1984 Jumping genes meet abdominal bristles hybrid dys-genesis-induced quantitative variation in Drosophila melanogasterGenet Res 44231ndash237

Mackay TFC 1989 Transposable elements and fitness in Drosophilamelanogaster Genome 31284ndash295

Mackay TFC 2010 Mutations and quantitative genetic variationlessons from Drosophila Philos Trans R Soc Lond B Biol Sci 3651229ndash1239

Mackay TFC Langley CH 1990 Molecular and phenotypic variation inthe achaete-scute region of Drosophila melanogaster Nature 34864ndash66

Mackay TFC Lyman RF Jackson MS 1992 Effects of P element inser-tions on quantitative traits in Drosophila melanogaster Genetics 130315ndash332

Mackay TFC Richards S Stone EA et al (52 co-authors) 2012 TheDrosophila melanogaster genetic reference panel Nature 482173ndash178

Meyer LR Zweig AS Hinrichs AS et al (32 co-authors) 2013 The UCSCgenome browser database extensions and updates 2013 NucleicAcids Res 41D64ndashD69

Montgomery E Charlesworth B Langley CH 1987 A test for the role ofnatural selection in the stabilization of transposable element copynumber in a population of Drosophila melanogaster Genet Res 4931ndash41

Montgomery EA Huang S-M Langley CH Judd BH 1991 Chromosomerearrangement by ectopic recombination in Drosophilamelanogaster genome structure and evolution Genetics 1291085ndash1098

NCI-NHGRI Working Group on Replication in Association Studies2007 Replicating genotype-phenotype associations Nature 447655ndash660

Nuzhdin SV Mackay TFC 1994 The genomic rate of transposable ele-ment movement in Drosophila melanogaster Mol Biol Evol 12180ndash181

Nuzhdin SV Pasyukova EG Mackay TFC 1997 Accumulation of trans-posable elements in laboratory lines of Drosophila melanogasterGenetica 100167ndash175

Ometto L Stephan W Lorenzo DD 2005 Insertiondeletion andnucleotide polymorphism data reveal constraints in Drosophilamelanogaster introns and intergenic regions Genetics 1691521ndash1527

Palsson A Gibson G 2004 Association between nucleotide variation inEgfr and wing shape in Drosophila melanogaster Genetics 1671187ndash1198

Petrov DA Fiston-Lavier A-S Lipatov M Lenkov K Gonzalez J 2011Population genomics of transposable elements in Drosophila mela-nogaster Mol Biol Evol 281633ndash1644

Presgraves DC 2006 Intron length evolution in Drosophila Mol Biol Evol132203ndash2213

Quinian AR Hall IM 2010 BEDTools a flexible suite of utilities forcomparing genomic features Bioinformatics 26841ndash842

R Development Core Team 2008 R a language and environment forstatistical computing Vienna Austria R Foundation for StatisticalComputing [cited 2013 May 1] Available from httpwwwR-projectorg

Rizzon C Marais G Gouy M Biemont C 2002 Recombination rate andthe distribution of transposable elements in the Drosophila mela-nogaster genome Genome Res 12400ndash407

Sackton TB Kulathinal RJ Bergman CM Quinlan AR Dopman EBCarneiro M Marth GT Hartl DL Clark AG 2009 Population geno-mic inferences from sparse high-throughput sequencing of twopopulations of Drosophila melanogaster Genome Biol Evol 1449ndash465

Sakurai M Aoki T Yoshikawa S et al (11 co-authors) 2009 Differentiallyexpressed Drl and Drl-2 play opposing roles in Wnt5 signalingduring Drosophila olfactory system development J Neurosci 294972ndash4980

Schmidt JM Good RT Appleton B et al (11 co-authors) 2010Copy number variation and transposable elements feature inrecent ongoing adaptation at the Cyp6g1 locus PLoS Genet 6e1000998

Shrimpton AE Mackay TFC Brown AJL 1990 Transposable element-in-duced response to artificial selection in Drosophila melanogastermolecular analysis of selection lines Genetics 125803ndash811

Slotkin RK Martienssen R 2007 Transposable elements and the epige-netic regulation of the genome Nat Rev Genet 8272ndash285

Smith PA Corces VG 1991 Drosophila transposable elements mecha-nisms of mutagenesis and interactions with the host genome AdvGenet 29229ndash300

Thornton K 2003 Libsequence a C ++ class library for evolutionarygenetic analysis Bioinformatics 192325ndash2327

Vazquez JF Albornoz J Dominguez A 2007 Direct determination of theeffects of genotype and extreme temperature on the transposition

2326

Cridland et al doi101093molbevmst129 MBE

of roo in long-term mutation accumulation lines of Drosophilamelanogaster Mol Genet Genomics 278653ndash664

Voight BF Pritchard JK 2005 Confounding from cryptic relatedness incase-control association studies PLoS Genet 1e32

Wakeley J 2009 Coalescent theory an introduction Greenwood VillageRoberts and Company Publishers

Wicker T Sabot F Hua-Van A et al (13 co-authors) 2007 A unifiedclassification system for eukaryotic transposable elements Nat RevGenet 8973ndash982

Xing J Witherspoon DJ Jorde LB 2013 Mobile element biology newpossibilities with high-throughput sequencing Trends Genet 29280ndash289

Yoshikawa S McKinnon RD Kokel M Thomas JB 2003 Wnt-mediated axon guidance via Drosophila derailed receptor Nature422583ndash588

Zachar Z Bingham PM 1982 Regulation of white locus expression thestructure of mutant alleles at the white locus of Drosophila mela-nogaster Cell 30529ndash541

2327

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

Page 4: Abundance and Distribution of Transposable Elements in Two ...

canonical element These are the set of events directly com-parable to the set of reference TEs we kept for our analysissince these are the set of TEs that are likely to be active copiesin the genome A drop in average genome coverage did notaffect our rate of identification in simulated data sets Wedetected 956 of elements at 25 coverage and 913 ofelements at 15 coverage

When we looked at the detection of all elements we foundthat we were able to identify the insertion locations for 70 ofall elements at 50 coverage 724 of elements at 25 cov-erage and 64 of elements at 15 coverage

PCR ValidationWe compared our pipelinersquos TE calls with PCR data from1687 PCR calls of TE insertions in 9 of the DGRP lines(Blumenstiel JP Chen X He M Bergman CM unpublisheddata httparxivorgabs12093456 last accessed June 262013) The comparison of our pipeline to PCR-validated in-sertions showed a 924 overall agreement between the twomethods (supplementary table S2 Supplementary Materialonline) when comparing presence and absence calls We didnot include the few examples of heterozygous insertions inthis analysis Our pipeline was unable to make a presence orabsence call for the set of PCR-validated insertions 6 of thetime and disagreed with the PCR data 16 of the time Failureto validate via PCR was equally likely for presence or absencecalls suggesting that detection of TEs was unbiased with re-spect to presence versus absence We estimate a pipelinesensitivity of 982 and a specificity of 94 based on thissubset of the data Our estimation of the validation rate isconservative requiring us to reconstruct a contig that eithercontains the TE breakpoint or spans over the insertion site inorder to call an event as either present or absent respectivelyFor some cases where our pipeline makes no call we did seeread pairs that are suggestive of the state of the presence orabsence of the TE but are not sufficient in number andoruniqueness for reconstruction the event de novo

Computational ValidationThe initial phase of our pipeline detected transposable ele-ment insertions in individual lines Following this initial de-tection phase our pipeline surveys every line at each genomiclocation where a TE insertion was identified in any other lineincluding lines in different data sets In this manner weattempted to computationally verify the presence or absenceof TE insertions at a total of 2596370 genomic locationsOur pipeline made presence or absence calls between9717 and 9998 of the time (supplementary table S3Supplementary Material online) Positive absence callsmade by our pipeline can also be considered a type of vali-dation since to make an absence call our pipeline must detectreads that span the junction where the insertion would belocated (supplementary table S3 Supplementary Materialonline)

The presence of regions of IBD amongst DGRP lines (fig 1)provided an opportunity to cross-validate TE calls in regionsmasked as IBD between different line pairs because we ex-pected TE genotypes to be identical within regions identified

as IBD We find a mean of 98 agreement for TE calls (bothpresence and absence) between pairs of regions identified asIBD with a standard deviation of 0015

Distribution and Abundance of TransposableElements

A total of 7104 transposable element insertions were foundin the DSPR and 17639 in the DGRP A comparison of the TEinsertion content of the DGRP and DSPR reveals that eachpanel contained a large number of TE insertions specific tothat panel only 1656 insertions were shared between theDSPR and DGRP data sets out of a total of 23087 insertionsdetected (ie 72 shared insertions) All transposable ele-ment insertion locations are listed in supplementarytables S4 and S5 Supplementary Material online A subsetof the DGRP data lines with an average sequence coverage25 mean sequence coverage 335 standard deviation51 hereafter DGRP25 (table 1) contained 5855 insertionsThe DGRP25 set of lines was used in some analyses to bemore directly comparable to the DSPR lines which all had amean genome coverage of 60 (fig 2)

In both resources TE insertions were generally at low fre-quency Overall we found a larger total number of insertionsper line in DSPR than in DGRP (fig 2) This was presumablydue to the much higher sequencing coverage of the formerleading to a higher power to make calls of complex insertionsMean sequence coverage for the DGRP lines was 195with astandard deviation of 64 mean sequence coverage for theDSPR was 671 with a standard deviation of 42 Howeverthe increased amount of time the DSPR lines have remained inlaboratory conditions may also have contributed to the ele-vated number of TEs seen in these lines (Nuzhdin et al 1997)

TEs may be located at random throughout the genome orthey may be found preferentially in certain genomic regionssuch as in intergenic sequence We used the binomial distri-bution as a null model for the distribution of TEs in thegenome assuming TEs insert and remain at random withrespect to these genomic features (table 2) We found thatin general the distribution of TEs observed differed from whatwe would expect if TEs were distributed at random through-out the genome We found significantly fewer TEs in exons50 UTRs and 30 UTRs than expected and significantly moreTEs than expected in intergenic and intronic regions (table 2)which was consistent with previous observations as TEs ingenes are much more likely to be deleterious (Kaminkeret al 2002 Lipatov et al 2005) This pattern held true forthe DGRP the DGRP25 and the DSPR suggesting that selec-tion may be acting universally to remove TEs that disruptexonic sequences

TE Identification

Information about the class of TE that has been inserted isuseful for two reasons First this information will allow forrapid assays for the presence of this TE in other strainsSecond insertions of different families of TEs can producedifferent phenotypic consequences (Smith and Corces1991) We were able to identify the TE class of the majority

2314

Cridland et al doi101093molbevmst129 MBE

of insertions (table 1) RNA elements were much morecommon than DNA elements and within RNA elementsover half were long terminal repeat (LTR) retrotransposonelements We were unable to identify element class for347 of the DGRP insertions and 320 of the DSPR inser-tions In cases where we were unable to identify the TE familyfor an insertion this was due to either conflict between themost likely TE as identified by the two reconstructed contigsfor each event or because we had limited data on the TEsequence for that event In cases where we have conflictinginformation this was due to 1) the degradation of the TEsequence so that there were many poor matches to multiplefamilies 2) conserved sequence between families and there-fore good matches to multiple families 3) because we iden-tified a nest of elements where multiple elements insert invery close proximity or within one another or 4) the amountof TE sequence we were able to reconstruct was less than

80 bp and therefore below our annotation threshold lengthBecause of the frequently short length of TE sequence that weattempted to identify we did not attempt to identify thestrand of the TE insertion Although TE orientation is impor-tant and influences selection upon the insertion (Cutter et al2005) our data were frequently insufficient to resolve strand

Factors Influencing TE Density

If TE insertions are deleterious and recessive with respect tofitness then insertions are also predicted to be morecommon on the autosomes since deleterious insertions onthe X will be eliminated via selection against hemizygousmales (Charlesworth and Langley 1989) Contrary to this pre-diction an analysis of variance (ANOVA) indicated a signifi-cant effect of chromosome on mean TE density per line whenboth resources were included in the same analysis (table 3)This difference was due to an increased mean density of TEson the X versus the autosomes in regions of high recombi-nation This pattern was the same both for autosomes takenas a group and for each autosome individually (table 4)

TEs are predicted to accumulate in regions of low recom-bination where they are less likely to produce deleteriousrearrangements via ectopic exchange (Charlesworth andLangley 1989) An ANOVA indicates a highly significanteffect of recombination rate on mean TE density per line(table 3) TE density was also found to be greater in regionsof low recombination than in high recombination through-out the genome which was consistent with the predictions ofCharlesworth and Langley (1989)

Because the factors that influence TE density may differgreatly from family to family we repeated this analysis in 14families of TEs that appear in moderate to high abundance inthe two resources (table 5) Many of these families were pre-viously examined by Montgomery et al (1987) and Bartolomeet al (2002)

We found that in 10 families (412 roo 176 F Bari1 copiahopper INE-1 mdg1 springer) there was a highly significantdifference in mean TE density between the X and the auto-somes (P value 0026 in all cases) although the direction ofthe difference varied among families Within these familiesthere were substantial differences In two of the familiesINE-1 and roo we saw an increased density in the X overthe autosomes in regions of high recombination in boththe DSPR and DGRP25 resource (table 5) However for 412F mdg1 springer Bari1 and copia we saw the opposite pat-tern This suggests that there may be substantial variation inTE insertion site preferences among families

Sample Counts of Transposable Elements

The power to detect an association is dependent both on thepenetrance and the frequency of the causal allele (Hirschhornand Daly 2005) We found that the majority of transposableelements in the DGRP25 and the DSPR are found in only oneline (fig 3) Comparisons between the observed TE data andSNPs from the same population found in introns 86 bp(Haddrill et al 2005) indicated an excess of TEs (DGRP252 test Pamp 0 df = 6 DSPR 2 test Pamp 0 df = 6)

0 20 40 60 80 100

020

040

060

080

010

00

Coverage

TE

s

DSPRDGRP gt 25x coverageDGRP lt 25x coverage

FIG 2 Total number of TEs identified versus coverage for the DGRPDGRP25 and DSPR lines

Table 1 Summary of TE Insertions

DGRP DGRP25 DSPR

Total 17639 5855 7104

Not present in reference 17346 5615 6812

Present in reference 293 240 292

X recombination 2 cMMb 418 149 203

X recombinationgt 2 cMMb 2741 948 1107

Autosomes recombination 2 cMMb 6205 2131 2694

Autosomes recombinationgt 2 cMMb 8177 2544 3019

4 all 98 83 81

Exon 1158 378 633

Intron 8595 2870 3310

30 UTR 477 159 234

50 UTR 190 63 81

Intergenic 7219 2680 3269

DNA elements 1388 514 748

RNA elements 10133 3004 4085

Indeterminate 6118 2336 2271

2315

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

Comparisons between the observed data and the expectednumber of events at the same count under an infinite sitesmodel of an equilibrium Wright-Fisher population (Wakeley2009 p 54ndash56) also indicated an excess of rare alleles(DGRP25 2 test Pamp 0 df = 6 DSPR 2 test Pamp 0df = 6) In addition to overall site count spectra we also gen-erated spectra for the X autosomes regions of high and lowrecombination DNA elements and RNA elements In eachcase we observed the same pattern as seen in the overallspectrum (data not shown) We also compared the distribu-tion of counts in the DGRP25 with the rescaled distribution ofcounts in the DSPR rescaled to the same sample size as theDGRP25 A KolmogorovndashSmirnov test comparing these twodistributions showed that they did not differ (P = 021) Wedid see more overall TEs in every count category in therescaled DSPR data over the DGRP25 data set

Patterns of Transposable Element Insertions betweenGenes

For association studies genotyping TE insertions withincoding regions is of interest due to the potentially large

effect of an insertion on gene function A total of 2931genes (213 of annotated genes) had insertions in theDGRP resource These are insertions which exist anywherewithin the span of the gene from 50 UTR to 30 UTR Onethousand nine hundred fifteen genes 139 had insertions inthe DSPR with 1317 genes having insertions in bothresources The number of TE insertions within genes variedsubstantially between individual genes up to a maximum of78 insertions though the majority of genes with insertionshad only one insertion 631 in the DSPR and 527 in theDGRP The number and locations of individual insertions in agene varied tremendously though the majority of insertionsthat fell within gene regions were in introns 777 DSPR and824 DGRP TE number was strongly correlated with intronsize (Pearsonrsquos correlation coefficient = 079 for DSPR 087for DGRP Pamp 0 for both tests) (supplementary fig S2Supplementary Material online)

Examining the pattern of insertions in individual genes inthe DSPR and DGRP data sets we found that RNA-bindingprotein 6 had the highest number of insertions 62 in theDGRP lines and 21 in the DSPR lines These insertions were

Table 3 ANOVAs for DSRP and DGRP25 Coverage and Comparison between the Two Data Sets

df Sum Sq Mean Sq F value Pr(gtF)

DSPR vs DGRP25

Set 1 54125 54125 6486975 lt2e16

Line 36 2848 791 94817 lt2e16

Chromosome 4 11858 2965 355313 lt2e16

Recombination rate (high vs low) 1 25575 25575 3065207 lt2e16

Chromosomerecombination rate 4 10335 2584 309663 lt2e16

Residuals 333 27784 083

DSPR

Line 14 26599 19 17894 004699

Chromosome 4 39632 9908 93316 123e06

Recombination rate (high vs low) 1 141337 141337 1331143 lt2e16

Chromosomerecombination rate 4 39712 9928 93504 119e06

Residuals 126 133783 1062

DGRP25

Line 22 258202 11736 1723 lt2e16

Chromosome 4 80126 20031 29409 lt2e16

Recombination rate (high vs low) 1 120011 120011 176191 lt2e16

Chromosomerecombination rate 4 66056 16514 24244 243e16

Residuals 198 134866 0681

Table 2 Differences between Observed and Expected TE Counts

DGRP DGRP25 DSPRObserved vs Expected P Value Observed vs Expected P Value Observed vs Expected P Value

X recombination 2 cMMb Decrease 329E11 Decrease 187E03 Decrease 522E02

X recombinationgt 2 cMMb Decrease 174E01 Increase 701E02 Increase 416E01

Autosomes recombination 2 cMMb Decrease 370E01 Increase 100E + 00 Increase 687E04

Autosomes recombinationgt 2 cMMb Increase 392E02 Decrease 100E + 00 Decrease 100E + 00

Exon Decrease 000E + 00 Decrease 247E323 Decrease 103E283

Intron Increase 354E103 Increase 000E + 00 Increase 000E + 00

30 UTR Decrease 699E16 Decrease 313E06 Decrease 126E02

50 UTR Decrease 143E51 Decrease 282E18 Decrease 498E20

Intergenic Increase 112E32 Increase 0 Increase 000E + 00

2316

Cridland et al doi101093molbevmst129 MBE

typically unique to a single line though there were 3 insertionsin more than one line in the DSPR and 14 insertions in morethan one line in the DGRP

Some genes accumulated many TE insertions such asklarsicht (fig 4A gene images from the UCSC genome brow-ser Meyer et al 2013) which had 29 total insertions This genealso had a hotspot of insertions with seven independent in-sertions each at low frequency located within 36 kb of each

other Only one of these insertions was present in any givenline and these could be functionally equivalent though inde-pendently arising mutations Other genes had only a few in-sertions such as Notch (fig 4B) and Delta (fig 4C) Theinsertions in these genes were also at low frequency andthe few present were located in intronic regions

While most insertions exist in only one panel some werefound in both Cyp6a20 (fig 4D) contains a high frequencynon-reference TE insertions present in both data sets in 4lines out of 121 lines where we were able to make a presenceor absence call in the DGRP (hereafter shown as 4121) and314 lines in the DSPR Both Cyp6a20 and klarsicht are exam-ples of genes with TE insertions in exons which was

00

02

04

06

08

10

1 2 3 4 5 6 7+

Observed TEsObserved SNPsExpected

DSPR

00

02

04

06

08

10

1 2 3 4 5 6 7+

Observed TEsObserved SNPsExpected

DGRP

FIG 3 Derived allele count spectra for the DSRP lines and the DGRP25lines where a positive presence or absence call was made for eachinsertion in each line 6613 insertions in the DSPR and 3274 in theDGRP25 Count spectra for SNPs is from SNPs in introns 86 bp 2

tests between observed and expected distributions result in Pamp 0 forcomparisons between TEs and the neutral model as well as between TEsand SNPs for both data sets

Table 5 TE Density for 15 Individual Families of Elements

Mean Density (TEMb)

Element Resource X High X Low Auto High Auto Low

roo DGRP25 039 037 032 033DSPR 080 088 065 068

297 DGRP25 005 000 002 001DSPR 005 000 004 000

412 DGRP25 003 001 003 007DSPR 006 008 009 012

F DGRP25 002 006 004 006DSPR 007 008 012 013

176 DGRP25 001 000 001 002DSPR 002 000 002 006

Bari1 DGRP25 001 000 004 001DSPR 002 002 006 002

copia DGRP25 001 000 002 003DSPR 009 002 014 019

H DGRP25 001 000 001 000DSPR 000 000 001 000

hopper DGRP25 010 001 000 000DSPR 010 004 002 002

INE-1 DGRP25 042 188 003 048DSPR 045 203 004 055

jockey DGRP25 015 012 019 013DSPR 035 034 040 033

mdg1 DGRP25 003 002 003 010DSPR 008 004 012 018

pogo DGRP25 006 004 007 005DSPR 013 021 015 013

springer DGRP25 000 001 001 003DSPR 002 000 001 005

Table 4 TE Density in the X and Autosomes

TEMb

DGRP25 DSPR Both

X all 382 638 483

Autosomes all 351 593 447

X high recombination 371 600 461

X low recombination 393 676 505

2L high recombination 227 473 324

2L low recombination 323 589 428

2R high recombination 282 489 364

2R low recombination 617 863 714

3L high recombination 296 523 386

3L low recombination 474 726 573

3R high recombination 281 468 355

3R low recombination 372 670 490

2317

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

uncommon both in this data set and in previous studies(Kaminker et al 2002 Lipatov et al 2005)

Differences in the strength of selection against closelyrelated genes can also be illustrated by patterns of TE inser-tions The paralogs derailed (fig 4F) and derailed-2 (fig 4E)showed very different patterns of TE insertions though thesame pattern was seen in both resources derailed-2 locatedon chromosome 2R is in a region of moderate recombination199 cMMb whereas derailed is located towards the distalend of 2L in a region of low recombination 044 cMMb

Given the context of the recombination rates the expectationwould be that derailed would have a higher TE load sincedeleterious alleles are removed more efficiently in regions ofhigh recombination (Hudson and Kaplan 1995) but theopposite pattern was observed

While both genes play a role in Wnt5 signaling mutationsin derailed can cause major phenotypic changes in Drosophilanervous system development resulting in the loss of normalfunction (Yoshikawa et al 2003) derailed-2 when mutatedproduces only minor differences in neuron positioning

115 115115 215 115115

114 1122

11181120 11181124

2117

21182119

1122

1119 1118

2118

1121 3114

1115

1114

1119

1113

1119

1120

1120

1119

1120

3119

1119

A klarsichtDGRP

DSPR

Insertion in an Exon

314 1195

4121D Cyp6a20

315

1123

1415

56125

115115

415

115615

10124

1124 1124 1123

2124E derailed-2

B Notch

115 11211122

C Delta1114 1120

191

F derailed

Same insertion in both resources

FIG 4 Transposable element insertions in genes The frequency above each insertion is the number of lines in which the element is present over thenumber of lines in which the element is validated as either present or absent Gene images are from the UCSC genome browser (httpgenomeucscedu last accessed June 31 2012)

2318

Cridland et al doi101093molbevmst129 MBE

(Sakurai et al 2009) This suggests that either the strength ofselection may be much higher against insertions in derailed orthe insertion rates are unequal between the paralogs

Gene Family Size and TEs

We used available information on gene families in DrosophilaHahn et al (2007) grouped into families based on sequencesimilarity and found a very weak but significant effect of genefamily size on the average number of TE insertions per kbin both resources (P value = 00003 R2 = 0001 DSPRP value = 0002 R2 = 0007 DGRP) (fig 5A and B) Althoughthere was a large variance in the number of insertions per kb

in both resources (fig 5A and B) the average number ofTEskb increased as family size increased up to a moderatelylarge family size

Patterns of Variation between Resources

To replicate an association with a single causative allele thatallele must both exist in both panels and also be at highenough frequency to detect the association Given thatmost TEs were singletons and only 45 of insertion eventswere identified in both panels and were within genes thereplicability of associations with these alleles is likely to below if alleles are considered individually We examined sharedinsertions segregating at different frequencies between panelsand found only two TE insertions in genes that were segre-gating at different frequencies in the DGRP versus DSPRresources following Bonferroni correction (Fisherrsquos exacttests P 000007) One of these was in an exon of bothCG13175 and CG33964 a pair of overlapping genes Herethe TE insertion was at much higher frequency in theDGRP than in the DSPR (89122 vs 215) The other insertionwas in an intron of Caliban and the frequency of this insertionwas higher in the DSPR than the DGRP (1115 vs 7116)

If insertions are considered as a class of mutation thenthere may be better power to detect an association assumingthat different insertions contribute in the same way to thephenotype We also examined genes with TE insertions(139 of genes) though not necessarily the same insertionin both resources Treating all insertion alleles within a gene asequivalent a comparison between resources showed thatthere were 21 genes which contain TEs at different frequen-cies between the DGRP versus DSPR following Bonferronicorrection (Fisherrsquos exact tests P 0000038197) includingderailed-2 (supplementary table S6 Supplementary Materialonline)

Discussion

Distribution and Abundance of TransposableElements

Previous work on transposable elements has generally beenfocused on a few families of TEs (Montgomery et al 1987Charlesworth and Lapid 1989 Charlesworth et al 1992) orfocused only on a single genome (Bartolome et al 2002Kaminker et al 2002 Rizzon et al 2002) We have presenteda genomic analysis of TEs across a large number of lines andtwo data sets We found that transposable elements in bothresources are primarily found as singletons with a few mod-erate to high frequency insertions Transposable elements alsosegregated at lower allele frequencies than do presumed non-functional SNPs drawn from the same data set suggestingthat negative selection is acting upon TEs However theremay be substantial variation in the age of transposable ele-ments in different families many LTR elements are thought tobe young insertions (Bowen and McDonald 2001 Bergmanand Bensasson 2007) while other families like INE-1 arethought to be quite old (Kapitonov and Jurka 2003) Thismay create a situation where the young age of insertions isresponsible for the excess of rare alleles We found a

0

1

2

3

00 05 10 15 20log10(Gene family size)

TE

den

sity

0

1

2

3

00 05 10 15 20log10(Gene family size)

TE

den

sity

DSPR

DGRPA

B

FIG 5 log10(TE density) versus log10(Gene family size)

2319

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

non-random distribution of TE insertions across the genomewith increases of TEs in both the intergenic and intronicregions of the genome and decreases in exonic and UTRregions (Kaminker et al 2002 Lipatov et al 2005) Similar toprevious studies we found variation in TE density betweenchromosomes and higher densities of TEs in regions of lowrecombination (Charlesworth et al 1994 Bartolome et al2002) We also found increased TE densities on the X relativeto the autosomes both in overall insertions and in someindividual TE families (table 5)

Detection method and the size of the data set are likely tobe contributing factors to differences between previous workand this study Many previous studies have used in situ meth-ods to ascertain TE number in samples and have looked atelement accumulation at the base versus the midsections ofchromosomes (Montgomery et al 1987 Charlesworth et al1992) or have looked only at one genome (Bartolome et al2002 Rizzon et al 2002) A previous PCR-based detectionstudy found no difference in TE frequency between the Xand the autosomes in natural populations (Petrov et al2011) but this study was biased in that it only looked atTEs shared with the reference sequence and thus did notcapture the full picture of TEs in these populations

While our analysis does disagree with previous work thesedisagreements may be mainly due to the way in which regionsof the genome are divided for examination Rizzon et al(2002) found an increase in copies of LTR retrotransposonson the X when they excluded pericentromeric telomeric andchromosome 4 from their analysis though they found a def-icit of TEs on the X when these regions are included Thisfinding is most directly comparable to our own finding for theentire TE data set since they exclude a very similar set ofgenomic regions We do agree with previous work in ourfinding of higher densities of TEs in regions of low recombi-nation than in high recombination (Charlesworth et al 1994Bartolome et al 2002)

Examining individual element families Carr et al (2002)found an increase of mdg-3 elements and 297 elements on theX than would be expected under a random insertion modelMontgomery et al (1987) found no evidence for a reductionin TEs on the X for both roo and 297 elements but foundfewer TEs on the X when looking at the 412 family HoweverCharlesworth et al (1992) observed the opposite pattern forroo and 297 Montgomery et al (1991) also found that moreroo elements in lab kept lines than they had previously ob-served in natural populations We found significantly in-creased densities on the X for roo and significantlyincreased densities on the autosomes for 412

Differences between previous studies of individual elementfamilies and our results may be due to a variety of reasonsFirst we were able to annotate only about two-thirds of ourTE insertions This was primarily due to having insufficientsequence to reliably annotate the element or because inde-pendent annotation using sequence reconstructed fromeither end of the insertion did not yield the same elementfamily annotation This means that our density calculationsfor individual elements may be skewed by missing annotationinformation Second many previous studies (Montgomery

et al 1987 Charlesworth et al 1992 Carr et al 2002) examinedonly LTR elements These elements are young (Bowen andMcDonald 2001 Bergman and Bensasson 2007) and thus in-dividual copies may not yet have been removed from thegenome thereby displaying different patterns of insertionthan older active elements

What is clear is that there is great variability in TE densityboth within and between element families as well as withinand between chromosomes It may also be that the overallpattern is driven by a small group of families In our data setINE-1 shows the largest pattern of difference between the Xand autosomes with an order of magnitude increase in meanTE density per line on the X over the autosomes possibly dueto an increased likelihood of fixation of insertions (Omettoet al 2005) specifically on the X chromosome (Presgraves2006) These INE-1 elements are a subset of the largest TEfamily in D melanogaster though only several dozen of themare in the euchromatin INE-1 elements are also considered tobe inactive in D melanogaster and likely have not been mobilefor gt3 million years (Kapitonov and Jurka 2003) Consistentwith this most of the INE-1 elements we detected were fixedor nearly fixed in our data set We did see some examples oflow frequency insertions which may be due to more recenttranspositional events or incorrect TE annotation due toshort sequences The old age of INE-1 insertions also suggeststhat the extant elements are unlikely to be selected againstAn increased density of TEs on the X compared with theautosomes suggests that it is not the fitness effects of inser-tions which govern TE copy number Fitness effects wouldresult in the more rapid removal of insertions on the X thaninsertions on the autosomes due to the increased selectivepressures experienced on the X (Montgomery et al 1987)Background selection also predicts that the X should be lessburdened by deleterious recessives since it is hemizygous onethird of the time (Charlesworth 1994)

However it is unclear whether our observation of in-creased densities of TEs on the X relative to the autosomesfits with the ectopic exchange model This model predictsthat TEs will accumulate in regions of low ectopic exchange(Langley et al 1988) The general assumption in the literatureis that ectopic exchange is positively correlated with recom-bination rate (Langley et al 1988 Montgomery et al 1991)The X chromosome experiences higher levels of recombina-tion than the autosomes with an average of 36 cMMb forthe X in regions of high recombination (cMMbgt 2) whereasthe autosomes have a mean of 336 cMMb If recombinationrate and ectopic exchange are positively correlated through-out the genome then the expectation would be fewer TEs inthe middle regions of the X where recombination rates areknown to be high This is the opposite of what we observe Ifrecombination rate and the rate of ectopic exchange are notpositively correlated then the ectopic exchange model maybe correct but data are lacking to properly address this issueCurrently there are no direct measurements of the rate ofectopic exchange on a genome-wide scale in Drosophiladuring meiosis Montgomery et al (1991) observed littleectopic exchange between homozygous chromosomes inDrosophila melanogaster on the X chromosome in the

2320

Cridland et al doi101093molbevmst129 MBE

region around the white locus They also found that whilerecombination rate was reduced in both centromeric andtelomeric regions only the centromeric region displayed anincreased density of TEs This information is suggestive ofdifferences between these regions in the rate of ectopicexchange

The DGRP and DSPR are also composed of inbred lineswhich have been kept in laboratory conditions for some time~9 years in the case of the DGRP lines and 40+ years in thecase of the DSPR Given that TE insertion mutations accumu-late rapidly and in an increasing manner over time (Nuzhdinet al 1997) the increased density of TEs on the X versus theautosomes may not be surprising if there is a highertransposition rate to the X The only direct study to datefocused on roo elements in D melanogaster and did observehigher rates of insertion in the X than the autosomes(Vazquez et al 2007)

Transposable Elements in Genes

The majority of genes 813 do not have any TE insertions inany line studied and intron length is a good predictor of thenumber of TEs within a gene However a few genes accumu-late many insertions even after correcting for gene size andintron length and some genes have very few TEs even thoughthey have large introns The reasons why different genes mayaccumulate TE insertions may be quite different for individualgenes In some cases there may be hot spots for insertionswhich if insertions are neutral or only slightly deleterious mayresult in large number of insertions in a small region In othergenes the lack of TE accumulation may suggest selectivepressures against TE accumulation

Gene family size may also play a role in TE accumulationWe see an increase in TE density correlated with an increase ingene family size This may indicate differences in selectivepressures for genes in different sized families Single-copygenes may experience more constraint whereas in moder-ately sized gene families these constraints may be more re-laxed perhaps due to redundancy

In the cytochrome p450 functional family of genes we findthat 3789 of these genes have TE insertions 2589 in theDGRP and 2089 in the DSPR Cytochrome p450 is also foundto be in an enriched term in the DSPR for the set of genescontaining TEs The cytochrome p450 gene Cyp6g1 is knownto confer resistance to DDT (Daborn et al 2001) and TE-mediated copy number variation is associated with increasingresistance to DDT (Schmidt et al 2010) This pattern of TEinsertions in the DSPR may be reflective of the locationswhere DSPR lines were originally collected The DSPR lineswere gathered from various locations worldwide at a timewhen different pesticides were in use The DGRP lines weregathered from a single geographic area in North America in2003 Presumably these sets of lines have experienced differ-ent selective pressures with respect to pesticide resistance InCyp6a20 (fig 4D) where TEs are segregating at different fre-quencies in the two resources some similar effect may beoccurring

TE Detection Pipelines

The TE detection strategy used here is an improvement uponthe strategy used in Mackay et al (2012) The most notabledifferences between the data reported in Mackay et al (2012)and the DGRP data reported here is substantial amount ofnew information First our pipeline improvements agree wellwith the previous implementation with 812 of previous TEpresence calls identified by the new pipeline However thesecalls represent on 182 of the presence calls made by theimproved pipeline demonstrating that we have identifiedsignificantly more calls than before The incorporation of TEabsence calls also allows us to provide a much more completepicture of TEs in euchromatic sequence in the DGRP

The TE detection pipeline used here is similar to otherpipelines which have been employed but it is useful to de-scribe briefly here the differences in approaches and why wefollowed the approach presented here A useful discussion ofcommon approaches to TE detection in many species can befound in Xing et al (2013) One common method for TEdetection is the split-read mapping approach which hasbeen employed in Drosophila by Linheiro and Bergman(2012) This approach identifies TE insertions by locating in-dividual sequence reads which span TE insertion breakpointsThis approach is the strategy of choice for next-generationsequencing experiments which produce only single-endreads However in a paired-end read situation this approachmay not take full advantage of the available data Specificallythe size of window around the TE insertion breakpoint whereinformative reads can be found is governed by read length in asingle-end experiment and fragment length in a paired-endexperiment Linheiro and Bergman (2012) discuss the impor-tance of read length on TE detection and suggest that in-creased read length improves TE detection In situationswhere read length is long and coverage is deep enough thesplit-read method may be preferable to the paired-end de-tection strategy However for the data sets analyzed herewhere read length is much shorter than fragment lengththe paired-end detection strategy will capture moreinformation

The pipeline used here also has similarities to the pipelineused in Kofler et al (2012) However there are two majordifferences in these studies First Kofler et al (2012) do notattempt to reconstruct their insertion breakpoints andsecond the population studied by Kofler et al (2012) wassequenced as a pool rather than as individual lines Becausetheir pooled sequence data reflect on average less than 1sequence coverage per line Kofler et al (2012) are primarilyable to detect intermediate to high frequency TE insertionstheir own analysis concludes that they cannot reliably detectinsertions at frequency lt7 (see supplementary fig S2Supplementary Material online in their supplementary dis-cussion) However the literature to date (Charlesworth andLangley 1989 Charlesworth and Lapid 1989 Charlesworthet al 1992) as well as the results of this study indicate thatthe majority of TEs in a population are found at very lowfrequency often in only one individual in a population andtherefore beyond the scope of detection of Kofler et al (2012)

2321

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

Caveats

The D melanogaster reference genome to which we alignedour data is a P-element-free genome Therefore the use of thisreference will be biased against detecting any P elementsexisting in the data sets In the case of the DSPR data setthis is not an issue since these lines are also all P-element-free(Macdonald and Long 2007) However it is likely that theDGRP lines do harbor P elements For the purpose of theseanalyses we decided to ignore P elements and focus on theset of element families common to the two resources

This situation does highlight the importance of having anappropriate reference sequence To detect P elements in thegenome of D melanogaster an artificially constructed refer-ence sequence or a different assembled sequence containingthe P element would need to be utilized This also brings upthe question about as yet undetected elements that exist atlow frequency If these elements do not have sufficient sim-ilarity to existing elements to be aligned to the referencegenome these elements will go undetected

Conclusions

We have presented a description of transposable elements intwo recently released QTL resources Transposable elementscan play important roles in gene function and regulation andincorporation of TE insertion information into any analysisperformed with either of these two resources will present amore complete picture of genomic variation and its contri-bution to complex traits In addition it will be important forassociation studies attempting to replicate genotypendashpheno-type associations utilizing these resources to be aware of thepotential contributions of TEs

We have also demonstrated a next-generation high-throughput sequencing analysis pipeline that is capable ofdetecting this type variation with a high rate of precisionand specificity We suggest that future studies using next-generation sequencing data utilize this technique

Materials and Methods

Data Sets

Two data sets were used in this analysis The first set consistsof 131 inbred lines from the Drosophila Genetic ReferencePanel (DGRP) (Mackay et al 2012) SRA accessionSRP000694 The second set consists of 15 lines used to es-tablish the Drosophila Synthetic Population Resource (DSPR)(King et al 2012a) initially described in Macdonald and Long(2007) SRA accession SRA051316 For the DGRP resourceboth 454 and Illumina data were available whereas onlyIllumina data are available for the DSPR We utilized onlythe Illumina sequence data to ensure consistency in datacollection across data sets We calculated a mean coverageof 21 for the DGRP lines and 50 for the DSPR linesVariation in coverage for the DGRP lines was substantial be-tween lines ranging from ~4 average coverage to nearly50 average coverage We therefore dropped from the anal-ysis four lines with an average coverage of 10 or less sincelow coverage can cause difficulties with assembly and TEidentification Sequence data for both data sets consisted of

paired-end data with 54 bp reads for the DSPR and paired-end 100 bp reads trimmed to 75 bp for DGRP

IBD in DGRP Data Set

We downloaded the SNP tables generated by the DGRPproject from httpwwwhgscbcmtmceduprojectsdgrpfreeze1_July_2010snp_callsIllumina (last accessed July 302010) to examine identity by descent in this data set Weperformed an all by all comparison between lines examiningsliding windows of 1 Mb across the genome with 100 kb stepsbetween windows We take a simple definition of IBD hereWhenever we identified two lines that weregt95 identity atSNP positions in a 1 Mb region we labeled the pair as IBD inthe identified region and masked that region of the genomein the line with lower coverage We also dropped entirelyfrom our analysis any lines where gt50 of the genome wasdetermined to be IBD with other lines in the population(supplementary table S1 Supplementary Material online) re-moving from the analysis the line with lower average se-quence coverage

We modeled 1 Mb regions of 200 chromosomes in a coa-lescent simulation using the parameters theta = 001site and4Nr = 10theta to determine the number of regions of95IBD under a Wright-Fisher population model using the MACSsoftware (Chen et al 2009) We simulated 1000 replicates ofthis simulation and tabulated the number of 1 Mb regionsshowing 95 IBD in each replicate

Alignment

We aligned reads to the D melanogaster reference genome(version 513 downloaded from FlyBase flybaseorg) using thealigner BWA (version 059 Li and Durbin 2009) We used thefollowing parameters (aln -t 8 -l 13 -m 50000000 -R 5000)followed by the command ldquosamperdquo to resolve paired endmappings (using parameters -a 5000 -N 5000 -n 500) It isimportant when detecting TEs that the -R during the initialalignment and the -N parameter during the ldquosamperdquo phase beset high otherwise highly repetitive sequences that would beinformative of the presence of a common element will beexcluded from the data by the aligner We also used the -Ioption for the DGRP lines but not the DSPR due to the dif-ferences in Illumina quality score output format as the twodata sets were sequenced at different times The -m parameterfrom the initial alignment step and the -n parameter from thepaired-end resolution step deal with how BWA treats multiplyaligning reads An additional caveat is that some TE familiesmost notably the P-element family is absent from the Dmelanogaster reference sequence Such families in the se-quenced genomes will not align to the reference genome weused even if they are present in the sampled lines

Transposable Element Detection

Transposable elements were detected by first identifying allreads that were aligned to an annotated TE in the referencegenome (supplementary fig S1 Supplementary Material on-line) We then identified the mates of these reads and selectedonly those mates which did not align to an annotated TE

2322

Cridland et al doi101093molbevmst129 MBE

These reads identify the set of read-pairs which span a TEinsertion The uniquely mapping non-TE reads were thenclustered based on their start position and chromosomalstrand to identify putative TEs Clusters on the plus strandwere matched to clusters on the minus strand to producepairs of clusters indicating either end of the putative TE Onceclusters were identified we filtered out all events which con-tained fewer than three read-pairs consistent with a giveninsertion for each cluster This means that putative eventsat this stage have a minimum of six read-pairs indicating theinsertion three on the plus strand and three on the minusstrand

We then extracted all reads and their mates that aligned tothe identified regions of the genome and attempted toreconstruct the local area including the breakpoint of theTE Because many TEs have repetitive regions at the ends ofthe element we reconstructed the two breakpoints of the TEseparately (see also supplementary fig 23 from Mackay et al2012) Local reconstruction was done by running Phrap (ver-sion 1090518 Ewing and Green 1998) with the followingparameters (-forcelevel 10 -minscore 10 -minmatch 10)This set of parameters causes Phrap to do a Smith-Waterman comparison when it finds matches of at least 10characters but then relaxes the minimum alignment scoreand the final assembly parameters This is necessary to recon-struct TEs because repetitive sequence within the TEs canresult in a failure to reconstruct a contig if stricter assemblyparameters are used All code we developed for the detectionof TEs will be freely available at wwwmolpopgenorgData(last accessed August 1 2013)

Following reconstruction we used BlastN (verson 2222) tocompare reconstructed contigs to the D melanogaster refer-ence to confirm the presence of a TE In most cases we wereable to fully reconstruct both breakpoints of the TE Fullyreconstructed indicates that we could reconstruct a singlecontig which contained the breakpoint of the TE In mostcases the two breakpoints of the TE did not resolve to thesame nucleotide but instead identify the location of thetarget site duplication for the insertion (Linheiro andBergman 2012 similar to fig 1) These are cases where theupstream estimate of the breakpoint is slightly higher thanthe downstream estimate of the breakpoint and the differ-ence in position can be seen as the likely span of the target siteduplication However sometimes we were not able to fullyreconstruct a contig but instead generated two contigs onecontaining uniquely aligning sequence and the other contain-ing TE sequence These contigs are known to belong togetherbecause reads forming one contig have their mates in theother contig In these cases we have an approximate but notprecise estimate of the TE breakpoint and are unable toidentify the target site duplication

After the first round of detection we performed a secondround of detection where we examined every location wherea novel TE was detected in any line in all other lines Thisincluded examining positions in one data set that were iden-tified in the other data set In addition we examined thelocations of all known TEs in the D melanogaster genomeFor each location queried in each line we extracted reads

aligning to that location and their mates and used Phrap toreconstruct a contig This allowed us to produce presenceabsence or missing data calls for all positions where TE wasdetected in all lines The absence of a TE is determined byidentifying reconstructed contigs that spanned the insertionposition of the TE by at least 15 bp on either side Theseabsence calls were not included in Mackay et al (2012)This also allows us to identify TEs that may be in regions ofthe genome with lower average sequence coverage comparedwith the rest of the genome provided that insertion is sharedwith another line where it was possible to detect the insertion

Transposable Element Annotation

Annotation of novel TEs was performed by aligning recon-structed contigs with BlastN to the set of TEs found in theD melanogaster reference We chose the best match basedon overall length and BLAST reported e-value and called thenovel TE a TE of that family We also required a minimum of80 bp of contig matching a TE to make an annotation callThis is an additional annotation analysis to the one we per-formed in Mackay et al (2012) For each of the two contigsproduced for each insertion we determined the best matchand then compared these to each other TEs are classified asper the classification scheme put forth in Wicker et al (2007)We annotated each TE insertion to the highest level of agree-ment between the two contigs if the two contigs disagreedwith each other at every level of classification we called thatelement ldquoundeterminedrdquo This could result from nested in-sertions of elements where the 50 end of the insertion is truly adifferent element from the 30 end or because the TE sequenceis degraded and differs substantially from the referencesequence or because there was not enough TE sequence tomeet our annotation guidelines Because we can only recon-struct a maximum of few hundred bases at either end of theinsertion it is difficult to distinguish between these issues Wealso did not attempt to determine the strand of the TE

Validation

A subset of 190 TE insertion sites shared with the D melano-gaster reference were validated via PCR in a subset of nineof the DGRP lines by Blumenstiel JP Chen X He M BergmanCM (unpublished data httparxivorgabs12093456 lastaccessed June 26 2013) To independently check IBD callsusing SNP data we also determined if pairs of lines wereIBD for TEs in the same regions In this comparison we onlyincluded TE insertions where both lines have a positive pres-ence or absence call This can also serve as a validation of ourTE calling pipeline since if two segments are IBD they shouldshare the same set of TEs

Simulation

We selected a set of elements of different sizes and of TEclasses from the set of TEs in the reference sequence andinserted them semi-randomly into the D melanogaster refer-ence genome in silico Insertions were semi-random since weavoided areas already occupied by reference TEs and the cen-tromeric and telomeric regions Read pairs were then

2323

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

generated to construct fastq files We then ran our TE detec-tion pipeline as above through the initial annotation stepThis process was repeated three times to generate a simula-tion of 15 25 and 50 coverage

Recombination Rates

We wrote an R routine to estimate local recombination ratesas a function of physical position Briefly we used ldquocyto-ge-netic-seqtsvrdquo from wwwflybaseorg (last accessed January 302009) to assign a set of physical landmarks to genetic posi-tions For each chromosome arm we then fit a local polyno-mial relating cumulative cM to cumulative basepair (usinglocpoly) Unlike many prior local polynomial fits we madethe added assumption that this polynomial must be a mono-tonically increasing function (using monoproc) an assump-tion that is true for real data We then fit a smooth spline tothe above curve (using smoothspline) The advantage ofsmoothspline is that the resulting curves can be differentiated(and hence estimates of rates obtained directly) Finally we fita smooth spline to the first derivatives obtained from theprior smooth spline largely to make the recombinationrates less noisy near centromeres and telomeres TheR code is freely available at wwwmolpopgenorgData (lastaccessed August 1 2013)

Data Set Restriction

Analyses performed on these data sets were restricted in twoways First we restricted our analyses to regions of thegenome with low TE density and moderate recombinationrates We restricted our data set in this way because regions ofhigh TE density make it more difficult to make accurate callsBecause our pipeline requires unique reads to positionallylocate the TE insertion regions with high TE density canlack enough unique sequence to generate these reads Theregions included in the analyses are X300000ndash208000002L200000ndash20100000 2R2300000ndash21000000 3L100000ndash21900000 3R600000ndash27800000 which are not centromericand telomeric regions This excludes the majority of TEs an-notated in the reference genome since these TEs are largelyclustered together in centromeric regions However this doesnot affect our analysis since we are interested in TEs in theeuchromatic portions of the genome

Second to restrict the set of reference TEs studied to full-length copies we removed from the analysis all reference TEsthat had a length that was less than 75 of the canonicalannotated length of the TE While these elements may stillproduce phenotypic effects these elements are unlikely to beactive copies This reduced the number of reference TEs in-cluded in the analysis to 607 from 1085 TEs that are present inthe reference genome in the regions included in our analysis

Additionally some analyses were performed on the subsetof 23 lines in the DGRP data set which had an average se-quencing coverage of 25 or higher hereafter referred to asDGRP25 This restriction was to mitigate the effect of differ-ences in coverage between the two data sets Simulations oftransposable element detection via our pipeline data not

shown indicate that TE detection at 25 is comparable todetection at 50

Site Count Spectra

We generated derived allele count spectra for TEs in the DSPRdata set and the DGRP25 set of lines We included only siteswhere we were able to make a positive presence or absencecall for each line in the data set so that the number of lineswas normalized for each resource for this analysis Observedvalues were compared with expected values at counts 1 2 34 5 6 and 7 under an infinite sites model via a 2 testCategories 7 and higher were grouped due to low counts insome of these categories We also calculated count spectra forSNPs present in small introns (introns86 bp [Haddrill et al2005]) a group unlikely to have functional effects We alsoused SoFoS a site frequency rescaler (Hufford et al 2012) torescale the DSPR data set to the sample size of the DGRP25from 15 to 23 samples to see if the rescaled data show adifferent distribution of element counts from the observedDGRP25 data We then did a KolmogorovndashSmirnov test tocompare the two distributions as a test of different selectiveeffects acting on the populations

Genomic Context of TEs

Transposable elements were identified in regions of low(lt2 cMMb) or high recombination (2 cMMb) FASTAfiles containing the annotated coordinates of 50 UTRs30 UTRs intronic intergenic and full gene span regions forversion 513 of the D melanogaster reference were down-loaded from flybase (wwwflybaseorg last accessed January30 2009) Version 513 of the reference annotation was usedfor comparative purposes with the DGRP analysis (Mackayet al 2012) Because alternative transcripts can result in thesame genomic position being classified in multiple categorieswe labeled each TE as every appropriate category This oc-curred with 52 of TEs in the DGRP and 60 of TEs in theDSPR When calculating the percentage of the referencegenome in our restricted areas of analysis that were classifiedas each genomic type we followed the same strategy

We compared our observations of TE insertions to a bino-mial distribution to determine whether TEs were distributedrandomly throughout the genome We used the pbinomfunction in R (R Development Core Team 2008) setting thenumber of trials to the total number of TE insertion locationsin the data set and the probability of success to the propor-tion of the genome defined as each context (intron exon30 UTR 50 UTR or intergenic) We did one tailed tests in eachcase For exon 30 UTR and 50 UTR tests we calculated theprobability of observing the observed value or fewer TEs Forintron and intergenic tests we calculated the probability ofobserving the observed value or more TEs

Variables Affecting TE Density

For the DGRP25 and DSPR data sets we calculatedTE density in terms of TEsMb For this analysis we alsoexcluded chromosome 4 We examined the effects of recom-bination rate chromosome and TE element family on TE

2324

Cridland et al doi101093molbevmst129 MBE

density using the standard linear model in R We usedthe lm() function to define our model as followsTEsMb ~ Line + Recombination Rate + Chromosome +Recombination Rate Chromosome We performed thisanalysis both for the set of all TEs and also for 15 TE familieswith moderate to high number of copies in our data set(supplementary table S4 Supplementary Material online)We selected individual TE families for this additional analysisbased on their level of abundance in the data sets and theirhaving been previously studied by Montgomery et al (1987)and Bartolome et al (2002)

Gene Families

We downloaded the Dfam database from (httpwwwindianaedu~hahnlabflyDfamDBdrosophila_frbhtml last acc-essed September 1 2012) This database describes all mem-bers of all gene families clustered using a fuzzy reciprocalBLAST method as described in Hahn et al (2007) for 12Drosophila species We used R to perform a linear regressionusing a simple model where the square root (TE density) in agene is the random variable and log10 (gene family size) is theindependent variable

Variation between Resources

For each TE present in both data sets we compared thefrequency of the TE in each population using Fisherrsquos exacttest We excluded from this analysis all insertions where wewere unable to make a presenceabsence call in at least 100 ofthe DGRP lines or 13 of the DSPR lines For each gene with TEinsertions in both data sets we compared the frequency atwhich the gene contains TEs between data sets using aFisherrsquos exact test

Supplementary MaterialSupplementary tables S1ndashS6 and figures S1 and S2 are avail-able at Molecular Biology and Evolution online (httpwwwmbeoxfordjournalsorg)

Acknowledgments

The authors thank Casey Bergman and Miaomiao He for earlyaccess to their data as well as helpful comments from CBThey also thank Rebekah Rogers for help with editing themanuscript This work was supported by the NationalInstitute of Health grant NIH R01-GM085183 to KRT theNational Library of Medicine - National Institute of Healthtraining grant LM007443 to JMC and the National Instituteof Health grant NIH R01-OD010974 to ADL and SJM

ReferencesAltschul SF Gish W Miller W Myers EW Lipman DJ 1990 Basic local

alignment search tool J Mol Biol 215403ndash410Aminetzach YT Macpherson JM Petrov DA 2005 Pesticide resistance

via transposition-mediated adaptive gene truncation in DrosophilaScience 309764ndash767

Bartolome C Maside X Charlesworth B 2002 On the abundance anddistribution of transposable elements in the genome of Drosophilamelanogaster Mol Biol Evol 19926ndash937

Bennetzen JL 2000 Transposable element contributions to plant geneand genome evolution Plant Mol Biol 42251ndash269

Bergman CM Bensasson D 2007 Recent LTR retrotransposon insertioncontrasts with waves of non-LTR insertion since speciation inDrosophila melanogaster Proc Natl Acad Sci U S A 10411340ndash11345

Biemont C Vieira C 2006 Junk DNA as an evolutionary force Nature443521ndash524

Birchler JA Hiebert JC 1989 Interaction of the Enhancer of white-apricotwith transposable element alleles at the white locus in Drosophilamelanogaster Genetics 122129ndash138

Birchler JA Hiebert JC Rabinow L 1989 Interaction of the mottler ofwhite with transposable element alleles at the white locus inDrosophila melanogaster Genes Dev 373ndash84

Bowen NJ McDonald JF 2001 Drosophila euchromatic LTR retrotran-sposons are much younger than the host species in which theyreside Genome Res 111527ndash1540

Carr M Soloway JR Robinson TE Brookfield JF 2002 Mechanisms reg-ulating copy numbers of six LTR retrotransposons in the genome ofDrosophila melanogaster Chromosoma 110511ndash518

Charlesworth B 1994 The effect of background selection against dele-terious mutations on weakly selected linked variants Genet Res 63213ndash227

Charlesworth B Jarne P Assimacopoulos S 1994 The distribution oftransposable elements within and between chromosomes in a pop-ulation of Drosophila melanogaster III Element abundances in het-erochromatin Genet Res 64183ndash197

Charlesworth B Langley CH 1989 The population genetics ofDrosophila transposable elements Annu Rev Genet 23251ndash287

Charlesworth B Lapid A 1989 A study of ten families of transposableelements on X chromosomes from a population of Drosophila mel-anogaster Genet Res 54113ndash125

Charlesworth B Lapid A Canada D 1992 The distribution of transpos-able elements within and between chromosomes in a population ofDrosophila melanogaster I Element frequencies and distributionGenet Res 60103ndash114

Chen GK Marjoram P Wall JD 2009 Fast and flexible simulation ofDNA sequence data Genome Res 19136ndash142

Chung H Bogwitz MR McCart C Andrianopoulos A Ffrench-ConstantRH Batterham P Daborn PJ 2007 Cis-regulatory elements in theaccord retrotransposon result in tissue-specific expression of theDrosophila melanogaster insecticide resistance gene Cyp6g1Genetics 1751071ndash1077

Cutter AD Good JM Pappas CT Saunders MA Starrett DM Wheeler TJ2005 Transposable element orientation bias in the Drosophila mel-anogaster genome J Mol Evol 6733ndash741

Daborn P Boundy S Yen J Pittendrigh B Ffrench-Constant R 2001DDT resistance in Drosophila correlates with Cyp6g1 over-expressionand confers cross-resistance to the neonicotinoid imidacloprid MolGenet Genomics 266556ndash563

Deloger M Cavalli FMG Lerat E Biemont C Sagot M-F Vieira C 2009Identification of expressed transposable element insertionsin the sequenced genome of Drosophila melanogaster Gene 43955ndash62

Dworkin I Palsson A Gibson G 2005 Replication of an Egfr-wing shapeassociation in a wild-caught cohort of Drosophila melanogasterGenetics 1692115ndash2125

Ewing B Green P 1998 Base-calling of automated sequencer tracesusing Phred II Error probabilities Genome Res 8186ndash194

Gruber JD Genissel A Macdonald SJ Long AD 2007 How repeatable areassociations between polymorphisms in achaete-scute and bristlenumber variation in Drosophila Genetics 1751987ndash1997

Haddrill PR Charlesworth B Halligan DL Andolfatto P 2005 Patterns ofintron sequence evolution in Drosophila are dependent upon lengthand GC content Genome Biol 6R67

Hahn MW Han MV Han S-G 2007 Gene family evolution across 12Drosophila genomes PLoS Genet 3e197

Hirschhorn JN Daly MJ 2005 Genome-wide association stud-ies for common diseases and complex traits Nat Rev Genet 695ndash108

2325

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

Hudson RR Kaplan NL 1995 Deleterious background selection withrecombination Genetics 1411605ndash1617

Hufford MB Xu X van Heerwaarden J et al (25 co-authors) 2012Comparative population genomics of maize domestication andimprovement Nat Genet 44808ndash811

Kaminker JS Bergman CM Kronmiller B et al (12 co-authors) 2002The transposable elements of the Drosophila melanogastereuchromatin a genomics perspective Genome Biol3RESEARCH0084

Kapitonov VV Jurka J 2003 Molecular paleontology of transposableelements in the Drosophila melanogaster genome Proc Natl AcadSci U S A 1006569ndash6574

Kazazian HH 2004 Mobile elements drivers of genome evolutionScience 3031626ndash1632

Kidwell MG 2002 Transposable elements and the evolution of genomesize in eukaryotes Genetica 11549ndash63

King EG Macdonald SJ Long AD 2012a Properties and power of theDrosophila synthetic population resource for the routine dissectionof complex traits Genetics 191935ndash949

King EG Merkes CM McNeil CL Hoofer SR Sen S Broman KW LongAD Macdonald SJ 2012b Genetic dissection of a model complextrait using the Drosophila synthetic population resource GenomeRes 221558ndash1566

Kofler R Betancourt AJ Schlotterer C 2012 Sequencing of pooled DNAsamples (Pool-Seq) uncovers complex dynamics of transposableelement insertions in Drosophila melanogaster PLoS Genet 8e1002487

Langley CH Montgomery E Hudson R Kaplan N Charlesworth B 1988On the role of unequal exchange in the containment of transpos-able element copy number Genet Res 52223ndash235

Li H Durbin R 2009 Fast and accurate short readalignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760

Linheiro RS Bergman CM 2012 Whole genome resequencing revealsnatural target site preferences of transposable elements inDrosophila melanogaster PLoS One 7e30008

Lipatov M Lenkov L Petrov DA Bergman CM 2005 Paucity of chimericgene-transposable elements in the Drosophila melanogastergenome BMC Biol 324

Long AD Lyman RF Morgan AH Langley CH Mackay TFC 2000 Bothnaturally occurring insertions of transposable elements and inter-mediate frequency polymorphisms at the achaete-scute complexare associated with bristle number in Drosophila melanogasterGenetics 1541255ndash1269

Lonnig W-E Saedler H 2002 Chromosome rearrangements and trans-posable elements Annu Rev Genet 36389ndash410

Macdonald SJ Long AD 2007 Joint estimates of quantitativetrait locus effect and frequency using synthetic recombinantpopulations of Drosophila melanogaster Genetics 1761261ndash1281

Macdonald SJ Pastinen T Long AD 2005 The effect of polymorphismsin the Enhancer of split gene complex on bristle number variation ina large wild-caught cohort of Drosophila melanogaster Genetics 1711741ndash1756

Mackay TFC 1984 Jumping genes meet abdominal bristles hybrid dys-genesis-induced quantitative variation in Drosophila melanogasterGenet Res 44231ndash237

Mackay TFC 1989 Transposable elements and fitness in Drosophilamelanogaster Genome 31284ndash295

Mackay TFC 2010 Mutations and quantitative genetic variationlessons from Drosophila Philos Trans R Soc Lond B Biol Sci 3651229ndash1239

Mackay TFC Langley CH 1990 Molecular and phenotypic variation inthe achaete-scute region of Drosophila melanogaster Nature 34864ndash66

Mackay TFC Lyman RF Jackson MS 1992 Effects of P element inser-tions on quantitative traits in Drosophila melanogaster Genetics 130315ndash332

Mackay TFC Richards S Stone EA et al (52 co-authors) 2012 TheDrosophila melanogaster genetic reference panel Nature 482173ndash178

Meyer LR Zweig AS Hinrichs AS et al (32 co-authors) 2013 The UCSCgenome browser database extensions and updates 2013 NucleicAcids Res 41D64ndashD69

Montgomery E Charlesworth B Langley CH 1987 A test for the role ofnatural selection in the stabilization of transposable element copynumber in a population of Drosophila melanogaster Genet Res 4931ndash41

Montgomery EA Huang S-M Langley CH Judd BH 1991 Chromosomerearrangement by ectopic recombination in Drosophilamelanogaster genome structure and evolution Genetics 1291085ndash1098

NCI-NHGRI Working Group on Replication in Association Studies2007 Replicating genotype-phenotype associations Nature 447655ndash660

Nuzhdin SV Mackay TFC 1994 The genomic rate of transposable ele-ment movement in Drosophila melanogaster Mol Biol Evol 12180ndash181

Nuzhdin SV Pasyukova EG Mackay TFC 1997 Accumulation of trans-posable elements in laboratory lines of Drosophila melanogasterGenetica 100167ndash175

Ometto L Stephan W Lorenzo DD 2005 Insertiondeletion andnucleotide polymorphism data reveal constraints in Drosophilamelanogaster introns and intergenic regions Genetics 1691521ndash1527

Palsson A Gibson G 2004 Association between nucleotide variation inEgfr and wing shape in Drosophila melanogaster Genetics 1671187ndash1198

Petrov DA Fiston-Lavier A-S Lipatov M Lenkov K Gonzalez J 2011Population genomics of transposable elements in Drosophila mela-nogaster Mol Biol Evol 281633ndash1644

Presgraves DC 2006 Intron length evolution in Drosophila Mol Biol Evol132203ndash2213

Quinian AR Hall IM 2010 BEDTools a flexible suite of utilities forcomparing genomic features Bioinformatics 26841ndash842

R Development Core Team 2008 R a language and environment forstatistical computing Vienna Austria R Foundation for StatisticalComputing [cited 2013 May 1] Available from httpwwwR-projectorg

Rizzon C Marais G Gouy M Biemont C 2002 Recombination rate andthe distribution of transposable elements in the Drosophila mela-nogaster genome Genome Res 12400ndash407

Sackton TB Kulathinal RJ Bergman CM Quinlan AR Dopman EBCarneiro M Marth GT Hartl DL Clark AG 2009 Population geno-mic inferences from sparse high-throughput sequencing of twopopulations of Drosophila melanogaster Genome Biol Evol 1449ndash465

Sakurai M Aoki T Yoshikawa S et al (11 co-authors) 2009 Differentiallyexpressed Drl and Drl-2 play opposing roles in Wnt5 signalingduring Drosophila olfactory system development J Neurosci 294972ndash4980

Schmidt JM Good RT Appleton B et al (11 co-authors) 2010Copy number variation and transposable elements feature inrecent ongoing adaptation at the Cyp6g1 locus PLoS Genet 6e1000998

Shrimpton AE Mackay TFC Brown AJL 1990 Transposable element-in-duced response to artificial selection in Drosophila melanogastermolecular analysis of selection lines Genetics 125803ndash811

Slotkin RK Martienssen R 2007 Transposable elements and the epige-netic regulation of the genome Nat Rev Genet 8272ndash285

Smith PA Corces VG 1991 Drosophila transposable elements mecha-nisms of mutagenesis and interactions with the host genome AdvGenet 29229ndash300

Thornton K 2003 Libsequence a C ++ class library for evolutionarygenetic analysis Bioinformatics 192325ndash2327

Vazquez JF Albornoz J Dominguez A 2007 Direct determination of theeffects of genotype and extreme temperature on the transposition

2326

Cridland et al doi101093molbevmst129 MBE

of roo in long-term mutation accumulation lines of Drosophilamelanogaster Mol Genet Genomics 278653ndash664

Voight BF Pritchard JK 2005 Confounding from cryptic relatedness incase-control association studies PLoS Genet 1e32

Wakeley J 2009 Coalescent theory an introduction Greenwood VillageRoberts and Company Publishers

Wicker T Sabot F Hua-Van A et al (13 co-authors) 2007 A unifiedclassification system for eukaryotic transposable elements Nat RevGenet 8973ndash982

Xing J Witherspoon DJ Jorde LB 2013 Mobile element biology newpossibilities with high-throughput sequencing Trends Genet 29280ndash289

Yoshikawa S McKinnon RD Kokel M Thomas JB 2003 Wnt-mediated axon guidance via Drosophila derailed receptor Nature422583ndash588

Zachar Z Bingham PM 1982 Regulation of white locus expression thestructure of mutant alleles at the white locus of Drosophila mela-nogaster Cell 30529ndash541

2327

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

Page 5: Abundance and Distribution of Transposable Elements in Two ...

of insertions (table 1) RNA elements were much morecommon than DNA elements and within RNA elementsover half were long terminal repeat (LTR) retrotransposonelements We were unable to identify element class for347 of the DGRP insertions and 320 of the DSPR inser-tions In cases where we were unable to identify the TE familyfor an insertion this was due to either conflict between themost likely TE as identified by the two reconstructed contigsfor each event or because we had limited data on the TEsequence for that event In cases where we have conflictinginformation this was due to 1) the degradation of the TEsequence so that there were many poor matches to multiplefamilies 2) conserved sequence between families and there-fore good matches to multiple families 3) because we iden-tified a nest of elements where multiple elements insert invery close proximity or within one another or 4) the amountof TE sequence we were able to reconstruct was less than

80 bp and therefore below our annotation threshold lengthBecause of the frequently short length of TE sequence that weattempted to identify we did not attempt to identify thestrand of the TE insertion Although TE orientation is impor-tant and influences selection upon the insertion (Cutter et al2005) our data were frequently insufficient to resolve strand

Factors Influencing TE Density

If TE insertions are deleterious and recessive with respect tofitness then insertions are also predicted to be morecommon on the autosomes since deleterious insertions onthe X will be eliminated via selection against hemizygousmales (Charlesworth and Langley 1989) Contrary to this pre-diction an analysis of variance (ANOVA) indicated a signifi-cant effect of chromosome on mean TE density per line whenboth resources were included in the same analysis (table 3)This difference was due to an increased mean density of TEson the X versus the autosomes in regions of high recombi-nation This pattern was the same both for autosomes takenas a group and for each autosome individually (table 4)

TEs are predicted to accumulate in regions of low recom-bination where they are less likely to produce deleteriousrearrangements via ectopic exchange (Charlesworth andLangley 1989) An ANOVA indicates a highly significanteffect of recombination rate on mean TE density per line(table 3) TE density was also found to be greater in regionsof low recombination than in high recombination through-out the genome which was consistent with the predictions ofCharlesworth and Langley (1989)

Because the factors that influence TE density may differgreatly from family to family we repeated this analysis in 14families of TEs that appear in moderate to high abundance inthe two resources (table 5) Many of these families were pre-viously examined by Montgomery et al (1987) and Bartolomeet al (2002)

We found that in 10 families (412 roo 176 F Bari1 copiahopper INE-1 mdg1 springer) there was a highly significantdifference in mean TE density between the X and the auto-somes (P value 0026 in all cases) although the direction ofthe difference varied among families Within these familiesthere were substantial differences In two of the familiesINE-1 and roo we saw an increased density in the X overthe autosomes in regions of high recombination in boththe DSPR and DGRP25 resource (table 5) However for 412F mdg1 springer Bari1 and copia we saw the opposite pat-tern This suggests that there may be substantial variation inTE insertion site preferences among families

Sample Counts of Transposable Elements

The power to detect an association is dependent both on thepenetrance and the frequency of the causal allele (Hirschhornand Daly 2005) We found that the majority of transposableelements in the DGRP25 and the DSPR are found in only oneline (fig 3) Comparisons between the observed TE data andSNPs from the same population found in introns 86 bp(Haddrill et al 2005) indicated an excess of TEs (DGRP252 test Pamp 0 df = 6 DSPR 2 test Pamp 0 df = 6)

0 20 40 60 80 100

020

040

060

080

010

00

Coverage

TE

s

DSPRDGRP gt 25x coverageDGRP lt 25x coverage

FIG 2 Total number of TEs identified versus coverage for the DGRPDGRP25 and DSPR lines

Table 1 Summary of TE Insertions

DGRP DGRP25 DSPR

Total 17639 5855 7104

Not present in reference 17346 5615 6812

Present in reference 293 240 292

X recombination 2 cMMb 418 149 203

X recombinationgt 2 cMMb 2741 948 1107

Autosomes recombination 2 cMMb 6205 2131 2694

Autosomes recombinationgt 2 cMMb 8177 2544 3019

4 all 98 83 81

Exon 1158 378 633

Intron 8595 2870 3310

30 UTR 477 159 234

50 UTR 190 63 81

Intergenic 7219 2680 3269

DNA elements 1388 514 748

RNA elements 10133 3004 4085

Indeterminate 6118 2336 2271

2315

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

Comparisons between the observed data and the expectednumber of events at the same count under an infinite sitesmodel of an equilibrium Wright-Fisher population (Wakeley2009 p 54ndash56) also indicated an excess of rare alleles(DGRP25 2 test Pamp 0 df = 6 DSPR 2 test Pamp 0df = 6) In addition to overall site count spectra we also gen-erated spectra for the X autosomes regions of high and lowrecombination DNA elements and RNA elements In eachcase we observed the same pattern as seen in the overallspectrum (data not shown) We also compared the distribu-tion of counts in the DGRP25 with the rescaled distribution ofcounts in the DSPR rescaled to the same sample size as theDGRP25 A KolmogorovndashSmirnov test comparing these twodistributions showed that they did not differ (P = 021) Wedid see more overall TEs in every count category in therescaled DSPR data over the DGRP25 data set

Patterns of Transposable Element Insertions betweenGenes

For association studies genotyping TE insertions withincoding regions is of interest due to the potentially large

effect of an insertion on gene function A total of 2931genes (213 of annotated genes) had insertions in theDGRP resource These are insertions which exist anywherewithin the span of the gene from 50 UTR to 30 UTR Onethousand nine hundred fifteen genes 139 had insertions inthe DSPR with 1317 genes having insertions in bothresources The number of TE insertions within genes variedsubstantially between individual genes up to a maximum of78 insertions though the majority of genes with insertionshad only one insertion 631 in the DSPR and 527 in theDGRP The number and locations of individual insertions in agene varied tremendously though the majority of insertionsthat fell within gene regions were in introns 777 DSPR and824 DGRP TE number was strongly correlated with intronsize (Pearsonrsquos correlation coefficient = 079 for DSPR 087for DGRP Pamp 0 for both tests) (supplementary fig S2Supplementary Material online)

Examining the pattern of insertions in individual genes inthe DSPR and DGRP data sets we found that RNA-bindingprotein 6 had the highest number of insertions 62 in theDGRP lines and 21 in the DSPR lines These insertions were

Table 3 ANOVAs for DSRP and DGRP25 Coverage and Comparison between the Two Data Sets

df Sum Sq Mean Sq F value Pr(gtF)

DSPR vs DGRP25

Set 1 54125 54125 6486975 lt2e16

Line 36 2848 791 94817 lt2e16

Chromosome 4 11858 2965 355313 lt2e16

Recombination rate (high vs low) 1 25575 25575 3065207 lt2e16

Chromosomerecombination rate 4 10335 2584 309663 lt2e16

Residuals 333 27784 083

DSPR

Line 14 26599 19 17894 004699

Chromosome 4 39632 9908 93316 123e06

Recombination rate (high vs low) 1 141337 141337 1331143 lt2e16

Chromosomerecombination rate 4 39712 9928 93504 119e06

Residuals 126 133783 1062

DGRP25

Line 22 258202 11736 1723 lt2e16

Chromosome 4 80126 20031 29409 lt2e16

Recombination rate (high vs low) 1 120011 120011 176191 lt2e16

Chromosomerecombination rate 4 66056 16514 24244 243e16

Residuals 198 134866 0681

Table 2 Differences between Observed and Expected TE Counts

DGRP DGRP25 DSPRObserved vs Expected P Value Observed vs Expected P Value Observed vs Expected P Value

X recombination 2 cMMb Decrease 329E11 Decrease 187E03 Decrease 522E02

X recombinationgt 2 cMMb Decrease 174E01 Increase 701E02 Increase 416E01

Autosomes recombination 2 cMMb Decrease 370E01 Increase 100E + 00 Increase 687E04

Autosomes recombinationgt 2 cMMb Increase 392E02 Decrease 100E + 00 Decrease 100E + 00

Exon Decrease 000E + 00 Decrease 247E323 Decrease 103E283

Intron Increase 354E103 Increase 000E + 00 Increase 000E + 00

30 UTR Decrease 699E16 Decrease 313E06 Decrease 126E02

50 UTR Decrease 143E51 Decrease 282E18 Decrease 498E20

Intergenic Increase 112E32 Increase 0 Increase 000E + 00

2316

Cridland et al doi101093molbevmst129 MBE

typically unique to a single line though there were 3 insertionsin more than one line in the DSPR and 14 insertions in morethan one line in the DGRP

Some genes accumulated many TE insertions such asklarsicht (fig 4A gene images from the UCSC genome brow-ser Meyer et al 2013) which had 29 total insertions This genealso had a hotspot of insertions with seven independent in-sertions each at low frequency located within 36 kb of each

other Only one of these insertions was present in any givenline and these could be functionally equivalent though inde-pendently arising mutations Other genes had only a few in-sertions such as Notch (fig 4B) and Delta (fig 4C) Theinsertions in these genes were also at low frequency andthe few present were located in intronic regions

While most insertions exist in only one panel some werefound in both Cyp6a20 (fig 4D) contains a high frequencynon-reference TE insertions present in both data sets in 4lines out of 121 lines where we were able to make a presenceor absence call in the DGRP (hereafter shown as 4121) and314 lines in the DSPR Both Cyp6a20 and klarsicht are exam-ples of genes with TE insertions in exons which was

00

02

04

06

08

10

1 2 3 4 5 6 7+

Observed TEsObserved SNPsExpected

DSPR

00

02

04

06

08

10

1 2 3 4 5 6 7+

Observed TEsObserved SNPsExpected

DGRP

FIG 3 Derived allele count spectra for the DSRP lines and the DGRP25lines where a positive presence or absence call was made for eachinsertion in each line 6613 insertions in the DSPR and 3274 in theDGRP25 Count spectra for SNPs is from SNPs in introns 86 bp 2

tests between observed and expected distributions result in Pamp 0 forcomparisons between TEs and the neutral model as well as between TEsand SNPs for both data sets

Table 5 TE Density for 15 Individual Families of Elements

Mean Density (TEMb)

Element Resource X High X Low Auto High Auto Low

roo DGRP25 039 037 032 033DSPR 080 088 065 068

297 DGRP25 005 000 002 001DSPR 005 000 004 000

412 DGRP25 003 001 003 007DSPR 006 008 009 012

F DGRP25 002 006 004 006DSPR 007 008 012 013

176 DGRP25 001 000 001 002DSPR 002 000 002 006

Bari1 DGRP25 001 000 004 001DSPR 002 002 006 002

copia DGRP25 001 000 002 003DSPR 009 002 014 019

H DGRP25 001 000 001 000DSPR 000 000 001 000

hopper DGRP25 010 001 000 000DSPR 010 004 002 002

INE-1 DGRP25 042 188 003 048DSPR 045 203 004 055

jockey DGRP25 015 012 019 013DSPR 035 034 040 033

mdg1 DGRP25 003 002 003 010DSPR 008 004 012 018

pogo DGRP25 006 004 007 005DSPR 013 021 015 013

springer DGRP25 000 001 001 003DSPR 002 000 001 005

Table 4 TE Density in the X and Autosomes

TEMb

DGRP25 DSPR Both

X all 382 638 483

Autosomes all 351 593 447

X high recombination 371 600 461

X low recombination 393 676 505

2L high recombination 227 473 324

2L low recombination 323 589 428

2R high recombination 282 489 364

2R low recombination 617 863 714

3L high recombination 296 523 386

3L low recombination 474 726 573

3R high recombination 281 468 355

3R low recombination 372 670 490

2317

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

uncommon both in this data set and in previous studies(Kaminker et al 2002 Lipatov et al 2005)

Differences in the strength of selection against closelyrelated genes can also be illustrated by patterns of TE inser-tions The paralogs derailed (fig 4F) and derailed-2 (fig 4E)showed very different patterns of TE insertions though thesame pattern was seen in both resources derailed-2 locatedon chromosome 2R is in a region of moderate recombination199 cMMb whereas derailed is located towards the distalend of 2L in a region of low recombination 044 cMMb

Given the context of the recombination rates the expectationwould be that derailed would have a higher TE load sincedeleterious alleles are removed more efficiently in regions ofhigh recombination (Hudson and Kaplan 1995) but theopposite pattern was observed

While both genes play a role in Wnt5 signaling mutationsin derailed can cause major phenotypic changes in Drosophilanervous system development resulting in the loss of normalfunction (Yoshikawa et al 2003) derailed-2 when mutatedproduces only minor differences in neuron positioning

115 115115 215 115115

114 1122

11181120 11181124

2117

21182119

1122

1119 1118

2118

1121 3114

1115

1114

1119

1113

1119

1120

1120

1119

1120

3119

1119

A klarsichtDGRP

DSPR

Insertion in an Exon

314 1195

4121D Cyp6a20

315

1123

1415

56125

115115

415

115615

10124

1124 1124 1123

2124E derailed-2

B Notch

115 11211122

C Delta1114 1120

191

F derailed

Same insertion in both resources

FIG 4 Transposable element insertions in genes The frequency above each insertion is the number of lines in which the element is present over thenumber of lines in which the element is validated as either present or absent Gene images are from the UCSC genome browser (httpgenomeucscedu last accessed June 31 2012)

2318

Cridland et al doi101093molbevmst129 MBE

(Sakurai et al 2009) This suggests that either the strength ofselection may be much higher against insertions in derailed orthe insertion rates are unequal between the paralogs

Gene Family Size and TEs

We used available information on gene families in DrosophilaHahn et al (2007) grouped into families based on sequencesimilarity and found a very weak but significant effect of genefamily size on the average number of TE insertions per kbin both resources (P value = 00003 R2 = 0001 DSPRP value = 0002 R2 = 0007 DGRP) (fig 5A and B) Althoughthere was a large variance in the number of insertions per kb

in both resources (fig 5A and B) the average number ofTEskb increased as family size increased up to a moderatelylarge family size

Patterns of Variation between Resources

To replicate an association with a single causative allele thatallele must both exist in both panels and also be at highenough frequency to detect the association Given thatmost TEs were singletons and only 45 of insertion eventswere identified in both panels and were within genes thereplicability of associations with these alleles is likely to below if alleles are considered individually We examined sharedinsertions segregating at different frequencies between panelsand found only two TE insertions in genes that were segre-gating at different frequencies in the DGRP versus DSPRresources following Bonferroni correction (Fisherrsquos exacttests P 000007) One of these was in an exon of bothCG13175 and CG33964 a pair of overlapping genes Herethe TE insertion was at much higher frequency in theDGRP than in the DSPR (89122 vs 215) The other insertionwas in an intron of Caliban and the frequency of this insertionwas higher in the DSPR than the DGRP (1115 vs 7116)

If insertions are considered as a class of mutation thenthere may be better power to detect an association assumingthat different insertions contribute in the same way to thephenotype We also examined genes with TE insertions(139 of genes) though not necessarily the same insertionin both resources Treating all insertion alleles within a gene asequivalent a comparison between resources showed thatthere were 21 genes which contain TEs at different frequen-cies between the DGRP versus DSPR following Bonferronicorrection (Fisherrsquos exact tests P 0000038197) includingderailed-2 (supplementary table S6 Supplementary Materialonline)

Discussion

Distribution and Abundance of TransposableElements

Previous work on transposable elements has generally beenfocused on a few families of TEs (Montgomery et al 1987Charlesworth and Lapid 1989 Charlesworth et al 1992) orfocused only on a single genome (Bartolome et al 2002Kaminker et al 2002 Rizzon et al 2002) We have presenteda genomic analysis of TEs across a large number of lines andtwo data sets We found that transposable elements in bothresources are primarily found as singletons with a few mod-erate to high frequency insertions Transposable elements alsosegregated at lower allele frequencies than do presumed non-functional SNPs drawn from the same data set suggestingthat negative selection is acting upon TEs However theremay be substantial variation in the age of transposable ele-ments in different families many LTR elements are thought tobe young insertions (Bowen and McDonald 2001 Bergmanand Bensasson 2007) while other families like INE-1 arethought to be quite old (Kapitonov and Jurka 2003) Thismay create a situation where the young age of insertions isresponsible for the excess of rare alleles We found a

0

1

2

3

00 05 10 15 20log10(Gene family size)

TE

den

sity

0

1

2

3

00 05 10 15 20log10(Gene family size)

TE

den

sity

DSPR

DGRPA

B

FIG 5 log10(TE density) versus log10(Gene family size)

2319

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

non-random distribution of TE insertions across the genomewith increases of TEs in both the intergenic and intronicregions of the genome and decreases in exonic and UTRregions (Kaminker et al 2002 Lipatov et al 2005) Similar toprevious studies we found variation in TE density betweenchromosomes and higher densities of TEs in regions of lowrecombination (Charlesworth et al 1994 Bartolome et al2002) We also found increased TE densities on the X relativeto the autosomes both in overall insertions and in someindividual TE families (table 5)

Detection method and the size of the data set are likely tobe contributing factors to differences between previous workand this study Many previous studies have used in situ meth-ods to ascertain TE number in samples and have looked atelement accumulation at the base versus the midsections ofchromosomes (Montgomery et al 1987 Charlesworth et al1992) or have looked only at one genome (Bartolome et al2002 Rizzon et al 2002) A previous PCR-based detectionstudy found no difference in TE frequency between the Xand the autosomes in natural populations (Petrov et al2011) but this study was biased in that it only looked atTEs shared with the reference sequence and thus did notcapture the full picture of TEs in these populations

While our analysis does disagree with previous work thesedisagreements may be mainly due to the way in which regionsof the genome are divided for examination Rizzon et al(2002) found an increase in copies of LTR retrotransposonson the X when they excluded pericentromeric telomeric andchromosome 4 from their analysis though they found a def-icit of TEs on the X when these regions are included Thisfinding is most directly comparable to our own finding for theentire TE data set since they exclude a very similar set ofgenomic regions We do agree with previous work in ourfinding of higher densities of TEs in regions of low recombi-nation than in high recombination (Charlesworth et al 1994Bartolome et al 2002)

Examining individual element families Carr et al (2002)found an increase of mdg-3 elements and 297 elements on theX than would be expected under a random insertion modelMontgomery et al (1987) found no evidence for a reductionin TEs on the X for both roo and 297 elements but foundfewer TEs on the X when looking at the 412 family HoweverCharlesworth et al (1992) observed the opposite pattern forroo and 297 Montgomery et al (1991) also found that moreroo elements in lab kept lines than they had previously ob-served in natural populations We found significantly in-creased densities on the X for roo and significantlyincreased densities on the autosomes for 412

Differences between previous studies of individual elementfamilies and our results may be due to a variety of reasonsFirst we were able to annotate only about two-thirds of ourTE insertions This was primarily due to having insufficientsequence to reliably annotate the element or because inde-pendent annotation using sequence reconstructed fromeither end of the insertion did not yield the same elementfamily annotation This means that our density calculationsfor individual elements may be skewed by missing annotationinformation Second many previous studies (Montgomery

et al 1987 Charlesworth et al 1992 Carr et al 2002) examinedonly LTR elements These elements are young (Bowen andMcDonald 2001 Bergman and Bensasson 2007) and thus in-dividual copies may not yet have been removed from thegenome thereby displaying different patterns of insertionthan older active elements

What is clear is that there is great variability in TE densityboth within and between element families as well as withinand between chromosomes It may also be that the overallpattern is driven by a small group of families In our data setINE-1 shows the largest pattern of difference between the Xand autosomes with an order of magnitude increase in meanTE density per line on the X over the autosomes possibly dueto an increased likelihood of fixation of insertions (Omettoet al 2005) specifically on the X chromosome (Presgraves2006) These INE-1 elements are a subset of the largest TEfamily in D melanogaster though only several dozen of themare in the euchromatin INE-1 elements are also considered tobe inactive in D melanogaster and likely have not been mobilefor gt3 million years (Kapitonov and Jurka 2003) Consistentwith this most of the INE-1 elements we detected were fixedor nearly fixed in our data set We did see some examples oflow frequency insertions which may be due to more recenttranspositional events or incorrect TE annotation due toshort sequences The old age of INE-1 insertions also suggeststhat the extant elements are unlikely to be selected againstAn increased density of TEs on the X compared with theautosomes suggests that it is not the fitness effects of inser-tions which govern TE copy number Fitness effects wouldresult in the more rapid removal of insertions on the X thaninsertions on the autosomes due to the increased selectivepressures experienced on the X (Montgomery et al 1987)Background selection also predicts that the X should be lessburdened by deleterious recessives since it is hemizygous onethird of the time (Charlesworth 1994)

However it is unclear whether our observation of in-creased densities of TEs on the X relative to the autosomesfits with the ectopic exchange model This model predictsthat TEs will accumulate in regions of low ectopic exchange(Langley et al 1988) The general assumption in the literatureis that ectopic exchange is positively correlated with recom-bination rate (Langley et al 1988 Montgomery et al 1991)The X chromosome experiences higher levels of recombina-tion than the autosomes with an average of 36 cMMb forthe X in regions of high recombination (cMMbgt 2) whereasthe autosomes have a mean of 336 cMMb If recombinationrate and ectopic exchange are positively correlated through-out the genome then the expectation would be fewer TEs inthe middle regions of the X where recombination rates areknown to be high This is the opposite of what we observe Ifrecombination rate and the rate of ectopic exchange are notpositively correlated then the ectopic exchange model maybe correct but data are lacking to properly address this issueCurrently there are no direct measurements of the rate ofectopic exchange on a genome-wide scale in Drosophiladuring meiosis Montgomery et al (1991) observed littleectopic exchange between homozygous chromosomes inDrosophila melanogaster on the X chromosome in the

2320

Cridland et al doi101093molbevmst129 MBE

region around the white locus They also found that whilerecombination rate was reduced in both centromeric andtelomeric regions only the centromeric region displayed anincreased density of TEs This information is suggestive ofdifferences between these regions in the rate of ectopicexchange

The DGRP and DSPR are also composed of inbred lineswhich have been kept in laboratory conditions for some time~9 years in the case of the DGRP lines and 40+ years in thecase of the DSPR Given that TE insertion mutations accumu-late rapidly and in an increasing manner over time (Nuzhdinet al 1997) the increased density of TEs on the X versus theautosomes may not be surprising if there is a highertransposition rate to the X The only direct study to datefocused on roo elements in D melanogaster and did observehigher rates of insertion in the X than the autosomes(Vazquez et al 2007)

Transposable Elements in Genes

The majority of genes 813 do not have any TE insertions inany line studied and intron length is a good predictor of thenumber of TEs within a gene However a few genes accumu-late many insertions even after correcting for gene size andintron length and some genes have very few TEs even thoughthey have large introns The reasons why different genes mayaccumulate TE insertions may be quite different for individualgenes In some cases there may be hot spots for insertionswhich if insertions are neutral or only slightly deleterious mayresult in large number of insertions in a small region In othergenes the lack of TE accumulation may suggest selectivepressures against TE accumulation

Gene family size may also play a role in TE accumulationWe see an increase in TE density correlated with an increase ingene family size This may indicate differences in selectivepressures for genes in different sized families Single-copygenes may experience more constraint whereas in moder-ately sized gene families these constraints may be more re-laxed perhaps due to redundancy

In the cytochrome p450 functional family of genes we findthat 3789 of these genes have TE insertions 2589 in theDGRP and 2089 in the DSPR Cytochrome p450 is also foundto be in an enriched term in the DSPR for the set of genescontaining TEs The cytochrome p450 gene Cyp6g1 is knownto confer resistance to DDT (Daborn et al 2001) and TE-mediated copy number variation is associated with increasingresistance to DDT (Schmidt et al 2010) This pattern of TEinsertions in the DSPR may be reflective of the locationswhere DSPR lines were originally collected The DSPR lineswere gathered from various locations worldwide at a timewhen different pesticides were in use The DGRP lines weregathered from a single geographic area in North America in2003 Presumably these sets of lines have experienced differ-ent selective pressures with respect to pesticide resistance InCyp6a20 (fig 4D) where TEs are segregating at different fre-quencies in the two resources some similar effect may beoccurring

TE Detection Pipelines

The TE detection strategy used here is an improvement uponthe strategy used in Mackay et al (2012) The most notabledifferences between the data reported in Mackay et al (2012)and the DGRP data reported here is substantial amount ofnew information First our pipeline improvements agree wellwith the previous implementation with 812 of previous TEpresence calls identified by the new pipeline However thesecalls represent on 182 of the presence calls made by theimproved pipeline demonstrating that we have identifiedsignificantly more calls than before The incorporation of TEabsence calls also allows us to provide a much more completepicture of TEs in euchromatic sequence in the DGRP

The TE detection pipeline used here is similar to otherpipelines which have been employed but it is useful to de-scribe briefly here the differences in approaches and why wefollowed the approach presented here A useful discussion ofcommon approaches to TE detection in many species can befound in Xing et al (2013) One common method for TEdetection is the split-read mapping approach which hasbeen employed in Drosophila by Linheiro and Bergman(2012) This approach identifies TE insertions by locating in-dividual sequence reads which span TE insertion breakpointsThis approach is the strategy of choice for next-generationsequencing experiments which produce only single-endreads However in a paired-end read situation this approachmay not take full advantage of the available data Specificallythe size of window around the TE insertion breakpoint whereinformative reads can be found is governed by read length in asingle-end experiment and fragment length in a paired-endexperiment Linheiro and Bergman (2012) discuss the impor-tance of read length on TE detection and suggest that in-creased read length improves TE detection In situationswhere read length is long and coverage is deep enough thesplit-read method may be preferable to the paired-end de-tection strategy However for the data sets analyzed herewhere read length is much shorter than fragment lengththe paired-end detection strategy will capture moreinformation

The pipeline used here also has similarities to the pipelineused in Kofler et al (2012) However there are two majordifferences in these studies First Kofler et al (2012) do notattempt to reconstruct their insertion breakpoints andsecond the population studied by Kofler et al (2012) wassequenced as a pool rather than as individual lines Becausetheir pooled sequence data reflect on average less than 1sequence coverage per line Kofler et al (2012) are primarilyable to detect intermediate to high frequency TE insertionstheir own analysis concludes that they cannot reliably detectinsertions at frequency lt7 (see supplementary fig S2Supplementary Material online in their supplementary dis-cussion) However the literature to date (Charlesworth andLangley 1989 Charlesworth and Lapid 1989 Charlesworthet al 1992) as well as the results of this study indicate thatthe majority of TEs in a population are found at very lowfrequency often in only one individual in a population andtherefore beyond the scope of detection of Kofler et al (2012)

2321

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

Caveats

The D melanogaster reference genome to which we alignedour data is a P-element-free genome Therefore the use of thisreference will be biased against detecting any P elementsexisting in the data sets In the case of the DSPR data setthis is not an issue since these lines are also all P-element-free(Macdonald and Long 2007) However it is likely that theDGRP lines do harbor P elements For the purpose of theseanalyses we decided to ignore P elements and focus on theset of element families common to the two resources

This situation does highlight the importance of having anappropriate reference sequence To detect P elements in thegenome of D melanogaster an artificially constructed refer-ence sequence or a different assembled sequence containingthe P element would need to be utilized This also brings upthe question about as yet undetected elements that exist atlow frequency If these elements do not have sufficient sim-ilarity to existing elements to be aligned to the referencegenome these elements will go undetected

Conclusions

We have presented a description of transposable elements intwo recently released QTL resources Transposable elementscan play important roles in gene function and regulation andincorporation of TE insertion information into any analysisperformed with either of these two resources will present amore complete picture of genomic variation and its contri-bution to complex traits In addition it will be important forassociation studies attempting to replicate genotypendashpheno-type associations utilizing these resources to be aware of thepotential contributions of TEs

We have also demonstrated a next-generation high-throughput sequencing analysis pipeline that is capable ofdetecting this type variation with a high rate of precisionand specificity We suggest that future studies using next-generation sequencing data utilize this technique

Materials and Methods

Data Sets

Two data sets were used in this analysis The first set consistsof 131 inbred lines from the Drosophila Genetic ReferencePanel (DGRP) (Mackay et al 2012) SRA accessionSRP000694 The second set consists of 15 lines used to es-tablish the Drosophila Synthetic Population Resource (DSPR)(King et al 2012a) initially described in Macdonald and Long(2007) SRA accession SRA051316 For the DGRP resourceboth 454 and Illumina data were available whereas onlyIllumina data are available for the DSPR We utilized onlythe Illumina sequence data to ensure consistency in datacollection across data sets We calculated a mean coverageof 21 for the DGRP lines and 50 for the DSPR linesVariation in coverage for the DGRP lines was substantial be-tween lines ranging from ~4 average coverage to nearly50 average coverage We therefore dropped from the anal-ysis four lines with an average coverage of 10 or less sincelow coverage can cause difficulties with assembly and TEidentification Sequence data for both data sets consisted of

paired-end data with 54 bp reads for the DSPR and paired-end 100 bp reads trimmed to 75 bp for DGRP

IBD in DGRP Data Set

We downloaded the SNP tables generated by the DGRPproject from httpwwwhgscbcmtmceduprojectsdgrpfreeze1_July_2010snp_callsIllumina (last accessed July 302010) to examine identity by descent in this data set Weperformed an all by all comparison between lines examiningsliding windows of 1 Mb across the genome with 100 kb stepsbetween windows We take a simple definition of IBD hereWhenever we identified two lines that weregt95 identity atSNP positions in a 1 Mb region we labeled the pair as IBD inthe identified region and masked that region of the genomein the line with lower coverage We also dropped entirelyfrom our analysis any lines where gt50 of the genome wasdetermined to be IBD with other lines in the population(supplementary table S1 Supplementary Material online) re-moving from the analysis the line with lower average se-quence coverage

We modeled 1 Mb regions of 200 chromosomes in a coa-lescent simulation using the parameters theta = 001site and4Nr = 10theta to determine the number of regions of95IBD under a Wright-Fisher population model using the MACSsoftware (Chen et al 2009) We simulated 1000 replicates ofthis simulation and tabulated the number of 1 Mb regionsshowing 95 IBD in each replicate

Alignment

We aligned reads to the D melanogaster reference genome(version 513 downloaded from FlyBase flybaseorg) using thealigner BWA (version 059 Li and Durbin 2009) We used thefollowing parameters (aln -t 8 -l 13 -m 50000000 -R 5000)followed by the command ldquosamperdquo to resolve paired endmappings (using parameters -a 5000 -N 5000 -n 500) It isimportant when detecting TEs that the -R during the initialalignment and the -N parameter during the ldquosamperdquo phase beset high otherwise highly repetitive sequences that would beinformative of the presence of a common element will beexcluded from the data by the aligner We also used the -Ioption for the DGRP lines but not the DSPR due to the dif-ferences in Illumina quality score output format as the twodata sets were sequenced at different times The -m parameterfrom the initial alignment step and the -n parameter from thepaired-end resolution step deal with how BWA treats multiplyaligning reads An additional caveat is that some TE familiesmost notably the P-element family is absent from the Dmelanogaster reference sequence Such families in the se-quenced genomes will not align to the reference genome weused even if they are present in the sampled lines

Transposable Element Detection

Transposable elements were detected by first identifying allreads that were aligned to an annotated TE in the referencegenome (supplementary fig S1 Supplementary Material on-line) We then identified the mates of these reads and selectedonly those mates which did not align to an annotated TE

2322

Cridland et al doi101093molbevmst129 MBE

These reads identify the set of read-pairs which span a TEinsertion The uniquely mapping non-TE reads were thenclustered based on their start position and chromosomalstrand to identify putative TEs Clusters on the plus strandwere matched to clusters on the minus strand to producepairs of clusters indicating either end of the putative TE Onceclusters were identified we filtered out all events which con-tained fewer than three read-pairs consistent with a giveninsertion for each cluster This means that putative eventsat this stage have a minimum of six read-pairs indicating theinsertion three on the plus strand and three on the minusstrand

We then extracted all reads and their mates that aligned tothe identified regions of the genome and attempted toreconstruct the local area including the breakpoint of theTE Because many TEs have repetitive regions at the ends ofthe element we reconstructed the two breakpoints of the TEseparately (see also supplementary fig 23 from Mackay et al2012) Local reconstruction was done by running Phrap (ver-sion 1090518 Ewing and Green 1998) with the followingparameters (-forcelevel 10 -minscore 10 -minmatch 10)This set of parameters causes Phrap to do a Smith-Waterman comparison when it finds matches of at least 10characters but then relaxes the minimum alignment scoreand the final assembly parameters This is necessary to recon-struct TEs because repetitive sequence within the TEs canresult in a failure to reconstruct a contig if stricter assemblyparameters are used All code we developed for the detectionof TEs will be freely available at wwwmolpopgenorgData(last accessed August 1 2013)

Following reconstruction we used BlastN (verson 2222) tocompare reconstructed contigs to the D melanogaster refer-ence to confirm the presence of a TE In most cases we wereable to fully reconstruct both breakpoints of the TE Fullyreconstructed indicates that we could reconstruct a singlecontig which contained the breakpoint of the TE In mostcases the two breakpoints of the TE did not resolve to thesame nucleotide but instead identify the location of thetarget site duplication for the insertion (Linheiro andBergman 2012 similar to fig 1) These are cases where theupstream estimate of the breakpoint is slightly higher thanthe downstream estimate of the breakpoint and the differ-ence in position can be seen as the likely span of the target siteduplication However sometimes we were not able to fullyreconstruct a contig but instead generated two contigs onecontaining uniquely aligning sequence and the other contain-ing TE sequence These contigs are known to belong togetherbecause reads forming one contig have their mates in theother contig In these cases we have an approximate but notprecise estimate of the TE breakpoint and are unable toidentify the target site duplication

After the first round of detection we performed a secondround of detection where we examined every location wherea novel TE was detected in any line in all other lines Thisincluded examining positions in one data set that were iden-tified in the other data set In addition we examined thelocations of all known TEs in the D melanogaster genomeFor each location queried in each line we extracted reads

aligning to that location and their mates and used Phrap toreconstruct a contig This allowed us to produce presenceabsence or missing data calls for all positions where TE wasdetected in all lines The absence of a TE is determined byidentifying reconstructed contigs that spanned the insertionposition of the TE by at least 15 bp on either side Theseabsence calls were not included in Mackay et al (2012)This also allows us to identify TEs that may be in regions ofthe genome with lower average sequence coverage comparedwith the rest of the genome provided that insertion is sharedwith another line where it was possible to detect the insertion

Transposable Element Annotation

Annotation of novel TEs was performed by aligning recon-structed contigs with BlastN to the set of TEs found in theD melanogaster reference We chose the best match basedon overall length and BLAST reported e-value and called thenovel TE a TE of that family We also required a minimum of80 bp of contig matching a TE to make an annotation callThis is an additional annotation analysis to the one we per-formed in Mackay et al (2012) For each of the two contigsproduced for each insertion we determined the best matchand then compared these to each other TEs are classified asper the classification scheme put forth in Wicker et al (2007)We annotated each TE insertion to the highest level of agree-ment between the two contigs if the two contigs disagreedwith each other at every level of classification we called thatelement ldquoundeterminedrdquo This could result from nested in-sertions of elements where the 50 end of the insertion is truly adifferent element from the 30 end or because the TE sequenceis degraded and differs substantially from the referencesequence or because there was not enough TE sequence tomeet our annotation guidelines Because we can only recon-struct a maximum of few hundred bases at either end of theinsertion it is difficult to distinguish between these issues Wealso did not attempt to determine the strand of the TE

Validation

A subset of 190 TE insertion sites shared with the D melano-gaster reference were validated via PCR in a subset of nineof the DGRP lines by Blumenstiel JP Chen X He M BergmanCM (unpublished data httparxivorgabs12093456 lastaccessed June 26 2013) To independently check IBD callsusing SNP data we also determined if pairs of lines wereIBD for TEs in the same regions In this comparison we onlyincluded TE insertions where both lines have a positive pres-ence or absence call This can also serve as a validation of ourTE calling pipeline since if two segments are IBD they shouldshare the same set of TEs

Simulation

We selected a set of elements of different sizes and of TEclasses from the set of TEs in the reference sequence andinserted them semi-randomly into the D melanogaster refer-ence genome in silico Insertions were semi-random since weavoided areas already occupied by reference TEs and the cen-tromeric and telomeric regions Read pairs were then

2323

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

generated to construct fastq files We then ran our TE detec-tion pipeline as above through the initial annotation stepThis process was repeated three times to generate a simula-tion of 15 25 and 50 coverage

Recombination Rates

We wrote an R routine to estimate local recombination ratesas a function of physical position Briefly we used ldquocyto-ge-netic-seqtsvrdquo from wwwflybaseorg (last accessed January 302009) to assign a set of physical landmarks to genetic posi-tions For each chromosome arm we then fit a local polyno-mial relating cumulative cM to cumulative basepair (usinglocpoly) Unlike many prior local polynomial fits we madethe added assumption that this polynomial must be a mono-tonically increasing function (using monoproc) an assump-tion that is true for real data We then fit a smooth spline tothe above curve (using smoothspline) The advantage ofsmoothspline is that the resulting curves can be differentiated(and hence estimates of rates obtained directly) Finally we fita smooth spline to the first derivatives obtained from theprior smooth spline largely to make the recombinationrates less noisy near centromeres and telomeres TheR code is freely available at wwwmolpopgenorgData (lastaccessed August 1 2013)

Data Set Restriction

Analyses performed on these data sets were restricted in twoways First we restricted our analyses to regions of thegenome with low TE density and moderate recombinationrates We restricted our data set in this way because regions ofhigh TE density make it more difficult to make accurate callsBecause our pipeline requires unique reads to positionallylocate the TE insertion regions with high TE density canlack enough unique sequence to generate these reads Theregions included in the analyses are X300000ndash208000002L200000ndash20100000 2R2300000ndash21000000 3L100000ndash21900000 3R600000ndash27800000 which are not centromericand telomeric regions This excludes the majority of TEs an-notated in the reference genome since these TEs are largelyclustered together in centromeric regions However this doesnot affect our analysis since we are interested in TEs in theeuchromatic portions of the genome

Second to restrict the set of reference TEs studied to full-length copies we removed from the analysis all reference TEsthat had a length that was less than 75 of the canonicalannotated length of the TE While these elements may stillproduce phenotypic effects these elements are unlikely to beactive copies This reduced the number of reference TEs in-cluded in the analysis to 607 from 1085 TEs that are present inthe reference genome in the regions included in our analysis

Additionally some analyses were performed on the subsetof 23 lines in the DGRP data set which had an average se-quencing coverage of 25 or higher hereafter referred to asDGRP25 This restriction was to mitigate the effect of differ-ences in coverage between the two data sets Simulations oftransposable element detection via our pipeline data not

shown indicate that TE detection at 25 is comparable todetection at 50

Site Count Spectra

We generated derived allele count spectra for TEs in the DSPRdata set and the DGRP25 set of lines We included only siteswhere we were able to make a positive presence or absencecall for each line in the data set so that the number of lineswas normalized for each resource for this analysis Observedvalues were compared with expected values at counts 1 2 34 5 6 and 7 under an infinite sites model via a 2 testCategories 7 and higher were grouped due to low counts insome of these categories We also calculated count spectra forSNPs present in small introns (introns86 bp [Haddrill et al2005]) a group unlikely to have functional effects We alsoused SoFoS a site frequency rescaler (Hufford et al 2012) torescale the DSPR data set to the sample size of the DGRP25from 15 to 23 samples to see if the rescaled data show adifferent distribution of element counts from the observedDGRP25 data We then did a KolmogorovndashSmirnov test tocompare the two distributions as a test of different selectiveeffects acting on the populations

Genomic Context of TEs

Transposable elements were identified in regions of low(lt2 cMMb) or high recombination (2 cMMb) FASTAfiles containing the annotated coordinates of 50 UTRs30 UTRs intronic intergenic and full gene span regions forversion 513 of the D melanogaster reference were down-loaded from flybase (wwwflybaseorg last accessed January30 2009) Version 513 of the reference annotation was usedfor comparative purposes with the DGRP analysis (Mackayet al 2012) Because alternative transcripts can result in thesame genomic position being classified in multiple categorieswe labeled each TE as every appropriate category This oc-curred with 52 of TEs in the DGRP and 60 of TEs in theDSPR When calculating the percentage of the referencegenome in our restricted areas of analysis that were classifiedas each genomic type we followed the same strategy

We compared our observations of TE insertions to a bino-mial distribution to determine whether TEs were distributedrandomly throughout the genome We used the pbinomfunction in R (R Development Core Team 2008) setting thenumber of trials to the total number of TE insertion locationsin the data set and the probability of success to the propor-tion of the genome defined as each context (intron exon30 UTR 50 UTR or intergenic) We did one tailed tests in eachcase For exon 30 UTR and 50 UTR tests we calculated theprobability of observing the observed value or fewer TEs Forintron and intergenic tests we calculated the probability ofobserving the observed value or more TEs

Variables Affecting TE Density

For the DGRP25 and DSPR data sets we calculatedTE density in terms of TEsMb For this analysis we alsoexcluded chromosome 4 We examined the effects of recom-bination rate chromosome and TE element family on TE

2324

Cridland et al doi101093molbevmst129 MBE

density using the standard linear model in R We usedthe lm() function to define our model as followsTEsMb ~ Line + Recombination Rate + Chromosome +Recombination Rate Chromosome We performed thisanalysis both for the set of all TEs and also for 15 TE familieswith moderate to high number of copies in our data set(supplementary table S4 Supplementary Material online)We selected individual TE families for this additional analysisbased on their level of abundance in the data sets and theirhaving been previously studied by Montgomery et al (1987)and Bartolome et al (2002)

Gene Families

We downloaded the Dfam database from (httpwwwindianaedu~hahnlabflyDfamDBdrosophila_frbhtml last acc-essed September 1 2012) This database describes all mem-bers of all gene families clustered using a fuzzy reciprocalBLAST method as described in Hahn et al (2007) for 12Drosophila species We used R to perform a linear regressionusing a simple model where the square root (TE density) in agene is the random variable and log10 (gene family size) is theindependent variable

Variation between Resources

For each TE present in both data sets we compared thefrequency of the TE in each population using Fisherrsquos exacttest We excluded from this analysis all insertions where wewere unable to make a presenceabsence call in at least 100 ofthe DGRP lines or 13 of the DSPR lines For each gene with TEinsertions in both data sets we compared the frequency atwhich the gene contains TEs between data sets using aFisherrsquos exact test

Supplementary MaterialSupplementary tables S1ndashS6 and figures S1 and S2 are avail-able at Molecular Biology and Evolution online (httpwwwmbeoxfordjournalsorg)

Acknowledgments

The authors thank Casey Bergman and Miaomiao He for earlyaccess to their data as well as helpful comments from CBThey also thank Rebekah Rogers for help with editing themanuscript This work was supported by the NationalInstitute of Health grant NIH R01-GM085183 to KRT theNational Library of Medicine - National Institute of Healthtraining grant LM007443 to JMC and the National Instituteof Health grant NIH R01-OD010974 to ADL and SJM

ReferencesAltschul SF Gish W Miller W Myers EW Lipman DJ 1990 Basic local

alignment search tool J Mol Biol 215403ndash410Aminetzach YT Macpherson JM Petrov DA 2005 Pesticide resistance

via transposition-mediated adaptive gene truncation in DrosophilaScience 309764ndash767

Bartolome C Maside X Charlesworth B 2002 On the abundance anddistribution of transposable elements in the genome of Drosophilamelanogaster Mol Biol Evol 19926ndash937

Bennetzen JL 2000 Transposable element contributions to plant geneand genome evolution Plant Mol Biol 42251ndash269

Bergman CM Bensasson D 2007 Recent LTR retrotransposon insertioncontrasts with waves of non-LTR insertion since speciation inDrosophila melanogaster Proc Natl Acad Sci U S A 10411340ndash11345

Biemont C Vieira C 2006 Junk DNA as an evolutionary force Nature443521ndash524

Birchler JA Hiebert JC 1989 Interaction of the Enhancer of white-apricotwith transposable element alleles at the white locus in Drosophilamelanogaster Genetics 122129ndash138

Birchler JA Hiebert JC Rabinow L 1989 Interaction of the mottler ofwhite with transposable element alleles at the white locus inDrosophila melanogaster Genes Dev 373ndash84

Bowen NJ McDonald JF 2001 Drosophila euchromatic LTR retrotran-sposons are much younger than the host species in which theyreside Genome Res 111527ndash1540

Carr M Soloway JR Robinson TE Brookfield JF 2002 Mechanisms reg-ulating copy numbers of six LTR retrotransposons in the genome ofDrosophila melanogaster Chromosoma 110511ndash518

Charlesworth B 1994 The effect of background selection against dele-terious mutations on weakly selected linked variants Genet Res 63213ndash227

Charlesworth B Jarne P Assimacopoulos S 1994 The distribution oftransposable elements within and between chromosomes in a pop-ulation of Drosophila melanogaster III Element abundances in het-erochromatin Genet Res 64183ndash197

Charlesworth B Langley CH 1989 The population genetics ofDrosophila transposable elements Annu Rev Genet 23251ndash287

Charlesworth B Lapid A 1989 A study of ten families of transposableelements on X chromosomes from a population of Drosophila mel-anogaster Genet Res 54113ndash125

Charlesworth B Lapid A Canada D 1992 The distribution of transpos-able elements within and between chromosomes in a population ofDrosophila melanogaster I Element frequencies and distributionGenet Res 60103ndash114

Chen GK Marjoram P Wall JD 2009 Fast and flexible simulation ofDNA sequence data Genome Res 19136ndash142

Chung H Bogwitz MR McCart C Andrianopoulos A Ffrench-ConstantRH Batterham P Daborn PJ 2007 Cis-regulatory elements in theaccord retrotransposon result in tissue-specific expression of theDrosophila melanogaster insecticide resistance gene Cyp6g1Genetics 1751071ndash1077

Cutter AD Good JM Pappas CT Saunders MA Starrett DM Wheeler TJ2005 Transposable element orientation bias in the Drosophila mel-anogaster genome J Mol Evol 6733ndash741

Daborn P Boundy S Yen J Pittendrigh B Ffrench-Constant R 2001DDT resistance in Drosophila correlates with Cyp6g1 over-expressionand confers cross-resistance to the neonicotinoid imidacloprid MolGenet Genomics 266556ndash563

Deloger M Cavalli FMG Lerat E Biemont C Sagot M-F Vieira C 2009Identification of expressed transposable element insertionsin the sequenced genome of Drosophila melanogaster Gene 43955ndash62

Dworkin I Palsson A Gibson G 2005 Replication of an Egfr-wing shapeassociation in a wild-caught cohort of Drosophila melanogasterGenetics 1692115ndash2125

Ewing B Green P 1998 Base-calling of automated sequencer tracesusing Phred II Error probabilities Genome Res 8186ndash194

Gruber JD Genissel A Macdonald SJ Long AD 2007 How repeatable areassociations between polymorphisms in achaete-scute and bristlenumber variation in Drosophila Genetics 1751987ndash1997

Haddrill PR Charlesworth B Halligan DL Andolfatto P 2005 Patterns ofintron sequence evolution in Drosophila are dependent upon lengthand GC content Genome Biol 6R67

Hahn MW Han MV Han S-G 2007 Gene family evolution across 12Drosophila genomes PLoS Genet 3e197

Hirschhorn JN Daly MJ 2005 Genome-wide association stud-ies for common diseases and complex traits Nat Rev Genet 695ndash108

2325

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

Hudson RR Kaplan NL 1995 Deleterious background selection withrecombination Genetics 1411605ndash1617

Hufford MB Xu X van Heerwaarden J et al (25 co-authors) 2012Comparative population genomics of maize domestication andimprovement Nat Genet 44808ndash811

Kaminker JS Bergman CM Kronmiller B et al (12 co-authors) 2002The transposable elements of the Drosophila melanogastereuchromatin a genomics perspective Genome Biol3RESEARCH0084

Kapitonov VV Jurka J 2003 Molecular paleontology of transposableelements in the Drosophila melanogaster genome Proc Natl AcadSci U S A 1006569ndash6574

Kazazian HH 2004 Mobile elements drivers of genome evolutionScience 3031626ndash1632

Kidwell MG 2002 Transposable elements and the evolution of genomesize in eukaryotes Genetica 11549ndash63

King EG Macdonald SJ Long AD 2012a Properties and power of theDrosophila synthetic population resource for the routine dissectionof complex traits Genetics 191935ndash949

King EG Merkes CM McNeil CL Hoofer SR Sen S Broman KW LongAD Macdonald SJ 2012b Genetic dissection of a model complextrait using the Drosophila synthetic population resource GenomeRes 221558ndash1566

Kofler R Betancourt AJ Schlotterer C 2012 Sequencing of pooled DNAsamples (Pool-Seq) uncovers complex dynamics of transposableelement insertions in Drosophila melanogaster PLoS Genet 8e1002487

Langley CH Montgomery E Hudson R Kaplan N Charlesworth B 1988On the role of unequal exchange in the containment of transpos-able element copy number Genet Res 52223ndash235

Li H Durbin R 2009 Fast and accurate short readalignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760

Linheiro RS Bergman CM 2012 Whole genome resequencing revealsnatural target site preferences of transposable elements inDrosophila melanogaster PLoS One 7e30008

Lipatov M Lenkov L Petrov DA Bergman CM 2005 Paucity of chimericgene-transposable elements in the Drosophila melanogastergenome BMC Biol 324

Long AD Lyman RF Morgan AH Langley CH Mackay TFC 2000 Bothnaturally occurring insertions of transposable elements and inter-mediate frequency polymorphisms at the achaete-scute complexare associated with bristle number in Drosophila melanogasterGenetics 1541255ndash1269

Lonnig W-E Saedler H 2002 Chromosome rearrangements and trans-posable elements Annu Rev Genet 36389ndash410

Macdonald SJ Long AD 2007 Joint estimates of quantitativetrait locus effect and frequency using synthetic recombinantpopulations of Drosophila melanogaster Genetics 1761261ndash1281

Macdonald SJ Pastinen T Long AD 2005 The effect of polymorphismsin the Enhancer of split gene complex on bristle number variation ina large wild-caught cohort of Drosophila melanogaster Genetics 1711741ndash1756

Mackay TFC 1984 Jumping genes meet abdominal bristles hybrid dys-genesis-induced quantitative variation in Drosophila melanogasterGenet Res 44231ndash237

Mackay TFC 1989 Transposable elements and fitness in Drosophilamelanogaster Genome 31284ndash295

Mackay TFC 2010 Mutations and quantitative genetic variationlessons from Drosophila Philos Trans R Soc Lond B Biol Sci 3651229ndash1239

Mackay TFC Langley CH 1990 Molecular and phenotypic variation inthe achaete-scute region of Drosophila melanogaster Nature 34864ndash66

Mackay TFC Lyman RF Jackson MS 1992 Effects of P element inser-tions on quantitative traits in Drosophila melanogaster Genetics 130315ndash332

Mackay TFC Richards S Stone EA et al (52 co-authors) 2012 TheDrosophila melanogaster genetic reference panel Nature 482173ndash178

Meyer LR Zweig AS Hinrichs AS et al (32 co-authors) 2013 The UCSCgenome browser database extensions and updates 2013 NucleicAcids Res 41D64ndashD69

Montgomery E Charlesworth B Langley CH 1987 A test for the role ofnatural selection in the stabilization of transposable element copynumber in a population of Drosophila melanogaster Genet Res 4931ndash41

Montgomery EA Huang S-M Langley CH Judd BH 1991 Chromosomerearrangement by ectopic recombination in Drosophilamelanogaster genome structure and evolution Genetics 1291085ndash1098

NCI-NHGRI Working Group on Replication in Association Studies2007 Replicating genotype-phenotype associations Nature 447655ndash660

Nuzhdin SV Mackay TFC 1994 The genomic rate of transposable ele-ment movement in Drosophila melanogaster Mol Biol Evol 12180ndash181

Nuzhdin SV Pasyukova EG Mackay TFC 1997 Accumulation of trans-posable elements in laboratory lines of Drosophila melanogasterGenetica 100167ndash175

Ometto L Stephan W Lorenzo DD 2005 Insertiondeletion andnucleotide polymorphism data reveal constraints in Drosophilamelanogaster introns and intergenic regions Genetics 1691521ndash1527

Palsson A Gibson G 2004 Association between nucleotide variation inEgfr and wing shape in Drosophila melanogaster Genetics 1671187ndash1198

Petrov DA Fiston-Lavier A-S Lipatov M Lenkov K Gonzalez J 2011Population genomics of transposable elements in Drosophila mela-nogaster Mol Biol Evol 281633ndash1644

Presgraves DC 2006 Intron length evolution in Drosophila Mol Biol Evol132203ndash2213

Quinian AR Hall IM 2010 BEDTools a flexible suite of utilities forcomparing genomic features Bioinformatics 26841ndash842

R Development Core Team 2008 R a language and environment forstatistical computing Vienna Austria R Foundation for StatisticalComputing [cited 2013 May 1] Available from httpwwwR-projectorg

Rizzon C Marais G Gouy M Biemont C 2002 Recombination rate andthe distribution of transposable elements in the Drosophila mela-nogaster genome Genome Res 12400ndash407

Sackton TB Kulathinal RJ Bergman CM Quinlan AR Dopman EBCarneiro M Marth GT Hartl DL Clark AG 2009 Population geno-mic inferences from sparse high-throughput sequencing of twopopulations of Drosophila melanogaster Genome Biol Evol 1449ndash465

Sakurai M Aoki T Yoshikawa S et al (11 co-authors) 2009 Differentiallyexpressed Drl and Drl-2 play opposing roles in Wnt5 signalingduring Drosophila olfactory system development J Neurosci 294972ndash4980

Schmidt JM Good RT Appleton B et al (11 co-authors) 2010Copy number variation and transposable elements feature inrecent ongoing adaptation at the Cyp6g1 locus PLoS Genet 6e1000998

Shrimpton AE Mackay TFC Brown AJL 1990 Transposable element-in-duced response to artificial selection in Drosophila melanogastermolecular analysis of selection lines Genetics 125803ndash811

Slotkin RK Martienssen R 2007 Transposable elements and the epige-netic regulation of the genome Nat Rev Genet 8272ndash285

Smith PA Corces VG 1991 Drosophila transposable elements mecha-nisms of mutagenesis and interactions with the host genome AdvGenet 29229ndash300

Thornton K 2003 Libsequence a C ++ class library for evolutionarygenetic analysis Bioinformatics 192325ndash2327

Vazquez JF Albornoz J Dominguez A 2007 Direct determination of theeffects of genotype and extreme temperature on the transposition

2326

Cridland et al doi101093molbevmst129 MBE

of roo in long-term mutation accumulation lines of Drosophilamelanogaster Mol Genet Genomics 278653ndash664

Voight BF Pritchard JK 2005 Confounding from cryptic relatedness incase-control association studies PLoS Genet 1e32

Wakeley J 2009 Coalescent theory an introduction Greenwood VillageRoberts and Company Publishers

Wicker T Sabot F Hua-Van A et al (13 co-authors) 2007 A unifiedclassification system for eukaryotic transposable elements Nat RevGenet 8973ndash982

Xing J Witherspoon DJ Jorde LB 2013 Mobile element biology newpossibilities with high-throughput sequencing Trends Genet 29280ndash289

Yoshikawa S McKinnon RD Kokel M Thomas JB 2003 Wnt-mediated axon guidance via Drosophila derailed receptor Nature422583ndash588

Zachar Z Bingham PM 1982 Regulation of white locus expression thestructure of mutant alleles at the white locus of Drosophila mela-nogaster Cell 30529ndash541

2327

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

Page 6: Abundance and Distribution of Transposable Elements in Two ...

Comparisons between the observed data and the expectednumber of events at the same count under an infinite sitesmodel of an equilibrium Wright-Fisher population (Wakeley2009 p 54ndash56) also indicated an excess of rare alleles(DGRP25 2 test Pamp 0 df = 6 DSPR 2 test Pamp 0df = 6) In addition to overall site count spectra we also gen-erated spectra for the X autosomes regions of high and lowrecombination DNA elements and RNA elements In eachcase we observed the same pattern as seen in the overallspectrum (data not shown) We also compared the distribu-tion of counts in the DGRP25 with the rescaled distribution ofcounts in the DSPR rescaled to the same sample size as theDGRP25 A KolmogorovndashSmirnov test comparing these twodistributions showed that they did not differ (P = 021) Wedid see more overall TEs in every count category in therescaled DSPR data over the DGRP25 data set

Patterns of Transposable Element Insertions betweenGenes

For association studies genotyping TE insertions withincoding regions is of interest due to the potentially large

effect of an insertion on gene function A total of 2931genes (213 of annotated genes) had insertions in theDGRP resource These are insertions which exist anywherewithin the span of the gene from 50 UTR to 30 UTR Onethousand nine hundred fifteen genes 139 had insertions inthe DSPR with 1317 genes having insertions in bothresources The number of TE insertions within genes variedsubstantially between individual genes up to a maximum of78 insertions though the majority of genes with insertionshad only one insertion 631 in the DSPR and 527 in theDGRP The number and locations of individual insertions in agene varied tremendously though the majority of insertionsthat fell within gene regions were in introns 777 DSPR and824 DGRP TE number was strongly correlated with intronsize (Pearsonrsquos correlation coefficient = 079 for DSPR 087for DGRP Pamp 0 for both tests) (supplementary fig S2Supplementary Material online)

Examining the pattern of insertions in individual genes inthe DSPR and DGRP data sets we found that RNA-bindingprotein 6 had the highest number of insertions 62 in theDGRP lines and 21 in the DSPR lines These insertions were

Table 3 ANOVAs for DSRP and DGRP25 Coverage and Comparison between the Two Data Sets

df Sum Sq Mean Sq F value Pr(gtF)

DSPR vs DGRP25

Set 1 54125 54125 6486975 lt2e16

Line 36 2848 791 94817 lt2e16

Chromosome 4 11858 2965 355313 lt2e16

Recombination rate (high vs low) 1 25575 25575 3065207 lt2e16

Chromosomerecombination rate 4 10335 2584 309663 lt2e16

Residuals 333 27784 083

DSPR

Line 14 26599 19 17894 004699

Chromosome 4 39632 9908 93316 123e06

Recombination rate (high vs low) 1 141337 141337 1331143 lt2e16

Chromosomerecombination rate 4 39712 9928 93504 119e06

Residuals 126 133783 1062

DGRP25

Line 22 258202 11736 1723 lt2e16

Chromosome 4 80126 20031 29409 lt2e16

Recombination rate (high vs low) 1 120011 120011 176191 lt2e16

Chromosomerecombination rate 4 66056 16514 24244 243e16

Residuals 198 134866 0681

Table 2 Differences between Observed and Expected TE Counts

DGRP DGRP25 DSPRObserved vs Expected P Value Observed vs Expected P Value Observed vs Expected P Value

X recombination 2 cMMb Decrease 329E11 Decrease 187E03 Decrease 522E02

X recombinationgt 2 cMMb Decrease 174E01 Increase 701E02 Increase 416E01

Autosomes recombination 2 cMMb Decrease 370E01 Increase 100E + 00 Increase 687E04

Autosomes recombinationgt 2 cMMb Increase 392E02 Decrease 100E + 00 Decrease 100E + 00

Exon Decrease 000E + 00 Decrease 247E323 Decrease 103E283

Intron Increase 354E103 Increase 000E + 00 Increase 000E + 00

30 UTR Decrease 699E16 Decrease 313E06 Decrease 126E02

50 UTR Decrease 143E51 Decrease 282E18 Decrease 498E20

Intergenic Increase 112E32 Increase 0 Increase 000E + 00

2316

Cridland et al doi101093molbevmst129 MBE

typically unique to a single line though there were 3 insertionsin more than one line in the DSPR and 14 insertions in morethan one line in the DGRP

Some genes accumulated many TE insertions such asklarsicht (fig 4A gene images from the UCSC genome brow-ser Meyer et al 2013) which had 29 total insertions This genealso had a hotspot of insertions with seven independent in-sertions each at low frequency located within 36 kb of each

other Only one of these insertions was present in any givenline and these could be functionally equivalent though inde-pendently arising mutations Other genes had only a few in-sertions such as Notch (fig 4B) and Delta (fig 4C) Theinsertions in these genes were also at low frequency andthe few present were located in intronic regions

While most insertions exist in only one panel some werefound in both Cyp6a20 (fig 4D) contains a high frequencynon-reference TE insertions present in both data sets in 4lines out of 121 lines where we were able to make a presenceor absence call in the DGRP (hereafter shown as 4121) and314 lines in the DSPR Both Cyp6a20 and klarsicht are exam-ples of genes with TE insertions in exons which was

00

02

04

06

08

10

1 2 3 4 5 6 7+

Observed TEsObserved SNPsExpected

DSPR

00

02

04

06

08

10

1 2 3 4 5 6 7+

Observed TEsObserved SNPsExpected

DGRP

FIG 3 Derived allele count spectra for the DSRP lines and the DGRP25lines where a positive presence or absence call was made for eachinsertion in each line 6613 insertions in the DSPR and 3274 in theDGRP25 Count spectra for SNPs is from SNPs in introns 86 bp 2

tests between observed and expected distributions result in Pamp 0 forcomparisons between TEs and the neutral model as well as between TEsand SNPs for both data sets

Table 5 TE Density for 15 Individual Families of Elements

Mean Density (TEMb)

Element Resource X High X Low Auto High Auto Low

roo DGRP25 039 037 032 033DSPR 080 088 065 068

297 DGRP25 005 000 002 001DSPR 005 000 004 000

412 DGRP25 003 001 003 007DSPR 006 008 009 012

F DGRP25 002 006 004 006DSPR 007 008 012 013

176 DGRP25 001 000 001 002DSPR 002 000 002 006

Bari1 DGRP25 001 000 004 001DSPR 002 002 006 002

copia DGRP25 001 000 002 003DSPR 009 002 014 019

H DGRP25 001 000 001 000DSPR 000 000 001 000

hopper DGRP25 010 001 000 000DSPR 010 004 002 002

INE-1 DGRP25 042 188 003 048DSPR 045 203 004 055

jockey DGRP25 015 012 019 013DSPR 035 034 040 033

mdg1 DGRP25 003 002 003 010DSPR 008 004 012 018

pogo DGRP25 006 004 007 005DSPR 013 021 015 013

springer DGRP25 000 001 001 003DSPR 002 000 001 005

Table 4 TE Density in the X and Autosomes

TEMb

DGRP25 DSPR Both

X all 382 638 483

Autosomes all 351 593 447

X high recombination 371 600 461

X low recombination 393 676 505

2L high recombination 227 473 324

2L low recombination 323 589 428

2R high recombination 282 489 364

2R low recombination 617 863 714

3L high recombination 296 523 386

3L low recombination 474 726 573

3R high recombination 281 468 355

3R low recombination 372 670 490

2317

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

uncommon both in this data set and in previous studies(Kaminker et al 2002 Lipatov et al 2005)

Differences in the strength of selection against closelyrelated genes can also be illustrated by patterns of TE inser-tions The paralogs derailed (fig 4F) and derailed-2 (fig 4E)showed very different patterns of TE insertions though thesame pattern was seen in both resources derailed-2 locatedon chromosome 2R is in a region of moderate recombination199 cMMb whereas derailed is located towards the distalend of 2L in a region of low recombination 044 cMMb

Given the context of the recombination rates the expectationwould be that derailed would have a higher TE load sincedeleterious alleles are removed more efficiently in regions ofhigh recombination (Hudson and Kaplan 1995) but theopposite pattern was observed

While both genes play a role in Wnt5 signaling mutationsin derailed can cause major phenotypic changes in Drosophilanervous system development resulting in the loss of normalfunction (Yoshikawa et al 2003) derailed-2 when mutatedproduces only minor differences in neuron positioning

115 115115 215 115115

114 1122

11181120 11181124

2117

21182119

1122

1119 1118

2118

1121 3114

1115

1114

1119

1113

1119

1120

1120

1119

1120

3119

1119

A klarsichtDGRP

DSPR

Insertion in an Exon

314 1195

4121D Cyp6a20

315

1123

1415

56125

115115

415

115615

10124

1124 1124 1123

2124E derailed-2

B Notch

115 11211122

C Delta1114 1120

191

F derailed

Same insertion in both resources

FIG 4 Transposable element insertions in genes The frequency above each insertion is the number of lines in which the element is present over thenumber of lines in which the element is validated as either present or absent Gene images are from the UCSC genome browser (httpgenomeucscedu last accessed June 31 2012)

2318

Cridland et al doi101093molbevmst129 MBE

(Sakurai et al 2009) This suggests that either the strength ofselection may be much higher against insertions in derailed orthe insertion rates are unequal between the paralogs

Gene Family Size and TEs

We used available information on gene families in DrosophilaHahn et al (2007) grouped into families based on sequencesimilarity and found a very weak but significant effect of genefamily size on the average number of TE insertions per kbin both resources (P value = 00003 R2 = 0001 DSPRP value = 0002 R2 = 0007 DGRP) (fig 5A and B) Althoughthere was a large variance in the number of insertions per kb

in both resources (fig 5A and B) the average number ofTEskb increased as family size increased up to a moderatelylarge family size

Patterns of Variation between Resources

To replicate an association with a single causative allele thatallele must both exist in both panels and also be at highenough frequency to detect the association Given thatmost TEs were singletons and only 45 of insertion eventswere identified in both panels and were within genes thereplicability of associations with these alleles is likely to below if alleles are considered individually We examined sharedinsertions segregating at different frequencies between panelsand found only two TE insertions in genes that were segre-gating at different frequencies in the DGRP versus DSPRresources following Bonferroni correction (Fisherrsquos exacttests P 000007) One of these was in an exon of bothCG13175 and CG33964 a pair of overlapping genes Herethe TE insertion was at much higher frequency in theDGRP than in the DSPR (89122 vs 215) The other insertionwas in an intron of Caliban and the frequency of this insertionwas higher in the DSPR than the DGRP (1115 vs 7116)

If insertions are considered as a class of mutation thenthere may be better power to detect an association assumingthat different insertions contribute in the same way to thephenotype We also examined genes with TE insertions(139 of genes) though not necessarily the same insertionin both resources Treating all insertion alleles within a gene asequivalent a comparison between resources showed thatthere were 21 genes which contain TEs at different frequen-cies between the DGRP versus DSPR following Bonferronicorrection (Fisherrsquos exact tests P 0000038197) includingderailed-2 (supplementary table S6 Supplementary Materialonline)

Discussion

Distribution and Abundance of TransposableElements

Previous work on transposable elements has generally beenfocused on a few families of TEs (Montgomery et al 1987Charlesworth and Lapid 1989 Charlesworth et al 1992) orfocused only on a single genome (Bartolome et al 2002Kaminker et al 2002 Rizzon et al 2002) We have presenteda genomic analysis of TEs across a large number of lines andtwo data sets We found that transposable elements in bothresources are primarily found as singletons with a few mod-erate to high frequency insertions Transposable elements alsosegregated at lower allele frequencies than do presumed non-functional SNPs drawn from the same data set suggestingthat negative selection is acting upon TEs However theremay be substantial variation in the age of transposable ele-ments in different families many LTR elements are thought tobe young insertions (Bowen and McDonald 2001 Bergmanand Bensasson 2007) while other families like INE-1 arethought to be quite old (Kapitonov and Jurka 2003) Thismay create a situation where the young age of insertions isresponsible for the excess of rare alleles We found a

0

1

2

3

00 05 10 15 20log10(Gene family size)

TE

den

sity

0

1

2

3

00 05 10 15 20log10(Gene family size)

TE

den

sity

DSPR

DGRPA

B

FIG 5 log10(TE density) versus log10(Gene family size)

2319

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

non-random distribution of TE insertions across the genomewith increases of TEs in both the intergenic and intronicregions of the genome and decreases in exonic and UTRregions (Kaminker et al 2002 Lipatov et al 2005) Similar toprevious studies we found variation in TE density betweenchromosomes and higher densities of TEs in regions of lowrecombination (Charlesworth et al 1994 Bartolome et al2002) We also found increased TE densities on the X relativeto the autosomes both in overall insertions and in someindividual TE families (table 5)

Detection method and the size of the data set are likely tobe contributing factors to differences between previous workand this study Many previous studies have used in situ meth-ods to ascertain TE number in samples and have looked atelement accumulation at the base versus the midsections ofchromosomes (Montgomery et al 1987 Charlesworth et al1992) or have looked only at one genome (Bartolome et al2002 Rizzon et al 2002) A previous PCR-based detectionstudy found no difference in TE frequency between the Xand the autosomes in natural populations (Petrov et al2011) but this study was biased in that it only looked atTEs shared with the reference sequence and thus did notcapture the full picture of TEs in these populations

While our analysis does disagree with previous work thesedisagreements may be mainly due to the way in which regionsof the genome are divided for examination Rizzon et al(2002) found an increase in copies of LTR retrotransposonson the X when they excluded pericentromeric telomeric andchromosome 4 from their analysis though they found a def-icit of TEs on the X when these regions are included Thisfinding is most directly comparable to our own finding for theentire TE data set since they exclude a very similar set ofgenomic regions We do agree with previous work in ourfinding of higher densities of TEs in regions of low recombi-nation than in high recombination (Charlesworth et al 1994Bartolome et al 2002)

Examining individual element families Carr et al (2002)found an increase of mdg-3 elements and 297 elements on theX than would be expected under a random insertion modelMontgomery et al (1987) found no evidence for a reductionin TEs on the X for both roo and 297 elements but foundfewer TEs on the X when looking at the 412 family HoweverCharlesworth et al (1992) observed the opposite pattern forroo and 297 Montgomery et al (1991) also found that moreroo elements in lab kept lines than they had previously ob-served in natural populations We found significantly in-creased densities on the X for roo and significantlyincreased densities on the autosomes for 412

Differences between previous studies of individual elementfamilies and our results may be due to a variety of reasonsFirst we were able to annotate only about two-thirds of ourTE insertions This was primarily due to having insufficientsequence to reliably annotate the element or because inde-pendent annotation using sequence reconstructed fromeither end of the insertion did not yield the same elementfamily annotation This means that our density calculationsfor individual elements may be skewed by missing annotationinformation Second many previous studies (Montgomery

et al 1987 Charlesworth et al 1992 Carr et al 2002) examinedonly LTR elements These elements are young (Bowen andMcDonald 2001 Bergman and Bensasson 2007) and thus in-dividual copies may not yet have been removed from thegenome thereby displaying different patterns of insertionthan older active elements

What is clear is that there is great variability in TE densityboth within and between element families as well as withinand between chromosomes It may also be that the overallpattern is driven by a small group of families In our data setINE-1 shows the largest pattern of difference between the Xand autosomes with an order of magnitude increase in meanTE density per line on the X over the autosomes possibly dueto an increased likelihood of fixation of insertions (Omettoet al 2005) specifically on the X chromosome (Presgraves2006) These INE-1 elements are a subset of the largest TEfamily in D melanogaster though only several dozen of themare in the euchromatin INE-1 elements are also considered tobe inactive in D melanogaster and likely have not been mobilefor gt3 million years (Kapitonov and Jurka 2003) Consistentwith this most of the INE-1 elements we detected were fixedor nearly fixed in our data set We did see some examples oflow frequency insertions which may be due to more recenttranspositional events or incorrect TE annotation due toshort sequences The old age of INE-1 insertions also suggeststhat the extant elements are unlikely to be selected againstAn increased density of TEs on the X compared with theautosomes suggests that it is not the fitness effects of inser-tions which govern TE copy number Fitness effects wouldresult in the more rapid removal of insertions on the X thaninsertions on the autosomes due to the increased selectivepressures experienced on the X (Montgomery et al 1987)Background selection also predicts that the X should be lessburdened by deleterious recessives since it is hemizygous onethird of the time (Charlesworth 1994)

However it is unclear whether our observation of in-creased densities of TEs on the X relative to the autosomesfits with the ectopic exchange model This model predictsthat TEs will accumulate in regions of low ectopic exchange(Langley et al 1988) The general assumption in the literatureis that ectopic exchange is positively correlated with recom-bination rate (Langley et al 1988 Montgomery et al 1991)The X chromosome experiences higher levels of recombina-tion than the autosomes with an average of 36 cMMb forthe X in regions of high recombination (cMMbgt 2) whereasthe autosomes have a mean of 336 cMMb If recombinationrate and ectopic exchange are positively correlated through-out the genome then the expectation would be fewer TEs inthe middle regions of the X where recombination rates areknown to be high This is the opposite of what we observe Ifrecombination rate and the rate of ectopic exchange are notpositively correlated then the ectopic exchange model maybe correct but data are lacking to properly address this issueCurrently there are no direct measurements of the rate ofectopic exchange on a genome-wide scale in Drosophiladuring meiosis Montgomery et al (1991) observed littleectopic exchange between homozygous chromosomes inDrosophila melanogaster on the X chromosome in the

2320

Cridland et al doi101093molbevmst129 MBE

region around the white locus They also found that whilerecombination rate was reduced in both centromeric andtelomeric regions only the centromeric region displayed anincreased density of TEs This information is suggestive ofdifferences between these regions in the rate of ectopicexchange

The DGRP and DSPR are also composed of inbred lineswhich have been kept in laboratory conditions for some time~9 years in the case of the DGRP lines and 40+ years in thecase of the DSPR Given that TE insertion mutations accumu-late rapidly and in an increasing manner over time (Nuzhdinet al 1997) the increased density of TEs on the X versus theautosomes may not be surprising if there is a highertransposition rate to the X The only direct study to datefocused on roo elements in D melanogaster and did observehigher rates of insertion in the X than the autosomes(Vazquez et al 2007)

Transposable Elements in Genes

The majority of genes 813 do not have any TE insertions inany line studied and intron length is a good predictor of thenumber of TEs within a gene However a few genes accumu-late many insertions even after correcting for gene size andintron length and some genes have very few TEs even thoughthey have large introns The reasons why different genes mayaccumulate TE insertions may be quite different for individualgenes In some cases there may be hot spots for insertionswhich if insertions are neutral or only slightly deleterious mayresult in large number of insertions in a small region In othergenes the lack of TE accumulation may suggest selectivepressures against TE accumulation

Gene family size may also play a role in TE accumulationWe see an increase in TE density correlated with an increase ingene family size This may indicate differences in selectivepressures for genes in different sized families Single-copygenes may experience more constraint whereas in moder-ately sized gene families these constraints may be more re-laxed perhaps due to redundancy

In the cytochrome p450 functional family of genes we findthat 3789 of these genes have TE insertions 2589 in theDGRP and 2089 in the DSPR Cytochrome p450 is also foundto be in an enriched term in the DSPR for the set of genescontaining TEs The cytochrome p450 gene Cyp6g1 is knownto confer resistance to DDT (Daborn et al 2001) and TE-mediated copy number variation is associated with increasingresistance to DDT (Schmidt et al 2010) This pattern of TEinsertions in the DSPR may be reflective of the locationswhere DSPR lines were originally collected The DSPR lineswere gathered from various locations worldwide at a timewhen different pesticides were in use The DGRP lines weregathered from a single geographic area in North America in2003 Presumably these sets of lines have experienced differ-ent selective pressures with respect to pesticide resistance InCyp6a20 (fig 4D) where TEs are segregating at different fre-quencies in the two resources some similar effect may beoccurring

TE Detection Pipelines

The TE detection strategy used here is an improvement uponthe strategy used in Mackay et al (2012) The most notabledifferences between the data reported in Mackay et al (2012)and the DGRP data reported here is substantial amount ofnew information First our pipeline improvements agree wellwith the previous implementation with 812 of previous TEpresence calls identified by the new pipeline However thesecalls represent on 182 of the presence calls made by theimproved pipeline demonstrating that we have identifiedsignificantly more calls than before The incorporation of TEabsence calls also allows us to provide a much more completepicture of TEs in euchromatic sequence in the DGRP

The TE detection pipeline used here is similar to otherpipelines which have been employed but it is useful to de-scribe briefly here the differences in approaches and why wefollowed the approach presented here A useful discussion ofcommon approaches to TE detection in many species can befound in Xing et al (2013) One common method for TEdetection is the split-read mapping approach which hasbeen employed in Drosophila by Linheiro and Bergman(2012) This approach identifies TE insertions by locating in-dividual sequence reads which span TE insertion breakpointsThis approach is the strategy of choice for next-generationsequencing experiments which produce only single-endreads However in a paired-end read situation this approachmay not take full advantage of the available data Specificallythe size of window around the TE insertion breakpoint whereinformative reads can be found is governed by read length in asingle-end experiment and fragment length in a paired-endexperiment Linheiro and Bergman (2012) discuss the impor-tance of read length on TE detection and suggest that in-creased read length improves TE detection In situationswhere read length is long and coverage is deep enough thesplit-read method may be preferable to the paired-end de-tection strategy However for the data sets analyzed herewhere read length is much shorter than fragment lengththe paired-end detection strategy will capture moreinformation

The pipeline used here also has similarities to the pipelineused in Kofler et al (2012) However there are two majordifferences in these studies First Kofler et al (2012) do notattempt to reconstruct their insertion breakpoints andsecond the population studied by Kofler et al (2012) wassequenced as a pool rather than as individual lines Becausetheir pooled sequence data reflect on average less than 1sequence coverage per line Kofler et al (2012) are primarilyable to detect intermediate to high frequency TE insertionstheir own analysis concludes that they cannot reliably detectinsertions at frequency lt7 (see supplementary fig S2Supplementary Material online in their supplementary dis-cussion) However the literature to date (Charlesworth andLangley 1989 Charlesworth and Lapid 1989 Charlesworthet al 1992) as well as the results of this study indicate thatthe majority of TEs in a population are found at very lowfrequency often in only one individual in a population andtherefore beyond the scope of detection of Kofler et al (2012)

2321

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

Caveats

The D melanogaster reference genome to which we alignedour data is a P-element-free genome Therefore the use of thisreference will be biased against detecting any P elementsexisting in the data sets In the case of the DSPR data setthis is not an issue since these lines are also all P-element-free(Macdonald and Long 2007) However it is likely that theDGRP lines do harbor P elements For the purpose of theseanalyses we decided to ignore P elements and focus on theset of element families common to the two resources

This situation does highlight the importance of having anappropriate reference sequence To detect P elements in thegenome of D melanogaster an artificially constructed refer-ence sequence or a different assembled sequence containingthe P element would need to be utilized This also brings upthe question about as yet undetected elements that exist atlow frequency If these elements do not have sufficient sim-ilarity to existing elements to be aligned to the referencegenome these elements will go undetected

Conclusions

We have presented a description of transposable elements intwo recently released QTL resources Transposable elementscan play important roles in gene function and regulation andincorporation of TE insertion information into any analysisperformed with either of these two resources will present amore complete picture of genomic variation and its contri-bution to complex traits In addition it will be important forassociation studies attempting to replicate genotypendashpheno-type associations utilizing these resources to be aware of thepotential contributions of TEs

We have also demonstrated a next-generation high-throughput sequencing analysis pipeline that is capable ofdetecting this type variation with a high rate of precisionand specificity We suggest that future studies using next-generation sequencing data utilize this technique

Materials and Methods

Data Sets

Two data sets were used in this analysis The first set consistsof 131 inbred lines from the Drosophila Genetic ReferencePanel (DGRP) (Mackay et al 2012) SRA accessionSRP000694 The second set consists of 15 lines used to es-tablish the Drosophila Synthetic Population Resource (DSPR)(King et al 2012a) initially described in Macdonald and Long(2007) SRA accession SRA051316 For the DGRP resourceboth 454 and Illumina data were available whereas onlyIllumina data are available for the DSPR We utilized onlythe Illumina sequence data to ensure consistency in datacollection across data sets We calculated a mean coverageof 21 for the DGRP lines and 50 for the DSPR linesVariation in coverage for the DGRP lines was substantial be-tween lines ranging from ~4 average coverage to nearly50 average coverage We therefore dropped from the anal-ysis four lines with an average coverage of 10 or less sincelow coverage can cause difficulties with assembly and TEidentification Sequence data for both data sets consisted of

paired-end data with 54 bp reads for the DSPR and paired-end 100 bp reads trimmed to 75 bp for DGRP

IBD in DGRP Data Set

We downloaded the SNP tables generated by the DGRPproject from httpwwwhgscbcmtmceduprojectsdgrpfreeze1_July_2010snp_callsIllumina (last accessed July 302010) to examine identity by descent in this data set Weperformed an all by all comparison between lines examiningsliding windows of 1 Mb across the genome with 100 kb stepsbetween windows We take a simple definition of IBD hereWhenever we identified two lines that weregt95 identity atSNP positions in a 1 Mb region we labeled the pair as IBD inthe identified region and masked that region of the genomein the line with lower coverage We also dropped entirelyfrom our analysis any lines where gt50 of the genome wasdetermined to be IBD with other lines in the population(supplementary table S1 Supplementary Material online) re-moving from the analysis the line with lower average se-quence coverage

We modeled 1 Mb regions of 200 chromosomes in a coa-lescent simulation using the parameters theta = 001site and4Nr = 10theta to determine the number of regions of95IBD under a Wright-Fisher population model using the MACSsoftware (Chen et al 2009) We simulated 1000 replicates ofthis simulation and tabulated the number of 1 Mb regionsshowing 95 IBD in each replicate

Alignment

We aligned reads to the D melanogaster reference genome(version 513 downloaded from FlyBase flybaseorg) using thealigner BWA (version 059 Li and Durbin 2009) We used thefollowing parameters (aln -t 8 -l 13 -m 50000000 -R 5000)followed by the command ldquosamperdquo to resolve paired endmappings (using parameters -a 5000 -N 5000 -n 500) It isimportant when detecting TEs that the -R during the initialalignment and the -N parameter during the ldquosamperdquo phase beset high otherwise highly repetitive sequences that would beinformative of the presence of a common element will beexcluded from the data by the aligner We also used the -Ioption for the DGRP lines but not the DSPR due to the dif-ferences in Illumina quality score output format as the twodata sets were sequenced at different times The -m parameterfrom the initial alignment step and the -n parameter from thepaired-end resolution step deal with how BWA treats multiplyaligning reads An additional caveat is that some TE familiesmost notably the P-element family is absent from the Dmelanogaster reference sequence Such families in the se-quenced genomes will not align to the reference genome weused even if they are present in the sampled lines

Transposable Element Detection

Transposable elements were detected by first identifying allreads that were aligned to an annotated TE in the referencegenome (supplementary fig S1 Supplementary Material on-line) We then identified the mates of these reads and selectedonly those mates which did not align to an annotated TE

2322

Cridland et al doi101093molbevmst129 MBE

These reads identify the set of read-pairs which span a TEinsertion The uniquely mapping non-TE reads were thenclustered based on their start position and chromosomalstrand to identify putative TEs Clusters on the plus strandwere matched to clusters on the minus strand to producepairs of clusters indicating either end of the putative TE Onceclusters were identified we filtered out all events which con-tained fewer than three read-pairs consistent with a giveninsertion for each cluster This means that putative eventsat this stage have a minimum of six read-pairs indicating theinsertion three on the plus strand and three on the minusstrand

We then extracted all reads and their mates that aligned tothe identified regions of the genome and attempted toreconstruct the local area including the breakpoint of theTE Because many TEs have repetitive regions at the ends ofthe element we reconstructed the two breakpoints of the TEseparately (see also supplementary fig 23 from Mackay et al2012) Local reconstruction was done by running Phrap (ver-sion 1090518 Ewing and Green 1998) with the followingparameters (-forcelevel 10 -minscore 10 -minmatch 10)This set of parameters causes Phrap to do a Smith-Waterman comparison when it finds matches of at least 10characters but then relaxes the minimum alignment scoreand the final assembly parameters This is necessary to recon-struct TEs because repetitive sequence within the TEs canresult in a failure to reconstruct a contig if stricter assemblyparameters are used All code we developed for the detectionof TEs will be freely available at wwwmolpopgenorgData(last accessed August 1 2013)

Following reconstruction we used BlastN (verson 2222) tocompare reconstructed contigs to the D melanogaster refer-ence to confirm the presence of a TE In most cases we wereable to fully reconstruct both breakpoints of the TE Fullyreconstructed indicates that we could reconstruct a singlecontig which contained the breakpoint of the TE In mostcases the two breakpoints of the TE did not resolve to thesame nucleotide but instead identify the location of thetarget site duplication for the insertion (Linheiro andBergman 2012 similar to fig 1) These are cases where theupstream estimate of the breakpoint is slightly higher thanthe downstream estimate of the breakpoint and the differ-ence in position can be seen as the likely span of the target siteduplication However sometimes we were not able to fullyreconstruct a contig but instead generated two contigs onecontaining uniquely aligning sequence and the other contain-ing TE sequence These contigs are known to belong togetherbecause reads forming one contig have their mates in theother contig In these cases we have an approximate but notprecise estimate of the TE breakpoint and are unable toidentify the target site duplication

After the first round of detection we performed a secondround of detection where we examined every location wherea novel TE was detected in any line in all other lines Thisincluded examining positions in one data set that were iden-tified in the other data set In addition we examined thelocations of all known TEs in the D melanogaster genomeFor each location queried in each line we extracted reads

aligning to that location and their mates and used Phrap toreconstruct a contig This allowed us to produce presenceabsence or missing data calls for all positions where TE wasdetected in all lines The absence of a TE is determined byidentifying reconstructed contigs that spanned the insertionposition of the TE by at least 15 bp on either side Theseabsence calls were not included in Mackay et al (2012)This also allows us to identify TEs that may be in regions ofthe genome with lower average sequence coverage comparedwith the rest of the genome provided that insertion is sharedwith another line where it was possible to detect the insertion

Transposable Element Annotation

Annotation of novel TEs was performed by aligning recon-structed contigs with BlastN to the set of TEs found in theD melanogaster reference We chose the best match basedon overall length and BLAST reported e-value and called thenovel TE a TE of that family We also required a minimum of80 bp of contig matching a TE to make an annotation callThis is an additional annotation analysis to the one we per-formed in Mackay et al (2012) For each of the two contigsproduced for each insertion we determined the best matchand then compared these to each other TEs are classified asper the classification scheme put forth in Wicker et al (2007)We annotated each TE insertion to the highest level of agree-ment between the two contigs if the two contigs disagreedwith each other at every level of classification we called thatelement ldquoundeterminedrdquo This could result from nested in-sertions of elements where the 50 end of the insertion is truly adifferent element from the 30 end or because the TE sequenceis degraded and differs substantially from the referencesequence or because there was not enough TE sequence tomeet our annotation guidelines Because we can only recon-struct a maximum of few hundred bases at either end of theinsertion it is difficult to distinguish between these issues Wealso did not attempt to determine the strand of the TE

Validation

A subset of 190 TE insertion sites shared with the D melano-gaster reference were validated via PCR in a subset of nineof the DGRP lines by Blumenstiel JP Chen X He M BergmanCM (unpublished data httparxivorgabs12093456 lastaccessed June 26 2013) To independently check IBD callsusing SNP data we also determined if pairs of lines wereIBD for TEs in the same regions In this comparison we onlyincluded TE insertions where both lines have a positive pres-ence or absence call This can also serve as a validation of ourTE calling pipeline since if two segments are IBD they shouldshare the same set of TEs

Simulation

We selected a set of elements of different sizes and of TEclasses from the set of TEs in the reference sequence andinserted them semi-randomly into the D melanogaster refer-ence genome in silico Insertions were semi-random since weavoided areas already occupied by reference TEs and the cen-tromeric and telomeric regions Read pairs were then

2323

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

generated to construct fastq files We then ran our TE detec-tion pipeline as above through the initial annotation stepThis process was repeated three times to generate a simula-tion of 15 25 and 50 coverage

Recombination Rates

We wrote an R routine to estimate local recombination ratesas a function of physical position Briefly we used ldquocyto-ge-netic-seqtsvrdquo from wwwflybaseorg (last accessed January 302009) to assign a set of physical landmarks to genetic posi-tions For each chromosome arm we then fit a local polyno-mial relating cumulative cM to cumulative basepair (usinglocpoly) Unlike many prior local polynomial fits we madethe added assumption that this polynomial must be a mono-tonically increasing function (using monoproc) an assump-tion that is true for real data We then fit a smooth spline tothe above curve (using smoothspline) The advantage ofsmoothspline is that the resulting curves can be differentiated(and hence estimates of rates obtained directly) Finally we fita smooth spline to the first derivatives obtained from theprior smooth spline largely to make the recombinationrates less noisy near centromeres and telomeres TheR code is freely available at wwwmolpopgenorgData (lastaccessed August 1 2013)

Data Set Restriction

Analyses performed on these data sets were restricted in twoways First we restricted our analyses to regions of thegenome with low TE density and moderate recombinationrates We restricted our data set in this way because regions ofhigh TE density make it more difficult to make accurate callsBecause our pipeline requires unique reads to positionallylocate the TE insertion regions with high TE density canlack enough unique sequence to generate these reads Theregions included in the analyses are X300000ndash208000002L200000ndash20100000 2R2300000ndash21000000 3L100000ndash21900000 3R600000ndash27800000 which are not centromericand telomeric regions This excludes the majority of TEs an-notated in the reference genome since these TEs are largelyclustered together in centromeric regions However this doesnot affect our analysis since we are interested in TEs in theeuchromatic portions of the genome

Second to restrict the set of reference TEs studied to full-length copies we removed from the analysis all reference TEsthat had a length that was less than 75 of the canonicalannotated length of the TE While these elements may stillproduce phenotypic effects these elements are unlikely to beactive copies This reduced the number of reference TEs in-cluded in the analysis to 607 from 1085 TEs that are present inthe reference genome in the regions included in our analysis

Additionally some analyses were performed on the subsetof 23 lines in the DGRP data set which had an average se-quencing coverage of 25 or higher hereafter referred to asDGRP25 This restriction was to mitigate the effect of differ-ences in coverage between the two data sets Simulations oftransposable element detection via our pipeline data not

shown indicate that TE detection at 25 is comparable todetection at 50

Site Count Spectra

We generated derived allele count spectra for TEs in the DSPRdata set and the DGRP25 set of lines We included only siteswhere we were able to make a positive presence or absencecall for each line in the data set so that the number of lineswas normalized for each resource for this analysis Observedvalues were compared with expected values at counts 1 2 34 5 6 and 7 under an infinite sites model via a 2 testCategories 7 and higher were grouped due to low counts insome of these categories We also calculated count spectra forSNPs present in small introns (introns86 bp [Haddrill et al2005]) a group unlikely to have functional effects We alsoused SoFoS a site frequency rescaler (Hufford et al 2012) torescale the DSPR data set to the sample size of the DGRP25from 15 to 23 samples to see if the rescaled data show adifferent distribution of element counts from the observedDGRP25 data We then did a KolmogorovndashSmirnov test tocompare the two distributions as a test of different selectiveeffects acting on the populations

Genomic Context of TEs

Transposable elements were identified in regions of low(lt2 cMMb) or high recombination (2 cMMb) FASTAfiles containing the annotated coordinates of 50 UTRs30 UTRs intronic intergenic and full gene span regions forversion 513 of the D melanogaster reference were down-loaded from flybase (wwwflybaseorg last accessed January30 2009) Version 513 of the reference annotation was usedfor comparative purposes with the DGRP analysis (Mackayet al 2012) Because alternative transcripts can result in thesame genomic position being classified in multiple categorieswe labeled each TE as every appropriate category This oc-curred with 52 of TEs in the DGRP and 60 of TEs in theDSPR When calculating the percentage of the referencegenome in our restricted areas of analysis that were classifiedas each genomic type we followed the same strategy

We compared our observations of TE insertions to a bino-mial distribution to determine whether TEs were distributedrandomly throughout the genome We used the pbinomfunction in R (R Development Core Team 2008) setting thenumber of trials to the total number of TE insertion locationsin the data set and the probability of success to the propor-tion of the genome defined as each context (intron exon30 UTR 50 UTR or intergenic) We did one tailed tests in eachcase For exon 30 UTR and 50 UTR tests we calculated theprobability of observing the observed value or fewer TEs Forintron and intergenic tests we calculated the probability ofobserving the observed value or more TEs

Variables Affecting TE Density

For the DGRP25 and DSPR data sets we calculatedTE density in terms of TEsMb For this analysis we alsoexcluded chromosome 4 We examined the effects of recom-bination rate chromosome and TE element family on TE

2324

Cridland et al doi101093molbevmst129 MBE

density using the standard linear model in R We usedthe lm() function to define our model as followsTEsMb ~ Line + Recombination Rate + Chromosome +Recombination Rate Chromosome We performed thisanalysis both for the set of all TEs and also for 15 TE familieswith moderate to high number of copies in our data set(supplementary table S4 Supplementary Material online)We selected individual TE families for this additional analysisbased on their level of abundance in the data sets and theirhaving been previously studied by Montgomery et al (1987)and Bartolome et al (2002)

Gene Families

We downloaded the Dfam database from (httpwwwindianaedu~hahnlabflyDfamDBdrosophila_frbhtml last acc-essed September 1 2012) This database describes all mem-bers of all gene families clustered using a fuzzy reciprocalBLAST method as described in Hahn et al (2007) for 12Drosophila species We used R to perform a linear regressionusing a simple model where the square root (TE density) in agene is the random variable and log10 (gene family size) is theindependent variable

Variation between Resources

For each TE present in both data sets we compared thefrequency of the TE in each population using Fisherrsquos exacttest We excluded from this analysis all insertions where wewere unable to make a presenceabsence call in at least 100 ofthe DGRP lines or 13 of the DSPR lines For each gene with TEinsertions in both data sets we compared the frequency atwhich the gene contains TEs between data sets using aFisherrsquos exact test

Supplementary MaterialSupplementary tables S1ndashS6 and figures S1 and S2 are avail-able at Molecular Biology and Evolution online (httpwwwmbeoxfordjournalsorg)

Acknowledgments

The authors thank Casey Bergman and Miaomiao He for earlyaccess to their data as well as helpful comments from CBThey also thank Rebekah Rogers for help with editing themanuscript This work was supported by the NationalInstitute of Health grant NIH R01-GM085183 to KRT theNational Library of Medicine - National Institute of Healthtraining grant LM007443 to JMC and the National Instituteof Health grant NIH R01-OD010974 to ADL and SJM

ReferencesAltschul SF Gish W Miller W Myers EW Lipman DJ 1990 Basic local

alignment search tool J Mol Biol 215403ndash410Aminetzach YT Macpherson JM Petrov DA 2005 Pesticide resistance

via transposition-mediated adaptive gene truncation in DrosophilaScience 309764ndash767

Bartolome C Maside X Charlesworth B 2002 On the abundance anddistribution of transposable elements in the genome of Drosophilamelanogaster Mol Biol Evol 19926ndash937

Bennetzen JL 2000 Transposable element contributions to plant geneand genome evolution Plant Mol Biol 42251ndash269

Bergman CM Bensasson D 2007 Recent LTR retrotransposon insertioncontrasts with waves of non-LTR insertion since speciation inDrosophila melanogaster Proc Natl Acad Sci U S A 10411340ndash11345

Biemont C Vieira C 2006 Junk DNA as an evolutionary force Nature443521ndash524

Birchler JA Hiebert JC 1989 Interaction of the Enhancer of white-apricotwith transposable element alleles at the white locus in Drosophilamelanogaster Genetics 122129ndash138

Birchler JA Hiebert JC Rabinow L 1989 Interaction of the mottler ofwhite with transposable element alleles at the white locus inDrosophila melanogaster Genes Dev 373ndash84

Bowen NJ McDonald JF 2001 Drosophila euchromatic LTR retrotran-sposons are much younger than the host species in which theyreside Genome Res 111527ndash1540

Carr M Soloway JR Robinson TE Brookfield JF 2002 Mechanisms reg-ulating copy numbers of six LTR retrotransposons in the genome ofDrosophila melanogaster Chromosoma 110511ndash518

Charlesworth B 1994 The effect of background selection against dele-terious mutations on weakly selected linked variants Genet Res 63213ndash227

Charlesworth B Jarne P Assimacopoulos S 1994 The distribution oftransposable elements within and between chromosomes in a pop-ulation of Drosophila melanogaster III Element abundances in het-erochromatin Genet Res 64183ndash197

Charlesworth B Langley CH 1989 The population genetics ofDrosophila transposable elements Annu Rev Genet 23251ndash287

Charlesworth B Lapid A 1989 A study of ten families of transposableelements on X chromosomes from a population of Drosophila mel-anogaster Genet Res 54113ndash125

Charlesworth B Lapid A Canada D 1992 The distribution of transpos-able elements within and between chromosomes in a population ofDrosophila melanogaster I Element frequencies and distributionGenet Res 60103ndash114

Chen GK Marjoram P Wall JD 2009 Fast and flexible simulation ofDNA sequence data Genome Res 19136ndash142

Chung H Bogwitz MR McCart C Andrianopoulos A Ffrench-ConstantRH Batterham P Daborn PJ 2007 Cis-regulatory elements in theaccord retrotransposon result in tissue-specific expression of theDrosophila melanogaster insecticide resistance gene Cyp6g1Genetics 1751071ndash1077

Cutter AD Good JM Pappas CT Saunders MA Starrett DM Wheeler TJ2005 Transposable element orientation bias in the Drosophila mel-anogaster genome J Mol Evol 6733ndash741

Daborn P Boundy S Yen J Pittendrigh B Ffrench-Constant R 2001DDT resistance in Drosophila correlates with Cyp6g1 over-expressionand confers cross-resistance to the neonicotinoid imidacloprid MolGenet Genomics 266556ndash563

Deloger M Cavalli FMG Lerat E Biemont C Sagot M-F Vieira C 2009Identification of expressed transposable element insertionsin the sequenced genome of Drosophila melanogaster Gene 43955ndash62

Dworkin I Palsson A Gibson G 2005 Replication of an Egfr-wing shapeassociation in a wild-caught cohort of Drosophila melanogasterGenetics 1692115ndash2125

Ewing B Green P 1998 Base-calling of automated sequencer tracesusing Phred II Error probabilities Genome Res 8186ndash194

Gruber JD Genissel A Macdonald SJ Long AD 2007 How repeatable areassociations between polymorphisms in achaete-scute and bristlenumber variation in Drosophila Genetics 1751987ndash1997

Haddrill PR Charlesworth B Halligan DL Andolfatto P 2005 Patterns ofintron sequence evolution in Drosophila are dependent upon lengthand GC content Genome Biol 6R67

Hahn MW Han MV Han S-G 2007 Gene family evolution across 12Drosophila genomes PLoS Genet 3e197

Hirschhorn JN Daly MJ 2005 Genome-wide association stud-ies for common diseases and complex traits Nat Rev Genet 695ndash108

2325

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

Hudson RR Kaplan NL 1995 Deleterious background selection withrecombination Genetics 1411605ndash1617

Hufford MB Xu X van Heerwaarden J et al (25 co-authors) 2012Comparative population genomics of maize domestication andimprovement Nat Genet 44808ndash811

Kaminker JS Bergman CM Kronmiller B et al (12 co-authors) 2002The transposable elements of the Drosophila melanogastereuchromatin a genomics perspective Genome Biol3RESEARCH0084

Kapitonov VV Jurka J 2003 Molecular paleontology of transposableelements in the Drosophila melanogaster genome Proc Natl AcadSci U S A 1006569ndash6574

Kazazian HH 2004 Mobile elements drivers of genome evolutionScience 3031626ndash1632

Kidwell MG 2002 Transposable elements and the evolution of genomesize in eukaryotes Genetica 11549ndash63

King EG Macdonald SJ Long AD 2012a Properties and power of theDrosophila synthetic population resource for the routine dissectionof complex traits Genetics 191935ndash949

King EG Merkes CM McNeil CL Hoofer SR Sen S Broman KW LongAD Macdonald SJ 2012b Genetic dissection of a model complextrait using the Drosophila synthetic population resource GenomeRes 221558ndash1566

Kofler R Betancourt AJ Schlotterer C 2012 Sequencing of pooled DNAsamples (Pool-Seq) uncovers complex dynamics of transposableelement insertions in Drosophila melanogaster PLoS Genet 8e1002487

Langley CH Montgomery E Hudson R Kaplan N Charlesworth B 1988On the role of unequal exchange in the containment of transpos-able element copy number Genet Res 52223ndash235

Li H Durbin R 2009 Fast and accurate short readalignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760

Linheiro RS Bergman CM 2012 Whole genome resequencing revealsnatural target site preferences of transposable elements inDrosophila melanogaster PLoS One 7e30008

Lipatov M Lenkov L Petrov DA Bergman CM 2005 Paucity of chimericgene-transposable elements in the Drosophila melanogastergenome BMC Biol 324

Long AD Lyman RF Morgan AH Langley CH Mackay TFC 2000 Bothnaturally occurring insertions of transposable elements and inter-mediate frequency polymorphisms at the achaete-scute complexare associated with bristle number in Drosophila melanogasterGenetics 1541255ndash1269

Lonnig W-E Saedler H 2002 Chromosome rearrangements and trans-posable elements Annu Rev Genet 36389ndash410

Macdonald SJ Long AD 2007 Joint estimates of quantitativetrait locus effect and frequency using synthetic recombinantpopulations of Drosophila melanogaster Genetics 1761261ndash1281

Macdonald SJ Pastinen T Long AD 2005 The effect of polymorphismsin the Enhancer of split gene complex on bristle number variation ina large wild-caught cohort of Drosophila melanogaster Genetics 1711741ndash1756

Mackay TFC 1984 Jumping genes meet abdominal bristles hybrid dys-genesis-induced quantitative variation in Drosophila melanogasterGenet Res 44231ndash237

Mackay TFC 1989 Transposable elements and fitness in Drosophilamelanogaster Genome 31284ndash295

Mackay TFC 2010 Mutations and quantitative genetic variationlessons from Drosophila Philos Trans R Soc Lond B Biol Sci 3651229ndash1239

Mackay TFC Langley CH 1990 Molecular and phenotypic variation inthe achaete-scute region of Drosophila melanogaster Nature 34864ndash66

Mackay TFC Lyman RF Jackson MS 1992 Effects of P element inser-tions on quantitative traits in Drosophila melanogaster Genetics 130315ndash332

Mackay TFC Richards S Stone EA et al (52 co-authors) 2012 TheDrosophila melanogaster genetic reference panel Nature 482173ndash178

Meyer LR Zweig AS Hinrichs AS et al (32 co-authors) 2013 The UCSCgenome browser database extensions and updates 2013 NucleicAcids Res 41D64ndashD69

Montgomery E Charlesworth B Langley CH 1987 A test for the role ofnatural selection in the stabilization of transposable element copynumber in a population of Drosophila melanogaster Genet Res 4931ndash41

Montgomery EA Huang S-M Langley CH Judd BH 1991 Chromosomerearrangement by ectopic recombination in Drosophilamelanogaster genome structure and evolution Genetics 1291085ndash1098

NCI-NHGRI Working Group on Replication in Association Studies2007 Replicating genotype-phenotype associations Nature 447655ndash660

Nuzhdin SV Mackay TFC 1994 The genomic rate of transposable ele-ment movement in Drosophila melanogaster Mol Biol Evol 12180ndash181

Nuzhdin SV Pasyukova EG Mackay TFC 1997 Accumulation of trans-posable elements in laboratory lines of Drosophila melanogasterGenetica 100167ndash175

Ometto L Stephan W Lorenzo DD 2005 Insertiondeletion andnucleotide polymorphism data reveal constraints in Drosophilamelanogaster introns and intergenic regions Genetics 1691521ndash1527

Palsson A Gibson G 2004 Association between nucleotide variation inEgfr and wing shape in Drosophila melanogaster Genetics 1671187ndash1198

Petrov DA Fiston-Lavier A-S Lipatov M Lenkov K Gonzalez J 2011Population genomics of transposable elements in Drosophila mela-nogaster Mol Biol Evol 281633ndash1644

Presgraves DC 2006 Intron length evolution in Drosophila Mol Biol Evol132203ndash2213

Quinian AR Hall IM 2010 BEDTools a flexible suite of utilities forcomparing genomic features Bioinformatics 26841ndash842

R Development Core Team 2008 R a language and environment forstatistical computing Vienna Austria R Foundation for StatisticalComputing [cited 2013 May 1] Available from httpwwwR-projectorg

Rizzon C Marais G Gouy M Biemont C 2002 Recombination rate andthe distribution of transposable elements in the Drosophila mela-nogaster genome Genome Res 12400ndash407

Sackton TB Kulathinal RJ Bergman CM Quinlan AR Dopman EBCarneiro M Marth GT Hartl DL Clark AG 2009 Population geno-mic inferences from sparse high-throughput sequencing of twopopulations of Drosophila melanogaster Genome Biol Evol 1449ndash465

Sakurai M Aoki T Yoshikawa S et al (11 co-authors) 2009 Differentiallyexpressed Drl and Drl-2 play opposing roles in Wnt5 signalingduring Drosophila olfactory system development J Neurosci 294972ndash4980

Schmidt JM Good RT Appleton B et al (11 co-authors) 2010Copy number variation and transposable elements feature inrecent ongoing adaptation at the Cyp6g1 locus PLoS Genet 6e1000998

Shrimpton AE Mackay TFC Brown AJL 1990 Transposable element-in-duced response to artificial selection in Drosophila melanogastermolecular analysis of selection lines Genetics 125803ndash811

Slotkin RK Martienssen R 2007 Transposable elements and the epige-netic regulation of the genome Nat Rev Genet 8272ndash285

Smith PA Corces VG 1991 Drosophila transposable elements mecha-nisms of mutagenesis and interactions with the host genome AdvGenet 29229ndash300

Thornton K 2003 Libsequence a C ++ class library for evolutionarygenetic analysis Bioinformatics 192325ndash2327

Vazquez JF Albornoz J Dominguez A 2007 Direct determination of theeffects of genotype and extreme temperature on the transposition

2326

Cridland et al doi101093molbevmst129 MBE

of roo in long-term mutation accumulation lines of Drosophilamelanogaster Mol Genet Genomics 278653ndash664

Voight BF Pritchard JK 2005 Confounding from cryptic relatedness incase-control association studies PLoS Genet 1e32

Wakeley J 2009 Coalescent theory an introduction Greenwood VillageRoberts and Company Publishers

Wicker T Sabot F Hua-Van A et al (13 co-authors) 2007 A unifiedclassification system for eukaryotic transposable elements Nat RevGenet 8973ndash982

Xing J Witherspoon DJ Jorde LB 2013 Mobile element biology newpossibilities with high-throughput sequencing Trends Genet 29280ndash289

Yoshikawa S McKinnon RD Kokel M Thomas JB 2003 Wnt-mediated axon guidance via Drosophila derailed receptor Nature422583ndash588

Zachar Z Bingham PM 1982 Regulation of white locus expression thestructure of mutant alleles at the white locus of Drosophila mela-nogaster Cell 30529ndash541

2327

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

Page 7: Abundance and Distribution of Transposable Elements in Two ...

typically unique to a single line though there were 3 insertionsin more than one line in the DSPR and 14 insertions in morethan one line in the DGRP

Some genes accumulated many TE insertions such asklarsicht (fig 4A gene images from the UCSC genome brow-ser Meyer et al 2013) which had 29 total insertions This genealso had a hotspot of insertions with seven independent in-sertions each at low frequency located within 36 kb of each

other Only one of these insertions was present in any givenline and these could be functionally equivalent though inde-pendently arising mutations Other genes had only a few in-sertions such as Notch (fig 4B) and Delta (fig 4C) Theinsertions in these genes were also at low frequency andthe few present were located in intronic regions

While most insertions exist in only one panel some werefound in both Cyp6a20 (fig 4D) contains a high frequencynon-reference TE insertions present in both data sets in 4lines out of 121 lines where we were able to make a presenceor absence call in the DGRP (hereafter shown as 4121) and314 lines in the DSPR Both Cyp6a20 and klarsicht are exam-ples of genes with TE insertions in exons which was

00

02

04

06

08

10

1 2 3 4 5 6 7+

Observed TEsObserved SNPsExpected

DSPR

00

02

04

06

08

10

1 2 3 4 5 6 7+

Observed TEsObserved SNPsExpected

DGRP

FIG 3 Derived allele count spectra for the DSRP lines and the DGRP25lines where a positive presence or absence call was made for eachinsertion in each line 6613 insertions in the DSPR and 3274 in theDGRP25 Count spectra for SNPs is from SNPs in introns 86 bp 2

tests between observed and expected distributions result in Pamp 0 forcomparisons between TEs and the neutral model as well as between TEsand SNPs for both data sets

Table 5 TE Density for 15 Individual Families of Elements

Mean Density (TEMb)

Element Resource X High X Low Auto High Auto Low

roo DGRP25 039 037 032 033DSPR 080 088 065 068

297 DGRP25 005 000 002 001DSPR 005 000 004 000

412 DGRP25 003 001 003 007DSPR 006 008 009 012

F DGRP25 002 006 004 006DSPR 007 008 012 013

176 DGRP25 001 000 001 002DSPR 002 000 002 006

Bari1 DGRP25 001 000 004 001DSPR 002 002 006 002

copia DGRP25 001 000 002 003DSPR 009 002 014 019

H DGRP25 001 000 001 000DSPR 000 000 001 000

hopper DGRP25 010 001 000 000DSPR 010 004 002 002

INE-1 DGRP25 042 188 003 048DSPR 045 203 004 055

jockey DGRP25 015 012 019 013DSPR 035 034 040 033

mdg1 DGRP25 003 002 003 010DSPR 008 004 012 018

pogo DGRP25 006 004 007 005DSPR 013 021 015 013

springer DGRP25 000 001 001 003DSPR 002 000 001 005

Table 4 TE Density in the X and Autosomes

TEMb

DGRP25 DSPR Both

X all 382 638 483

Autosomes all 351 593 447

X high recombination 371 600 461

X low recombination 393 676 505

2L high recombination 227 473 324

2L low recombination 323 589 428

2R high recombination 282 489 364

2R low recombination 617 863 714

3L high recombination 296 523 386

3L low recombination 474 726 573

3R high recombination 281 468 355

3R low recombination 372 670 490

2317

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

uncommon both in this data set and in previous studies(Kaminker et al 2002 Lipatov et al 2005)

Differences in the strength of selection against closelyrelated genes can also be illustrated by patterns of TE inser-tions The paralogs derailed (fig 4F) and derailed-2 (fig 4E)showed very different patterns of TE insertions though thesame pattern was seen in both resources derailed-2 locatedon chromosome 2R is in a region of moderate recombination199 cMMb whereas derailed is located towards the distalend of 2L in a region of low recombination 044 cMMb

Given the context of the recombination rates the expectationwould be that derailed would have a higher TE load sincedeleterious alleles are removed more efficiently in regions ofhigh recombination (Hudson and Kaplan 1995) but theopposite pattern was observed

While both genes play a role in Wnt5 signaling mutationsin derailed can cause major phenotypic changes in Drosophilanervous system development resulting in the loss of normalfunction (Yoshikawa et al 2003) derailed-2 when mutatedproduces only minor differences in neuron positioning

115 115115 215 115115

114 1122

11181120 11181124

2117

21182119

1122

1119 1118

2118

1121 3114

1115

1114

1119

1113

1119

1120

1120

1119

1120

3119

1119

A klarsichtDGRP

DSPR

Insertion in an Exon

314 1195

4121D Cyp6a20

315

1123

1415

56125

115115

415

115615

10124

1124 1124 1123

2124E derailed-2

B Notch

115 11211122

C Delta1114 1120

191

F derailed

Same insertion in both resources

FIG 4 Transposable element insertions in genes The frequency above each insertion is the number of lines in which the element is present over thenumber of lines in which the element is validated as either present or absent Gene images are from the UCSC genome browser (httpgenomeucscedu last accessed June 31 2012)

2318

Cridland et al doi101093molbevmst129 MBE

(Sakurai et al 2009) This suggests that either the strength ofselection may be much higher against insertions in derailed orthe insertion rates are unequal between the paralogs

Gene Family Size and TEs

We used available information on gene families in DrosophilaHahn et al (2007) grouped into families based on sequencesimilarity and found a very weak but significant effect of genefamily size on the average number of TE insertions per kbin both resources (P value = 00003 R2 = 0001 DSPRP value = 0002 R2 = 0007 DGRP) (fig 5A and B) Althoughthere was a large variance in the number of insertions per kb

in both resources (fig 5A and B) the average number ofTEskb increased as family size increased up to a moderatelylarge family size

Patterns of Variation between Resources

To replicate an association with a single causative allele thatallele must both exist in both panels and also be at highenough frequency to detect the association Given thatmost TEs were singletons and only 45 of insertion eventswere identified in both panels and were within genes thereplicability of associations with these alleles is likely to below if alleles are considered individually We examined sharedinsertions segregating at different frequencies between panelsand found only two TE insertions in genes that were segre-gating at different frequencies in the DGRP versus DSPRresources following Bonferroni correction (Fisherrsquos exacttests P 000007) One of these was in an exon of bothCG13175 and CG33964 a pair of overlapping genes Herethe TE insertion was at much higher frequency in theDGRP than in the DSPR (89122 vs 215) The other insertionwas in an intron of Caliban and the frequency of this insertionwas higher in the DSPR than the DGRP (1115 vs 7116)

If insertions are considered as a class of mutation thenthere may be better power to detect an association assumingthat different insertions contribute in the same way to thephenotype We also examined genes with TE insertions(139 of genes) though not necessarily the same insertionin both resources Treating all insertion alleles within a gene asequivalent a comparison between resources showed thatthere were 21 genes which contain TEs at different frequen-cies between the DGRP versus DSPR following Bonferronicorrection (Fisherrsquos exact tests P 0000038197) includingderailed-2 (supplementary table S6 Supplementary Materialonline)

Discussion

Distribution and Abundance of TransposableElements

Previous work on transposable elements has generally beenfocused on a few families of TEs (Montgomery et al 1987Charlesworth and Lapid 1989 Charlesworth et al 1992) orfocused only on a single genome (Bartolome et al 2002Kaminker et al 2002 Rizzon et al 2002) We have presenteda genomic analysis of TEs across a large number of lines andtwo data sets We found that transposable elements in bothresources are primarily found as singletons with a few mod-erate to high frequency insertions Transposable elements alsosegregated at lower allele frequencies than do presumed non-functional SNPs drawn from the same data set suggestingthat negative selection is acting upon TEs However theremay be substantial variation in the age of transposable ele-ments in different families many LTR elements are thought tobe young insertions (Bowen and McDonald 2001 Bergmanand Bensasson 2007) while other families like INE-1 arethought to be quite old (Kapitonov and Jurka 2003) Thismay create a situation where the young age of insertions isresponsible for the excess of rare alleles We found a

0

1

2

3

00 05 10 15 20log10(Gene family size)

TE

den

sity

0

1

2

3

00 05 10 15 20log10(Gene family size)

TE

den

sity

DSPR

DGRPA

B

FIG 5 log10(TE density) versus log10(Gene family size)

2319

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

non-random distribution of TE insertions across the genomewith increases of TEs in both the intergenic and intronicregions of the genome and decreases in exonic and UTRregions (Kaminker et al 2002 Lipatov et al 2005) Similar toprevious studies we found variation in TE density betweenchromosomes and higher densities of TEs in regions of lowrecombination (Charlesworth et al 1994 Bartolome et al2002) We also found increased TE densities on the X relativeto the autosomes both in overall insertions and in someindividual TE families (table 5)

Detection method and the size of the data set are likely tobe contributing factors to differences between previous workand this study Many previous studies have used in situ meth-ods to ascertain TE number in samples and have looked atelement accumulation at the base versus the midsections ofchromosomes (Montgomery et al 1987 Charlesworth et al1992) or have looked only at one genome (Bartolome et al2002 Rizzon et al 2002) A previous PCR-based detectionstudy found no difference in TE frequency between the Xand the autosomes in natural populations (Petrov et al2011) but this study was biased in that it only looked atTEs shared with the reference sequence and thus did notcapture the full picture of TEs in these populations

While our analysis does disagree with previous work thesedisagreements may be mainly due to the way in which regionsof the genome are divided for examination Rizzon et al(2002) found an increase in copies of LTR retrotransposonson the X when they excluded pericentromeric telomeric andchromosome 4 from their analysis though they found a def-icit of TEs on the X when these regions are included Thisfinding is most directly comparable to our own finding for theentire TE data set since they exclude a very similar set ofgenomic regions We do agree with previous work in ourfinding of higher densities of TEs in regions of low recombi-nation than in high recombination (Charlesworth et al 1994Bartolome et al 2002)

Examining individual element families Carr et al (2002)found an increase of mdg-3 elements and 297 elements on theX than would be expected under a random insertion modelMontgomery et al (1987) found no evidence for a reductionin TEs on the X for both roo and 297 elements but foundfewer TEs on the X when looking at the 412 family HoweverCharlesworth et al (1992) observed the opposite pattern forroo and 297 Montgomery et al (1991) also found that moreroo elements in lab kept lines than they had previously ob-served in natural populations We found significantly in-creased densities on the X for roo and significantlyincreased densities on the autosomes for 412

Differences between previous studies of individual elementfamilies and our results may be due to a variety of reasonsFirst we were able to annotate only about two-thirds of ourTE insertions This was primarily due to having insufficientsequence to reliably annotate the element or because inde-pendent annotation using sequence reconstructed fromeither end of the insertion did not yield the same elementfamily annotation This means that our density calculationsfor individual elements may be skewed by missing annotationinformation Second many previous studies (Montgomery

et al 1987 Charlesworth et al 1992 Carr et al 2002) examinedonly LTR elements These elements are young (Bowen andMcDonald 2001 Bergman and Bensasson 2007) and thus in-dividual copies may not yet have been removed from thegenome thereby displaying different patterns of insertionthan older active elements

What is clear is that there is great variability in TE densityboth within and between element families as well as withinand between chromosomes It may also be that the overallpattern is driven by a small group of families In our data setINE-1 shows the largest pattern of difference between the Xand autosomes with an order of magnitude increase in meanTE density per line on the X over the autosomes possibly dueto an increased likelihood of fixation of insertions (Omettoet al 2005) specifically on the X chromosome (Presgraves2006) These INE-1 elements are a subset of the largest TEfamily in D melanogaster though only several dozen of themare in the euchromatin INE-1 elements are also considered tobe inactive in D melanogaster and likely have not been mobilefor gt3 million years (Kapitonov and Jurka 2003) Consistentwith this most of the INE-1 elements we detected were fixedor nearly fixed in our data set We did see some examples oflow frequency insertions which may be due to more recenttranspositional events or incorrect TE annotation due toshort sequences The old age of INE-1 insertions also suggeststhat the extant elements are unlikely to be selected againstAn increased density of TEs on the X compared with theautosomes suggests that it is not the fitness effects of inser-tions which govern TE copy number Fitness effects wouldresult in the more rapid removal of insertions on the X thaninsertions on the autosomes due to the increased selectivepressures experienced on the X (Montgomery et al 1987)Background selection also predicts that the X should be lessburdened by deleterious recessives since it is hemizygous onethird of the time (Charlesworth 1994)

However it is unclear whether our observation of in-creased densities of TEs on the X relative to the autosomesfits with the ectopic exchange model This model predictsthat TEs will accumulate in regions of low ectopic exchange(Langley et al 1988) The general assumption in the literatureis that ectopic exchange is positively correlated with recom-bination rate (Langley et al 1988 Montgomery et al 1991)The X chromosome experiences higher levels of recombina-tion than the autosomes with an average of 36 cMMb forthe X in regions of high recombination (cMMbgt 2) whereasthe autosomes have a mean of 336 cMMb If recombinationrate and ectopic exchange are positively correlated through-out the genome then the expectation would be fewer TEs inthe middle regions of the X where recombination rates areknown to be high This is the opposite of what we observe Ifrecombination rate and the rate of ectopic exchange are notpositively correlated then the ectopic exchange model maybe correct but data are lacking to properly address this issueCurrently there are no direct measurements of the rate ofectopic exchange on a genome-wide scale in Drosophiladuring meiosis Montgomery et al (1991) observed littleectopic exchange between homozygous chromosomes inDrosophila melanogaster on the X chromosome in the

2320

Cridland et al doi101093molbevmst129 MBE

region around the white locus They also found that whilerecombination rate was reduced in both centromeric andtelomeric regions only the centromeric region displayed anincreased density of TEs This information is suggestive ofdifferences between these regions in the rate of ectopicexchange

The DGRP and DSPR are also composed of inbred lineswhich have been kept in laboratory conditions for some time~9 years in the case of the DGRP lines and 40+ years in thecase of the DSPR Given that TE insertion mutations accumu-late rapidly and in an increasing manner over time (Nuzhdinet al 1997) the increased density of TEs on the X versus theautosomes may not be surprising if there is a highertransposition rate to the X The only direct study to datefocused on roo elements in D melanogaster and did observehigher rates of insertion in the X than the autosomes(Vazquez et al 2007)

Transposable Elements in Genes

The majority of genes 813 do not have any TE insertions inany line studied and intron length is a good predictor of thenumber of TEs within a gene However a few genes accumu-late many insertions even after correcting for gene size andintron length and some genes have very few TEs even thoughthey have large introns The reasons why different genes mayaccumulate TE insertions may be quite different for individualgenes In some cases there may be hot spots for insertionswhich if insertions are neutral or only slightly deleterious mayresult in large number of insertions in a small region In othergenes the lack of TE accumulation may suggest selectivepressures against TE accumulation

Gene family size may also play a role in TE accumulationWe see an increase in TE density correlated with an increase ingene family size This may indicate differences in selectivepressures for genes in different sized families Single-copygenes may experience more constraint whereas in moder-ately sized gene families these constraints may be more re-laxed perhaps due to redundancy

In the cytochrome p450 functional family of genes we findthat 3789 of these genes have TE insertions 2589 in theDGRP and 2089 in the DSPR Cytochrome p450 is also foundto be in an enriched term in the DSPR for the set of genescontaining TEs The cytochrome p450 gene Cyp6g1 is knownto confer resistance to DDT (Daborn et al 2001) and TE-mediated copy number variation is associated with increasingresistance to DDT (Schmidt et al 2010) This pattern of TEinsertions in the DSPR may be reflective of the locationswhere DSPR lines were originally collected The DSPR lineswere gathered from various locations worldwide at a timewhen different pesticides were in use The DGRP lines weregathered from a single geographic area in North America in2003 Presumably these sets of lines have experienced differ-ent selective pressures with respect to pesticide resistance InCyp6a20 (fig 4D) where TEs are segregating at different fre-quencies in the two resources some similar effect may beoccurring

TE Detection Pipelines

The TE detection strategy used here is an improvement uponthe strategy used in Mackay et al (2012) The most notabledifferences between the data reported in Mackay et al (2012)and the DGRP data reported here is substantial amount ofnew information First our pipeline improvements agree wellwith the previous implementation with 812 of previous TEpresence calls identified by the new pipeline However thesecalls represent on 182 of the presence calls made by theimproved pipeline demonstrating that we have identifiedsignificantly more calls than before The incorporation of TEabsence calls also allows us to provide a much more completepicture of TEs in euchromatic sequence in the DGRP

The TE detection pipeline used here is similar to otherpipelines which have been employed but it is useful to de-scribe briefly here the differences in approaches and why wefollowed the approach presented here A useful discussion ofcommon approaches to TE detection in many species can befound in Xing et al (2013) One common method for TEdetection is the split-read mapping approach which hasbeen employed in Drosophila by Linheiro and Bergman(2012) This approach identifies TE insertions by locating in-dividual sequence reads which span TE insertion breakpointsThis approach is the strategy of choice for next-generationsequencing experiments which produce only single-endreads However in a paired-end read situation this approachmay not take full advantage of the available data Specificallythe size of window around the TE insertion breakpoint whereinformative reads can be found is governed by read length in asingle-end experiment and fragment length in a paired-endexperiment Linheiro and Bergman (2012) discuss the impor-tance of read length on TE detection and suggest that in-creased read length improves TE detection In situationswhere read length is long and coverage is deep enough thesplit-read method may be preferable to the paired-end de-tection strategy However for the data sets analyzed herewhere read length is much shorter than fragment lengththe paired-end detection strategy will capture moreinformation

The pipeline used here also has similarities to the pipelineused in Kofler et al (2012) However there are two majordifferences in these studies First Kofler et al (2012) do notattempt to reconstruct their insertion breakpoints andsecond the population studied by Kofler et al (2012) wassequenced as a pool rather than as individual lines Becausetheir pooled sequence data reflect on average less than 1sequence coverage per line Kofler et al (2012) are primarilyable to detect intermediate to high frequency TE insertionstheir own analysis concludes that they cannot reliably detectinsertions at frequency lt7 (see supplementary fig S2Supplementary Material online in their supplementary dis-cussion) However the literature to date (Charlesworth andLangley 1989 Charlesworth and Lapid 1989 Charlesworthet al 1992) as well as the results of this study indicate thatthe majority of TEs in a population are found at very lowfrequency often in only one individual in a population andtherefore beyond the scope of detection of Kofler et al (2012)

2321

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

Caveats

The D melanogaster reference genome to which we alignedour data is a P-element-free genome Therefore the use of thisreference will be biased against detecting any P elementsexisting in the data sets In the case of the DSPR data setthis is not an issue since these lines are also all P-element-free(Macdonald and Long 2007) However it is likely that theDGRP lines do harbor P elements For the purpose of theseanalyses we decided to ignore P elements and focus on theset of element families common to the two resources

This situation does highlight the importance of having anappropriate reference sequence To detect P elements in thegenome of D melanogaster an artificially constructed refer-ence sequence or a different assembled sequence containingthe P element would need to be utilized This also brings upthe question about as yet undetected elements that exist atlow frequency If these elements do not have sufficient sim-ilarity to existing elements to be aligned to the referencegenome these elements will go undetected

Conclusions

We have presented a description of transposable elements intwo recently released QTL resources Transposable elementscan play important roles in gene function and regulation andincorporation of TE insertion information into any analysisperformed with either of these two resources will present amore complete picture of genomic variation and its contri-bution to complex traits In addition it will be important forassociation studies attempting to replicate genotypendashpheno-type associations utilizing these resources to be aware of thepotential contributions of TEs

We have also demonstrated a next-generation high-throughput sequencing analysis pipeline that is capable ofdetecting this type variation with a high rate of precisionand specificity We suggest that future studies using next-generation sequencing data utilize this technique

Materials and Methods

Data Sets

Two data sets were used in this analysis The first set consistsof 131 inbred lines from the Drosophila Genetic ReferencePanel (DGRP) (Mackay et al 2012) SRA accessionSRP000694 The second set consists of 15 lines used to es-tablish the Drosophila Synthetic Population Resource (DSPR)(King et al 2012a) initially described in Macdonald and Long(2007) SRA accession SRA051316 For the DGRP resourceboth 454 and Illumina data were available whereas onlyIllumina data are available for the DSPR We utilized onlythe Illumina sequence data to ensure consistency in datacollection across data sets We calculated a mean coverageof 21 for the DGRP lines and 50 for the DSPR linesVariation in coverage for the DGRP lines was substantial be-tween lines ranging from ~4 average coverage to nearly50 average coverage We therefore dropped from the anal-ysis four lines with an average coverage of 10 or less sincelow coverage can cause difficulties with assembly and TEidentification Sequence data for both data sets consisted of

paired-end data with 54 bp reads for the DSPR and paired-end 100 bp reads trimmed to 75 bp for DGRP

IBD in DGRP Data Set

We downloaded the SNP tables generated by the DGRPproject from httpwwwhgscbcmtmceduprojectsdgrpfreeze1_July_2010snp_callsIllumina (last accessed July 302010) to examine identity by descent in this data set Weperformed an all by all comparison between lines examiningsliding windows of 1 Mb across the genome with 100 kb stepsbetween windows We take a simple definition of IBD hereWhenever we identified two lines that weregt95 identity atSNP positions in a 1 Mb region we labeled the pair as IBD inthe identified region and masked that region of the genomein the line with lower coverage We also dropped entirelyfrom our analysis any lines where gt50 of the genome wasdetermined to be IBD with other lines in the population(supplementary table S1 Supplementary Material online) re-moving from the analysis the line with lower average se-quence coverage

We modeled 1 Mb regions of 200 chromosomes in a coa-lescent simulation using the parameters theta = 001site and4Nr = 10theta to determine the number of regions of95IBD under a Wright-Fisher population model using the MACSsoftware (Chen et al 2009) We simulated 1000 replicates ofthis simulation and tabulated the number of 1 Mb regionsshowing 95 IBD in each replicate

Alignment

We aligned reads to the D melanogaster reference genome(version 513 downloaded from FlyBase flybaseorg) using thealigner BWA (version 059 Li and Durbin 2009) We used thefollowing parameters (aln -t 8 -l 13 -m 50000000 -R 5000)followed by the command ldquosamperdquo to resolve paired endmappings (using parameters -a 5000 -N 5000 -n 500) It isimportant when detecting TEs that the -R during the initialalignment and the -N parameter during the ldquosamperdquo phase beset high otherwise highly repetitive sequences that would beinformative of the presence of a common element will beexcluded from the data by the aligner We also used the -Ioption for the DGRP lines but not the DSPR due to the dif-ferences in Illumina quality score output format as the twodata sets were sequenced at different times The -m parameterfrom the initial alignment step and the -n parameter from thepaired-end resolution step deal with how BWA treats multiplyaligning reads An additional caveat is that some TE familiesmost notably the P-element family is absent from the Dmelanogaster reference sequence Such families in the se-quenced genomes will not align to the reference genome weused even if they are present in the sampled lines

Transposable Element Detection

Transposable elements were detected by first identifying allreads that were aligned to an annotated TE in the referencegenome (supplementary fig S1 Supplementary Material on-line) We then identified the mates of these reads and selectedonly those mates which did not align to an annotated TE

2322

Cridland et al doi101093molbevmst129 MBE

These reads identify the set of read-pairs which span a TEinsertion The uniquely mapping non-TE reads were thenclustered based on their start position and chromosomalstrand to identify putative TEs Clusters on the plus strandwere matched to clusters on the minus strand to producepairs of clusters indicating either end of the putative TE Onceclusters were identified we filtered out all events which con-tained fewer than three read-pairs consistent with a giveninsertion for each cluster This means that putative eventsat this stage have a minimum of six read-pairs indicating theinsertion three on the plus strand and three on the minusstrand

We then extracted all reads and their mates that aligned tothe identified regions of the genome and attempted toreconstruct the local area including the breakpoint of theTE Because many TEs have repetitive regions at the ends ofthe element we reconstructed the two breakpoints of the TEseparately (see also supplementary fig 23 from Mackay et al2012) Local reconstruction was done by running Phrap (ver-sion 1090518 Ewing and Green 1998) with the followingparameters (-forcelevel 10 -minscore 10 -minmatch 10)This set of parameters causes Phrap to do a Smith-Waterman comparison when it finds matches of at least 10characters but then relaxes the minimum alignment scoreand the final assembly parameters This is necessary to recon-struct TEs because repetitive sequence within the TEs canresult in a failure to reconstruct a contig if stricter assemblyparameters are used All code we developed for the detectionof TEs will be freely available at wwwmolpopgenorgData(last accessed August 1 2013)

Following reconstruction we used BlastN (verson 2222) tocompare reconstructed contigs to the D melanogaster refer-ence to confirm the presence of a TE In most cases we wereable to fully reconstruct both breakpoints of the TE Fullyreconstructed indicates that we could reconstruct a singlecontig which contained the breakpoint of the TE In mostcases the two breakpoints of the TE did not resolve to thesame nucleotide but instead identify the location of thetarget site duplication for the insertion (Linheiro andBergman 2012 similar to fig 1) These are cases where theupstream estimate of the breakpoint is slightly higher thanthe downstream estimate of the breakpoint and the differ-ence in position can be seen as the likely span of the target siteduplication However sometimes we were not able to fullyreconstruct a contig but instead generated two contigs onecontaining uniquely aligning sequence and the other contain-ing TE sequence These contigs are known to belong togetherbecause reads forming one contig have their mates in theother contig In these cases we have an approximate but notprecise estimate of the TE breakpoint and are unable toidentify the target site duplication

After the first round of detection we performed a secondround of detection where we examined every location wherea novel TE was detected in any line in all other lines Thisincluded examining positions in one data set that were iden-tified in the other data set In addition we examined thelocations of all known TEs in the D melanogaster genomeFor each location queried in each line we extracted reads

aligning to that location and their mates and used Phrap toreconstruct a contig This allowed us to produce presenceabsence or missing data calls for all positions where TE wasdetected in all lines The absence of a TE is determined byidentifying reconstructed contigs that spanned the insertionposition of the TE by at least 15 bp on either side Theseabsence calls were not included in Mackay et al (2012)This also allows us to identify TEs that may be in regions ofthe genome with lower average sequence coverage comparedwith the rest of the genome provided that insertion is sharedwith another line where it was possible to detect the insertion

Transposable Element Annotation

Annotation of novel TEs was performed by aligning recon-structed contigs with BlastN to the set of TEs found in theD melanogaster reference We chose the best match basedon overall length and BLAST reported e-value and called thenovel TE a TE of that family We also required a minimum of80 bp of contig matching a TE to make an annotation callThis is an additional annotation analysis to the one we per-formed in Mackay et al (2012) For each of the two contigsproduced for each insertion we determined the best matchand then compared these to each other TEs are classified asper the classification scheme put forth in Wicker et al (2007)We annotated each TE insertion to the highest level of agree-ment between the two contigs if the two contigs disagreedwith each other at every level of classification we called thatelement ldquoundeterminedrdquo This could result from nested in-sertions of elements where the 50 end of the insertion is truly adifferent element from the 30 end or because the TE sequenceis degraded and differs substantially from the referencesequence or because there was not enough TE sequence tomeet our annotation guidelines Because we can only recon-struct a maximum of few hundred bases at either end of theinsertion it is difficult to distinguish between these issues Wealso did not attempt to determine the strand of the TE

Validation

A subset of 190 TE insertion sites shared with the D melano-gaster reference were validated via PCR in a subset of nineof the DGRP lines by Blumenstiel JP Chen X He M BergmanCM (unpublished data httparxivorgabs12093456 lastaccessed June 26 2013) To independently check IBD callsusing SNP data we also determined if pairs of lines wereIBD for TEs in the same regions In this comparison we onlyincluded TE insertions where both lines have a positive pres-ence or absence call This can also serve as a validation of ourTE calling pipeline since if two segments are IBD they shouldshare the same set of TEs

Simulation

We selected a set of elements of different sizes and of TEclasses from the set of TEs in the reference sequence andinserted them semi-randomly into the D melanogaster refer-ence genome in silico Insertions were semi-random since weavoided areas already occupied by reference TEs and the cen-tromeric and telomeric regions Read pairs were then

2323

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

generated to construct fastq files We then ran our TE detec-tion pipeline as above through the initial annotation stepThis process was repeated three times to generate a simula-tion of 15 25 and 50 coverage

Recombination Rates

We wrote an R routine to estimate local recombination ratesas a function of physical position Briefly we used ldquocyto-ge-netic-seqtsvrdquo from wwwflybaseorg (last accessed January 302009) to assign a set of physical landmarks to genetic posi-tions For each chromosome arm we then fit a local polyno-mial relating cumulative cM to cumulative basepair (usinglocpoly) Unlike many prior local polynomial fits we madethe added assumption that this polynomial must be a mono-tonically increasing function (using monoproc) an assump-tion that is true for real data We then fit a smooth spline tothe above curve (using smoothspline) The advantage ofsmoothspline is that the resulting curves can be differentiated(and hence estimates of rates obtained directly) Finally we fita smooth spline to the first derivatives obtained from theprior smooth spline largely to make the recombinationrates less noisy near centromeres and telomeres TheR code is freely available at wwwmolpopgenorgData (lastaccessed August 1 2013)

Data Set Restriction

Analyses performed on these data sets were restricted in twoways First we restricted our analyses to regions of thegenome with low TE density and moderate recombinationrates We restricted our data set in this way because regions ofhigh TE density make it more difficult to make accurate callsBecause our pipeline requires unique reads to positionallylocate the TE insertion regions with high TE density canlack enough unique sequence to generate these reads Theregions included in the analyses are X300000ndash208000002L200000ndash20100000 2R2300000ndash21000000 3L100000ndash21900000 3R600000ndash27800000 which are not centromericand telomeric regions This excludes the majority of TEs an-notated in the reference genome since these TEs are largelyclustered together in centromeric regions However this doesnot affect our analysis since we are interested in TEs in theeuchromatic portions of the genome

Second to restrict the set of reference TEs studied to full-length copies we removed from the analysis all reference TEsthat had a length that was less than 75 of the canonicalannotated length of the TE While these elements may stillproduce phenotypic effects these elements are unlikely to beactive copies This reduced the number of reference TEs in-cluded in the analysis to 607 from 1085 TEs that are present inthe reference genome in the regions included in our analysis

Additionally some analyses were performed on the subsetof 23 lines in the DGRP data set which had an average se-quencing coverage of 25 or higher hereafter referred to asDGRP25 This restriction was to mitigate the effect of differ-ences in coverage between the two data sets Simulations oftransposable element detection via our pipeline data not

shown indicate that TE detection at 25 is comparable todetection at 50

Site Count Spectra

We generated derived allele count spectra for TEs in the DSPRdata set and the DGRP25 set of lines We included only siteswhere we were able to make a positive presence or absencecall for each line in the data set so that the number of lineswas normalized for each resource for this analysis Observedvalues were compared with expected values at counts 1 2 34 5 6 and 7 under an infinite sites model via a 2 testCategories 7 and higher were grouped due to low counts insome of these categories We also calculated count spectra forSNPs present in small introns (introns86 bp [Haddrill et al2005]) a group unlikely to have functional effects We alsoused SoFoS a site frequency rescaler (Hufford et al 2012) torescale the DSPR data set to the sample size of the DGRP25from 15 to 23 samples to see if the rescaled data show adifferent distribution of element counts from the observedDGRP25 data We then did a KolmogorovndashSmirnov test tocompare the two distributions as a test of different selectiveeffects acting on the populations

Genomic Context of TEs

Transposable elements were identified in regions of low(lt2 cMMb) or high recombination (2 cMMb) FASTAfiles containing the annotated coordinates of 50 UTRs30 UTRs intronic intergenic and full gene span regions forversion 513 of the D melanogaster reference were down-loaded from flybase (wwwflybaseorg last accessed January30 2009) Version 513 of the reference annotation was usedfor comparative purposes with the DGRP analysis (Mackayet al 2012) Because alternative transcripts can result in thesame genomic position being classified in multiple categorieswe labeled each TE as every appropriate category This oc-curred with 52 of TEs in the DGRP and 60 of TEs in theDSPR When calculating the percentage of the referencegenome in our restricted areas of analysis that were classifiedas each genomic type we followed the same strategy

We compared our observations of TE insertions to a bino-mial distribution to determine whether TEs were distributedrandomly throughout the genome We used the pbinomfunction in R (R Development Core Team 2008) setting thenumber of trials to the total number of TE insertion locationsin the data set and the probability of success to the propor-tion of the genome defined as each context (intron exon30 UTR 50 UTR or intergenic) We did one tailed tests in eachcase For exon 30 UTR and 50 UTR tests we calculated theprobability of observing the observed value or fewer TEs Forintron and intergenic tests we calculated the probability ofobserving the observed value or more TEs

Variables Affecting TE Density

For the DGRP25 and DSPR data sets we calculatedTE density in terms of TEsMb For this analysis we alsoexcluded chromosome 4 We examined the effects of recom-bination rate chromosome and TE element family on TE

2324

Cridland et al doi101093molbevmst129 MBE

density using the standard linear model in R We usedthe lm() function to define our model as followsTEsMb ~ Line + Recombination Rate + Chromosome +Recombination Rate Chromosome We performed thisanalysis both for the set of all TEs and also for 15 TE familieswith moderate to high number of copies in our data set(supplementary table S4 Supplementary Material online)We selected individual TE families for this additional analysisbased on their level of abundance in the data sets and theirhaving been previously studied by Montgomery et al (1987)and Bartolome et al (2002)

Gene Families

We downloaded the Dfam database from (httpwwwindianaedu~hahnlabflyDfamDBdrosophila_frbhtml last acc-essed September 1 2012) This database describes all mem-bers of all gene families clustered using a fuzzy reciprocalBLAST method as described in Hahn et al (2007) for 12Drosophila species We used R to perform a linear regressionusing a simple model where the square root (TE density) in agene is the random variable and log10 (gene family size) is theindependent variable

Variation between Resources

For each TE present in both data sets we compared thefrequency of the TE in each population using Fisherrsquos exacttest We excluded from this analysis all insertions where wewere unable to make a presenceabsence call in at least 100 ofthe DGRP lines or 13 of the DSPR lines For each gene with TEinsertions in both data sets we compared the frequency atwhich the gene contains TEs between data sets using aFisherrsquos exact test

Supplementary MaterialSupplementary tables S1ndashS6 and figures S1 and S2 are avail-able at Molecular Biology and Evolution online (httpwwwmbeoxfordjournalsorg)

Acknowledgments

The authors thank Casey Bergman and Miaomiao He for earlyaccess to their data as well as helpful comments from CBThey also thank Rebekah Rogers for help with editing themanuscript This work was supported by the NationalInstitute of Health grant NIH R01-GM085183 to KRT theNational Library of Medicine - National Institute of Healthtraining grant LM007443 to JMC and the National Instituteof Health grant NIH R01-OD010974 to ADL and SJM

ReferencesAltschul SF Gish W Miller W Myers EW Lipman DJ 1990 Basic local

alignment search tool J Mol Biol 215403ndash410Aminetzach YT Macpherson JM Petrov DA 2005 Pesticide resistance

via transposition-mediated adaptive gene truncation in DrosophilaScience 309764ndash767

Bartolome C Maside X Charlesworth B 2002 On the abundance anddistribution of transposable elements in the genome of Drosophilamelanogaster Mol Biol Evol 19926ndash937

Bennetzen JL 2000 Transposable element contributions to plant geneand genome evolution Plant Mol Biol 42251ndash269

Bergman CM Bensasson D 2007 Recent LTR retrotransposon insertioncontrasts with waves of non-LTR insertion since speciation inDrosophila melanogaster Proc Natl Acad Sci U S A 10411340ndash11345

Biemont C Vieira C 2006 Junk DNA as an evolutionary force Nature443521ndash524

Birchler JA Hiebert JC 1989 Interaction of the Enhancer of white-apricotwith transposable element alleles at the white locus in Drosophilamelanogaster Genetics 122129ndash138

Birchler JA Hiebert JC Rabinow L 1989 Interaction of the mottler ofwhite with transposable element alleles at the white locus inDrosophila melanogaster Genes Dev 373ndash84

Bowen NJ McDonald JF 2001 Drosophila euchromatic LTR retrotran-sposons are much younger than the host species in which theyreside Genome Res 111527ndash1540

Carr M Soloway JR Robinson TE Brookfield JF 2002 Mechanisms reg-ulating copy numbers of six LTR retrotransposons in the genome ofDrosophila melanogaster Chromosoma 110511ndash518

Charlesworth B 1994 The effect of background selection against dele-terious mutations on weakly selected linked variants Genet Res 63213ndash227

Charlesworth B Jarne P Assimacopoulos S 1994 The distribution oftransposable elements within and between chromosomes in a pop-ulation of Drosophila melanogaster III Element abundances in het-erochromatin Genet Res 64183ndash197

Charlesworth B Langley CH 1989 The population genetics ofDrosophila transposable elements Annu Rev Genet 23251ndash287

Charlesworth B Lapid A 1989 A study of ten families of transposableelements on X chromosomes from a population of Drosophila mel-anogaster Genet Res 54113ndash125

Charlesworth B Lapid A Canada D 1992 The distribution of transpos-able elements within and between chromosomes in a population ofDrosophila melanogaster I Element frequencies and distributionGenet Res 60103ndash114

Chen GK Marjoram P Wall JD 2009 Fast and flexible simulation ofDNA sequence data Genome Res 19136ndash142

Chung H Bogwitz MR McCart C Andrianopoulos A Ffrench-ConstantRH Batterham P Daborn PJ 2007 Cis-regulatory elements in theaccord retrotransposon result in tissue-specific expression of theDrosophila melanogaster insecticide resistance gene Cyp6g1Genetics 1751071ndash1077

Cutter AD Good JM Pappas CT Saunders MA Starrett DM Wheeler TJ2005 Transposable element orientation bias in the Drosophila mel-anogaster genome J Mol Evol 6733ndash741

Daborn P Boundy S Yen J Pittendrigh B Ffrench-Constant R 2001DDT resistance in Drosophila correlates with Cyp6g1 over-expressionand confers cross-resistance to the neonicotinoid imidacloprid MolGenet Genomics 266556ndash563

Deloger M Cavalli FMG Lerat E Biemont C Sagot M-F Vieira C 2009Identification of expressed transposable element insertionsin the sequenced genome of Drosophila melanogaster Gene 43955ndash62

Dworkin I Palsson A Gibson G 2005 Replication of an Egfr-wing shapeassociation in a wild-caught cohort of Drosophila melanogasterGenetics 1692115ndash2125

Ewing B Green P 1998 Base-calling of automated sequencer tracesusing Phred II Error probabilities Genome Res 8186ndash194

Gruber JD Genissel A Macdonald SJ Long AD 2007 How repeatable areassociations between polymorphisms in achaete-scute and bristlenumber variation in Drosophila Genetics 1751987ndash1997

Haddrill PR Charlesworth B Halligan DL Andolfatto P 2005 Patterns ofintron sequence evolution in Drosophila are dependent upon lengthand GC content Genome Biol 6R67

Hahn MW Han MV Han S-G 2007 Gene family evolution across 12Drosophila genomes PLoS Genet 3e197

Hirschhorn JN Daly MJ 2005 Genome-wide association stud-ies for common diseases and complex traits Nat Rev Genet 695ndash108

2325

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

Hudson RR Kaplan NL 1995 Deleterious background selection withrecombination Genetics 1411605ndash1617

Hufford MB Xu X van Heerwaarden J et al (25 co-authors) 2012Comparative population genomics of maize domestication andimprovement Nat Genet 44808ndash811

Kaminker JS Bergman CM Kronmiller B et al (12 co-authors) 2002The transposable elements of the Drosophila melanogastereuchromatin a genomics perspective Genome Biol3RESEARCH0084

Kapitonov VV Jurka J 2003 Molecular paleontology of transposableelements in the Drosophila melanogaster genome Proc Natl AcadSci U S A 1006569ndash6574

Kazazian HH 2004 Mobile elements drivers of genome evolutionScience 3031626ndash1632

Kidwell MG 2002 Transposable elements and the evolution of genomesize in eukaryotes Genetica 11549ndash63

King EG Macdonald SJ Long AD 2012a Properties and power of theDrosophila synthetic population resource for the routine dissectionof complex traits Genetics 191935ndash949

King EG Merkes CM McNeil CL Hoofer SR Sen S Broman KW LongAD Macdonald SJ 2012b Genetic dissection of a model complextrait using the Drosophila synthetic population resource GenomeRes 221558ndash1566

Kofler R Betancourt AJ Schlotterer C 2012 Sequencing of pooled DNAsamples (Pool-Seq) uncovers complex dynamics of transposableelement insertions in Drosophila melanogaster PLoS Genet 8e1002487

Langley CH Montgomery E Hudson R Kaplan N Charlesworth B 1988On the role of unequal exchange in the containment of transpos-able element copy number Genet Res 52223ndash235

Li H Durbin R 2009 Fast and accurate short readalignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760

Linheiro RS Bergman CM 2012 Whole genome resequencing revealsnatural target site preferences of transposable elements inDrosophila melanogaster PLoS One 7e30008

Lipatov M Lenkov L Petrov DA Bergman CM 2005 Paucity of chimericgene-transposable elements in the Drosophila melanogastergenome BMC Biol 324

Long AD Lyman RF Morgan AH Langley CH Mackay TFC 2000 Bothnaturally occurring insertions of transposable elements and inter-mediate frequency polymorphisms at the achaete-scute complexare associated with bristle number in Drosophila melanogasterGenetics 1541255ndash1269

Lonnig W-E Saedler H 2002 Chromosome rearrangements and trans-posable elements Annu Rev Genet 36389ndash410

Macdonald SJ Long AD 2007 Joint estimates of quantitativetrait locus effect and frequency using synthetic recombinantpopulations of Drosophila melanogaster Genetics 1761261ndash1281

Macdonald SJ Pastinen T Long AD 2005 The effect of polymorphismsin the Enhancer of split gene complex on bristle number variation ina large wild-caught cohort of Drosophila melanogaster Genetics 1711741ndash1756

Mackay TFC 1984 Jumping genes meet abdominal bristles hybrid dys-genesis-induced quantitative variation in Drosophila melanogasterGenet Res 44231ndash237

Mackay TFC 1989 Transposable elements and fitness in Drosophilamelanogaster Genome 31284ndash295

Mackay TFC 2010 Mutations and quantitative genetic variationlessons from Drosophila Philos Trans R Soc Lond B Biol Sci 3651229ndash1239

Mackay TFC Langley CH 1990 Molecular and phenotypic variation inthe achaete-scute region of Drosophila melanogaster Nature 34864ndash66

Mackay TFC Lyman RF Jackson MS 1992 Effects of P element inser-tions on quantitative traits in Drosophila melanogaster Genetics 130315ndash332

Mackay TFC Richards S Stone EA et al (52 co-authors) 2012 TheDrosophila melanogaster genetic reference panel Nature 482173ndash178

Meyer LR Zweig AS Hinrichs AS et al (32 co-authors) 2013 The UCSCgenome browser database extensions and updates 2013 NucleicAcids Res 41D64ndashD69

Montgomery E Charlesworth B Langley CH 1987 A test for the role ofnatural selection in the stabilization of transposable element copynumber in a population of Drosophila melanogaster Genet Res 4931ndash41

Montgomery EA Huang S-M Langley CH Judd BH 1991 Chromosomerearrangement by ectopic recombination in Drosophilamelanogaster genome structure and evolution Genetics 1291085ndash1098

NCI-NHGRI Working Group on Replication in Association Studies2007 Replicating genotype-phenotype associations Nature 447655ndash660

Nuzhdin SV Mackay TFC 1994 The genomic rate of transposable ele-ment movement in Drosophila melanogaster Mol Biol Evol 12180ndash181

Nuzhdin SV Pasyukova EG Mackay TFC 1997 Accumulation of trans-posable elements in laboratory lines of Drosophila melanogasterGenetica 100167ndash175

Ometto L Stephan W Lorenzo DD 2005 Insertiondeletion andnucleotide polymorphism data reveal constraints in Drosophilamelanogaster introns and intergenic regions Genetics 1691521ndash1527

Palsson A Gibson G 2004 Association between nucleotide variation inEgfr and wing shape in Drosophila melanogaster Genetics 1671187ndash1198

Petrov DA Fiston-Lavier A-S Lipatov M Lenkov K Gonzalez J 2011Population genomics of transposable elements in Drosophila mela-nogaster Mol Biol Evol 281633ndash1644

Presgraves DC 2006 Intron length evolution in Drosophila Mol Biol Evol132203ndash2213

Quinian AR Hall IM 2010 BEDTools a flexible suite of utilities forcomparing genomic features Bioinformatics 26841ndash842

R Development Core Team 2008 R a language and environment forstatistical computing Vienna Austria R Foundation for StatisticalComputing [cited 2013 May 1] Available from httpwwwR-projectorg

Rizzon C Marais G Gouy M Biemont C 2002 Recombination rate andthe distribution of transposable elements in the Drosophila mela-nogaster genome Genome Res 12400ndash407

Sackton TB Kulathinal RJ Bergman CM Quinlan AR Dopman EBCarneiro M Marth GT Hartl DL Clark AG 2009 Population geno-mic inferences from sparse high-throughput sequencing of twopopulations of Drosophila melanogaster Genome Biol Evol 1449ndash465

Sakurai M Aoki T Yoshikawa S et al (11 co-authors) 2009 Differentiallyexpressed Drl and Drl-2 play opposing roles in Wnt5 signalingduring Drosophila olfactory system development J Neurosci 294972ndash4980

Schmidt JM Good RT Appleton B et al (11 co-authors) 2010Copy number variation and transposable elements feature inrecent ongoing adaptation at the Cyp6g1 locus PLoS Genet 6e1000998

Shrimpton AE Mackay TFC Brown AJL 1990 Transposable element-in-duced response to artificial selection in Drosophila melanogastermolecular analysis of selection lines Genetics 125803ndash811

Slotkin RK Martienssen R 2007 Transposable elements and the epige-netic regulation of the genome Nat Rev Genet 8272ndash285

Smith PA Corces VG 1991 Drosophila transposable elements mecha-nisms of mutagenesis and interactions with the host genome AdvGenet 29229ndash300

Thornton K 2003 Libsequence a C ++ class library for evolutionarygenetic analysis Bioinformatics 192325ndash2327

Vazquez JF Albornoz J Dominguez A 2007 Direct determination of theeffects of genotype and extreme temperature on the transposition

2326

Cridland et al doi101093molbevmst129 MBE

of roo in long-term mutation accumulation lines of Drosophilamelanogaster Mol Genet Genomics 278653ndash664

Voight BF Pritchard JK 2005 Confounding from cryptic relatedness incase-control association studies PLoS Genet 1e32

Wakeley J 2009 Coalescent theory an introduction Greenwood VillageRoberts and Company Publishers

Wicker T Sabot F Hua-Van A et al (13 co-authors) 2007 A unifiedclassification system for eukaryotic transposable elements Nat RevGenet 8973ndash982

Xing J Witherspoon DJ Jorde LB 2013 Mobile element biology newpossibilities with high-throughput sequencing Trends Genet 29280ndash289

Yoshikawa S McKinnon RD Kokel M Thomas JB 2003 Wnt-mediated axon guidance via Drosophila derailed receptor Nature422583ndash588

Zachar Z Bingham PM 1982 Regulation of white locus expression thestructure of mutant alleles at the white locus of Drosophila mela-nogaster Cell 30529ndash541

2327

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

Page 8: Abundance and Distribution of Transposable Elements in Two ...

uncommon both in this data set and in previous studies(Kaminker et al 2002 Lipatov et al 2005)

Differences in the strength of selection against closelyrelated genes can also be illustrated by patterns of TE inser-tions The paralogs derailed (fig 4F) and derailed-2 (fig 4E)showed very different patterns of TE insertions though thesame pattern was seen in both resources derailed-2 locatedon chromosome 2R is in a region of moderate recombination199 cMMb whereas derailed is located towards the distalend of 2L in a region of low recombination 044 cMMb

Given the context of the recombination rates the expectationwould be that derailed would have a higher TE load sincedeleterious alleles are removed more efficiently in regions ofhigh recombination (Hudson and Kaplan 1995) but theopposite pattern was observed

While both genes play a role in Wnt5 signaling mutationsin derailed can cause major phenotypic changes in Drosophilanervous system development resulting in the loss of normalfunction (Yoshikawa et al 2003) derailed-2 when mutatedproduces only minor differences in neuron positioning

115 115115 215 115115

114 1122

11181120 11181124

2117

21182119

1122

1119 1118

2118

1121 3114

1115

1114

1119

1113

1119

1120

1120

1119

1120

3119

1119

A klarsichtDGRP

DSPR

Insertion in an Exon

314 1195

4121D Cyp6a20

315

1123

1415

56125

115115

415

115615

10124

1124 1124 1123

2124E derailed-2

B Notch

115 11211122

C Delta1114 1120

191

F derailed

Same insertion in both resources

FIG 4 Transposable element insertions in genes The frequency above each insertion is the number of lines in which the element is present over thenumber of lines in which the element is validated as either present or absent Gene images are from the UCSC genome browser (httpgenomeucscedu last accessed June 31 2012)

2318

Cridland et al doi101093molbevmst129 MBE

(Sakurai et al 2009) This suggests that either the strength ofselection may be much higher against insertions in derailed orthe insertion rates are unequal between the paralogs

Gene Family Size and TEs

We used available information on gene families in DrosophilaHahn et al (2007) grouped into families based on sequencesimilarity and found a very weak but significant effect of genefamily size on the average number of TE insertions per kbin both resources (P value = 00003 R2 = 0001 DSPRP value = 0002 R2 = 0007 DGRP) (fig 5A and B) Althoughthere was a large variance in the number of insertions per kb

in both resources (fig 5A and B) the average number ofTEskb increased as family size increased up to a moderatelylarge family size

Patterns of Variation between Resources

To replicate an association with a single causative allele thatallele must both exist in both panels and also be at highenough frequency to detect the association Given thatmost TEs were singletons and only 45 of insertion eventswere identified in both panels and were within genes thereplicability of associations with these alleles is likely to below if alleles are considered individually We examined sharedinsertions segregating at different frequencies between panelsand found only two TE insertions in genes that were segre-gating at different frequencies in the DGRP versus DSPRresources following Bonferroni correction (Fisherrsquos exacttests P 000007) One of these was in an exon of bothCG13175 and CG33964 a pair of overlapping genes Herethe TE insertion was at much higher frequency in theDGRP than in the DSPR (89122 vs 215) The other insertionwas in an intron of Caliban and the frequency of this insertionwas higher in the DSPR than the DGRP (1115 vs 7116)

If insertions are considered as a class of mutation thenthere may be better power to detect an association assumingthat different insertions contribute in the same way to thephenotype We also examined genes with TE insertions(139 of genes) though not necessarily the same insertionin both resources Treating all insertion alleles within a gene asequivalent a comparison between resources showed thatthere were 21 genes which contain TEs at different frequen-cies between the DGRP versus DSPR following Bonferronicorrection (Fisherrsquos exact tests P 0000038197) includingderailed-2 (supplementary table S6 Supplementary Materialonline)

Discussion

Distribution and Abundance of TransposableElements

Previous work on transposable elements has generally beenfocused on a few families of TEs (Montgomery et al 1987Charlesworth and Lapid 1989 Charlesworth et al 1992) orfocused only on a single genome (Bartolome et al 2002Kaminker et al 2002 Rizzon et al 2002) We have presenteda genomic analysis of TEs across a large number of lines andtwo data sets We found that transposable elements in bothresources are primarily found as singletons with a few mod-erate to high frequency insertions Transposable elements alsosegregated at lower allele frequencies than do presumed non-functional SNPs drawn from the same data set suggestingthat negative selection is acting upon TEs However theremay be substantial variation in the age of transposable ele-ments in different families many LTR elements are thought tobe young insertions (Bowen and McDonald 2001 Bergmanand Bensasson 2007) while other families like INE-1 arethought to be quite old (Kapitonov and Jurka 2003) Thismay create a situation where the young age of insertions isresponsible for the excess of rare alleles We found a

0

1

2

3

00 05 10 15 20log10(Gene family size)

TE

den

sity

0

1

2

3

00 05 10 15 20log10(Gene family size)

TE

den

sity

DSPR

DGRPA

B

FIG 5 log10(TE density) versus log10(Gene family size)

2319

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

non-random distribution of TE insertions across the genomewith increases of TEs in both the intergenic and intronicregions of the genome and decreases in exonic and UTRregions (Kaminker et al 2002 Lipatov et al 2005) Similar toprevious studies we found variation in TE density betweenchromosomes and higher densities of TEs in regions of lowrecombination (Charlesworth et al 1994 Bartolome et al2002) We also found increased TE densities on the X relativeto the autosomes both in overall insertions and in someindividual TE families (table 5)

Detection method and the size of the data set are likely tobe contributing factors to differences between previous workand this study Many previous studies have used in situ meth-ods to ascertain TE number in samples and have looked atelement accumulation at the base versus the midsections ofchromosomes (Montgomery et al 1987 Charlesworth et al1992) or have looked only at one genome (Bartolome et al2002 Rizzon et al 2002) A previous PCR-based detectionstudy found no difference in TE frequency between the Xand the autosomes in natural populations (Petrov et al2011) but this study was biased in that it only looked atTEs shared with the reference sequence and thus did notcapture the full picture of TEs in these populations

While our analysis does disagree with previous work thesedisagreements may be mainly due to the way in which regionsof the genome are divided for examination Rizzon et al(2002) found an increase in copies of LTR retrotransposonson the X when they excluded pericentromeric telomeric andchromosome 4 from their analysis though they found a def-icit of TEs on the X when these regions are included Thisfinding is most directly comparable to our own finding for theentire TE data set since they exclude a very similar set ofgenomic regions We do agree with previous work in ourfinding of higher densities of TEs in regions of low recombi-nation than in high recombination (Charlesworth et al 1994Bartolome et al 2002)

Examining individual element families Carr et al (2002)found an increase of mdg-3 elements and 297 elements on theX than would be expected under a random insertion modelMontgomery et al (1987) found no evidence for a reductionin TEs on the X for both roo and 297 elements but foundfewer TEs on the X when looking at the 412 family HoweverCharlesworth et al (1992) observed the opposite pattern forroo and 297 Montgomery et al (1991) also found that moreroo elements in lab kept lines than they had previously ob-served in natural populations We found significantly in-creased densities on the X for roo and significantlyincreased densities on the autosomes for 412

Differences between previous studies of individual elementfamilies and our results may be due to a variety of reasonsFirst we were able to annotate only about two-thirds of ourTE insertions This was primarily due to having insufficientsequence to reliably annotate the element or because inde-pendent annotation using sequence reconstructed fromeither end of the insertion did not yield the same elementfamily annotation This means that our density calculationsfor individual elements may be skewed by missing annotationinformation Second many previous studies (Montgomery

et al 1987 Charlesworth et al 1992 Carr et al 2002) examinedonly LTR elements These elements are young (Bowen andMcDonald 2001 Bergman and Bensasson 2007) and thus in-dividual copies may not yet have been removed from thegenome thereby displaying different patterns of insertionthan older active elements

What is clear is that there is great variability in TE densityboth within and between element families as well as withinand between chromosomes It may also be that the overallpattern is driven by a small group of families In our data setINE-1 shows the largest pattern of difference between the Xand autosomes with an order of magnitude increase in meanTE density per line on the X over the autosomes possibly dueto an increased likelihood of fixation of insertions (Omettoet al 2005) specifically on the X chromosome (Presgraves2006) These INE-1 elements are a subset of the largest TEfamily in D melanogaster though only several dozen of themare in the euchromatin INE-1 elements are also considered tobe inactive in D melanogaster and likely have not been mobilefor gt3 million years (Kapitonov and Jurka 2003) Consistentwith this most of the INE-1 elements we detected were fixedor nearly fixed in our data set We did see some examples oflow frequency insertions which may be due to more recenttranspositional events or incorrect TE annotation due toshort sequences The old age of INE-1 insertions also suggeststhat the extant elements are unlikely to be selected againstAn increased density of TEs on the X compared with theautosomes suggests that it is not the fitness effects of inser-tions which govern TE copy number Fitness effects wouldresult in the more rapid removal of insertions on the X thaninsertions on the autosomes due to the increased selectivepressures experienced on the X (Montgomery et al 1987)Background selection also predicts that the X should be lessburdened by deleterious recessives since it is hemizygous onethird of the time (Charlesworth 1994)

However it is unclear whether our observation of in-creased densities of TEs on the X relative to the autosomesfits with the ectopic exchange model This model predictsthat TEs will accumulate in regions of low ectopic exchange(Langley et al 1988) The general assumption in the literatureis that ectopic exchange is positively correlated with recom-bination rate (Langley et al 1988 Montgomery et al 1991)The X chromosome experiences higher levels of recombina-tion than the autosomes with an average of 36 cMMb forthe X in regions of high recombination (cMMbgt 2) whereasthe autosomes have a mean of 336 cMMb If recombinationrate and ectopic exchange are positively correlated through-out the genome then the expectation would be fewer TEs inthe middle regions of the X where recombination rates areknown to be high This is the opposite of what we observe Ifrecombination rate and the rate of ectopic exchange are notpositively correlated then the ectopic exchange model maybe correct but data are lacking to properly address this issueCurrently there are no direct measurements of the rate ofectopic exchange on a genome-wide scale in Drosophiladuring meiosis Montgomery et al (1991) observed littleectopic exchange between homozygous chromosomes inDrosophila melanogaster on the X chromosome in the

2320

Cridland et al doi101093molbevmst129 MBE

region around the white locus They also found that whilerecombination rate was reduced in both centromeric andtelomeric regions only the centromeric region displayed anincreased density of TEs This information is suggestive ofdifferences between these regions in the rate of ectopicexchange

The DGRP and DSPR are also composed of inbred lineswhich have been kept in laboratory conditions for some time~9 years in the case of the DGRP lines and 40+ years in thecase of the DSPR Given that TE insertion mutations accumu-late rapidly and in an increasing manner over time (Nuzhdinet al 1997) the increased density of TEs on the X versus theautosomes may not be surprising if there is a highertransposition rate to the X The only direct study to datefocused on roo elements in D melanogaster and did observehigher rates of insertion in the X than the autosomes(Vazquez et al 2007)

Transposable Elements in Genes

The majority of genes 813 do not have any TE insertions inany line studied and intron length is a good predictor of thenumber of TEs within a gene However a few genes accumu-late many insertions even after correcting for gene size andintron length and some genes have very few TEs even thoughthey have large introns The reasons why different genes mayaccumulate TE insertions may be quite different for individualgenes In some cases there may be hot spots for insertionswhich if insertions are neutral or only slightly deleterious mayresult in large number of insertions in a small region In othergenes the lack of TE accumulation may suggest selectivepressures against TE accumulation

Gene family size may also play a role in TE accumulationWe see an increase in TE density correlated with an increase ingene family size This may indicate differences in selectivepressures for genes in different sized families Single-copygenes may experience more constraint whereas in moder-ately sized gene families these constraints may be more re-laxed perhaps due to redundancy

In the cytochrome p450 functional family of genes we findthat 3789 of these genes have TE insertions 2589 in theDGRP and 2089 in the DSPR Cytochrome p450 is also foundto be in an enriched term in the DSPR for the set of genescontaining TEs The cytochrome p450 gene Cyp6g1 is knownto confer resistance to DDT (Daborn et al 2001) and TE-mediated copy number variation is associated with increasingresistance to DDT (Schmidt et al 2010) This pattern of TEinsertions in the DSPR may be reflective of the locationswhere DSPR lines were originally collected The DSPR lineswere gathered from various locations worldwide at a timewhen different pesticides were in use The DGRP lines weregathered from a single geographic area in North America in2003 Presumably these sets of lines have experienced differ-ent selective pressures with respect to pesticide resistance InCyp6a20 (fig 4D) where TEs are segregating at different fre-quencies in the two resources some similar effect may beoccurring

TE Detection Pipelines

The TE detection strategy used here is an improvement uponthe strategy used in Mackay et al (2012) The most notabledifferences between the data reported in Mackay et al (2012)and the DGRP data reported here is substantial amount ofnew information First our pipeline improvements agree wellwith the previous implementation with 812 of previous TEpresence calls identified by the new pipeline However thesecalls represent on 182 of the presence calls made by theimproved pipeline demonstrating that we have identifiedsignificantly more calls than before The incorporation of TEabsence calls also allows us to provide a much more completepicture of TEs in euchromatic sequence in the DGRP

The TE detection pipeline used here is similar to otherpipelines which have been employed but it is useful to de-scribe briefly here the differences in approaches and why wefollowed the approach presented here A useful discussion ofcommon approaches to TE detection in many species can befound in Xing et al (2013) One common method for TEdetection is the split-read mapping approach which hasbeen employed in Drosophila by Linheiro and Bergman(2012) This approach identifies TE insertions by locating in-dividual sequence reads which span TE insertion breakpointsThis approach is the strategy of choice for next-generationsequencing experiments which produce only single-endreads However in a paired-end read situation this approachmay not take full advantage of the available data Specificallythe size of window around the TE insertion breakpoint whereinformative reads can be found is governed by read length in asingle-end experiment and fragment length in a paired-endexperiment Linheiro and Bergman (2012) discuss the impor-tance of read length on TE detection and suggest that in-creased read length improves TE detection In situationswhere read length is long and coverage is deep enough thesplit-read method may be preferable to the paired-end de-tection strategy However for the data sets analyzed herewhere read length is much shorter than fragment lengththe paired-end detection strategy will capture moreinformation

The pipeline used here also has similarities to the pipelineused in Kofler et al (2012) However there are two majordifferences in these studies First Kofler et al (2012) do notattempt to reconstruct their insertion breakpoints andsecond the population studied by Kofler et al (2012) wassequenced as a pool rather than as individual lines Becausetheir pooled sequence data reflect on average less than 1sequence coverage per line Kofler et al (2012) are primarilyable to detect intermediate to high frequency TE insertionstheir own analysis concludes that they cannot reliably detectinsertions at frequency lt7 (see supplementary fig S2Supplementary Material online in their supplementary dis-cussion) However the literature to date (Charlesworth andLangley 1989 Charlesworth and Lapid 1989 Charlesworthet al 1992) as well as the results of this study indicate thatthe majority of TEs in a population are found at very lowfrequency often in only one individual in a population andtherefore beyond the scope of detection of Kofler et al (2012)

2321

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

Caveats

The D melanogaster reference genome to which we alignedour data is a P-element-free genome Therefore the use of thisreference will be biased against detecting any P elementsexisting in the data sets In the case of the DSPR data setthis is not an issue since these lines are also all P-element-free(Macdonald and Long 2007) However it is likely that theDGRP lines do harbor P elements For the purpose of theseanalyses we decided to ignore P elements and focus on theset of element families common to the two resources

This situation does highlight the importance of having anappropriate reference sequence To detect P elements in thegenome of D melanogaster an artificially constructed refer-ence sequence or a different assembled sequence containingthe P element would need to be utilized This also brings upthe question about as yet undetected elements that exist atlow frequency If these elements do not have sufficient sim-ilarity to existing elements to be aligned to the referencegenome these elements will go undetected

Conclusions

We have presented a description of transposable elements intwo recently released QTL resources Transposable elementscan play important roles in gene function and regulation andincorporation of TE insertion information into any analysisperformed with either of these two resources will present amore complete picture of genomic variation and its contri-bution to complex traits In addition it will be important forassociation studies attempting to replicate genotypendashpheno-type associations utilizing these resources to be aware of thepotential contributions of TEs

We have also demonstrated a next-generation high-throughput sequencing analysis pipeline that is capable ofdetecting this type variation with a high rate of precisionand specificity We suggest that future studies using next-generation sequencing data utilize this technique

Materials and Methods

Data Sets

Two data sets were used in this analysis The first set consistsof 131 inbred lines from the Drosophila Genetic ReferencePanel (DGRP) (Mackay et al 2012) SRA accessionSRP000694 The second set consists of 15 lines used to es-tablish the Drosophila Synthetic Population Resource (DSPR)(King et al 2012a) initially described in Macdonald and Long(2007) SRA accession SRA051316 For the DGRP resourceboth 454 and Illumina data were available whereas onlyIllumina data are available for the DSPR We utilized onlythe Illumina sequence data to ensure consistency in datacollection across data sets We calculated a mean coverageof 21 for the DGRP lines and 50 for the DSPR linesVariation in coverage for the DGRP lines was substantial be-tween lines ranging from ~4 average coverage to nearly50 average coverage We therefore dropped from the anal-ysis four lines with an average coverage of 10 or less sincelow coverage can cause difficulties with assembly and TEidentification Sequence data for both data sets consisted of

paired-end data with 54 bp reads for the DSPR and paired-end 100 bp reads trimmed to 75 bp for DGRP

IBD in DGRP Data Set

We downloaded the SNP tables generated by the DGRPproject from httpwwwhgscbcmtmceduprojectsdgrpfreeze1_July_2010snp_callsIllumina (last accessed July 302010) to examine identity by descent in this data set Weperformed an all by all comparison between lines examiningsliding windows of 1 Mb across the genome with 100 kb stepsbetween windows We take a simple definition of IBD hereWhenever we identified two lines that weregt95 identity atSNP positions in a 1 Mb region we labeled the pair as IBD inthe identified region and masked that region of the genomein the line with lower coverage We also dropped entirelyfrom our analysis any lines where gt50 of the genome wasdetermined to be IBD with other lines in the population(supplementary table S1 Supplementary Material online) re-moving from the analysis the line with lower average se-quence coverage

We modeled 1 Mb regions of 200 chromosomes in a coa-lescent simulation using the parameters theta = 001site and4Nr = 10theta to determine the number of regions of95IBD under a Wright-Fisher population model using the MACSsoftware (Chen et al 2009) We simulated 1000 replicates ofthis simulation and tabulated the number of 1 Mb regionsshowing 95 IBD in each replicate

Alignment

We aligned reads to the D melanogaster reference genome(version 513 downloaded from FlyBase flybaseorg) using thealigner BWA (version 059 Li and Durbin 2009) We used thefollowing parameters (aln -t 8 -l 13 -m 50000000 -R 5000)followed by the command ldquosamperdquo to resolve paired endmappings (using parameters -a 5000 -N 5000 -n 500) It isimportant when detecting TEs that the -R during the initialalignment and the -N parameter during the ldquosamperdquo phase beset high otherwise highly repetitive sequences that would beinformative of the presence of a common element will beexcluded from the data by the aligner We also used the -Ioption for the DGRP lines but not the DSPR due to the dif-ferences in Illumina quality score output format as the twodata sets were sequenced at different times The -m parameterfrom the initial alignment step and the -n parameter from thepaired-end resolution step deal with how BWA treats multiplyaligning reads An additional caveat is that some TE familiesmost notably the P-element family is absent from the Dmelanogaster reference sequence Such families in the se-quenced genomes will not align to the reference genome weused even if they are present in the sampled lines

Transposable Element Detection

Transposable elements were detected by first identifying allreads that were aligned to an annotated TE in the referencegenome (supplementary fig S1 Supplementary Material on-line) We then identified the mates of these reads and selectedonly those mates which did not align to an annotated TE

2322

Cridland et al doi101093molbevmst129 MBE

These reads identify the set of read-pairs which span a TEinsertion The uniquely mapping non-TE reads were thenclustered based on their start position and chromosomalstrand to identify putative TEs Clusters on the plus strandwere matched to clusters on the minus strand to producepairs of clusters indicating either end of the putative TE Onceclusters were identified we filtered out all events which con-tained fewer than three read-pairs consistent with a giveninsertion for each cluster This means that putative eventsat this stage have a minimum of six read-pairs indicating theinsertion three on the plus strand and three on the minusstrand

We then extracted all reads and their mates that aligned tothe identified regions of the genome and attempted toreconstruct the local area including the breakpoint of theTE Because many TEs have repetitive regions at the ends ofthe element we reconstructed the two breakpoints of the TEseparately (see also supplementary fig 23 from Mackay et al2012) Local reconstruction was done by running Phrap (ver-sion 1090518 Ewing and Green 1998) with the followingparameters (-forcelevel 10 -minscore 10 -minmatch 10)This set of parameters causes Phrap to do a Smith-Waterman comparison when it finds matches of at least 10characters but then relaxes the minimum alignment scoreand the final assembly parameters This is necessary to recon-struct TEs because repetitive sequence within the TEs canresult in a failure to reconstruct a contig if stricter assemblyparameters are used All code we developed for the detectionof TEs will be freely available at wwwmolpopgenorgData(last accessed August 1 2013)

Following reconstruction we used BlastN (verson 2222) tocompare reconstructed contigs to the D melanogaster refer-ence to confirm the presence of a TE In most cases we wereable to fully reconstruct both breakpoints of the TE Fullyreconstructed indicates that we could reconstruct a singlecontig which contained the breakpoint of the TE In mostcases the two breakpoints of the TE did not resolve to thesame nucleotide but instead identify the location of thetarget site duplication for the insertion (Linheiro andBergman 2012 similar to fig 1) These are cases where theupstream estimate of the breakpoint is slightly higher thanthe downstream estimate of the breakpoint and the differ-ence in position can be seen as the likely span of the target siteduplication However sometimes we were not able to fullyreconstruct a contig but instead generated two contigs onecontaining uniquely aligning sequence and the other contain-ing TE sequence These contigs are known to belong togetherbecause reads forming one contig have their mates in theother contig In these cases we have an approximate but notprecise estimate of the TE breakpoint and are unable toidentify the target site duplication

After the first round of detection we performed a secondround of detection where we examined every location wherea novel TE was detected in any line in all other lines Thisincluded examining positions in one data set that were iden-tified in the other data set In addition we examined thelocations of all known TEs in the D melanogaster genomeFor each location queried in each line we extracted reads

aligning to that location and their mates and used Phrap toreconstruct a contig This allowed us to produce presenceabsence or missing data calls for all positions where TE wasdetected in all lines The absence of a TE is determined byidentifying reconstructed contigs that spanned the insertionposition of the TE by at least 15 bp on either side Theseabsence calls were not included in Mackay et al (2012)This also allows us to identify TEs that may be in regions ofthe genome with lower average sequence coverage comparedwith the rest of the genome provided that insertion is sharedwith another line where it was possible to detect the insertion

Transposable Element Annotation

Annotation of novel TEs was performed by aligning recon-structed contigs with BlastN to the set of TEs found in theD melanogaster reference We chose the best match basedon overall length and BLAST reported e-value and called thenovel TE a TE of that family We also required a minimum of80 bp of contig matching a TE to make an annotation callThis is an additional annotation analysis to the one we per-formed in Mackay et al (2012) For each of the two contigsproduced for each insertion we determined the best matchand then compared these to each other TEs are classified asper the classification scheme put forth in Wicker et al (2007)We annotated each TE insertion to the highest level of agree-ment between the two contigs if the two contigs disagreedwith each other at every level of classification we called thatelement ldquoundeterminedrdquo This could result from nested in-sertions of elements where the 50 end of the insertion is truly adifferent element from the 30 end or because the TE sequenceis degraded and differs substantially from the referencesequence or because there was not enough TE sequence tomeet our annotation guidelines Because we can only recon-struct a maximum of few hundred bases at either end of theinsertion it is difficult to distinguish between these issues Wealso did not attempt to determine the strand of the TE

Validation

A subset of 190 TE insertion sites shared with the D melano-gaster reference were validated via PCR in a subset of nineof the DGRP lines by Blumenstiel JP Chen X He M BergmanCM (unpublished data httparxivorgabs12093456 lastaccessed June 26 2013) To independently check IBD callsusing SNP data we also determined if pairs of lines wereIBD for TEs in the same regions In this comparison we onlyincluded TE insertions where both lines have a positive pres-ence or absence call This can also serve as a validation of ourTE calling pipeline since if two segments are IBD they shouldshare the same set of TEs

Simulation

We selected a set of elements of different sizes and of TEclasses from the set of TEs in the reference sequence andinserted them semi-randomly into the D melanogaster refer-ence genome in silico Insertions were semi-random since weavoided areas already occupied by reference TEs and the cen-tromeric and telomeric regions Read pairs were then

2323

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

generated to construct fastq files We then ran our TE detec-tion pipeline as above through the initial annotation stepThis process was repeated three times to generate a simula-tion of 15 25 and 50 coverage

Recombination Rates

We wrote an R routine to estimate local recombination ratesas a function of physical position Briefly we used ldquocyto-ge-netic-seqtsvrdquo from wwwflybaseorg (last accessed January 302009) to assign a set of physical landmarks to genetic posi-tions For each chromosome arm we then fit a local polyno-mial relating cumulative cM to cumulative basepair (usinglocpoly) Unlike many prior local polynomial fits we madethe added assumption that this polynomial must be a mono-tonically increasing function (using monoproc) an assump-tion that is true for real data We then fit a smooth spline tothe above curve (using smoothspline) The advantage ofsmoothspline is that the resulting curves can be differentiated(and hence estimates of rates obtained directly) Finally we fita smooth spline to the first derivatives obtained from theprior smooth spline largely to make the recombinationrates less noisy near centromeres and telomeres TheR code is freely available at wwwmolpopgenorgData (lastaccessed August 1 2013)

Data Set Restriction

Analyses performed on these data sets were restricted in twoways First we restricted our analyses to regions of thegenome with low TE density and moderate recombinationrates We restricted our data set in this way because regions ofhigh TE density make it more difficult to make accurate callsBecause our pipeline requires unique reads to positionallylocate the TE insertion regions with high TE density canlack enough unique sequence to generate these reads Theregions included in the analyses are X300000ndash208000002L200000ndash20100000 2R2300000ndash21000000 3L100000ndash21900000 3R600000ndash27800000 which are not centromericand telomeric regions This excludes the majority of TEs an-notated in the reference genome since these TEs are largelyclustered together in centromeric regions However this doesnot affect our analysis since we are interested in TEs in theeuchromatic portions of the genome

Second to restrict the set of reference TEs studied to full-length copies we removed from the analysis all reference TEsthat had a length that was less than 75 of the canonicalannotated length of the TE While these elements may stillproduce phenotypic effects these elements are unlikely to beactive copies This reduced the number of reference TEs in-cluded in the analysis to 607 from 1085 TEs that are present inthe reference genome in the regions included in our analysis

Additionally some analyses were performed on the subsetof 23 lines in the DGRP data set which had an average se-quencing coverage of 25 or higher hereafter referred to asDGRP25 This restriction was to mitigate the effect of differ-ences in coverage between the two data sets Simulations oftransposable element detection via our pipeline data not

shown indicate that TE detection at 25 is comparable todetection at 50

Site Count Spectra

We generated derived allele count spectra for TEs in the DSPRdata set and the DGRP25 set of lines We included only siteswhere we were able to make a positive presence or absencecall for each line in the data set so that the number of lineswas normalized for each resource for this analysis Observedvalues were compared with expected values at counts 1 2 34 5 6 and 7 under an infinite sites model via a 2 testCategories 7 and higher were grouped due to low counts insome of these categories We also calculated count spectra forSNPs present in small introns (introns86 bp [Haddrill et al2005]) a group unlikely to have functional effects We alsoused SoFoS a site frequency rescaler (Hufford et al 2012) torescale the DSPR data set to the sample size of the DGRP25from 15 to 23 samples to see if the rescaled data show adifferent distribution of element counts from the observedDGRP25 data We then did a KolmogorovndashSmirnov test tocompare the two distributions as a test of different selectiveeffects acting on the populations

Genomic Context of TEs

Transposable elements were identified in regions of low(lt2 cMMb) or high recombination (2 cMMb) FASTAfiles containing the annotated coordinates of 50 UTRs30 UTRs intronic intergenic and full gene span regions forversion 513 of the D melanogaster reference were down-loaded from flybase (wwwflybaseorg last accessed January30 2009) Version 513 of the reference annotation was usedfor comparative purposes with the DGRP analysis (Mackayet al 2012) Because alternative transcripts can result in thesame genomic position being classified in multiple categorieswe labeled each TE as every appropriate category This oc-curred with 52 of TEs in the DGRP and 60 of TEs in theDSPR When calculating the percentage of the referencegenome in our restricted areas of analysis that were classifiedas each genomic type we followed the same strategy

We compared our observations of TE insertions to a bino-mial distribution to determine whether TEs were distributedrandomly throughout the genome We used the pbinomfunction in R (R Development Core Team 2008) setting thenumber of trials to the total number of TE insertion locationsin the data set and the probability of success to the propor-tion of the genome defined as each context (intron exon30 UTR 50 UTR or intergenic) We did one tailed tests in eachcase For exon 30 UTR and 50 UTR tests we calculated theprobability of observing the observed value or fewer TEs Forintron and intergenic tests we calculated the probability ofobserving the observed value or more TEs

Variables Affecting TE Density

For the DGRP25 and DSPR data sets we calculatedTE density in terms of TEsMb For this analysis we alsoexcluded chromosome 4 We examined the effects of recom-bination rate chromosome and TE element family on TE

2324

Cridland et al doi101093molbevmst129 MBE

density using the standard linear model in R We usedthe lm() function to define our model as followsTEsMb ~ Line + Recombination Rate + Chromosome +Recombination Rate Chromosome We performed thisanalysis both for the set of all TEs and also for 15 TE familieswith moderate to high number of copies in our data set(supplementary table S4 Supplementary Material online)We selected individual TE families for this additional analysisbased on their level of abundance in the data sets and theirhaving been previously studied by Montgomery et al (1987)and Bartolome et al (2002)

Gene Families

We downloaded the Dfam database from (httpwwwindianaedu~hahnlabflyDfamDBdrosophila_frbhtml last acc-essed September 1 2012) This database describes all mem-bers of all gene families clustered using a fuzzy reciprocalBLAST method as described in Hahn et al (2007) for 12Drosophila species We used R to perform a linear regressionusing a simple model where the square root (TE density) in agene is the random variable and log10 (gene family size) is theindependent variable

Variation between Resources

For each TE present in both data sets we compared thefrequency of the TE in each population using Fisherrsquos exacttest We excluded from this analysis all insertions where wewere unable to make a presenceabsence call in at least 100 ofthe DGRP lines or 13 of the DSPR lines For each gene with TEinsertions in both data sets we compared the frequency atwhich the gene contains TEs between data sets using aFisherrsquos exact test

Supplementary MaterialSupplementary tables S1ndashS6 and figures S1 and S2 are avail-able at Molecular Biology and Evolution online (httpwwwmbeoxfordjournalsorg)

Acknowledgments

The authors thank Casey Bergman and Miaomiao He for earlyaccess to their data as well as helpful comments from CBThey also thank Rebekah Rogers for help with editing themanuscript This work was supported by the NationalInstitute of Health grant NIH R01-GM085183 to KRT theNational Library of Medicine - National Institute of Healthtraining grant LM007443 to JMC and the National Instituteof Health grant NIH R01-OD010974 to ADL and SJM

ReferencesAltschul SF Gish W Miller W Myers EW Lipman DJ 1990 Basic local

alignment search tool J Mol Biol 215403ndash410Aminetzach YT Macpherson JM Petrov DA 2005 Pesticide resistance

via transposition-mediated adaptive gene truncation in DrosophilaScience 309764ndash767

Bartolome C Maside X Charlesworth B 2002 On the abundance anddistribution of transposable elements in the genome of Drosophilamelanogaster Mol Biol Evol 19926ndash937

Bennetzen JL 2000 Transposable element contributions to plant geneand genome evolution Plant Mol Biol 42251ndash269

Bergman CM Bensasson D 2007 Recent LTR retrotransposon insertioncontrasts with waves of non-LTR insertion since speciation inDrosophila melanogaster Proc Natl Acad Sci U S A 10411340ndash11345

Biemont C Vieira C 2006 Junk DNA as an evolutionary force Nature443521ndash524

Birchler JA Hiebert JC 1989 Interaction of the Enhancer of white-apricotwith transposable element alleles at the white locus in Drosophilamelanogaster Genetics 122129ndash138

Birchler JA Hiebert JC Rabinow L 1989 Interaction of the mottler ofwhite with transposable element alleles at the white locus inDrosophila melanogaster Genes Dev 373ndash84

Bowen NJ McDonald JF 2001 Drosophila euchromatic LTR retrotran-sposons are much younger than the host species in which theyreside Genome Res 111527ndash1540

Carr M Soloway JR Robinson TE Brookfield JF 2002 Mechanisms reg-ulating copy numbers of six LTR retrotransposons in the genome ofDrosophila melanogaster Chromosoma 110511ndash518

Charlesworth B 1994 The effect of background selection against dele-terious mutations on weakly selected linked variants Genet Res 63213ndash227

Charlesworth B Jarne P Assimacopoulos S 1994 The distribution oftransposable elements within and between chromosomes in a pop-ulation of Drosophila melanogaster III Element abundances in het-erochromatin Genet Res 64183ndash197

Charlesworth B Langley CH 1989 The population genetics ofDrosophila transposable elements Annu Rev Genet 23251ndash287

Charlesworth B Lapid A 1989 A study of ten families of transposableelements on X chromosomes from a population of Drosophila mel-anogaster Genet Res 54113ndash125

Charlesworth B Lapid A Canada D 1992 The distribution of transpos-able elements within and between chromosomes in a population ofDrosophila melanogaster I Element frequencies and distributionGenet Res 60103ndash114

Chen GK Marjoram P Wall JD 2009 Fast and flexible simulation ofDNA sequence data Genome Res 19136ndash142

Chung H Bogwitz MR McCart C Andrianopoulos A Ffrench-ConstantRH Batterham P Daborn PJ 2007 Cis-regulatory elements in theaccord retrotransposon result in tissue-specific expression of theDrosophila melanogaster insecticide resistance gene Cyp6g1Genetics 1751071ndash1077

Cutter AD Good JM Pappas CT Saunders MA Starrett DM Wheeler TJ2005 Transposable element orientation bias in the Drosophila mel-anogaster genome J Mol Evol 6733ndash741

Daborn P Boundy S Yen J Pittendrigh B Ffrench-Constant R 2001DDT resistance in Drosophila correlates with Cyp6g1 over-expressionand confers cross-resistance to the neonicotinoid imidacloprid MolGenet Genomics 266556ndash563

Deloger M Cavalli FMG Lerat E Biemont C Sagot M-F Vieira C 2009Identification of expressed transposable element insertionsin the sequenced genome of Drosophila melanogaster Gene 43955ndash62

Dworkin I Palsson A Gibson G 2005 Replication of an Egfr-wing shapeassociation in a wild-caught cohort of Drosophila melanogasterGenetics 1692115ndash2125

Ewing B Green P 1998 Base-calling of automated sequencer tracesusing Phred II Error probabilities Genome Res 8186ndash194

Gruber JD Genissel A Macdonald SJ Long AD 2007 How repeatable areassociations between polymorphisms in achaete-scute and bristlenumber variation in Drosophila Genetics 1751987ndash1997

Haddrill PR Charlesworth B Halligan DL Andolfatto P 2005 Patterns ofintron sequence evolution in Drosophila are dependent upon lengthand GC content Genome Biol 6R67

Hahn MW Han MV Han S-G 2007 Gene family evolution across 12Drosophila genomes PLoS Genet 3e197

Hirschhorn JN Daly MJ 2005 Genome-wide association stud-ies for common diseases and complex traits Nat Rev Genet 695ndash108

2325

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

Hudson RR Kaplan NL 1995 Deleterious background selection withrecombination Genetics 1411605ndash1617

Hufford MB Xu X van Heerwaarden J et al (25 co-authors) 2012Comparative population genomics of maize domestication andimprovement Nat Genet 44808ndash811

Kaminker JS Bergman CM Kronmiller B et al (12 co-authors) 2002The transposable elements of the Drosophila melanogastereuchromatin a genomics perspective Genome Biol3RESEARCH0084

Kapitonov VV Jurka J 2003 Molecular paleontology of transposableelements in the Drosophila melanogaster genome Proc Natl AcadSci U S A 1006569ndash6574

Kazazian HH 2004 Mobile elements drivers of genome evolutionScience 3031626ndash1632

Kidwell MG 2002 Transposable elements and the evolution of genomesize in eukaryotes Genetica 11549ndash63

King EG Macdonald SJ Long AD 2012a Properties and power of theDrosophila synthetic population resource for the routine dissectionof complex traits Genetics 191935ndash949

King EG Merkes CM McNeil CL Hoofer SR Sen S Broman KW LongAD Macdonald SJ 2012b Genetic dissection of a model complextrait using the Drosophila synthetic population resource GenomeRes 221558ndash1566

Kofler R Betancourt AJ Schlotterer C 2012 Sequencing of pooled DNAsamples (Pool-Seq) uncovers complex dynamics of transposableelement insertions in Drosophila melanogaster PLoS Genet 8e1002487

Langley CH Montgomery E Hudson R Kaplan N Charlesworth B 1988On the role of unequal exchange in the containment of transpos-able element copy number Genet Res 52223ndash235

Li H Durbin R 2009 Fast and accurate short readalignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760

Linheiro RS Bergman CM 2012 Whole genome resequencing revealsnatural target site preferences of transposable elements inDrosophila melanogaster PLoS One 7e30008

Lipatov M Lenkov L Petrov DA Bergman CM 2005 Paucity of chimericgene-transposable elements in the Drosophila melanogastergenome BMC Biol 324

Long AD Lyman RF Morgan AH Langley CH Mackay TFC 2000 Bothnaturally occurring insertions of transposable elements and inter-mediate frequency polymorphisms at the achaete-scute complexare associated with bristle number in Drosophila melanogasterGenetics 1541255ndash1269

Lonnig W-E Saedler H 2002 Chromosome rearrangements and trans-posable elements Annu Rev Genet 36389ndash410

Macdonald SJ Long AD 2007 Joint estimates of quantitativetrait locus effect and frequency using synthetic recombinantpopulations of Drosophila melanogaster Genetics 1761261ndash1281

Macdonald SJ Pastinen T Long AD 2005 The effect of polymorphismsin the Enhancer of split gene complex on bristle number variation ina large wild-caught cohort of Drosophila melanogaster Genetics 1711741ndash1756

Mackay TFC 1984 Jumping genes meet abdominal bristles hybrid dys-genesis-induced quantitative variation in Drosophila melanogasterGenet Res 44231ndash237

Mackay TFC 1989 Transposable elements and fitness in Drosophilamelanogaster Genome 31284ndash295

Mackay TFC 2010 Mutations and quantitative genetic variationlessons from Drosophila Philos Trans R Soc Lond B Biol Sci 3651229ndash1239

Mackay TFC Langley CH 1990 Molecular and phenotypic variation inthe achaete-scute region of Drosophila melanogaster Nature 34864ndash66

Mackay TFC Lyman RF Jackson MS 1992 Effects of P element inser-tions on quantitative traits in Drosophila melanogaster Genetics 130315ndash332

Mackay TFC Richards S Stone EA et al (52 co-authors) 2012 TheDrosophila melanogaster genetic reference panel Nature 482173ndash178

Meyer LR Zweig AS Hinrichs AS et al (32 co-authors) 2013 The UCSCgenome browser database extensions and updates 2013 NucleicAcids Res 41D64ndashD69

Montgomery E Charlesworth B Langley CH 1987 A test for the role ofnatural selection in the stabilization of transposable element copynumber in a population of Drosophila melanogaster Genet Res 4931ndash41

Montgomery EA Huang S-M Langley CH Judd BH 1991 Chromosomerearrangement by ectopic recombination in Drosophilamelanogaster genome structure and evolution Genetics 1291085ndash1098

NCI-NHGRI Working Group on Replication in Association Studies2007 Replicating genotype-phenotype associations Nature 447655ndash660

Nuzhdin SV Mackay TFC 1994 The genomic rate of transposable ele-ment movement in Drosophila melanogaster Mol Biol Evol 12180ndash181

Nuzhdin SV Pasyukova EG Mackay TFC 1997 Accumulation of trans-posable elements in laboratory lines of Drosophila melanogasterGenetica 100167ndash175

Ometto L Stephan W Lorenzo DD 2005 Insertiondeletion andnucleotide polymorphism data reveal constraints in Drosophilamelanogaster introns and intergenic regions Genetics 1691521ndash1527

Palsson A Gibson G 2004 Association between nucleotide variation inEgfr and wing shape in Drosophila melanogaster Genetics 1671187ndash1198

Petrov DA Fiston-Lavier A-S Lipatov M Lenkov K Gonzalez J 2011Population genomics of transposable elements in Drosophila mela-nogaster Mol Biol Evol 281633ndash1644

Presgraves DC 2006 Intron length evolution in Drosophila Mol Biol Evol132203ndash2213

Quinian AR Hall IM 2010 BEDTools a flexible suite of utilities forcomparing genomic features Bioinformatics 26841ndash842

R Development Core Team 2008 R a language and environment forstatistical computing Vienna Austria R Foundation for StatisticalComputing [cited 2013 May 1] Available from httpwwwR-projectorg

Rizzon C Marais G Gouy M Biemont C 2002 Recombination rate andthe distribution of transposable elements in the Drosophila mela-nogaster genome Genome Res 12400ndash407

Sackton TB Kulathinal RJ Bergman CM Quinlan AR Dopman EBCarneiro M Marth GT Hartl DL Clark AG 2009 Population geno-mic inferences from sparse high-throughput sequencing of twopopulations of Drosophila melanogaster Genome Biol Evol 1449ndash465

Sakurai M Aoki T Yoshikawa S et al (11 co-authors) 2009 Differentiallyexpressed Drl and Drl-2 play opposing roles in Wnt5 signalingduring Drosophila olfactory system development J Neurosci 294972ndash4980

Schmidt JM Good RT Appleton B et al (11 co-authors) 2010Copy number variation and transposable elements feature inrecent ongoing adaptation at the Cyp6g1 locus PLoS Genet 6e1000998

Shrimpton AE Mackay TFC Brown AJL 1990 Transposable element-in-duced response to artificial selection in Drosophila melanogastermolecular analysis of selection lines Genetics 125803ndash811

Slotkin RK Martienssen R 2007 Transposable elements and the epige-netic regulation of the genome Nat Rev Genet 8272ndash285

Smith PA Corces VG 1991 Drosophila transposable elements mecha-nisms of mutagenesis and interactions with the host genome AdvGenet 29229ndash300

Thornton K 2003 Libsequence a C ++ class library for evolutionarygenetic analysis Bioinformatics 192325ndash2327

Vazquez JF Albornoz J Dominguez A 2007 Direct determination of theeffects of genotype and extreme temperature on the transposition

2326

Cridland et al doi101093molbevmst129 MBE

of roo in long-term mutation accumulation lines of Drosophilamelanogaster Mol Genet Genomics 278653ndash664

Voight BF Pritchard JK 2005 Confounding from cryptic relatedness incase-control association studies PLoS Genet 1e32

Wakeley J 2009 Coalescent theory an introduction Greenwood VillageRoberts and Company Publishers

Wicker T Sabot F Hua-Van A et al (13 co-authors) 2007 A unifiedclassification system for eukaryotic transposable elements Nat RevGenet 8973ndash982

Xing J Witherspoon DJ Jorde LB 2013 Mobile element biology newpossibilities with high-throughput sequencing Trends Genet 29280ndash289

Yoshikawa S McKinnon RD Kokel M Thomas JB 2003 Wnt-mediated axon guidance via Drosophila derailed receptor Nature422583ndash588

Zachar Z Bingham PM 1982 Regulation of white locus expression thestructure of mutant alleles at the white locus of Drosophila mela-nogaster Cell 30529ndash541

2327

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

Page 9: Abundance and Distribution of Transposable Elements in Two ...

(Sakurai et al 2009) This suggests that either the strength ofselection may be much higher against insertions in derailed orthe insertion rates are unequal between the paralogs

Gene Family Size and TEs

We used available information on gene families in DrosophilaHahn et al (2007) grouped into families based on sequencesimilarity and found a very weak but significant effect of genefamily size on the average number of TE insertions per kbin both resources (P value = 00003 R2 = 0001 DSPRP value = 0002 R2 = 0007 DGRP) (fig 5A and B) Althoughthere was a large variance in the number of insertions per kb

in both resources (fig 5A and B) the average number ofTEskb increased as family size increased up to a moderatelylarge family size

Patterns of Variation between Resources

To replicate an association with a single causative allele thatallele must both exist in both panels and also be at highenough frequency to detect the association Given thatmost TEs were singletons and only 45 of insertion eventswere identified in both panels and were within genes thereplicability of associations with these alleles is likely to below if alleles are considered individually We examined sharedinsertions segregating at different frequencies between panelsand found only two TE insertions in genes that were segre-gating at different frequencies in the DGRP versus DSPRresources following Bonferroni correction (Fisherrsquos exacttests P 000007) One of these was in an exon of bothCG13175 and CG33964 a pair of overlapping genes Herethe TE insertion was at much higher frequency in theDGRP than in the DSPR (89122 vs 215) The other insertionwas in an intron of Caliban and the frequency of this insertionwas higher in the DSPR than the DGRP (1115 vs 7116)

If insertions are considered as a class of mutation thenthere may be better power to detect an association assumingthat different insertions contribute in the same way to thephenotype We also examined genes with TE insertions(139 of genes) though not necessarily the same insertionin both resources Treating all insertion alleles within a gene asequivalent a comparison between resources showed thatthere were 21 genes which contain TEs at different frequen-cies between the DGRP versus DSPR following Bonferronicorrection (Fisherrsquos exact tests P 0000038197) includingderailed-2 (supplementary table S6 Supplementary Materialonline)

Discussion

Distribution and Abundance of TransposableElements

Previous work on transposable elements has generally beenfocused on a few families of TEs (Montgomery et al 1987Charlesworth and Lapid 1989 Charlesworth et al 1992) orfocused only on a single genome (Bartolome et al 2002Kaminker et al 2002 Rizzon et al 2002) We have presenteda genomic analysis of TEs across a large number of lines andtwo data sets We found that transposable elements in bothresources are primarily found as singletons with a few mod-erate to high frequency insertions Transposable elements alsosegregated at lower allele frequencies than do presumed non-functional SNPs drawn from the same data set suggestingthat negative selection is acting upon TEs However theremay be substantial variation in the age of transposable ele-ments in different families many LTR elements are thought tobe young insertions (Bowen and McDonald 2001 Bergmanand Bensasson 2007) while other families like INE-1 arethought to be quite old (Kapitonov and Jurka 2003) Thismay create a situation where the young age of insertions isresponsible for the excess of rare alleles We found a

0

1

2

3

00 05 10 15 20log10(Gene family size)

TE

den

sity

0

1

2

3

00 05 10 15 20log10(Gene family size)

TE

den

sity

DSPR

DGRPA

B

FIG 5 log10(TE density) versus log10(Gene family size)

2319

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

non-random distribution of TE insertions across the genomewith increases of TEs in both the intergenic and intronicregions of the genome and decreases in exonic and UTRregions (Kaminker et al 2002 Lipatov et al 2005) Similar toprevious studies we found variation in TE density betweenchromosomes and higher densities of TEs in regions of lowrecombination (Charlesworth et al 1994 Bartolome et al2002) We also found increased TE densities on the X relativeto the autosomes both in overall insertions and in someindividual TE families (table 5)

Detection method and the size of the data set are likely tobe contributing factors to differences between previous workand this study Many previous studies have used in situ meth-ods to ascertain TE number in samples and have looked atelement accumulation at the base versus the midsections ofchromosomes (Montgomery et al 1987 Charlesworth et al1992) or have looked only at one genome (Bartolome et al2002 Rizzon et al 2002) A previous PCR-based detectionstudy found no difference in TE frequency between the Xand the autosomes in natural populations (Petrov et al2011) but this study was biased in that it only looked atTEs shared with the reference sequence and thus did notcapture the full picture of TEs in these populations

While our analysis does disagree with previous work thesedisagreements may be mainly due to the way in which regionsof the genome are divided for examination Rizzon et al(2002) found an increase in copies of LTR retrotransposonson the X when they excluded pericentromeric telomeric andchromosome 4 from their analysis though they found a def-icit of TEs on the X when these regions are included Thisfinding is most directly comparable to our own finding for theentire TE data set since they exclude a very similar set ofgenomic regions We do agree with previous work in ourfinding of higher densities of TEs in regions of low recombi-nation than in high recombination (Charlesworth et al 1994Bartolome et al 2002)

Examining individual element families Carr et al (2002)found an increase of mdg-3 elements and 297 elements on theX than would be expected under a random insertion modelMontgomery et al (1987) found no evidence for a reductionin TEs on the X for both roo and 297 elements but foundfewer TEs on the X when looking at the 412 family HoweverCharlesworth et al (1992) observed the opposite pattern forroo and 297 Montgomery et al (1991) also found that moreroo elements in lab kept lines than they had previously ob-served in natural populations We found significantly in-creased densities on the X for roo and significantlyincreased densities on the autosomes for 412

Differences between previous studies of individual elementfamilies and our results may be due to a variety of reasonsFirst we were able to annotate only about two-thirds of ourTE insertions This was primarily due to having insufficientsequence to reliably annotate the element or because inde-pendent annotation using sequence reconstructed fromeither end of the insertion did not yield the same elementfamily annotation This means that our density calculationsfor individual elements may be skewed by missing annotationinformation Second many previous studies (Montgomery

et al 1987 Charlesworth et al 1992 Carr et al 2002) examinedonly LTR elements These elements are young (Bowen andMcDonald 2001 Bergman and Bensasson 2007) and thus in-dividual copies may not yet have been removed from thegenome thereby displaying different patterns of insertionthan older active elements

What is clear is that there is great variability in TE densityboth within and between element families as well as withinand between chromosomes It may also be that the overallpattern is driven by a small group of families In our data setINE-1 shows the largest pattern of difference between the Xand autosomes with an order of magnitude increase in meanTE density per line on the X over the autosomes possibly dueto an increased likelihood of fixation of insertions (Omettoet al 2005) specifically on the X chromosome (Presgraves2006) These INE-1 elements are a subset of the largest TEfamily in D melanogaster though only several dozen of themare in the euchromatin INE-1 elements are also considered tobe inactive in D melanogaster and likely have not been mobilefor gt3 million years (Kapitonov and Jurka 2003) Consistentwith this most of the INE-1 elements we detected were fixedor nearly fixed in our data set We did see some examples oflow frequency insertions which may be due to more recenttranspositional events or incorrect TE annotation due toshort sequences The old age of INE-1 insertions also suggeststhat the extant elements are unlikely to be selected againstAn increased density of TEs on the X compared with theautosomes suggests that it is not the fitness effects of inser-tions which govern TE copy number Fitness effects wouldresult in the more rapid removal of insertions on the X thaninsertions on the autosomes due to the increased selectivepressures experienced on the X (Montgomery et al 1987)Background selection also predicts that the X should be lessburdened by deleterious recessives since it is hemizygous onethird of the time (Charlesworth 1994)

However it is unclear whether our observation of in-creased densities of TEs on the X relative to the autosomesfits with the ectopic exchange model This model predictsthat TEs will accumulate in regions of low ectopic exchange(Langley et al 1988) The general assumption in the literatureis that ectopic exchange is positively correlated with recom-bination rate (Langley et al 1988 Montgomery et al 1991)The X chromosome experiences higher levels of recombina-tion than the autosomes with an average of 36 cMMb forthe X in regions of high recombination (cMMbgt 2) whereasthe autosomes have a mean of 336 cMMb If recombinationrate and ectopic exchange are positively correlated through-out the genome then the expectation would be fewer TEs inthe middle regions of the X where recombination rates areknown to be high This is the opposite of what we observe Ifrecombination rate and the rate of ectopic exchange are notpositively correlated then the ectopic exchange model maybe correct but data are lacking to properly address this issueCurrently there are no direct measurements of the rate ofectopic exchange on a genome-wide scale in Drosophiladuring meiosis Montgomery et al (1991) observed littleectopic exchange between homozygous chromosomes inDrosophila melanogaster on the X chromosome in the

2320

Cridland et al doi101093molbevmst129 MBE

region around the white locus They also found that whilerecombination rate was reduced in both centromeric andtelomeric regions only the centromeric region displayed anincreased density of TEs This information is suggestive ofdifferences between these regions in the rate of ectopicexchange

The DGRP and DSPR are also composed of inbred lineswhich have been kept in laboratory conditions for some time~9 years in the case of the DGRP lines and 40+ years in thecase of the DSPR Given that TE insertion mutations accumu-late rapidly and in an increasing manner over time (Nuzhdinet al 1997) the increased density of TEs on the X versus theautosomes may not be surprising if there is a highertransposition rate to the X The only direct study to datefocused on roo elements in D melanogaster and did observehigher rates of insertion in the X than the autosomes(Vazquez et al 2007)

Transposable Elements in Genes

The majority of genes 813 do not have any TE insertions inany line studied and intron length is a good predictor of thenumber of TEs within a gene However a few genes accumu-late many insertions even after correcting for gene size andintron length and some genes have very few TEs even thoughthey have large introns The reasons why different genes mayaccumulate TE insertions may be quite different for individualgenes In some cases there may be hot spots for insertionswhich if insertions are neutral or only slightly deleterious mayresult in large number of insertions in a small region In othergenes the lack of TE accumulation may suggest selectivepressures against TE accumulation

Gene family size may also play a role in TE accumulationWe see an increase in TE density correlated with an increase ingene family size This may indicate differences in selectivepressures for genes in different sized families Single-copygenes may experience more constraint whereas in moder-ately sized gene families these constraints may be more re-laxed perhaps due to redundancy

In the cytochrome p450 functional family of genes we findthat 3789 of these genes have TE insertions 2589 in theDGRP and 2089 in the DSPR Cytochrome p450 is also foundto be in an enriched term in the DSPR for the set of genescontaining TEs The cytochrome p450 gene Cyp6g1 is knownto confer resistance to DDT (Daborn et al 2001) and TE-mediated copy number variation is associated with increasingresistance to DDT (Schmidt et al 2010) This pattern of TEinsertions in the DSPR may be reflective of the locationswhere DSPR lines were originally collected The DSPR lineswere gathered from various locations worldwide at a timewhen different pesticides were in use The DGRP lines weregathered from a single geographic area in North America in2003 Presumably these sets of lines have experienced differ-ent selective pressures with respect to pesticide resistance InCyp6a20 (fig 4D) where TEs are segregating at different fre-quencies in the two resources some similar effect may beoccurring

TE Detection Pipelines

The TE detection strategy used here is an improvement uponthe strategy used in Mackay et al (2012) The most notabledifferences between the data reported in Mackay et al (2012)and the DGRP data reported here is substantial amount ofnew information First our pipeline improvements agree wellwith the previous implementation with 812 of previous TEpresence calls identified by the new pipeline However thesecalls represent on 182 of the presence calls made by theimproved pipeline demonstrating that we have identifiedsignificantly more calls than before The incorporation of TEabsence calls also allows us to provide a much more completepicture of TEs in euchromatic sequence in the DGRP

The TE detection pipeline used here is similar to otherpipelines which have been employed but it is useful to de-scribe briefly here the differences in approaches and why wefollowed the approach presented here A useful discussion ofcommon approaches to TE detection in many species can befound in Xing et al (2013) One common method for TEdetection is the split-read mapping approach which hasbeen employed in Drosophila by Linheiro and Bergman(2012) This approach identifies TE insertions by locating in-dividual sequence reads which span TE insertion breakpointsThis approach is the strategy of choice for next-generationsequencing experiments which produce only single-endreads However in a paired-end read situation this approachmay not take full advantage of the available data Specificallythe size of window around the TE insertion breakpoint whereinformative reads can be found is governed by read length in asingle-end experiment and fragment length in a paired-endexperiment Linheiro and Bergman (2012) discuss the impor-tance of read length on TE detection and suggest that in-creased read length improves TE detection In situationswhere read length is long and coverage is deep enough thesplit-read method may be preferable to the paired-end de-tection strategy However for the data sets analyzed herewhere read length is much shorter than fragment lengththe paired-end detection strategy will capture moreinformation

The pipeline used here also has similarities to the pipelineused in Kofler et al (2012) However there are two majordifferences in these studies First Kofler et al (2012) do notattempt to reconstruct their insertion breakpoints andsecond the population studied by Kofler et al (2012) wassequenced as a pool rather than as individual lines Becausetheir pooled sequence data reflect on average less than 1sequence coverage per line Kofler et al (2012) are primarilyable to detect intermediate to high frequency TE insertionstheir own analysis concludes that they cannot reliably detectinsertions at frequency lt7 (see supplementary fig S2Supplementary Material online in their supplementary dis-cussion) However the literature to date (Charlesworth andLangley 1989 Charlesworth and Lapid 1989 Charlesworthet al 1992) as well as the results of this study indicate thatthe majority of TEs in a population are found at very lowfrequency often in only one individual in a population andtherefore beyond the scope of detection of Kofler et al (2012)

2321

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

Caveats

The D melanogaster reference genome to which we alignedour data is a P-element-free genome Therefore the use of thisreference will be biased against detecting any P elementsexisting in the data sets In the case of the DSPR data setthis is not an issue since these lines are also all P-element-free(Macdonald and Long 2007) However it is likely that theDGRP lines do harbor P elements For the purpose of theseanalyses we decided to ignore P elements and focus on theset of element families common to the two resources

This situation does highlight the importance of having anappropriate reference sequence To detect P elements in thegenome of D melanogaster an artificially constructed refer-ence sequence or a different assembled sequence containingthe P element would need to be utilized This also brings upthe question about as yet undetected elements that exist atlow frequency If these elements do not have sufficient sim-ilarity to existing elements to be aligned to the referencegenome these elements will go undetected

Conclusions

We have presented a description of transposable elements intwo recently released QTL resources Transposable elementscan play important roles in gene function and regulation andincorporation of TE insertion information into any analysisperformed with either of these two resources will present amore complete picture of genomic variation and its contri-bution to complex traits In addition it will be important forassociation studies attempting to replicate genotypendashpheno-type associations utilizing these resources to be aware of thepotential contributions of TEs

We have also demonstrated a next-generation high-throughput sequencing analysis pipeline that is capable ofdetecting this type variation with a high rate of precisionand specificity We suggest that future studies using next-generation sequencing data utilize this technique

Materials and Methods

Data Sets

Two data sets were used in this analysis The first set consistsof 131 inbred lines from the Drosophila Genetic ReferencePanel (DGRP) (Mackay et al 2012) SRA accessionSRP000694 The second set consists of 15 lines used to es-tablish the Drosophila Synthetic Population Resource (DSPR)(King et al 2012a) initially described in Macdonald and Long(2007) SRA accession SRA051316 For the DGRP resourceboth 454 and Illumina data were available whereas onlyIllumina data are available for the DSPR We utilized onlythe Illumina sequence data to ensure consistency in datacollection across data sets We calculated a mean coverageof 21 for the DGRP lines and 50 for the DSPR linesVariation in coverage for the DGRP lines was substantial be-tween lines ranging from ~4 average coverage to nearly50 average coverage We therefore dropped from the anal-ysis four lines with an average coverage of 10 or less sincelow coverage can cause difficulties with assembly and TEidentification Sequence data for both data sets consisted of

paired-end data with 54 bp reads for the DSPR and paired-end 100 bp reads trimmed to 75 bp for DGRP

IBD in DGRP Data Set

We downloaded the SNP tables generated by the DGRPproject from httpwwwhgscbcmtmceduprojectsdgrpfreeze1_July_2010snp_callsIllumina (last accessed July 302010) to examine identity by descent in this data set Weperformed an all by all comparison between lines examiningsliding windows of 1 Mb across the genome with 100 kb stepsbetween windows We take a simple definition of IBD hereWhenever we identified two lines that weregt95 identity atSNP positions in a 1 Mb region we labeled the pair as IBD inthe identified region and masked that region of the genomein the line with lower coverage We also dropped entirelyfrom our analysis any lines where gt50 of the genome wasdetermined to be IBD with other lines in the population(supplementary table S1 Supplementary Material online) re-moving from the analysis the line with lower average se-quence coverage

We modeled 1 Mb regions of 200 chromosomes in a coa-lescent simulation using the parameters theta = 001site and4Nr = 10theta to determine the number of regions of95IBD under a Wright-Fisher population model using the MACSsoftware (Chen et al 2009) We simulated 1000 replicates ofthis simulation and tabulated the number of 1 Mb regionsshowing 95 IBD in each replicate

Alignment

We aligned reads to the D melanogaster reference genome(version 513 downloaded from FlyBase flybaseorg) using thealigner BWA (version 059 Li and Durbin 2009) We used thefollowing parameters (aln -t 8 -l 13 -m 50000000 -R 5000)followed by the command ldquosamperdquo to resolve paired endmappings (using parameters -a 5000 -N 5000 -n 500) It isimportant when detecting TEs that the -R during the initialalignment and the -N parameter during the ldquosamperdquo phase beset high otherwise highly repetitive sequences that would beinformative of the presence of a common element will beexcluded from the data by the aligner We also used the -Ioption for the DGRP lines but not the DSPR due to the dif-ferences in Illumina quality score output format as the twodata sets were sequenced at different times The -m parameterfrom the initial alignment step and the -n parameter from thepaired-end resolution step deal with how BWA treats multiplyaligning reads An additional caveat is that some TE familiesmost notably the P-element family is absent from the Dmelanogaster reference sequence Such families in the se-quenced genomes will not align to the reference genome weused even if they are present in the sampled lines

Transposable Element Detection

Transposable elements were detected by first identifying allreads that were aligned to an annotated TE in the referencegenome (supplementary fig S1 Supplementary Material on-line) We then identified the mates of these reads and selectedonly those mates which did not align to an annotated TE

2322

Cridland et al doi101093molbevmst129 MBE

These reads identify the set of read-pairs which span a TEinsertion The uniquely mapping non-TE reads were thenclustered based on their start position and chromosomalstrand to identify putative TEs Clusters on the plus strandwere matched to clusters on the minus strand to producepairs of clusters indicating either end of the putative TE Onceclusters were identified we filtered out all events which con-tained fewer than three read-pairs consistent with a giveninsertion for each cluster This means that putative eventsat this stage have a minimum of six read-pairs indicating theinsertion three on the plus strand and three on the minusstrand

We then extracted all reads and their mates that aligned tothe identified regions of the genome and attempted toreconstruct the local area including the breakpoint of theTE Because many TEs have repetitive regions at the ends ofthe element we reconstructed the two breakpoints of the TEseparately (see also supplementary fig 23 from Mackay et al2012) Local reconstruction was done by running Phrap (ver-sion 1090518 Ewing and Green 1998) with the followingparameters (-forcelevel 10 -minscore 10 -minmatch 10)This set of parameters causes Phrap to do a Smith-Waterman comparison when it finds matches of at least 10characters but then relaxes the minimum alignment scoreand the final assembly parameters This is necessary to recon-struct TEs because repetitive sequence within the TEs canresult in a failure to reconstruct a contig if stricter assemblyparameters are used All code we developed for the detectionof TEs will be freely available at wwwmolpopgenorgData(last accessed August 1 2013)

Following reconstruction we used BlastN (verson 2222) tocompare reconstructed contigs to the D melanogaster refer-ence to confirm the presence of a TE In most cases we wereable to fully reconstruct both breakpoints of the TE Fullyreconstructed indicates that we could reconstruct a singlecontig which contained the breakpoint of the TE In mostcases the two breakpoints of the TE did not resolve to thesame nucleotide but instead identify the location of thetarget site duplication for the insertion (Linheiro andBergman 2012 similar to fig 1) These are cases where theupstream estimate of the breakpoint is slightly higher thanthe downstream estimate of the breakpoint and the differ-ence in position can be seen as the likely span of the target siteduplication However sometimes we were not able to fullyreconstruct a contig but instead generated two contigs onecontaining uniquely aligning sequence and the other contain-ing TE sequence These contigs are known to belong togetherbecause reads forming one contig have their mates in theother contig In these cases we have an approximate but notprecise estimate of the TE breakpoint and are unable toidentify the target site duplication

After the first round of detection we performed a secondround of detection where we examined every location wherea novel TE was detected in any line in all other lines Thisincluded examining positions in one data set that were iden-tified in the other data set In addition we examined thelocations of all known TEs in the D melanogaster genomeFor each location queried in each line we extracted reads

aligning to that location and their mates and used Phrap toreconstruct a contig This allowed us to produce presenceabsence or missing data calls for all positions where TE wasdetected in all lines The absence of a TE is determined byidentifying reconstructed contigs that spanned the insertionposition of the TE by at least 15 bp on either side Theseabsence calls were not included in Mackay et al (2012)This also allows us to identify TEs that may be in regions ofthe genome with lower average sequence coverage comparedwith the rest of the genome provided that insertion is sharedwith another line where it was possible to detect the insertion

Transposable Element Annotation

Annotation of novel TEs was performed by aligning recon-structed contigs with BlastN to the set of TEs found in theD melanogaster reference We chose the best match basedon overall length and BLAST reported e-value and called thenovel TE a TE of that family We also required a minimum of80 bp of contig matching a TE to make an annotation callThis is an additional annotation analysis to the one we per-formed in Mackay et al (2012) For each of the two contigsproduced for each insertion we determined the best matchand then compared these to each other TEs are classified asper the classification scheme put forth in Wicker et al (2007)We annotated each TE insertion to the highest level of agree-ment between the two contigs if the two contigs disagreedwith each other at every level of classification we called thatelement ldquoundeterminedrdquo This could result from nested in-sertions of elements where the 50 end of the insertion is truly adifferent element from the 30 end or because the TE sequenceis degraded and differs substantially from the referencesequence or because there was not enough TE sequence tomeet our annotation guidelines Because we can only recon-struct a maximum of few hundred bases at either end of theinsertion it is difficult to distinguish between these issues Wealso did not attempt to determine the strand of the TE

Validation

A subset of 190 TE insertion sites shared with the D melano-gaster reference were validated via PCR in a subset of nineof the DGRP lines by Blumenstiel JP Chen X He M BergmanCM (unpublished data httparxivorgabs12093456 lastaccessed June 26 2013) To independently check IBD callsusing SNP data we also determined if pairs of lines wereIBD for TEs in the same regions In this comparison we onlyincluded TE insertions where both lines have a positive pres-ence or absence call This can also serve as a validation of ourTE calling pipeline since if two segments are IBD they shouldshare the same set of TEs

Simulation

We selected a set of elements of different sizes and of TEclasses from the set of TEs in the reference sequence andinserted them semi-randomly into the D melanogaster refer-ence genome in silico Insertions were semi-random since weavoided areas already occupied by reference TEs and the cen-tromeric and telomeric regions Read pairs were then

2323

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

generated to construct fastq files We then ran our TE detec-tion pipeline as above through the initial annotation stepThis process was repeated three times to generate a simula-tion of 15 25 and 50 coverage

Recombination Rates

We wrote an R routine to estimate local recombination ratesas a function of physical position Briefly we used ldquocyto-ge-netic-seqtsvrdquo from wwwflybaseorg (last accessed January 302009) to assign a set of physical landmarks to genetic posi-tions For each chromosome arm we then fit a local polyno-mial relating cumulative cM to cumulative basepair (usinglocpoly) Unlike many prior local polynomial fits we madethe added assumption that this polynomial must be a mono-tonically increasing function (using monoproc) an assump-tion that is true for real data We then fit a smooth spline tothe above curve (using smoothspline) The advantage ofsmoothspline is that the resulting curves can be differentiated(and hence estimates of rates obtained directly) Finally we fita smooth spline to the first derivatives obtained from theprior smooth spline largely to make the recombinationrates less noisy near centromeres and telomeres TheR code is freely available at wwwmolpopgenorgData (lastaccessed August 1 2013)

Data Set Restriction

Analyses performed on these data sets were restricted in twoways First we restricted our analyses to regions of thegenome with low TE density and moderate recombinationrates We restricted our data set in this way because regions ofhigh TE density make it more difficult to make accurate callsBecause our pipeline requires unique reads to positionallylocate the TE insertion regions with high TE density canlack enough unique sequence to generate these reads Theregions included in the analyses are X300000ndash208000002L200000ndash20100000 2R2300000ndash21000000 3L100000ndash21900000 3R600000ndash27800000 which are not centromericand telomeric regions This excludes the majority of TEs an-notated in the reference genome since these TEs are largelyclustered together in centromeric regions However this doesnot affect our analysis since we are interested in TEs in theeuchromatic portions of the genome

Second to restrict the set of reference TEs studied to full-length copies we removed from the analysis all reference TEsthat had a length that was less than 75 of the canonicalannotated length of the TE While these elements may stillproduce phenotypic effects these elements are unlikely to beactive copies This reduced the number of reference TEs in-cluded in the analysis to 607 from 1085 TEs that are present inthe reference genome in the regions included in our analysis

Additionally some analyses were performed on the subsetof 23 lines in the DGRP data set which had an average se-quencing coverage of 25 or higher hereafter referred to asDGRP25 This restriction was to mitigate the effect of differ-ences in coverage between the two data sets Simulations oftransposable element detection via our pipeline data not

shown indicate that TE detection at 25 is comparable todetection at 50

Site Count Spectra

We generated derived allele count spectra for TEs in the DSPRdata set and the DGRP25 set of lines We included only siteswhere we were able to make a positive presence or absencecall for each line in the data set so that the number of lineswas normalized for each resource for this analysis Observedvalues were compared with expected values at counts 1 2 34 5 6 and 7 under an infinite sites model via a 2 testCategories 7 and higher were grouped due to low counts insome of these categories We also calculated count spectra forSNPs present in small introns (introns86 bp [Haddrill et al2005]) a group unlikely to have functional effects We alsoused SoFoS a site frequency rescaler (Hufford et al 2012) torescale the DSPR data set to the sample size of the DGRP25from 15 to 23 samples to see if the rescaled data show adifferent distribution of element counts from the observedDGRP25 data We then did a KolmogorovndashSmirnov test tocompare the two distributions as a test of different selectiveeffects acting on the populations

Genomic Context of TEs

Transposable elements were identified in regions of low(lt2 cMMb) or high recombination (2 cMMb) FASTAfiles containing the annotated coordinates of 50 UTRs30 UTRs intronic intergenic and full gene span regions forversion 513 of the D melanogaster reference were down-loaded from flybase (wwwflybaseorg last accessed January30 2009) Version 513 of the reference annotation was usedfor comparative purposes with the DGRP analysis (Mackayet al 2012) Because alternative transcripts can result in thesame genomic position being classified in multiple categorieswe labeled each TE as every appropriate category This oc-curred with 52 of TEs in the DGRP and 60 of TEs in theDSPR When calculating the percentage of the referencegenome in our restricted areas of analysis that were classifiedas each genomic type we followed the same strategy

We compared our observations of TE insertions to a bino-mial distribution to determine whether TEs were distributedrandomly throughout the genome We used the pbinomfunction in R (R Development Core Team 2008) setting thenumber of trials to the total number of TE insertion locationsin the data set and the probability of success to the propor-tion of the genome defined as each context (intron exon30 UTR 50 UTR or intergenic) We did one tailed tests in eachcase For exon 30 UTR and 50 UTR tests we calculated theprobability of observing the observed value or fewer TEs Forintron and intergenic tests we calculated the probability ofobserving the observed value or more TEs

Variables Affecting TE Density

For the DGRP25 and DSPR data sets we calculatedTE density in terms of TEsMb For this analysis we alsoexcluded chromosome 4 We examined the effects of recom-bination rate chromosome and TE element family on TE

2324

Cridland et al doi101093molbevmst129 MBE

density using the standard linear model in R We usedthe lm() function to define our model as followsTEsMb ~ Line + Recombination Rate + Chromosome +Recombination Rate Chromosome We performed thisanalysis both for the set of all TEs and also for 15 TE familieswith moderate to high number of copies in our data set(supplementary table S4 Supplementary Material online)We selected individual TE families for this additional analysisbased on their level of abundance in the data sets and theirhaving been previously studied by Montgomery et al (1987)and Bartolome et al (2002)

Gene Families

We downloaded the Dfam database from (httpwwwindianaedu~hahnlabflyDfamDBdrosophila_frbhtml last acc-essed September 1 2012) This database describes all mem-bers of all gene families clustered using a fuzzy reciprocalBLAST method as described in Hahn et al (2007) for 12Drosophila species We used R to perform a linear regressionusing a simple model where the square root (TE density) in agene is the random variable and log10 (gene family size) is theindependent variable

Variation between Resources

For each TE present in both data sets we compared thefrequency of the TE in each population using Fisherrsquos exacttest We excluded from this analysis all insertions where wewere unable to make a presenceabsence call in at least 100 ofthe DGRP lines or 13 of the DSPR lines For each gene with TEinsertions in both data sets we compared the frequency atwhich the gene contains TEs between data sets using aFisherrsquos exact test

Supplementary MaterialSupplementary tables S1ndashS6 and figures S1 and S2 are avail-able at Molecular Biology and Evolution online (httpwwwmbeoxfordjournalsorg)

Acknowledgments

The authors thank Casey Bergman and Miaomiao He for earlyaccess to their data as well as helpful comments from CBThey also thank Rebekah Rogers for help with editing themanuscript This work was supported by the NationalInstitute of Health grant NIH R01-GM085183 to KRT theNational Library of Medicine - National Institute of Healthtraining grant LM007443 to JMC and the National Instituteof Health grant NIH R01-OD010974 to ADL and SJM

ReferencesAltschul SF Gish W Miller W Myers EW Lipman DJ 1990 Basic local

alignment search tool J Mol Biol 215403ndash410Aminetzach YT Macpherson JM Petrov DA 2005 Pesticide resistance

via transposition-mediated adaptive gene truncation in DrosophilaScience 309764ndash767

Bartolome C Maside X Charlesworth B 2002 On the abundance anddistribution of transposable elements in the genome of Drosophilamelanogaster Mol Biol Evol 19926ndash937

Bennetzen JL 2000 Transposable element contributions to plant geneand genome evolution Plant Mol Biol 42251ndash269

Bergman CM Bensasson D 2007 Recent LTR retrotransposon insertioncontrasts with waves of non-LTR insertion since speciation inDrosophila melanogaster Proc Natl Acad Sci U S A 10411340ndash11345

Biemont C Vieira C 2006 Junk DNA as an evolutionary force Nature443521ndash524

Birchler JA Hiebert JC 1989 Interaction of the Enhancer of white-apricotwith transposable element alleles at the white locus in Drosophilamelanogaster Genetics 122129ndash138

Birchler JA Hiebert JC Rabinow L 1989 Interaction of the mottler ofwhite with transposable element alleles at the white locus inDrosophila melanogaster Genes Dev 373ndash84

Bowen NJ McDonald JF 2001 Drosophila euchromatic LTR retrotran-sposons are much younger than the host species in which theyreside Genome Res 111527ndash1540

Carr M Soloway JR Robinson TE Brookfield JF 2002 Mechanisms reg-ulating copy numbers of six LTR retrotransposons in the genome ofDrosophila melanogaster Chromosoma 110511ndash518

Charlesworth B 1994 The effect of background selection against dele-terious mutations on weakly selected linked variants Genet Res 63213ndash227

Charlesworth B Jarne P Assimacopoulos S 1994 The distribution oftransposable elements within and between chromosomes in a pop-ulation of Drosophila melanogaster III Element abundances in het-erochromatin Genet Res 64183ndash197

Charlesworth B Langley CH 1989 The population genetics ofDrosophila transposable elements Annu Rev Genet 23251ndash287

Charlesworth B Lapid A 1989 A study of ten families of transposableelements on X chromosomes from a population of Drosophila mel-anogaster Genet Res 54113ndash125

Charlesworth B Lapid A Canada D 1992 The distribution of transpos-able elements within and between chromosomes in a population ofDrosophila melanogaster I Element frequencies and distributionGenet Res 60103ndash114

Chen GK Marjoram P Wall JD 2009 Fast and flexible simulation ofDNA sequence data Genome Res 19136ndash142

Chung H Bogwitz MR McCart C Andrianopoulos A Ffrench-ConstantRH Batterham P Daborn PJ 2007 Cis-regulatory elements in theaccord retrotransposon result in tissue-specific expression of theDrosophila melanogaster insecticide resistance gene Cyp6g1Genetics 1751071ndash1077

Cutter AD Good JM Pappas CT Saunders MA Starrett DM Wheeler TJ2005 Transposable element orientation bias in the Drosophila mel-anogaster genome J Mol Evol 6733ndash741

Daborn P Boundy S Yen J Pittendrigh B Ffrench-Constant R 2001DDT resistance in Drosophila correlates with Cyp6g1 over-expressionand confers cross-resistance to the neonicotinoid imidacloprid MolGenet Genomics 266556ndash563

Deloger M Cavalli FMG Lerat E Biemont C Sagot M-F Vieira C 2009Identification of expressed transposable element insertionsin the sequenced genome of Drosophila melanogaster Gene 43955ndash62

Dworkin I Palsson A Gibson G 2005 Replication of an Egfr-wing shapeassociation in a wild-caught cohort of Drosophila melanogasterGenetics 1692115ndash2125

Ewing B Green P 1998 Base-calling of automated sequencer tracesusing Phred II Error probabilities Genome Res 8186ndash194

Gruber JD Genissel A Macdonald SJ Long AD 2007 How repeatable areassociations between polymorphisms in achaete-scute and bristlenumber variation in Drosophila Genetics 1751987ndash1997

Haddrill PR Charlesworth B Halligan DL Andolfatto P 2005 Patterns ofintron sequence evolution in Drosophila are dependent upon lengthand GC content Genome Biol 6R67

Hahn MW Han MV Han S-G 2007 Gene family evolution across 12Drosophila genomes PLoS Genet 3e197

Hirschhorn JN Daly MJ 2005 Genome-wide association stud-ies for common diseases and complex traits Nat Rev Genet 695ndash108

2325

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

Hudson RR Kaplan NL 1995 Deleterious background selection withrecombination Genetics 1411605ndash1617

Hufford MB Xu X van Heerwaarden J et al (25 co-authors) 2012Comparative population genomics of maize domestication andimprovement Nat Genet 44808ndash811

Kaminker JS Bergman CM Kronmiller B et al (12 co-authors) 2002The transposable elements of the Drosophila melanogastereuchromatin a genomics perspective Genome Biol3RESEARCH0084

Kapitonov VV Jurka J 2003 Molecular paleontology of transposableelements in the Drosophila melanogaster genome Proc Natl AcadSci U S A 1006569ndash6574

Kazazian HH 2004 Mobile elements drivers of genome evolutionScience 3031626ndash1632

Kidwell MG 2002 Transposable elements and the evolution of genomesize in eukaryotes Genetica 11549ndash63

King EG Macdonald SJ Long AD 2012a Properties and power of theDrosophila synthetic population resource for the routine dissectionof complex traits Genetics 191935ndash949

King EG Merkes CM McNeil CL Hoofer SR Sen S Broman KW LongAD Macdonald SJ 2012b Genetic dissection of a model complextrait using the Drosophila synthetic population resource GenomeRes 221558ndash1566

Kofler R Betancourt AJ Schlotterer C 2012 Sequencing of pooled DNAsamples (Pool-Seq) uncovers complex dynamics of transposableelement insertions in Drosophila melanogaster PLoS Genet 8e1002487

Langley CH Montgomery E Hudson R Kaplan N Charlesworth B 1988On the role of unequal exchange in the containment of transpos-able element copy number Genet Res 52223ndash235

Li H Durbin R 2009 Fast and accurate short readalignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760

Linheiro RS Bergman CM 2012 Whole genome resequencing revealsnatural target site preferences of transposable elements inDrosophila melanogaster PLoS One 7e30008

Lipatov M Lenkov L Petrov DA Bergman CM 2005 Paucity of chimericgene-transposable elements in the Drosophila melanogastergenome BMC Biol 324

Long AD Lyman RF Morgan AH Langley CH Mackay TFC 2000 Bothnaturally occurring insertions of transposable elements and inter-mediate frequency polymorphisms at the achaete-scute complexare associated with bristle number in Drosophila melanogasterGenetics 1541255ndash1269

Lonnig W-E Saedler H 2002 Chromosome rearrangements and trans-posable elements Annu Rev Genet 36389ndash410

Macdonald SJ Long AD 2007 Joint estimates of quantitativetrait locus effect and frequency using synthetic recombinantpopulations of Drosophila melanogaster Genetics 1761261ndash1281

Macdonald SJ Pastinen T Long AD 2005 The effect of polymorphismsin the Enhancer of split gene complex on bristle number variation ina large wild-caught cohort of Drosophila melanogaster Genetics 1711741ndash1756

Mackay TFC 1984 Jumping genes meet abdominal bristles hybrid dys-genesis-induced quantitative variation in Drosophila melanogasterGenet Res 44231ndash237

Mackay TFC 1989 Transposable elements and fitness in Drosophilamelanogaster Genome 31284ndash295

Mackay TFC 2010 Mutations and quantitative genetic variationlessons from Drosophila Philos Trans R Soc Lond B Biol Sci 3651229ndash1239

Mackay TFC Langley CH 1990 Molecular and phenotypic variation inthe achaete-scute region of Drosophila melanogaster Nature 34864ndash66

Mackay TFC Lyman RF Jackson MS 1992 Effects of P element inser-tions on quantitative traits in Drosophila melanogaster Genetics 130315ndash332

Mackay TFC Richards S Stone EA et al (52 co-authors) 2012 TheDrosophila melanogaster genetic reference panel Nature 482173ndash178

Meyer LR Zweig AS Hinrichs AS et al (32 co-authors) 2013 The UCSCgenome browser database extensions and updates 2013 NucleicAcids Res 41D64ndashD69

Montgomery E Charlesworth B Langley CH 1987 A test for the role ofnatural selection in the stabilization of transposable element copynumber in a population of Drosophila melanogaster Genet Res 4931ndash41

Montgomery EA Huang S-M Langley CH Judd BH 1991 Chromosomerearrangement by ectopic recombination in Drosophilamelanogaster genome structure and evolution Genetics 1291085ndash1098

NCI-NHGRI Working Group on Replication in Association Studies2007 Replicating genotype-phenotype associations Nature 447655ndash660

Nuzhdin SV Mackay TFC 1994 The genomic rate of transposable ele-ment movement in Drosophila melanogaster Mol Biol Evol 12180ndash181

Nuzhdin SV Pasyukova EG Mackay TFC 1997 Accumulation of trans-posable elements in laboratory lines of Drosophila melanogasterGenetica 100167ndash175

Ometto L Stephan W Lorenzo DD 2005 Insertiondeletion andnucleotide polymorphism data reveal constraints in Drosophilamelanogaster introns and intergenic regions Genetics 1691521ndash1527

Palsson A Gibson G 2004 Association between nucleotide variation inEgfr and wing shape in Drosophila melanogaster Genetics 1671187ndash1198

Petrov DA Fiston-Lavier A-S Lipatov M Lenkov K Gonzalez J 2011Population genomics of transposable elements in Drosophila mela-nogaster Mol Biol Evol 281633ndash1644

Presgraves DC 2006 Intron length evolution in Drosophila Mol Biol Evol132203ndash2213

Quinian AR Hall IM 2010 BEDTools a flexible suite of utilities forcomparing genomic features Bioinformatics 26841ndash842

R Development Core Team 2008 R a language and environment forstatistical computing Vienna Austria R Foundation for StatisticalComputing [cited 2013 May 1] Available from httpwwwR-projectorg

Rizzon C Marais G Gouy M Biemont C 2002 Recombination rate andthe distribution of transposable elements in the Drosophila mela-nogaster genome Genome Res 12400ndash407

Sackton TB Kulathinal RJ Bergman CM Quinlan AR Dopman EBCarneiro M Marth GT Hartl DL Clark AG 2009 Population geno-mic inferences from sparse high-throughput sequencing of twopopulations of Drosophila melanogaster Genome Biol Evol 1449ndash465

Sakurai M Aoki T Yoshikawa S et al (11 co-authors) 2009 Differentiallyexpressed Drl and Drl-2 play opposing roles in Wnt5 signalingduring Drosophila olfactory system development J Neurosci 294972ndash4980

Schmidt JM Good RT Appleton B et al (11 co-authors) 2010Copy number variation and transposable elements feature inrecent ongoing adaptation at the Cyp6g1 locus PLoS Genet 6e1000998

Shrimpton AE Mackay TFC Brown AJL 1990 Transposable element-in-duced response to artificial selection in Drosophila melanogastermolecular analysis of selection lines Genetics 125803ndash811

Slotkin RK Martienssen R 2007 Transposable elements and the epige-netic regulation of the genome Nat Rev Genet 8272ndash285

Smith PA Corces VG 1991 Drosophila transposable elements mecha-nisms of mutagenesis and interactions with the host genome AdvGenet 29229ndash300

Thornton K 2003 Libsequence a C ++ class library for evolutionarygenetic analysis Bioinformatics 192325ndash2327

Vazquez JF Albornoz J Dominguez A 2007 Direct determination of theeffects of genotype and extreme temperature on the transposition

2326

Cridland et al doi101093molbevmst129 MBE

of roo in long-term mutation accumulation lines of Drosophilamelanogaster Mol Genet Genomics 278653ndash664

Voight BF Pritchard JK 2005 Confounding from cryptic relatedness incase-control association studies PLoS Genet 1e32

Wakeley J 2009 Coalescent theory an introduction Greenwood VillageRoberts and Company Publishers

Wicker T Sabot F Hua-Van A et al (13 co-authors) 2007 A unifiedclassification system for eukaryotic transposable elements Nat RevGenet 8973ndash982

Xing J Witherspoon DJ Jorde LB 2013 Mobile element biology newpossibilities with high-throughput sequencing Trends Genet 29280ndash289

Yoshikawa S McKinnon RD Kokel M Thomas JB 2003 Wnt-mediated axon guidance via Drosophila derailed receptor Nature422583ndash588

Zachar Z Bingham PM 1982 Regulation of white locus expression thestructure of mutant alleles at the white locus of Drosophila mela-nogaster Cell 30529ndash541

2327

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

Page 10: Abundance and Distribution of Transposable Elements in Two ...

non-random distribution of TE insertions across the genomewith increases of TEs in both the intergenic and intronicregions of the genome and decreases in exonic and UTRregions (Kaminker et al 2002 Lipatov et al 2005) Similar toprevious studies we found variation in TE density betweenchromosomes and higher densities of TEs in regions of lowrecombination (Charlesworth et al 1994 Bartolome et al2002) We also found increased TE densities on the X relativeto the autosomes both in overall insertions and in someindividual TE families (table 5)

Detection method and the size of the data set are likely tobe contributing factors to differences between previous workand this study Many previous studies have used in situ meth-ods to ascertain TE number in samples and have looked atelement accumulation at the base versus the midsections ofchromosomes (Montgomery et al 1987 Charlesworth et al1992) or have looked only at one genome (Bartolome et al2002 Rizzon et al 2002) A previous PCR-based detectionstudy found no difference in TE frequency between the Xand the autosomes in natural populations (Petrov et al2011) but this study was biased in that it only looked atTEs shared with the reference sequence and thus did notcapture the full picture of TEs in these populations

While our analysis does disagree with previous work thesedisagreements may be mainly due to the way in which regionsof the genome are divided for examination Rizzon et al(2002) found an increase in copies of LTR retrotransposonson the X when they excluded pericentromeric telomeric andchromosome 4 from their analysis though they found a def-icit of TEs on the X when these regions are included Thisfinding is most directly comparable to our own finding for theentire TE data set since they exclude a very similar set ofgenomic regions We do agree with previous work in ourfinding of higher densities of TEs in regions of low recombi-nation than in high recombination (Charlesworth et al 1994Bartolome et al 2002)

Examining individual element families Carr et al (2002)found an increase of mdg-3 elements and 297 elements on theX than would be expected under a random insertion modelMontgomery et al (1987) found no evidence for a reductionin TEs on the X for both roo and 297 elements but foundfewer TEs on the X when looking at the 412 family HoweverCharlesworth et al (1992) observed the opposite pattern forroo and 297 Montgomery et al (1991) also found that moreroo elements in lab kept lines than they had previously ob-served in natural populations We found significantly in-creased densities on the X for roo and significantlyincreased densities on the autosomes for 412

Differences between previous studies of individual elementfamilies and our results may be due to a variety of reasonsFirst we were able to annotate only about two-thirds of ourTE insertions This was primarily due to having insufficientsequence to reliably annotate the element or because inde-pendent annotation using sequence reconstructed fromeither end of the insertion did not yield the same elementfamily annotation This means that our density calculationsfor individual elements may be skewed by missing annotationinformation Second many previous studies (Montgomery

et al 1987 Charlesworth et al 1992 Carr et al 2002) examinedonly LTR elements These elements are young (Bowen andMcDonald 2001 Bergman and Bensasson 2007) and thus in-dividual copies may not yet have been removed from thegenome thereby displaying different patterns of insertionthan older active elements

What is clear is that there is great variability in TE densityboth within and between element families as well as withinand between chromosomes It may also be that the overallpattern is driven by a small group of families In our data setINE-1 shows the largest pattern of difference between the Xand autosomes with an order of magnitude increase in meanTE density per line on the X over the autosomes possibly dueto an increased likelihood of fixation of insertions (Omettoet al 2005) specifically on the X chromosome (Presgraves2006) These INE-1 elements are a subset of the largest TEfamily in D melanogaster though only several dozen of themare in the euchromatin INE-1 elements are also considered tobe inactive in D melanogaster and likely have not been mobilefor gt3 million years (Kapitonov and Jurka 2003) Consistentwith this most of the INE-1 elements we detected were fixedor nearly fixed in our data set We did see some examples oflow frequency insertions which may be due to more recenttranspositional events or incorrect TE annotation due toshort sequences The old age of INE-1 insertions also suggeststhat the extant elements are unlikely to be selected againstAn increased density of TEs on the X compared with theautosomes suggests that it is not the fitness effects of inser-tions which govern TE copy number Fitness effects wouldresult in the more rapid removal of insertions on the X thaninsertions on the autosomes due to the increased selectivepressures experienced on the X (Montgomery et al 1987)Background selection also predicts that the X should be lessburdened by deleterious recessives since it is hemizygous onethird of the time (Charlesworth 1994)

However it is unclear whether our observation of in-creased densities of TEs on the X relative to the autosomesfits with the ectopic exchange model This model predictsthat TEs will accumulate in regions of low ectopic exchange(Langley et al 1988) The general assumption in the literatureis that ectopic exchange is positively correlated with recom-bination rate (Langley et al 1988 Montgomery et al 1991)The X chromosome experiences higher levels of recombina-tion than the autosomes with an average of 36 cMMb forthe X in regions of high recombination (cMMbgt 2) whereasthe autosomes have a mean of 336 cMMb If recombinationrate and ectopic exchange are positively correlated through-out the genome then the expectation would be fewer TEs inthe middle regions of the X where recombination rates areknown to be high This is the opposite of what we observe Ifrecombination rate and the rate of ectopic exchange are notpositively correlated then the ectopic exchange model maybe correct but data are lacking to properly address this issueCurrently there are no direct measurements of the rate ofectopic exchange on a genome-wide scale in Drosophiladuring meiosis Montgomery et al (1991) observed littleectopic exchange between homozygous chromosomes inDrosophila melanogaster on the X chromosome in the

2320

Cridland et al doi101093molbevmst129 MBE

region around the white locus They also found that whilerecombination rate was reduced in both centromeric andtelomeric regions only the centromeric region displayed anincreased density of TEs This information is suggestive ofdifferences between these regions in the rate of ectopicexchange

The DGRP and DSPR are also composed of inbred lineswhich have been kept in laboratory conditions for some time~9 years in the case of the DGRP lines and 40+ years in thecase of the DSPR Given that TE insertion mutations accumu-late rapidly and in an increasing manner over time (Nuzhdinet al 1997) the increased density of TEs on the X versus theautosomes may not be surprising if there is a highertransposition rate to the X The only direct study to datefocused on roo elements in D melanogaster and did observehigher rates of insertion in the X than the autosomes(Vazquez et al 2007)

Transposable Elements in Genes

The majority of genes 813 do not have any TE insertions inany line studied and intron length is a good predictor of thenumber of TEs within a gene However a few genes accumu-late many insertions even after correcting for gene size andintron length and some genes have very few TEs even thoughthey have large introns The reasons why different genes mayaccumulate TE insertions may be quite different for individualgenes In some cases there may be hot spots for insertionswhich if insertions are neutral or only slightly deleterious mayresult in large number of insertions in a small region In othergenes the lack of TE accumulation may suggest selectivepressures against TE accumulation

Gene family size may also play a role in TE accumulationWe see an increase in TE density correlated with an increase ingene family size This may indicate differences in selectivepressures for genes in different sized families Single-copygenes may experience more constraint whereas in moder-ately sized gene families these constraints may be more re-laxed perhaps due to redundancy

In the cytochrome p450 functional family of genes we findthat 3789 of these genes have TE insertions 2589 in theDGRP and 2089 in the DSPR Cytochrome p450 is also foundto be in an enriched term in the DSPR for the set of genescontaining TEs The cytochrome p450 gene Cyp6g1 is knownto confer resistance to DDT (Daborn et al 2001) and TE-mediated copy number variation is associated with increasingresistance to DDT (Schmidt et al 2010) This pattern of TEinsertions in the DSPR may be reflective of the locationswhere DSPR lines were originally collected The DSPR lineswere gathered from various locations worldwide at a timewhen different pesticides were in use The DGRP lines weregathered from a single geographic area in North America in2003 Presumably these sets of lines have experienced differ-ent selective pressures with respect to pesticide resistance InCyp6a20 (fig 4D) where TEs are segregating at different fre-quencies in the two resources some similar effect may beoccurring

TE Detection Pipelines

The TE detection strategy used here is an improvement uponthe strategy used in Mackay et al (2012) The most notabledifferences between the data reported in Mackay et al (2012)and the DGRP data reported here is substantial amount ofnew information First our pipeline improvements agree wellwith the previous implementation with 812 of previous TEpresence calls identified by the new pipeline However thesecalls represent on 182 of the presence calls made by theimproved pipeline demonstrating that we have identifiedsignificantly more calls than before The incorporation of TEabsence calls also allows us to provide a much more completepicture of TEs in euchromatic sequence in the DGRP

The TE detection pipeline used here is similar to otherpipelines which have been employed but it is useful to de-scribe briefly here the differences in approaches and why wefollowed the approach presented here A useful discussion ofcommon approaches to TE detection in many species can befound in Xing et al (2013) One common method for TEdetection is the split-read mapping approach which hasbeen employed in Drosophila by Linheiro and Bergman(2012) This approach identifies TE insertions by locating in-dividual sequence reads which span TE insertion breakpointsThis approach is the strategy of choice for next-generationsequencing experiments which produce only single-endreads However in a paired-end read situation this approachmay not take full advantage of the available data Specificallythe size of window around the TE insertion breakpoint whereinformative reads can be found is governed by read length in asingle-end experiment and fragment length in a paired-endexperiment Linheiro and Bergman (2012) discuss the impor-tance of read length on TE detection and suggest that in-creased read length improves TE detection In situationswhere read length is long and coverage is deep enough thesplit-read method may be preferable to the paired-end de-tection strategy However for the data sets analyzed herewhere read length is much shorter than fragment lengththe paired-end detection strategy will capture moreinformation

The pipeline used here also has similarities to the pipelineused in Kofler et al (2012) However there are two majordifferences in these studies First Kofler et al (2012) do notattempt to reconstruct their insertion breakpoints andsecond the population studied by Kofler et al (2012) wassequenced as a pool rather than as individual lines Becausetheir pooled sequence data reflect on average less than 1sequence coverage per line Kofler et al (2012) are primarilyable to detect intermediate to high frequency TE insertionstheir own analysis concludes that they cannot reliably detectinsertions at frequency lt7 (see supplementary fig S2Supplementary Material online in their supplementary dis-cussion) However the literature to date (Charlesworth andLangley 1989 Charlesworth and Lapid 1989 Charlesworthet al 1992) as well as the results of this study indicate thatthe majority of TEs in a population are found at very lowfrequency often in only one individual in a population andtherefore beyond the scope of detection of Kofler et al (2012)

2321

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

Caveats

The D melanogaster reference genome to which we alignedour data is a P-element-free genome Therefore the use of thisreference will be biased against detecting any P elementsexisting in the data sets In the case of the DSPR data setthis is not an issue since these lines are also all P-element-free(Macdonald and Long 2007) However it is likely that theDGRP lines do harbor P elements For the purpose of theseanalyses we decided to ignore P elements and focus on theset of element families common to the two resources

This situation does highlight the importance of having anappropriate reference sequence To detect P elements in thegenome of D melanogaster an artificially constructed refer-ence sequence or a different assembled sequence containingthe P element would need to be utilized This also brings upthe question about as yet undetected elements that exist atlow frequency If these elements do not have sufficient sim-ilarity to existing elements to be aligned to the referencegenome these elements will go undetected

Conclusions

We have presented a description of transposable elements intwo recently released QTL resources Transposable elementscan play important roles in gene function and regulation andincorporation of TE insertion information into any analysisperformed with either of these two resources will present amore complete picture of genomic variation and its contri-bution to complex traits In addition it will be important forassociation studies attempting to replicate genotypendashpheno-type associations utilizing these resources to be aware of thepotential contributions of TEs

We have also demonstrated a next-generation high-throughput sequencing analysis pipeline that is capable ofdetecting this type variation with a high rate of precisionand specificity We suggest that future studies using next-generation sequencing data utilize this technique

Materials and Methods

Data Sets

Two data sets were used in this analysis The first set consistsof 131 inbred lines from the Drosophila Genetic ReferencePanel (DGRP) (Mackay et al 2012) SRA accessionSRP000694 The second set consists of 15 lines used to es-tablish the Drosophila Synthetic Population Resource (DSPR)(King et al 2012a) initially described in Macdonald and Long(2007) SRA accession SRA051316 For the DGRP resourceboth 454 and Illumina data were available whereas onlyIllumina data are available for the DSPR We utilized onlythe Illumina sequence data to ensure consistency in datacollection across data sets We calculated a mean coverageof 21 for the DGRP lines and 50 for the DSPR linesVariation in coverage for the DGRP lines was substantial be-tween lines ranging from ~4 average coverage to nearly50 average coverage We therefore dropped from the anal-ysis four lines with an average coverage of 10 or less sincelow coverage can cause difficulties with assembly and TEidentification Sequence data for both data sets consisted of

paired-end data with 54 bp reads for the DSPR and paired-end 100 bp reads trimmed to 75 bp for DGRP

IBD in DGRP Data Set

We downloaded the SNP tables generated by the DGRPproject from httpwwwhgscbcmtmceduprojectsdgrpfreeze1_July_2010snp_callsIllumina (last accessed July 302010) to examine identity by descent in this data set Weperformed an all by all comparison between lines examiningsliding windows of 1 Mb across the genome with 100 kb stepsbetween windows We take a simple definition of IBD hereWhenever we identified two lines that weregt95 identity atSNP positions in a 1 Mb region we labeled the pair as IBD inthe identified region and masked that region of the genomein the line with lower coverage We also dropped entirelyfrom our analysis any lines where gt50 of the genome wasdetermined to be IBD with other lines in the population(supplementary table S1 Supplementary Material online) re-moving from the analysis the line with lower average se-quence coverage

We modeled 1 Mb regions of 200 chromosomes in a coa-lescent simulation using the parameters theta = 001site and4Nr = 10theta to determine the number of regions of95IBD under a Wright-Fisher population model using the MACSsoftware (Chen et al 2009) We simulated 1000 replicates ofthis simulation and tabulated the number of 1 Mb regionsshowing 95 IBD in each replicate

Alignment

We aligned reads to the D melanogaster reference genome(version 513 downloaded from FlyBase flybaseorg) using thealigner BWA (version 059 Li and Durbin 2009) We used thefollowing parameters (aln -t 8 -l 13 -m 50000000 -R 5000)followed by the command ldquosamperdquo to resolve paired endmappings (using parameters -a 5000 -N 5000 -n 500) It isimportant when detecting TEs that the -R during the initialalignment and the -N parameter during the ldquosamperdquo phase beset high otherwise highly repetitive sequences that would beinformative of the presence of a common element will beexcluded from the data by the aligner We also used the -Ioption for the DGRP lines but not the DSPR due to the dif-ferences in Illumina quality score output format as the twodata sets were sequenced at different times The -m parameterfrom the initial alignment step and the -n parameter from thepaired-end resolution step deal with how BWA treats multiplyaligning reads An additional caveat is that some TE familiesmost notably the P-element family is absent from the Dmelanogaster reference sequence Such families in the se-quenced genomes will not align to the reference genome weused even if they are present in the sampled lines

Transposable Element Detection

Transposable elements were detected by first identifying allreads that were aligned to an annotated TE in the referencegenome (supplementary fig S1 Supplementary Material on-line) We then identified the mates of these reads and selectedonly those mates which did not align to an annotated TE

2322

Cridland et al doi101093molbevmst129 MBE

These reads identify the set of read-pairs which span a TEinsertion The uniquely mapping non-TE reads were thenclustered based on their start position and chromosomalstrand to identify putative TEs Clusters on the plus strandwere matched to clusters on the minus strand to producepairs of clusters indicating either end of the putative TE Onceclusters were identified we filtered out all events which con-tained fewer than three read-pairs consistent with a giveninsertion for each cluster This means that putative eventsat this stage have a minimum of six read-pairs indicating theinsertion three on the plus strand and three on the minusstrand

We then extracted all reads and their mates that aligned tothe identified regions of the genome and attempted toreconstruct the local area including the breakpoint of theTE Because many TEs have repetitive regions at the ends ofthe element we reconstructed the two breakpoints of the TEseparately (see also supplementary fig 23 from Mackay et al2012) Local reconstruction was done by running Phrap (ver-sion 1090518 Ewing and Green 1998) with the followingparameters (-forcelevel 10 -minscore 10 -minmatch 10)This set of parameters causes Phrap to do a Smith-Waterman comparison when it finds matches of at least 10characters but then relaxes the minimum alignment scoreand the final assembly parameters This is necessary to recon-struct TEs because repetitive sequence within the TEs canresult in a failure to reconstruct a contig if stricter assemblyparameters are used All code we developed for the detectionof TEs will be freely available at wwwmolpopgenorgData(last accessed August 1 2013)

Following reconstruction we used BlastN (verson 2222) tocompare reconstructed contigs to the D melanogaster refer-ence to confirm the presence of a TE In most cases we wereable to fully reconstruct both breakpoints of the TE Fullyreconstructed indicates that we could reconstruct a singlecontig which contained the breakpoint of the TE In mostcases the two breakpoints of the TE did not resolve to thesame nucleotide but instead identify the location of thetarget site duplication for the insertion (Linheiro andBergman 2012 similar to fig 1) These are cases where theupstream estimate of the breakpoint is slightly higher thanthe downstream estimate of the breakpoint and the differ-ence in position can be seen as the likely span of the target siteduplication However sometimes we were not able to fullyreconstruct a contig but instead generated two contigs onecontaining uniquely aligning sequence and the other contain-ing TE sequence These contigs are known to belong togetherbecause reads forming one contig have their mates in theother contig In these cases we have an approximate but notprecise estimate of the TE breakpoint and are unable toidentify the target site duplication

After the first round of detection we performed a secondround of detection where we examined every location wherea novel TE was detected in any line in all other lines Thisincluded examining positions in one data set that were iden-tified in the other data set In addition we examined thelocations of all known TEs in the D melanogaster genomeFor each location queried in each line we extracted reads

aligning to that location and their mates and used Phrap toreconstruct a contig This allowed us to produce presenceabsence or missing data calls for all positions where TE wasdetected in all lines The absence of a TE is determined byidentifying reconstructed contigs that spanned the insertionposition of the TE by at least 15 bp on either side Theseabsence calls were not included in Mackay et al (2012)This also allows us to identify TEs that may be in regions ofthe genome with lower average sequence coverage comparedwith the rest of the genome provided that insertion is sharedwith another line where it was possible to detect the insertion

Transposable Element Annotation

Annotation of novel TEs was performed by aligning recon-structed contigs with BlastN to the set of TEs found in theD melanogaster reference We chose the best match basedon overall length and BLAST reported e-value and called thenovel TE a TE of that family We also required a minimum of80 bp of contig matching a TE to make an annotation callThis is an additional annotation analysis to the one we per-formed in Mackay et al (2012) For each of the two contigsproduced for each insertion we determined the best matchand then compared these to each other TEs are classified asper the classification scheme put forth in Wicker et al (2007)We annotated each TE insertion to the highest level of agree-ment between the two contigs if the two contigs disagreedwith each other at every level of classification we called thatelement ldquoundeterminedrdquo This could result from nested in-sertions of elements where the 50 end of the insertion is truly adifferent element from the 30 end or because the TE sequenceis degraded and differs substantially from the referencesequence or because there was not enough TE sequence tomeet our annotation guidelines Because we can only recon-struct a maximum of few hundred bases at either end of theinsertion it is difficult to distinguish between these issues Wealso did not attempt to determine the strand of the TE

Validation

A subset of 190 TE insertion sites shared with the D melano-gaster reference were validated via PCR in a subset of nineof the DGRP lines by Blumenstiel JP Chen X He M BergmanCM (unpublished data httparxivorgabs12093456 lastaccessed June 26 2013) To independently check IBD callsusing SNP data we also determined if pairs of lines wereIBD for TEs in the same regions In this comparison we onlyincluded TE insertions where both lines have a positive pres-ence or absence call This can also serve as a validation of ourTE calling pipeline since if two segments are IBD they shouldshare the same set of TEs

Simulation

We selected a set of elements of different sizes and of TEclasses from the set of TEs in the reference sequence andinserted them semi-randomly into the D melanogaster refer-ence genome in silico Insertions were semi-random since weavoided areas already occupied by reference TEs and the cen-tromeric and telomeric regions Read pairs were then

2323

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

generated to construct fastq files We then ran our TE detec-tion pipeline as above through the initial annotation stepThis process was repeated three times to generate a simula-tion of 15 25 and 50 coverage

Recombination Rates

We wrote an R routine to estimate local recombination ratesas a function of physical position Briefly we used ldquocyto-ge-netic-seqtsvrdquo from wwwflybaseorg (last accessed January 302009) to assign a set of physical landmarks to genetic posi-tions For each chromosome arm we then fit a local polyno-mial relating cumulative cM to cumulative basepair (usinglocpoly) Unlike many prior local polynomial fits we madethe added assumption that this polynomial must be a mono-tonically increasing function (using monoproc) an assump-tion that is true for real data We then fit a smooth spline tothe above curve (using smoothspline) The advantage ofsmoothspline is that the resulting curves can be differentiated(and hence estimates of rates obtained directly) Finally we fita smooth spline to the first derivatives obtained from theprior smooth spline largely to make the recombinationrates less noisy near centromeres and telomeres TheR code is freely available at wwwmolpopgenorgData (lastaccessed August 1 2013)

Data Set Restriction

Analyses performed on these data sets were restricted in twoways First we restricted our analyses to regions of thegenome with low TE density and moderate recombinationrates We restricted our data set in this way because regions ofhigh TE density make it more difficult to make accurate callsBecause our pipeline requires unique reads to positionallylocate the TE insertion regions with high TE density canlack enough unique sequence to generate these reads Theregions included in the analyses are X300000ndash208000002L200000ndash20100000 2R2300000ndash21000000 3L100000ndash21900000 3R600000ndash27800000 which are not centromericand telomeric regions This excludes the majority of TEs an-notated in the reference genome since these TEs are largelyclustered together in centromeric regions However this doesnot affect our analysis since we are interested in TEs in theeuchromatic portions of the genome

Second to restrict the set of reference TEs studied to full-length copies we removed from the analysis all reference TEsthat had a length that was less than 75 of the canonicalannotated length of the TE While these elements may stillproduce phenotypic effects these elements are unlikely to beactive copies This reduced the number of reference TEs in-cluded in the analysis to 607 from 1085 TEs that are present inthe reference genome in the regions included in our analysis

Additionally some analyses were performed on the subsetof 23 lines in the DGRP data set which had an average se-quencing coverage of 25 or higher hereafter referred to asDGRP25 This restriction was to mitigate the effect of differ-ences in coverage between the two data sets Simulations oftransposable element detection via our pipeline data not

shown indicate that TE detection at 25 is comparable todetection at 50

Site Count Spectra

We generated derived allele count spectra for TEs in the DSPRdata set and the DGRP25 set of lines We included only siteswhere we were able to make a positive presence or absencecall for each line in the data set so that the number of lineswas normalized for each resource for this analysis Observedvalues were compared with expected values at counts 1 2 34 5 6 and 7 under an infinite sites model via a 2 testCategories 7 and higher were grouped due to low counts insome of these categories We also calculated count spectra forSNPs present in small introns (introns86 bp [Haddrill et al2005]) a group unlikely to have functional effects We alsoused SoFoS a site frequency rescaler (Hufford et al 2012) torescale the DSPR data set to the sample size of the DGRP25from 15 to 23 samples to see if the rescaled data show adifferent distribution of element counts from the observedDGRP25 data We then did a KolmogorovndashSmirnov test tocompare the two distributions as a test of different selectiveeffects acting on the populations

Genomic Context of TEs

Transposable elements were identified in regions of low(lt2 cMMb) or high recombination (2 cMMb) FASTAfiles containing the annotated coordinates of 50 UTRs30 UTRs intronic intergenic and full gene span regions forversion 513 of the D melanogaster reference were down-loaded from flybase (wwwflybaseorg last accessed January30 2009) Version 513 of the reference annotation was usedfor comparative purposes with the DGRP analysis (Mackayet al 2012) Because alternative transcripts can result in thesame genomic position being classified in multiple categorieswe labeled each TE as every appropriate category This oc-curred with 52 of TEs in the DGRP and 60 of TEs in theDSPR When calculating the percentage of the referencegenome in our restricted areas of analysis that were classifiedas each genomic type we followed the same strategy

We compared our observations of TE insertions to a bino-mial distribution to determine whether TEs were distributedrandomly throughout the genome We used the pbinomfunction in R (R Development Core Team 2008) setting thenumber of trials to the total number of TE insertion locationsin the data set and the probability of success to the propor-tion of the genome defined as each context (intron exon30 UTR 50 UTR or intergenic) We did one tailed tests in eachcase For exon 30 UTR and 50 UTR tests we calculated theprobability of observing the observed value or fewer TEs Forintron and intergenic tests we calculated the probability ofobserving the observed value or more TEs

Variables Affecting TE Density

For the DGRP25 and DSPR data sets we calculatedTE density in terms of TEsMb For this analysis we alsoexcluded chromosome 4 We examined the effects of recom-bination rate chromosome and TE element family on TE

2324

Cridland et al doi101093molbevmst129 MBE

density using the standard linear model in R We usedthe lm() function to define our model as followsTEsMb ~ Line + Recombination Rate + Chromosome +Recombination Rate Chromosome We performed thisanalysis both for the set of all TEs and also for 15 TE familieswith moderate to high number of copies in our data set(supplementary table S4 Supplementary Material online)We selected individual TE families for this additional analysisbased on their level of abundance in the data sets and theirhaving been previously studied by Montgomery et al (1987)and Bartolome et al (2002)

Gene Families

We downloaded the Dfam database from (httpwwwindianaedu~hahnlabflyDfamDBdrosophila_frbhtml last acc-essed September 1 2012) This database describes all mem-bers of all gene families clustered using a fuzzy reciprocalBLAST method as described in Hahn et al (2007) for 12Drosophila species We used R to perform a linear regressionusing a simple model where the square root (TE density) in agene is the random variable and log10 (gene family size) is theindependent variable

Variation between Resources

For each TE present in both data sets we compared thefrequency of the TE in each population using Fisherrsquos exacttest We excluded from this analysis all insertions where wewere unable to make a presenceabsence call in at least 100 ofthe DGRP lines or 13 of the DSPR lines For each gene with TEinsertions in both data sets we compared the frequency atwhich the gene contains TEs between data sets using aFisherrsquos exact test

Supplementary MaterialSupplementary tables S1ndashS6 and figures S1 and S2 are avail-able at Molecular Biology and Evolution online (httpwwwmbeoxfordjournalsorg)

Acknowledgments

The authors thank Casey Bergman and Miaomiao He for earlyaccess to their data as well as helpful comments from CBThey also thank Rebekah Rogers for help with editing themanuscript This work was supported by the NationalInstitute of Health grant NIH R01-GM085183 to KRT theNational Library of Medicine - National Institute of Healthtraining grant LM007443 to JMC and the National Instituteof Health grant NIH R01-OD010974 to ADL and SJM

ReferencesAltschul SF Gish W Miller W Myers EW Lipman DJ 1990 Basic local

alignment search tool J Mol Biol 215403ndash410Aminetzach YT Macpherson JM Petrov DA 2005 Pesticide resistance

via transposition-mediated adaptive gene truncation in DrosophilaScience 309764ndash767

Bartolome C Maside X Charlesworth B 2002 On the abundance anddistribution of transposable elements in the genome of Drosophilamelanogaster Mol Biol Evol 19926ndash937

Bennetzen JL 2000 Transposable element contributions to plant geneand genome evolution Plant Mol Biol 42251ndash269

Bergman CM Bensasson D 2007 Recent LTR retrotransposon insertioncontrasts with waves of non-LTR insertion since speciation inDrosophila melanogaster Proc Natl Acad Sci U S A 10411340ndash11345

Biemont C Vieira C 2006 Junk DNA as an evolutionary force Nature443521ndash524

Birchler JA Hiebert JC 1989 Interaction of the Enhancer of white-apricotwith transposable element alleles at the white locus in Drosophilamelanogaster Genetics 122129ndash138

Birchler JA Hiebert JC Rabinow L 1989 Interaction of the mottler ofwhite with transposable element alleles at the white locus inDrosophila melanogaster Genes Dev 373ndash84

Bowen NJ McDonald JF 2001 Drosophila euchromatic LTR retrotran-sposons are much younger than the host species in which theyreside Genome Res 111527ndash1540

Carr M Soloway JR Robinson TE Brookfield JF 2002 Mechanisms reg-ulating copy numbers of six LTR retrotransposons in the genome ofDrosophila melanogaster Chromosoma 110511ndash518

Charlesworth B 1994 The effect of background selection against dele-terious mutations on weakly selected linked variants Genet Res 63213ndash227

Charlesworth B Jarne P Assimacopoulos S 1994 The distribution oftransposable elements within and between chromosomes in a pop-ulation of Drosophila melanogaster III Element abundances in het-erochromatin Genet Res 64183ndash197

Charlesworth B Langley CH 1989 The population genetics ofDrosophila transposable elements Annu Rev Genet 23251ndash287

Charlesworth B Lapid A 1989 A study of ten families of transposableelements on X chromosomes from a population of Drosophila mel-anogaster Genet Res 54113ndash125

Charlesworth B Lapid A Canada D 1992 The distribution of transpos-able elements within and between chromosomes in a population ofDrosophila melanogaster I Element frequencies and distributionGenet Res 60103ndash114

Chen GK Marjoram P Wall JD 2009 Fast and flexible simulation ofDNA sequence data Genome Res 19136ndash142

Chung H Bogwitz MR McCart C Andrianopoulos A Ffrench-ConstantRH Batterham P Daborn PJ 2007 Cis-regulatory elements in theaccord retrotransposon result in tissue-specific expression of theDrosophila melanogaster insecticide resistance gene Cyp6g1Genetics 1751071ndash1077

Cutter AD Good JM Pappas CT Saunders MA Starrett DM Wheeler TJ2005 Transposable element orientation bias in the Drosophila mel-anogaster genome J Mol Evol 6733ndash741

Daborn P Boundy S Yen J Pittendrigh B Ffrench-Constant R 2001DDT resistance in Drosophila correlates with Cyp6g1 over-expressionand confers cross-resistance to the neonicotinoid imidacloprid MolGenet Genomics 266556ndash563

Deloger M Cavalli FMG Lerat E Biemont C Sagot M-F Vieira C 2009Identification of expressed transposable element insertionsin the sequenced genome of Drosophila melanogaster Gene 43955ndash62

Dworkin I Palsson A Gibson G 2005 Replication of an Egfr-wing shapeassociation in a wild-caught cohort of Drosophila melanogasterGenetics 1692115ndash2125

Ewing B Green P 1998 Base-calling of automated sequencer tracesusing Phred II Error probabilities Genome Res 8186ndash194

Gruber JD Genissel A Macdonald SJ Long AD 2007 How repeatable areassociations between polymorphisms in achaete-scute and bristlenumber variation in Drosophila Genetics 1751987ndash1997

Haddrill PR Charlesworth B Halligan DL Andolfatto P 2005 Patterns ofintron sequence evolution in Drosophila are dependent upon lengthand GC content Genome Biol 6R67

Hahn MW Han MV Han S-G 2007 Gene family evolution across 12Drosophila genomes PLoS Genet 3e197

Hirschhorn JN Daly MJ 2005 Genome-wide association stud-ies for common diseases and complex traits Nat Rev Genet 695ndash108

2325

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

Hudson RR Kaplan NL 1995 Deleterious background selection withrecombination Genetics 1411605ndash1617

Hufford MB Xu X van Heerwaarden J et al (25 co-authors) 2012Comparative population genomics of maize domestication andimprovement Nat Genet 44808ndash811

Kaminker JS Bergman CM Kronmiller B et al (12 co-authors) 2002The transposable elements of the Drosophila melanogastereuchromatin a genomics perspective Genome Biol3RESEARCH0084

Kapitonov VV Jurka J 2003 Molecular paleontology of transposableelements in the Drosophila melanogaster genome Proc Natl AcadSci U S A 1006569ndash6574

Kazazian HH 2004 Mobile elements drivers of genome evolutionScience 3031626ndash1632

Kidwell MG 2002 Transposable elements and the evolution of genomesize in eukaryotes Genetica 11549ndash63

King EG Macdonald SJ Long AD 2012a Properties and power of theDrosophila synthetic population resource for the routine dissectionof complex traits Genetics 191935ndash949

King EG Merkes CM McNeil CL Hoofer SR Sen S Broman KW LongAD Macdonald SJ 2012b Genetic dissection of a model complextrait using the Drosophila synthetic population resource GenomeRes 221558ndash1566

Kofler R Betancourt AJ Schlotterer C 2012 Sequencing of pooled DNAsamples (Pool-Seq) uncovers complex dynamics of transposableelement insertions in Drosophila melanogaster PLoS Genet 8e1002487

Langley CH Montgomery E Hudson R Kaplan N Charlesworth B 1988On the role of unequal exchange in the containment of transpos-able element copy number Genet Res 52223ndash235

Li H Durbin R 2009 Fast and accurate short readalignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760

Linheiro RS Bergman CM 2012 Whole genome resequencing revealsnatural target site preferences of transposable elements inDrosophila melanogaster PLoS One 7e30008

Lipatov M Lenkov L Petrov DA Bergman CM 2005 Paucity of chimericgene-transposable elements in the Drosophila melanogastergenome BMC Biol 324

Long AD Lyman RF Morgan AH Langley CH Mackay TFC 2000 Bothnaturally occurring insertions of transposable elements and inter-mediate frequency polymorphisms at the achaete-scute complexare associated with bristle number in Drosophila melanogasterGenetics 1541255ndash1269

Lonnig W-E Saedler H 2002 Chromosome rearrangements and trans-posable elements Annu Rev Genet 36389ndash410

Macdonald SJ Long AD 2007 Joint estimates of quantitativetrait locus effect and frequency using synthetic recombinantpopulations of Drosophila melanogaster Genetics 1761261ndash1281

Macdonald SJ Pastinen T Long AD 2005 The effect of polymorphismsin the Enhancer of split gene complex on bristle number variation ina large wild-caught cohort of Drosophila melanogaster Genetics 1711741ndash1756

Mackay TFC 1984 Jumping genes meet abdominal bristles hybrid dys-genesis-induced quantitative variation in Drosophila melanogasterGenet Res 44231ndash237

Mackay TFC 1989 Transposable elements and fitness in Drosophilamelanogaster Genome 31284ndash295

Mackay TFC 2010 Mutations and quantitative genetic variationlessons from Drosophila Philos Trans R Soc Lond B Biol Sci 3651229ndash1239

Mackay TFC Langley CH 1990 Molecular and phenotypic variation inthe achaete-scute region of Drosophila melanogaster Nature 34864ndash66

Mackay TFC Lyman RF Jackson MS 1992 Effects of P element inser-tions on quantitative traits in Drosophila melanogaster Genetics 130315ndash332

Mackay TFC Richards S Stone EA et al (52 co-authors) 2012 TheDrosophila melanogaster genetic reference panel Nature 482173ndash178

Meyer LR Zweig AS Hinrichs AS et al (32 co-authors) 2013 The UCSCgenome browser database extensions and updates 2013 NucleicAcids Res 41D64ndashD69

Montgomery E Charlesworth B Langley CH 1987 A test for the role ofnatural selection in the stabilization of transposable element copynumber in a population of Drosophila melanogaster Genet Res 4931ndash41

Montgomery EA Huang S-M Langley CH Judd BH 1991 Chromosomerearrangement by ectopic recombination in Drosophilamelanogaster genome structure and evolution Genetics 1291085ndash1098

NCI-NHGRI Working Group on Replication in Association Studies2007 Replicating genotype-phenotype associations Nature 447655ndash660

Nuzhdin SV Mackay TFC 1994 The genomic rate of transposable ele-ment movement in Drosophila melanogaster Mol Biol Evol 12180ndash181

Nuzhdin SV Pasyukova EG Mackay TFC 1997 Accumulation of trans-posable elements in laboratory lines of Drosophila melanogasterGenetica 100167ndash175

Ometto L Stephan W Lorenzo DD 2005 Insertiondeletion andnucleotide polymorphism data reveal constraints in Drosophilamelanogaster introns and intergenic regions Genetics 1691521ndash1527

Palsson A Gibson G 2004 Association between nucleotide variation inEgfr and wing shape in Drosophila melanogaster Genetics 1671187ndash1198

Petrov DA Fiston-Lavier A-S Lipatov M Lenkov K Gonzalez J 2011Population genomics of transposable elements in Drosophila mela-nogaster Mol Biol Evol 281633ndash1644

Presgraves DC 2006 Intron length evolution in Drosophila Mol Biol Evol132203ndash2213

Quinian AR Hall IM 2010 BEDTools a flexible suite of utilities forcomparing genomic features Bioinformatics 26841ndash842

R Development Core Team 2008 R a language and environment forstatistical computing Vienna Austria R Foundation for StatisticalComputing [cited 2013 May 1] Available from httpwwwR-projectorg

Rizzon C Marais G Gouy M Biemont C 2002 Recombination rate andthe distribution of transposable elements in the Drosophila mela-nogaster genome Genome Res 12400ndash407

Sackton TB Kulathinal RJ Bergman CM Quinlan AR Dopman EBCarneiro M Marth GT Hartl DL Clark AG 2009 Population geno-mic inferences from sparse high-throughput sequencing of twopopulations of Drosophila melanogaster Genome Biol Evol 1449ndash465

Sakurai M Aoki T Yoshikawa S et al (11 co-authors) 2009 Differentiallyexpressed Drl and Drl-2 play opposing roles in Wnt5 signalingduring Drosophila olfactory system development J Neurosci 294972ndash4980

Schmidt JM Good RT Appleton B et al (11 co-authors) 2010Copy number variation and transposable elements feature inrecent ongoing adaptation at the Cyp6g1 locus PLoS Genet 6e1000998

Shrimpton AE Mackay TFC Brown AJL 1990 Transposable element-in-duced response to artificial selection in Drosophila melanogastermolecular analysis of selection lines Genetics 125803ndash811

Slotkin RK Martienssen R 2007 Transposable elements and the epige-netic regulation of the genome Nat Rev Genet 8272ndash285

Smith PA Corces VG 1991 Drosophila transposable elements mecha-nisms of mutagenesis and interactions with the host genome AdvGenet 29229ndash300

Thornton K 2003 Libsequence a C ++ class library for evolutionarygenetic analysis Bioinformatics 192325ndash2327

Vazquez JF Albornoz J Dominguez A 2007 Direct determination of theeffects of genotype and extreme temperature on the transposition

2326

Cridland et al doi101093molbevmst129 MBE

of roo in long-term mutation accumulation lines of Drosophilamelanogaster Mol Genet Genomics 278653ndash664

Voight BF Pritchard JK 2005 Confounding from cryptic relatedness incase-control association studies PLoS Genet 1e32

Wakeley J 2009 Coalescent theory an introduction Greenwood VillageRoberts and Company Publishers

Wicker T Sabot F Hua-Van A et al (13 co-authors) 2007 A unifiedclassification system for eukaryotic transposable elements Nat RevGenet 8973ndash982

Xing J Witherspoon DJ Jorde LB 2013 Mobile element biology newpossibilities with high-throughput sequencing Trends Genet 29280ndash289

Yoshikawa S McKinnon RD Kokel M Thomas JB 2003 Wnt-mediated axon guidance via Drosophila derailed receptor Nature422583ndash588

Zachar Z Bingham PM 1982 Regulation of white locus expression thestructure of mutant alleles at the white locus of Drosophila mela-nogaster Cell 30529ndash541

2327

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

Page 11: Abundance and Distribution of Transposable Elements in Two ...

region around the white locus They also found that whilerecombination rate was reduced in both centromeric andtelomeric regions only the centromeric region displayed anincreased density of TEs This information is suggestive ofdifferences between these regions in the rate of ectopicexchange

The DGRP and DSPR are also composed of inbred lineswhich have been kept in laboratory conditions for some time~9 years in the case of the DGRP lines and 40+ years in thecase of the DSPR Given that TE insertion mutations accumu-late rapidly and in an increasing manner over time (Nuzhdinet al 1997) the increased density of TEs on the X versus theautosomes may not be surprising if there is a highertransposition rate to the X The only direct study to datefocused on roo elements in D melanogaster and did observehigher rates of insertion in the X than the autosomes(Vazquez et al 2007)

Transposable Elements in Genes

The majority of genes 813 do not have any TE insertions inany line studied and intron length is a good predictor of thenumber of TEs within a gene However a few genes accumu-late many insertions even after correcting for gene size andintron length and some genes have very few TEs even thoughthey have large introns The reasons why different genes mayaccumulate TE insertions may be quite different for individualgenes In some cases there may be hot spots for insertionswhich if insertions are neutral or only slightly deleterious mayresult in large number of insertions in a small region In othergenes the lack of TE accumulation may suggest selectivepressures against TE accumulation

Gene family size may also play a role in TE accumulationWe see an increase in TE density correlated with an increase ingene family size This may indicate differences in selectivepressures for genes in different sized families Single-copygenes may experience more constraint whereas in moder-ately sized gene families these constraints may be more re-laxed perhaps due to redundancy

In the cytochrome p450 functional family of genes we findthat 3789 of these genes have TE insertions 2589 in theDGRP and 2089 in the DSPR Cytochrome p450 is also foundto be in an enriched term in the DSPR for the set of genescontaining TEs The cytochrome p450 gene Cyp6g1 is knownto confer resistance to DDT (Daborn et al 2001) and TE-mediated copy number variation is associated with increasingresistance to DDT (Schmidt et al 2010) This pattern of TEinsertions in the DSPR may be reflective of the locationswhere DSPR lines were originally collected The DSPR lineswere gathered from various locations worldwide at a timewhen different pesticides were in use The DGRP lines weregathered from a single geographic area in North America in2003 Presumably these sets of lines have experienced differ-ent selective pressures with respect to pesticide resistance InCyp6a20 (fig 4D) where TEs are segregating at different fre-quencies in the two resources some similar effect may beoccurring

TE Detection Pipelines

The TE detection strategy used here is an improvement uponthe strategy used in Mackay et al (2012) The most notabledifferences between the data reported in Mackay et al (2012)and the DGRP data reported here is substantial amount ofnew information First our pipeline improvements agree wellwith the previous implementation with 812 of previous TEpresence calls identified by the new pipeline However thesecalls represent on 182 of the presence calls made by theimproved pipeline demonstrating that we have identifiedsignificantly more calls than before The incorporation of TEabsence calls also allows us to provide a much more completepicture of TEs in euchromatic sequence in the DGRP

The TE detection pipeline used here is similar to otherpipelines which have been employed but it is useful to de-scribe briefly here the differences in approaches and why wefollowed the approach presented here A useful discussion ofcommon approaches to TE detection in many species can befound in Xing et al (2013) One common method for TEdetection is the split-read mapping approach which hasbeen employed in Drosophila by Linheiro and Bergman(2012) This approach identifies TE insertions by locating in-dividual sequence reads which span TE insertion breakpointsThis approach is the strategy of choice for next-generationsequencing experiments which produce only single-endreads However in a paired-end read situation this approachmay not take full advantage of the available data Specificallythe size of window around the TE insertion breakpoint whereinformative reads can be found is governed by read length in asingle-end experiment and fragment length in a paired-endexperiment Linheiro and Bergman (2012) discuss the impor-tance of read length on TE detection and suggest that in-creased read length improves TE detection In situationswhere read length is long and coverage is deep enough thesplit-read method may be preferable to the paired-end de-tection strategy However for the data sets analyzed herewhere read length is much shorter than fragment lengththe paired-end detection strategy will capture moreinformation

The pipeline used here also has similarities to the pipelineused in Kofler et al (2012) However there are two majordifferences in these studies First Kofler et al (2012) do notattempt to reconstruct their insertion breakpoints andsecond the population studied by Kofler et al (2012) wassequenced as a pool rather than as individual lines Becausetheir pooled sequence data reflect on average less than 1sequence coverage per line Kofler et al (2012) are primarilyable to detect intermediate to high frequency TE insertionstheir own analysis concludes that they cannot reliably detectinsertions at frequency lt7 (see supplementary fig S2Supplementary Material online in their supplementary dis-cussion) However the literature to date (Charlesworth andLangley 1989 Charlesworth and Lapid 1989 Charlesworthet al 1992) as well as the results of this study indicate thatthe majority of TEs in a population are found at very lowfrequency often in only one individual in a population andtherefore beyond the scope of detection of Kofler et al (2012)

2321

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

Caveats

The D melanogaster reference genome to which we alignedour data is a P-element-free genome Therefore the use of thisreference will be biased against detecting any P elementsexisting in the data sets In the case of the DSPR data setthis is not an issue since these lines are also all P-element-free(Macdonald and Long 2007) However it is likely that theDGRP lines do harbor P elements For the purpose of theseanalyses we decided to ignore P elements and focus on theset of element families common to the two resources

This situation does highlight the importance of having anappropriate reference sequence To detect P elements in thegenome of D melanogaster an artificially constructed refer-ence sequence or a different assembled sequence containingthe P element would need to be utilized This also brings upthe question about as yet undetected elements that exist atlow frequency If these elements do not have sufficient sim-ilarity to existing elements to be aligned to the referencegenome these elements will go undetected

Conclusions

We have presented a description of transposable elements intwo recently released QTL resources Transposable elementscan play important roles in gene function and regulation andincorporation of TE insertion information into any analysisperformed with either of these two resources will present amore complete picture of genomic variation and its contri-bution to complex traits In addition it will be important forassociation studies attempting to replicate genotypendashpheno-type associations utilizing these resources to be aware of thepotential contributions of TEs

We have also demonstrated a next-generation high-throughput sequencing analysis pipeline that is capable ofdetecting this type variation with a high rate of precisionand specificity We suggest that future studies using next-generation sequencing data utilize this technique

Materials and Methods

Data Sets

Two data sets were used in this analysis The first set consistsof 131 inbred lines from the Drosophila Genetic ReferencePanel (DGRP) (Mackay et al 2012) SRA accessionSRP000694 The second set consists of 15 lines used to es-tablish the Drosophila Synthetic Population Resource (DSPR)(King et al 2012a) initially described in Macdonald and Long(2007) SRA accession SRA051316 For the DGRP resourceboth 454 and Illumina data were available whereas onlyIllumina data are available for the DSPR We utilized onlythe Illumina sequence data to ensure consistency in datacollection across data sets We calculated a mean coverageof 21 for the DGRP lines and 50 for the DSPR linesVariation in coverage for the DGRP lines was substantial be-tween lines ranging from ~4 average coverage to nearly50 average coverage We therefore dropped from the anal-ysis four lines with an average coverage of 10 or less sincelow coverage can cause difficulties with assembly and TEidentification Sequence data for both data sets consisted of

paired-end data with 54 bp reads for the DSPR and paired-end 100 bp reads trimmed to 75 bp for DGRP

IBD in DGRP Data Set

We downloaded the SNP tables generated by the DGRPproject from httpwwwhgscbcmtmceduprojectsdgrpfreeze1_July_2010snp_callsIllumina (last accessed July 302010) to examine identity by descent in this data set Weperformed an all by all comparison between lines examiningsliding windows of 1 Mb across the genome with 100 kb stepsbetween windows We take a simple definition of IBD hereWhenever we identified two lines that weregt95 identity atSNP positions in a 1 Mb region we labeled the pair as IBD inthe identified region and masked that region of the genomein the line with lower coverage We also dropped entirelyfrom our analysis any lines where gt50 of the genome wasdetermined to be IBD with other lines in the population(supplementary table S1 Supplementary Material online) re-moving from the analysis the line with lower average se-quence coverage

We modeled 1 Mb regions of 200 chromosomes in a coa-lescent simulation using the parameters theta = 001site and4Nr = 10theta to determine the number of regions of95IBD under a Wright-Fisher population model using the MACSsoftware (Chen et al 2009) We simulated 1000 replicates ofthis simulation and tabulated the number of 1 Mb regionsshowing 95 IBD in each replicate

Alignment

We aligned reads to the D melanogaster reference genome(version 513 downloaded from FlyBase flybaseorg) using thealigner BWA (version 059 Li and Durbin 2009) We used thefollowing parameters (aln -t 8 -l 13 -m 50000000 -R 5000)followed by the command ldquosamperdquo to resolve paired endmappings (using parameters -a 5000 -N 5000 -n 500) It isimportant when detecting TEs that the -R during the initialalignment and the -N parameter during the ldquosamperdquo phase beset high otherwise highly repetitive sequences that would beinformative of the presence of a common element will beexcluded from the data by the aligner We also used the -Ioption for the DGRP lines but not the DSPR due to the dif-ferences in Illumina quality score output format as the twodata sets were sequenced at different times The -m parameterfrom the initial alignment step and the -n parameter from thepaired-end resolution step deal with how BWA treats multiplyaligning reads An additional caveat is that some TE familiesmost notably the P-element family is absent from the Dmelanogaster reference sequence Such families in the se-quenced genomes will not align to the reference genome weused even if they are present in the sampled lines

Transposable Element Detection

Transposable elements were detected by first identifying allreads that were aligned to an annotated TE in the referencegenome (supplementary fig S1 Supplementary Material on-line) We then identified the mates of these reads and selectedonly those mates which did not align to an annotated TE

2322

Cridland et al doi101093molbevmst129 MBE

These reads identify the set of read-pairs which span a TEinsertion The uniquely mapping non-TE reads were thenclustered based on their start position and chromosomalstrand to identify putative TEs Clusters on the plus strandwere matched to clusters on the minus strand to producepairs of clusters indicating either end of the putative TE Onceclusters were identified we filtered out all events which con-tained fewer than three read-pairs consistent with a giveninsertion for each cluster This means that putative eventsat this stage have a minimum of six read-pairs indicating theinsertion three on the plus strand and three on the minusstrand

We then extracted all reads and their mates that aligned tothe identified regions of the genome and attempted toreconstruct the local area including the breakpoint of theTE Because many TEs have repetitive regions at the ends ofthe element we reconstructed the two breakpoints of the TEseparately (see also supplementary fig 23 from Mackay et al2012) Local reconstruction was done by running Phrap (ver-sion 1090518 Ewing and Green 1998) with the followingparameters (-forcelevel 10 -minscore 10 -minmatch 10)This set of parameters causes Phrap to do a Smith-Waterman comparison when it finds matches of at least 10characters but then relaxes the minimum alignment scoreand the final assembly parameters This is necessary to recon-struct TEs because repetitive sequence within the TEs canresult in a failure to reconstruct a contig if stricter assemblyparameters are used All code we developed for the detectionof TEs will be freely available at wwwmolpopgenorgData(last accessed August 1 2013)

Following reconstruction we used BlastN (verson 2222) tocompare reconstructed contigs to the D melanogaster refer-ence to confirm the presence of a TE In most cases we wereable to fully reconstruct both breakpoints of the TE Fullyreconstructed indicates that we could reconstruct a singlecontig which contained the breakpoint of the TE In mostcases the two breakpoints of the TE did not resolve to thesame nucleotide but instead identify the location of thetarget site duplication for the insertion (Linheiro andBergman 2012 similar to fig 1) These are cases where theupstream estimate of the breakpoint is slightly higher thanthe downstream estimate of the breakpoint and the differ-ence in position can be seen as the likely span of the target siteduplication However sometimes we were not able to fullyreconstruct a contig but instead generated two contigs onecontaining uniquely aligning sequence and the other contain-ing TE sequence These contigs are known to belong togetherbecause reads forming one contig have their mates in theother contig In these cases we have an approximate but notprecise estimate of the TE breakpoint and are unable toidentify the target site duplication

After the first round of detection we performed a secondround of detection where we examined every location wherea novel TE was detected in any line in all other lines Thisincluded examining positions in one data set that were iden-tified in the other data set In addition we examined thelocations of all known TEs in the D melanogaster genomeFor each location queried in each line we extracted reads

aligning to that location and their mates and used Phrap toreconstruct a contig This allowed us to produce presenceabsence or missing data calls for all positions where TE wasdetected in all lines The absence of a TE is determined byidentifying reconstructed contigs that spanned the insertionposition of the TE by at least 15 bp on either side Theseabsence calls were not included in Mackay et al (2012)This also allows us to identify TEs that may be in regions ofthe genome with lower average sequence coverage comparedwith the rest of the genome provided that insertion is sharedwith another line where it was possible to detect the insertion

Transposable Element Annotation

Annotation of novel TEs was performed by aligning recon-structed contigs with BlastN to the set of TEs found in theD melanogaster reference We chose the best match basedon overall length and BLAST reported e-value and called thenovel TE a TE of that family We also required a minimum of80 bp of contig matching a TE to make an annotation callThis is an additional annotation analysis to the one we per-formed in Mackay et al (2012) For each of the two contigsproduced for each insertion we determined the best matchand then compared these to each other TEs are classified asper the classification scheme put forth in Wicker et al (2007)We annotated each TE insertion to the highest level of agree-ment between the two contigs if the two contigs disagreedwith each other at every level of classification we called thatelement ldquoundeterminedrdquo This could result from nested in-sertions of elements where the 50 end of the insertion is truly adifferent element from the 30 end or because the TE sequenceis degraded and differs substantially from the referencesequence or because there was not enough TE sequence tomeet our annotation guidelines Because we can only recon-struct a maximum of few hundred bases at either end of theinsertion it is difficult to distinguish between these issues Wealso did not attempt to determine the strand of the TE

Validation

A subset of 190 TE insertion sites shared with the D melano-gaster reference were validated via PCR in a subset of nineof the DGRP lines by Blumenstiel JP Chen X He M BergmanCM (unpublished data httparxivorgabs12093456 lastaccessed June 26 2013) To independently check IBD callsusing SNP data we also determined if pairs of lines wereIBD for TEs in the same regions In this comparison we onlyincluded TE insertions where both lines have a positive pres-ence or absence call This can also serve as a validation of ourTE calling pipeline since if two segments are IBD they shouldshare the same set of TEs

Simulation

We selected a set of elements of different sizes and of TEclasses from the set of TEs in the reference sequence andinserted them semi-randomly into the D melanogaster refer-ence genome in silico Insertions were semi-random since weavoided areas already occupied by reference TEs and the cen-tromeric and telomeric regions Read pairs were then

2323

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

generated to construct fastq files We then ran our TE detec-tion pipeline as above through the initial annotation stepThis process was repeated three times to generate a simula-tion of 15 25 and 50 coverage

Recombination Rates

We wrote an R routine to estimate local recombination ratesas a function of physical position Briefly we used ldquocyto-ge-netic-seqtsvrdquo from wwwflybaseorg (last accessed January 302009) to assign a set of physical landmarks to genetic posi-tions For each chromosome arm we then fit a local polyno-mial relating cumulative cM to cumulative basepair (usinglocpoly) Unlike many prior local polynomial fits we madethe added assumption that this polynomial must be a mono-tonically increasing function (using monoproc) an assump-tion that is true for real data We then fit a smooth spline tothe above curve (using smoothspline) The advantage ofsmoothspline is that the resulting curves can be differentiated(and hence estimates of rates obtained directly) Finally we fita smooth spline to the first derivatives obtained from theprior smooth spline largely to make the recombinationrates less noisy near centromeres and telomeres TheR code is freely available at wwwmolpopgenorgData (lastaccessed August 1 2013)

Data Set Restriction

Analyses performed on these data sets were restricted in twoways First we restricted our analyses to regions of thegenome with low TE density and moderate recombinationrates We restricted our data set in this way because regions ofhigh TE density make it more difficult to make accurate callsBecause our pipeline requires unique reads to positionallylocate the TE insertion regions with high TE density canlack enough unique sequence to generate these reads Theregions included in the analyses are X300000ndash208000002L200000ndash20100000 2R2300000ndash21000000 3L100000ndash21900000 3R600000ndash27800000 which are not centromericand telomeric regions This excludes the majority of TEs an-notated in the reference genome since these TEs are largelyclustered together in centromeric regions However this doesnot affect our analysis since we are interested in TEs in theeuchromatic portions of the genome

Second to restrict the set of reference TEs studied to full-length copies we removed from the analysis all reference TEsthat had a length that was less than 75 of the canonicalannotated length of the TE While these elements may stillproduce phenotypic effects these elements are unlikely to beactive copies This reduced the number of reference TEs in-cluded in the analysis to 607 from 1085 TEs that are present inthe reference genome in the regions included in our analysis

Additionally some analyses were performed on the subsetof 23 lines in the DGRP data set which had an average se-quencing coverage of 25 or higher hereafter referred to asDGRP25 This restriction was to mitigate the effect of differ-ences in coverage between the two data sets Simulations oftransposable element detection via our pipeline data not

shown indicate that TE detection at 25 is comparable todetection at 50

Site Count Spectra

We generated derived allele count spectra for TEs in the DSPRdata set and the DGRP25 set of lines We included only siteswhere we were able to make a positive presence or absencecall for each line in the data set so that the number of lineswas normalized for each resource for this analysis Observedvalues were compared with expected values at counts 1 2 34 5 6 and 7 under an infinite sites model via a 2 testCategories 7 and higher were grouped due to low counts insome of these categories We also calculated count spectra forSNPs present in small introns (introns86 bp [Haddrill et al2005]) a group unlikely to have functional effects We alsoused SoFoS a site frequency rescaler (Hufford et al 2012) torescale the DSPR data set to the sample size of the DGRP25from 15 to 23 samples to see if the rescaled data show adifferent distribution of element counts from the observedDGRP25 data We then did a KolmogorovndashSmirnov test tocompare the two distributions as a test of different selectiveeffects acting on the populations

Genomic Context of TEs

Transposable elements were identified in regions of low(lt2 cMMb) or high recombination (2 cMMb) FASTAfiles containing the annotated coordinates of 50 UTRs30 UTRs intronic intergenic and full gene span regions forversion 513 of the D melanogaster reference were down-loaded from flybase (wwwflybaseorg last accessed January30 2009) Version 513 of the reference annotation was usedfor comparative purposes with the DGRP analysis (Mackayet al 2012) Because alternative transcripts can result in thesame genomic position being classified in multiple categorieswe labeled each TE as every appropriate category This oc-curred with 52 of TEs in the DGRP and 60 of TEs in theDSPR When calculating the percentage of the referencegenome in our restricted areas of analysis that were classifiedas each genomic type we followed the same strategy

We compared our observations of TE insertions to a bino-mial distribution to determine whether TEs were distributedrandomly throughout the genome We used the pbinomfunction in R (R Development Core Team 2008) setting thenumber of trials to the total number of TE insertion locationsin the data set and the probability of success to the propor-tion of the genome defined as each context (intron exon30 UTR 50 UTR or intergenic) We did one tailed tests in eachcase For exon 30 UTR and 50 UTR tests we calculated theprobability of observing the observed value or fewer TEs Forintron and intergenic tests we calculated the probability ofobserving the observed value or more TEs

Variables Affecting TE Density

For the DGRP25 and DSPR data sets we calculatedTE density in terms of TEsMb For this analysis we alsoexcluded chromosome 4 We examined the effects of recom-bination rate chromosome and TE element family on TE

2324

Cridland et al doi101093molbevmst129 MBE

density using the standard linear model in R We usedthe lm() function to define our model as followsTEsMb ~ Line + Recombination Rate + Chromosome +Recombination Rate Chromosome We performed thisanalysis both for the set of all TEs and also for 15 TE familieswith moderate to high number of copies in our data set(supplementary table S4 Supplementary Material online)We selected individual TE families for this additional analysisbased on their level of abundance in the data sets and theirhaving been previously studied by Montgomery et al (1987)and Bartolome et al (2002)

Gene Families

We downloaded the Dfam database from (httpwwwindianaedu~hahnlabflyDfamDBdrosophila_frbhtml last acc-essed September 1 2012) This database describes all mem-bers of all gene families clustered using a fuzzy reciprocalBLAST method as described in Hahn et al (2007) for 12Drosophila species We used R to perform a linear regressionusing a simple model where the square root (TE density) in agene is the random variable and log10 (gene family size) is theindependent variable

Variation between Resources

For each TE present in both data sets we compared thefrequency of the TE in each population using Fisherrsquos exacttest We excluded from this analysis all insertions where wewere unable to make a presenceabsence call in at least 100 ofthe DGRP lines or 13 of the DSPR lines For each gene with TEinsertions in both data sets we compared the frequency atwhich the gene contains TEs between data sets using aFisherrsquos exact test

Supplementary MaterialSupplementary tables S1ndashS6 and figures S1 and S2 are avail-able at Molecular Biology and Evolution online (httpwwwmbeoxfordjournalsorg)

Acknowledgments

The authors thank Casey Bergman and Miaomiao He for earlyaccess to their data as well as helpful comments from CBThey also thank Rebekah Rogers for help with editing themanuscript This work was supported by the NationalInstitute of Health grant NIH R01-GM085183 to KRT theNational Library of Medicine - National Institute of Healthtraining grant LM007443 to JMC and the National Instituteof Health grant NIH R01-OD010974 to ADL and SJM

ReferencesAltschul SF Gish W Miller W Myers EW Lipman DJ 1990 Basic local

alignment search tool J Mol Biol 215403ndash410Aminetzach YT Macpherson JM Petrov DA 2005 Pesticide resistance

via transposition-mediated adaptive gene truncation in DrosophilaScience 309764ndash767

Bartolome C Maside X Charlesworth B 2002 On the abundance anddistribution of transposable elements in the genome of Drosophilamelanogaster Mol Biol Evol 19926ndash937

Bennetzen JL 2000 Transposable element contributions to plant geneand genome evolution Plant Mol Biol 42251ndash269

Bergman CM Bensasson D 2007 Recent LTR retrotransposon insertioncontrasts with waves of non-LTR insertion since speciation inDrosophila melanogaster Proc Natl Acad Sci U S A 10411340ndash11345

Biemont C Vieira C 2006 Junk DNA as an evolutionary force Nature443521ndash524

Birchler JA Hiebert JC 1989 Interaction of the Enhancer of white-apricotwith transposable element alleles at the white locus in Drosophilamelanogaster Genetics 122129ndash138

Birchler JA Hiebert JC Rabinow L 1989 Interaction of the mottler ofwhite with transposable element alleles at the white locus inDrosophila melanogaster Genes Dev 373ndash84

Bowen NJ McDonald JF 2001 Drosophila euchromatic LTR retrotran-sposons are much younger than the host species in which theyreside Genome Res 111527ndash1540

Carr M Soloway JR Robinson TE Brookfield JF 2002 Mechanisms reg-ulating copy numbers of six LTR retrotransposons in the genome ofDrosophila melanogaster Chromosoma 110511ndash518

Charlesworth B 1994 The effect of background selection against dele-terious mutations on weakly selected linked variants Genet Res 63213ndash227

Charlesworth B Jarne P Assimacopoulos S 1994 The distribution oftransposable elements within and between chromosomes in a pop-ulation of Drosophila melanogaster III Element abundances in het-erochromatin Genet Res 64183ndash197

Charlesworth B Langley CH 1989 The population genetics ofDrosophila transposable elements Annu Rev Genet 23251ndash287

Charlesworth B Lapid A 1989 A study of ten families of transposableelements on X chromosomes from a population of Drosophila mel-anogaster Genet Res 54113ndash125

Charlesworth B Lapid A Canada D 1992 The distribution of transpos-able elements within and between chromosomes in a population ofDrosophila melanogaster I Element frequencies and distributionGenet Res 60103ndash114

Chen GK Marjoram P Wall JD 2009 Fast and flexible simulation ofDNA sequence data Genome Res 19136ndash142

Chung H Bogwitz MR McCart C Andrianopoulos A Ffrench-ConstantRH Batterham P Daborn PJ 2007 Cis-regulatory elements in theaccord retrotransposon result in tissue-specific expression of theDrosophila melanogaster insecticide resistance gene Cyp6g1Genetics 1751071ndash1077

Cutter AD Good JM Pappas CT Saunders MA Starrett DM Wheeler TJ2005 Transposable element orientation bias in the Drosophila mel-anogaster genome J Mol Evol 6733ndash741

Daborn P Boundy S Yen J Pittendrigh B Ffrench-Constant R 2001DDT resistance in Drosophila correlates with Cyp6g1 over-expressionand confers cross-resistance to the neonicotinoid imidacloprid MolGenet Genomics 266556ndash563

Deloger M Cavalli FMG Lerat E Biemont C Sagot M-F Vieira C 2009Identification of expressed transposable element insertionsin the sequenced genome of Drosophila melanogaster Gene 43955ndash62

Dworkin I Palsson A Gibson G 2005 Replication of an Egfr-wing shapeassociation in a wild-caught cohort of Drosophila melanogasterGenetics 1692115ndash2125

Ewing B Green P 1998 Base-calling of automated sequencer tracesusing Phred II Error probabilities Genome Res 8186ndash194

Gruber JD Genissel A Macdonald SJ Long AD 2007 How repeatable areassociations between polymorphisms in achaete-scute and bristlenumber variation in Drosophila Genetics 1751987ndash1997

Haddrill PR Charlesworth B Halligan DL Andolfatto P 2005 Patterns ofintron sequence evolution in Drosophila are dependent upon lengthand GC content Genome Biol 6R67

Hahn MW Han MV Han S-G 2007 Gene family evolution across 12Drosophila genomes PLoS Genet 3e197

Hirschhorn JN Daly MJ 2005 Genome-wide association stud-ies for common diseases and complex traits Nat Rev Genet 695ndash108

2325

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

Hudson RR Kaplan NL 1995 Deleterious background selection withrecombination Genetics 1411605ndash1617

Hufford MB Xu X van Heerwaarden J et al (25 co-authors) 2012Comparative population genomics of maize domestication andimprovement Nat Genet 44808ndash811

Kaminker JS Bergman CM Kronmiller B et al (12 co-authors) 2002The transposable elements of the Drosophila melanogastereuchromatin a genomics perspective Genome Biol3RESEARCH0084

Kapitonov VV Jurka J 2003 Molecular paleontology of transposableelements in the Drosophila melanogaster genome Proc Natl AcadSci U S A 1006569ndash6574

Kazazian HH 2004 Mobile elements drivers of genome evolutionScience 3031626ndash1632

Kidwell MG 2002 Transposable elements and the evolution of genomesize in eukaryotes Genetica 11549ndash63

King EG Macdonald SJ Long AD 2012a Properties and power of theDrosophila synthetic population resource for the routine dissectionof complex traits Genetics 191935ndash949

King EG Merkes CM McNeil CL Hoofer SR Sen S Broman KW LongAD Macdonald SJ 2012b Genetic dissection of a model complextrait using the Drosophila synthetic population resource GenomeRes 221558ndash1566

Kofler R Betancourt AJ Schlotterer C 2012 Sequencing of pooled DNAsamples (Pool-Seq) uncovers complex dynamics of transposableelement insertions in Drosophila melanogaster PLoS Genet 8e1002487

Langley CH Montgomery E Hudson R Kaplan N Charlesworth B 1988On the role of unequal exchange in the containment of transpos-able element copy number Genet Res 52223ndash235

Li H Durbin R 2009 Fast and accurate short readalignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760

Linheiro RS Bergman CM 2012 Whole genome resequencing revealsnatural target site preferences of transposable elements inDrosophila melanogaster PLoS One 7e30008

Lipatov M Lenkov L Petrov DA Bergman CM 2005 Paucity of chimericgene-transposable elements in the Drosophila melanogastergenome BMC Biol 324

Long AD Lyman RF Morgan AH Langley CH Mackay TFC 2000 Bothnaturally occurring insertions of transposable elements and inter-mediate frequency polymorphisms at the achaete-scute complexare associated with bristle number in Drosophila melanogasterGenetics 1541255ndash1269

Lonnig W-E Saedler H 2002 Chromosome rearrangements and trans-posable elements Annu Rev Genet 36389ndash410

Macdonald SJ Long AD 2007 Joint estimates of quantitativetrait locus effect and frequency using synthetic recombinantpopulations of Drosophila melanogaster Genetics 1761261ndash1281

Macdonald SJ Pastinen T Long AD 2005 The effect of polymorphismsin the Enhancer of split gene complex on bristle number variation ina large wild-caught cohort of Drosophila melanogaster Genetics 1711741ndash1756

Mackay TFC 1984 Jumping genes meet abdominal bristles hybrid dys-genesis-induced quantitative variation in Drosophila melanogasterGenet Res 44231ndash237

Mackay TFC 1989 Transposable elements and fitness in Drosophilamelanogaster Genome 31284ndash295

Mackay TFC 2010 Mutations and quantitative genetic variationlessons from Drosophila Philos Trans R Soc Lond B Biol Sci 3651229ndash1239

Mackay TFC Langley CH 1990 Molecular and phenotypic variation inthe achaete-scute region of Drosophila melanogaster Nature 34864ndash66

Mackay TFC Lyman RF Jackson MS 1992 Effects of P element inser-tions on quantitative traits in Drosophila melanogaster Genetics 130315ndash332

Mackay TFC Richards S Stone EA et al (52 co-authors) 2012 TheDrosophila melanogaster genetic reference panel Nature 482173ndash178

Meyer LR Zweig AS Hinrichs AS et al (32 co-authors) 2013 The UCSCgenome browser database extensions and updates 2013 NucleicAcids Res 41D64ndashD69

Montgomery E Charlesworth B Langley CH 1987 A test for the role ofnatural selection in the stabilization of transposable element copynumber in a population of Drosophila melanogaster Genet Res 4931ndash41

Montgomery EA Huang S-M Langley CH Judd BH 1991 Chromosomerearrangement by ectopic recombination in Drosophilamelanogaster genome structure and evolution Genetics 1291085ndash1098

NCI-NHGRI Working Group on Replication in Association Studies2007 Replicating genotype-phenotype associations Nature 447655ndash660

Nuzhdin SV Mackay TFC 1994 The genomic rate of transposable ele-ment movement in Drosophila melanogaster Mol Biol Evol 12180ndash181

Nuzhdin SV Pasyukova EG Mackay TFC 1997 Accumulation of trans-posable elements in laboratory lines of Drosophila melanogasterGenetica 100167ndash175

Ometto L Stephan W Lorenzo DD 2005 Insertiondeletion andnucleotide polymorphism data reveal constraints in Drosophilamelanogaster introns and intergenic regions Genetics 1691521ndash1527

Palsson A Gibson G 2004 Association between nucleotide variation inEgfr and wing shape in Drosophila melanogaster Genetics 1671187ndash1198

Petrov DA Fiston-Lavier A-S Lipatov M Lenkov K Gonzalez J 2011Population genomics of transposable elements in Drosophila mela-nogaster Mol Biol Evol 281633ndash1644

Presgraves DC 2006 Intron length evolution in Drosophila Mol Biol Evol132203ndash2213

Quinian AR Hall IM 2010 BEDTools a flexible suite of utilities forcomparing genomic features Bioinformatics 26841ndash842

R Development Core Team 2008 R a language and environment forstatistical computing Vienna Austria R Foundation for StatisticalComputing [cited 2013 May 1] Available from httpwwwR-projectorg

Rizzon C Marais G Gouy M Biemont C 2002 Recombination rate andthe distribution of transposable elements in the Drosophila mela-nogaster genome Genome Res 12400ndash407

Sackton TB Kulathinal RJ Bergman CM Quinlan AR Dopman EBCarneiro M Marth GT Hartl DL Clark AG 2009 Population geno-mic inferences from sparse high-throughput sequencing of twopopulations of Drosophila melanogaster Genome Biol Evol 1449ndash465

Sakurai M Aoki T Yoshikawa S et al (11 co-authors) 2009 Differentiallyexpressed Drl and Drl-2 play opposing roles in Wnt5 signalingduring Drosophila olfactory system development J Neurosci 294972ndash4980

Schmidt JM Good RT Appleton B et al (11 co-authors) 2010Copy number variation and transposable elements feature inrecent ongoing adaptation at the Cyp6g1 locus PLoS Genet 6e1000998

Shrimpton AE Mackay TFC Brown AJL 1990 Transposable element-in-duced response to artificial selection in Drosophila melanogastermolecular analysis of selection lines Genetics 125803ndash811

Slotkin RK Martienssen R 2007 Transposable elements and the epige-netic regulation of the genome Nat Rev Genet 8272ndash285

Smith PA Corces VG 1991 Drosophila transposable elements mecha-nisms of mutagenesis and interactions with the host genome AdvGenet 29229ndash300

Thornton K 2003 Libsequence a C ++ class library for evolutionarygenetic analysis Bioinformatics 192325ndash2327

Vazquez JF Albornoz J Dominguez A 2007 Direct determination of theeffects of genotype and extreme temperature on the transposition

2326

Cridland et al doi101093molbevmst129 MBE

of roo in long-term mutation accumulation lines of Drosophilamelanogaster Mol Genet Genomics 278653ndash664

Voight BF Pritchard JK 2005 Confounding from cryptic relatedness incase-control association studies PLoS Genet 1e32

Wakeley J 2009 Coalescent theory an introduction Greenwood VillageRoberts and Company Publishers

Wicker T Sabot F Hua-Van A et al (13 co-authors) 2007 A unifiedclassification system for eukaryotic transposable elements Nat RevGenet 8973ndash982

Xing J Witherspoon DJ Jorde LB 2013 Mobile element biology newpossibilities with high-throughput sequencing Trends Genet 29280ndash289

Yoshikawa S McKinnon RD Kokel M Thomas JB 2003 Wnt-mediated axon guidance via Drosophila derailed receptor Nature422583ndash588

Zachar Z Bingham PM 1982 Regulation of white locus expression thestructure of mutant alleles at the white locus of Drosophila mela-nogaster Cell 30529ndash541

2327

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

Page 12: Abundance and Distribution of Transposable Elements in Two ...

Caveats

The D melanogaster reference genome to which we alignedour data is a P-element-free genome Therefore the use of thisreference will be biased against detecting any P elementsexisting in the data sets In the case of the DSPR data setthis is not an issue since these lines are also all P-element-free(Macdonald and Long 2007) However it is likely that theDGRP lines do harbor P elements For the purpose of theseanalyses we decided to ignore P elements and focus on theset of element families common to the two resources

This situation does highlight the importance of having anappropriate reference sequence To detect P elements in thegenome of D melanogaster an artificially constructed refer-ence sequence or a different assembled sequence containingthe P element would need to be utilized This also brings upthe question about as yet undetected elements that exist atlow frequency If these elements do not have sufficient sim-ilarity to existing elements to be aligned to the referencegenome these elements will go undetected

Conclusions

We have presented a description of transposable elements intwo recently released QTL resources Transposable elementscan play important roles in gene function and regulation andincorporation of TE insertion information into any analysisperformed with either of these two resources will present amore complete picture of genomic variation and its contri-bution to complex traits In addition it will be important forassociation studies attempting to replicate genotypendashpheno-type associations utilizing these resources to be aware of thepotential contributions of TEs

We have also demonstrated a next-generation high-throughput sequencing analysis pipeline that is capable ofdetecting this type variation with a high rate of precisionand specificity We suggest that future studies using next-generation sequencing data utilize this technique

Materials and Methods

Data Sets

Two data sets were used in this analysis The first set consistsof 131 inbred lines from the Drosophila Genetic ReferencePanel (DGRP) (Mackay et al 2012) SRA accessionSRP000694 The second set consists of 15 lines used to es-tablish the Drosophila Synthetic Population Resource (DSPR)(King et al 2012a) initially described in Macdonald and Long(2007) SRA accession SRA051316 For the DGRP resourceboth 454 and Illumina data were available whereas onlyIllumina data are available for the DSPR We utilized onlythe Illumina sequence data to ensure consistency in datacollection across data sets We calculated a mean coverageof 21 for the DGRP lines and 50 for the DSPR linesVariation in coverage for the DGRP lines was substantial be-tween lines ranging from ~4 average coverage to nearly50 average coverage We therefore dropped from the anal-ysis four lines with an average coverage of 10 or less sincelow coverage can cause difficulties with assembly and TEidentification Sequence data for both data sets consisted of

paired-end data with 54 bp reads for the DSPR and paired-end 100 bp reads trimmed to 75 bp for DGRP

IBD in DGRP Data Set

We downloaded the SNP tables generated by the DGRPproject from httpwwwhgscbcmtmceduprojectsdgrpfreeze1_July_2010snp_callsIllumina (last accessed July 302010) to examine identity by descent in this data set Weperformed an all by all comparison between lines examiningsliding windows of 1 Mb across the genome with 100 kb stepsbetween windows We take a simple definition of IBD hereWhenever we identified two lines that weregt95 identity atSNP positions in a 1 Mb region we labeled the pair as IBD inthe identified region and masked that region of the genomein the line with lower coverage We also dropped entirelyfrom our analysis any lines where gt50 of the genome wasdetermined to be IBD with other lines in the population(supplementary table S1 Supplementary Material online) re-moving from the analysis the line with lower average se-quence coverage

We modeled 1 Mb regions of 200 chromosomes in a coa-lescent simulation using the parameters theta = 001site and4Nr = 10theta to determine the number of regions of95IBD under a Wright-Fisher population model using the MACSsoftware (Chen et al 2009) We simulated 1000 replicates ofthis simulation and tabulated the number of 1 Mb regionsshowing 95 IBD in each replicate

Alignment

We aligned reads to the D melanogaster reference genome(version 513 downloaded from FlyBase flybaseorg) using thealigner BWA (version 059 Li and Durbin 2009) We used thefollowing parameters (aln -t 8 -l 13 -m 50000000 -R 5000)followed by the command ldquosamperdquo to resolve paired endmappings (using parameters -a 5000 -N 5000 -n 500) It isimportant when detecting TEs that the -R during the initialalignment and the -N parameter during the ldquosamperdquo phase beset high otherwise highly repetitive sequences that would beinformative of the presence of a common element will beexcluded from the data by the aligner We also used the -Ioption for the DGRP lines but not the DSPR due to the dif-ferences in Illumina quality score output format as the twodata sets were sequenced at different times The -m parameterfrom the initial alignment step and the -n parameter from thepaired-end resolution step deal with how BWA treats multiplyaligning reads An additional caveat is that some TE familiesmost notably the P-element family is absent from the Dmelanogaster reference sequence Such families in the se-quenced genomes will not align to the reference genome weused even if they are present in the sampled lines

Transposable Element Detection

Transposable elements were detected by first identifying allreads that were aligned to an annotated TE in the referencegenome (supplementary fig S1 Supplementary Material on-line) We then identified the mates of these reads and selectedonly those mates which did not align to an annotated TE

2322

Cridland et al doi101093molbevmst129 MBE

These reads identify the set of read-pairs which span a TEinsertion The uniquely mapping non-TE reads were thenclustered based on their start position and chromosomalstrand to identify putative TEs Clusters on the plus strandwere matched to clusters on the minus strand to producepairs of clusters indicating either end of the putative TE Onceclusters were identified we filtered out all events which con-tained fewer than three read-pairs consistent with a giveninsertion for each cluster This means that putative eventsat this stage have a minimum of six read-pairs indicating theinsertion three on the plus strand and three on the minusstrand

We then extracted all reads and their mates that aligned tothe identified regions of the genome and attempted toreconstruct the local area including the breakpoint of theTE Because many TEs have repetitive regions at the ends ofthe element we reconstructed the two breakpoints of the TEseparately (see also supplementary fig 23 from Mackay et al2012) Local reconstruction was done by running Phrap (ver-sion 1090518 Ewing and Green 1998) with the followingparameters (-forcelevel 10 -minscore 10 -minmatch 10)This set of parameters causes Phrap to do a Smith-Waterman comparison when it finds matches of at least 10characters but then relaxes the minimum alignment scoreand the final assembly parameters This is necessary to recon-struct TEs because repetitive sequence within the TEs canresult in a failure to reconstruct a contig if stricter assemblyparameters are used All code we developed for the detectionof TEs will be freely available at wwwmolpopgenorgData(last accessed August 1 2013)

Following reconstruction we used BlastN (verson 2222) tocompare reconstructed contigs to the D melanogaster refer-ence to confirm the presence of a TE In most cases we wereable to fully reconstruct both breakpoints of the TE Fullyreconstructed indicates that we could reconstruct a singlecontig which contained the breakpoint of the TE In mostcases the two breakpoints of the TE did not resolve to thesame nucleotide but instead identify the location of thetarget site duplication for the insertion (Linheiro andBergman 2012 similar to fig 1) These are cases where theupstream estimate of the breakpoint is slightly higher thanthe downstream estimate of the breakpoint and the differ-ence in position can be seen as the likely span of the target siteduplication However sometimes we were not able to fullyreconstruct a contig but instead generated two contigs onecontaining uniquely aligning sequence and the other contain-ing TE sequence These contigs are known to belong togetherbecause reads forming one contig have their mates in theother contig In these cases we have an approximate but notprecise estimate of the TE breakpoint and are unable toidentify the target site duplication

After the first round of detection we performed a secondround of detection where we examined every location wherea novel TE was detected in any line in all other lines Thisincluded examining positions in one data set that were iden-tified in the other data set In addition we examined thelocations of all known TEs in the D melanogaster genomeFor each location queried in each line we extracted reads

aligning to that location and their mates and used Phrap toreconstruct a contig This allowed us to produce presenceabsence or missing data calls for all positions where TE wasdetected in all lines The absence of a TE is determined byidentifying reconstructed contigs that spanned the insertionposition of the TE by at least 15 bp on either side Theseabsence calls were not included in Mackay et al (2012)This also allows us to identify TEs that may be in regions ofthe genome with lower average sequence coverage comparedwith the rest of the genome provided that insertion is sharedwith another line where it was possible to detect the insertion

Transposable Element Annotation

Annotation of novel TEs was performed by aligning recon-structed contigs with BlastN to the set of TEs found in theD melanogaster reference We chose the best match basedon overall length and BLAST reported e-value and called thenovel TE a TE of that family We also required a minimum of80 bp of contig matching a TE to make an annotation callThis is an additional annotation analysis to the one we per-formed in Mackay et al (2012) For each of the two contigsproduced for each insertion we determined the best matchand then compared these to each other TEs are classified asper the classification scheme put forth in Wicker et al (2007)We annotated each TE insertion to the highest level of agree-ment between the two contigs if the two contigs disagreedwith each other at every level of classification we called thatelement ldquoundeterminedrdquo This could result from nested in-sertions of elements where the 50 end of the insertion is truly adifferent element from the 30 end or because the TE sequenceis degraded and differs substantially from the referencesequence or because there was not enough TE sequence tomeet our annotation guidelines Because we can only recon-struct a maximum of few hundred bases at either end of theinsertion it is difficult to distinguish between these issues Wealso did not attempt to determine the strand of the TE

Validation

A subset of 190 TE insertion sites shared with the D melano-gaster reference were validated via PCR in a subset of nineof the DGRP lines by Blumenstiel JP Chen X He M BergmanCM (unpublished data httparxivorgabs12093456 lastaccessed June 26 2013) To independently check IBD callsusing SNP data we also determined if pairs of lines wereIBD for TEs in the same regions In this comparison we onlyincluded TE insertions where both lines have a positive pres-ence or absence call This can also serve as a validation of ourTE calling pipeline since if two segments are IBD they shouldshare the same set of TEs

Simulation

We selected a set of elements of different sizes and of TEclasses from the set of TEs in the reference sequence andinserted them semi-randomly into the D melanogaster refer-ence genome in silico Insertions were semi-random since weavoided areas already occupied by reference TEs and the cen-tromeric and telomeric regions Read pairs were then

2323

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

generated to construct fastq files We then ran our TE detec-tion pipeline as above through the initial annotation stepThis process was repeated three times to generate a simula-tion of 15 25 and 50 coverage

Recombination Rates

We wrote an R routine to estimate local recombination ratesas a function of physical position Briefly we used ldquocyto-ge-netic-seqtsvrdquo from wwwflybaseorg (last accessed January 302009) to assign a set of physical landmarks to genetic posi-tions For each chromosome arm we then fit a local polyno-mial relating cumulative cM to cumulative basepair (usinglocpoly) Unlike many prior local polynomial fits we madethe added assumption that this polynomial must be a mono-tonically increasing function (using monoproc) an assump-tion that is true for real data We then fit a smooth spline tothe above curve (using smoothspline) The advantage ofsmoothspline is that the resulting curves can be differentiated(and hence estimates of rates obtained directly) Finally we fita smooth spline to the first derivatives obtained from theprior smooth spline largely to make the recombinationrates less noisy near centromeres and telomeres TheR code is freely available at wwwmolpopgenorgData (lastaccessed August 1 2013)

Data Set Restriction

Analyses performed on these data sets were restricted in twoways First we restricted our analyses to regions of thegenome with low TE density and moderate recombinationrates We restricted our data set in this way because regions ofhigh TE density make it more difficult to make accurate callsBecause our pipeline requires unique reads to positionallylocate the TE insertion regions with high TE density canlack enough unique sequence to generate these reads Theregions included in the analyses are X300000ndash208000002L200000ndash20100000 2R2300000ndash21000000 3L100000ndash21900000 3R600000ndash27800000 which are not centromericand telomeric regions This excludes the majority of TEs an-notated in the reference genome since these TEs are largelyclustered together in centromeric regions However this doesnot affect our analysis since we are interested in TEs in theeuchromatic portions of the genome

Second to restrict the set of reference TEs studied to full-length copies we removed from the analysis all reference TEsthat had a length that was less than 75 of the canonicalannotated length of the TE While these elements may stillproduce phenotypic effects these elements are unlikely to beactive copies This reduced the number of reference TEs in-cluded in the analysis to 607 from 1085 TEs that are present inthe reference genome in the regions included in our analysis

Additionally some analyses were performed on the subsetof 23 lines in the DGRP data set which had an average se-quencing coverage of 25 or higher hereafter referred to asDGRP25 This restriction was to mitigate the effect of differ-ences in coverage between the two data sets Simulations oftransposable element detection via our pipeline data not

shown indicate that TE detection at 25 is comparable todetection at 50

Site Count Spectra

We generated derived allele count spectra for TEs in the DSPRdata set and the DGRP25 set of lines We included only siteswhere we were able to make a positive presence or absencecall for each line in the data set so that the number of lineswas normalized for each resource for this analysis Observedvalues were compared with expected values at counts 1 2 34 5 6 and 7 under an infinite sites model via a 2 testCategories 7 and higher were grouped due to low counts insome of these categories We also calculated count spectra forSNPs present in small introns (introns86 bp [Haddrill et al2005]) a group unlikely to have functional effects We alsoused SoFoS a site frequency rescaler (Hufford et al 2012) torescale the DSPR data set to the sample size of the DGRP25from 15 to 23 samples to see if the rescaled data show adifferent distribution of element counts from the observedDGRP25 data We then did a KolmogorovndashSmirnov test tocompare the two distributions as a test of different selectiveeffects acting on the populations

Genomic Context of TEs

Transposable elements were identified in regions of low(lt2 cMMb) or high recombination (2 cMMb) FASTAfiles containing the annotated coordinates of 50 UTRs30 UTRs intronic intergenic and full gene span regions forversion 513 of the D melanogaster reference were down-loaded from flybase (wwwflybaseorg last accessed January30 2009) Version 513 of the reference annotation was usedfor comparative purposes with the DGRP analysis (Mackayet al 2012) Because alternative transcripts can result in thesame genomic position being classified in multiple categorieswe labeled each TE as every appropriate category This oc-curred with 52 of TEs in the DGRP and 60 of TEs in theDSPR When calculating the percentage of the referencegenome in our restricted areas of analysis that were classifiedas each genomic type we followed the same strategy

We compared our observations of TE insertions to a bino-mial distribution to determine whether TEs were distributedrandomly throughout the genome We used the pbinomfunction in R (R Development Core Team 2008) setting thenumber of trials to the total number of TE insertion locationsin the data set and the probability of success to the propor-tion of the genome defined as each context (intron exon30 UTR 50 UTR or intergenic) We did one tailed tests in eachcase For exon 30 UTR and 50 UTR tests we calculated theprobability of observing the observed value or fewer TEs Forintron and intergenic tests we calculated the probability ofobserving the observed value or more TEs

Variables Affecting TE Density

For the DGRP25 and DSPR data sets we calculatedTE density in terms of TEsMb For this analysis we alsoexcluded chromosome 4 We examined the effects of recom-bination rate chromosome and TE element family on TE

2324

Cridland et al doi101093molbevmst129 MBE

density using the standard linear model in R We usedthe lm() function to define our model as followsTEsMb ~ Line + Recombination Rate + Chromosome +Recombination Rate Chromosome We performed thisanalysis both for the set of all TEs and also for 15 TE familieswith moderate to high number of copies in our data set(supplementary table S4 Supplementary Material online)We selected individual TE families for this additional analysisbased on their level of abundance in the data sets and theirhaving been previously studied by Montgomery et al (1987)and Bartolome et al (2002)

Gene Families

We downloaded the Dfam database from (httpwwwindianaedu~hahnlabflyDfamDBdrosophila_frbhtml last acc-essed September 1 2012) This database describes all mem-bers of all gene families clustered using a fuzzy reciprocalBLAST method as described in Hahn et al (2007) for 12Drosophila species We used R to perform a linear regressionusing a simple model where the square root (TE density) in agene is the random variable and log10 (gene family size) is theindependent variable

Variation between Resources

For each TE present in both data sets we compared thefrequency of the TE in each population using Fisherrsquos exacttest We excluded from this analysis all insertions where wewere unable to make a presenceabsence call in at least 100 ofthe DGRP lines or 13 of the DSPR lines For each gene with TEinsertions in both data sets we compared the frequency atwhich the gene contains TEs between data sets using aFisherrsquos exact test

Supplementary MaterialSupplementary tables S1ndashS6 and figures S1 and S2 are avail-able at Molecular Biology and Evolution online (httpwwwmbeoxfordjournalsorg)

Acknowledgments

The authors thank Casey Bergman and Miaomiao He for earlyaccess to their data as well as helpful comments from CBThey also thank Rebekah Rogers for help with editing themanuscript This work was supported by the NationalInstitute of Health grant NIH R01-GM085183 to KRT theNational Library of Medicine - National Institute of Healthtraining grant LM007443 to JMC and the National Instituteof Health grant NIH R01-OD010974 to ADL and SJM

ReferencesAltschul SF Gish W Miller W Myers EW Lipman DJ 1990 Basic local

alignment search tool J Mol Biol 215403ndash410Aminetzach YT Macpherson JM Petrov DA 2005 Pesticide resistance

via transposition-mediated adaptive gene truncation in DrosophilaScience 309764ndash767

Bartolome C Maside X Charlesworth B 2002 On the abundance anddistribution of transposable elements in the genome of Drosophilamelanogaster Mol Biol Evol 19926ndash937

Bennetzen JL 2000 Transposable element contributions to plant geneand genome evolution Plant Mol Biol 42251ndash269

Bergman CM Bensasson D 2007 Recent LTR retrotransposon insertioncontrasts with waves of non-LTR insertion since speciation inDrosophila melanogaster Proc Natl Acad Sci U S A 10411340ndash11345

Biemont C Vieira C 2006 Junk DNA as an evolutionary force Nature443521ndash524

Birchler JA Hiebert JC 1989 Interaction of the Enhancer of white-apricotwith transposable element alleles at the white locus in Drosophilamelanogaster Genetics 122129ndash138

Birchler JA Hiebert JC Rabinow L 1989 Interaction of the mottler ofwhite with transposable element alleles at the white locus inDrosophila melanogaster Genes Dev 373ndash84

Bowen NJ McDonald JF 2001 Drosophila euchromatic LTR retrotran-sposons are much younger than the host species in which theyreside Genome Res 111527ndash1540

Carr M Soloway JR Robinson TE Brookfield JF 2002 Mechanisms reg-ulating copy numbers of six LTR retrotransposons in the genome ofDrosophila melanogaster Chromosoma 110511ndash518

Charlesworth B 1994 The effect of background selection against dele-terious mutations on weakly selected linked variants Genet Res 63213ndash227

Charlesworth B Jarne P Assimacopoulos S 1994 The distribution oftransposable elements within and between chromosomes in a pop-ulation of Drosophila melanogaster III Element abundances in het-erochromatin Genet Res 64183ndash197

Charlesworth B Langley CH 1989 The population genetics ofDrosophila transposable elements Annu Rev Genet 23251ndash287

Charlesworth B Lapid A 1989 A study of ten families of transposableelements on X chromosomes from a population of Drosophila mel-anogaster Genet Res 54113ndash125

Charlesworth B Lapid A Canada D 1992 The distribution of transpos-able elements within and between chromosomes in a population ofDrosophila melanogaster I Element frequencies and distributionGenet Res 60103ndash114

Chen GK Marjoram P Wall JD 2009 Fast and flexible simulation ofDNA sequence data Genome Res 19136ndash142

Chung H Bogwitz MR McCart C Andrianopoulos A Ffrench-ConstantRH Batterham P Daborn PJ 2007 Cis-regulatory elements in theaccord retrotransposon result in tissue-specific expression of theDrosophila melanogaster insecticide resistance gene Cyp6g1Genetics 1751071ndash1077

Cutter AD Good JM Pappas CT Saunders MA Starrett DM Wheeler TJ2005 Transposable element orientation bias in the Drosophila mel-anogaster genome J Mol Evol 6733ndash741

Daborn P Boundy S Yen J Pittendrigh B Ffrench-Constant R 2001DDT resistance in Drosophila correlates with Cyp6g1 over-expressionand confers cross-resistance to the neonicotinoid imidacloprid MolGenet Genomics 266556ndash563

Deloger M Cavalli FMG Lerat E Biemont C Sagot M-F Vieira C 2009Identification of expressed transposable element insertionsin the sequenced genome of Drosophila melanogaster Gene 43955ndash62

Dworkin I Palsson A Gibson G 2005 Replication of an Egfr-wing shapeassociation in a wild-caught cohort of Drosophila melanogasterGenetics 1692115ndash2125

Ewing B Green P 1998 Base-calling of automated sequencer tracesusing Phred II Error probabilities Genome Res 8186ndash194

Gruber JD Genissel A Macdonald SJ Long AD 2007 How repeatable areassociations between polymorphisms in achaete-scute and bristlenumber variation in Drosophila Genetics 1751987ndash1997

Haddrill PR Charlesworth B Halligan DL Andolfatto P 2005 Patterns ofintron sequence evolution in Drosophila are dependent upon lengthand GC content Genome Biol 6R67

Hahn MW Han MV Han S-G 2007 Gene family evolution across 12Drosophila genomes PLoS Genet 3e197

Hirschhorn JN Daly MJ 2005 Genome-wide association stud-ies for common diseases and complex traits Nat Rev Genet 695ndash108

2325

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

Hudson RR Kaplan NL 1995 Deleterious background selection withrecombination Genetics 1411605ndash1617

Hufford MB Xu X van Heerwaarden J et al (25 co-authors) 2012Comparative population genomics of maize domestication andimprovement Nat Genet 44808ndash811

Kaminker JS Bergman CM Kronmiller B et al (12 co-authors) 2002The transposable elements of the Drosophila melanogastereuchromatin a genomics perspective Genome Biol3RESEARCH0084

Kapitonov VV Jurka J 2003 Molecular paleontology of transposableelements in the Drosophila melanogaster genome Proc Natl AcadSci U S A 1006569ndash6574

Kazazian HH 2004 Mobile elements drivers of genome evolutionScience 3031626ndash1632

Kidwell MG 2002 Transposable elements and the evolution of genomesize in eukaryotes Genetica 11549ndash63

King EG Macdonald SJ Long AD 2012a Properties and power of theDrosophila synthetic population resource for the routine dissectionof complex traits Genetics 191935ndash949

King EG Merkes CM McNeil CL Hoofer SR Sen S Broman KW LongAD Macdonald SJ 2012b Genetic dissection of a model complextrait using the Drosophila synthetic population resource GenomeRes 221558ndash1566

Kofler R Betancourt AJ Schlotterer C 2012 Sequencing of pooled DNAsamples (Pool-Seq) uncovers complex dynamics of transposableelement insertions in Drosophila melanogaster PLoS Genet 8e1002487

Langley CH Montgomery E Hudson R Kaplan N Charlesworth B 1988On the role of unequal exchange in the containment of transpos-able element copy number Genet Res 52223ndash235

Li H Durbin R 2009 Fast and accurate short readalignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760

Linheiro RS Bergman CM 2012 Whole genome resequencing revealsnatural target site preferences of transposable elements inDrosophila melanogaster PLoS One 7e30008

Lipatov M Lenkov L Petrov DA Bergman CM 2005 Paucity of chimericgene-transposable elements in the Drosophila melanogastergenome BMC Biol 324

Long AD Lyman RF Morgan AH Langley CH Mackay TFC 2000 Bothnaturally occurring insertions of transposable elements and inter-mediate frequency polymorphisms at the achaete-scute complexare associated with bristle number in Drosophila melanogasterGenetics 1541255ndash1269

Lonnig W-E Saedler H 2002 Chromosome rearrangements and trans-posable elements Annu Rev Genet 36389ndash410

Macdonald SJ Long AD 2007 Joint estimates of quantitativetrait locus effect and frequency using synthetic recombinantpopulations of Drosophila melanogaster Genetics 1761261ndash1281

Macdonald SJ Pastinen T Long AD 2005 The effect of polymorphismsin the Enhancer of split gene complex on bristle number variation ina large wild-caught cohort of Drosophila melanogaster Genetics 1711741ndash1756

Mackay TFC 1984 Jumping genes meet abdominal bristles hybrid dys-genesis-induced quantitative variation in Drosophila melanogasterGenet Res 44231ndash237

Mackay TFC 1989 Transposable elements and fitness in Drosophilamelanogaster Genome 31284ndash295

Mackay TFC 2010 Mutations and quantitative genetic variationlessons from Drosophila Philos Trans R Soc Lond B Biol Sci 3651229ndash1239

Mackay TFC Langley CH 1990 Molecular and phenotypic variation inthe achaete-scute region of Drosophila melanogaster Nature 34864ndash66

Mackay TFC Lyman RF Jackson MS 1992 Effects of P element inser-tions on quantitative traits in Drosophila melanogaster Genetics 130315ndash332

Mackay TFC Richards S Stone EA et al (52 co-authors) 2012 TheDrosophila melanogaster genetic reference panel Nature 482173ndash178

Meyer LR Zweig AS Hinrichs AS et al (32 co-authors) 2013 The UCSCgenome browser database extensions and updates 2013 NucleicAcids Res 41D64ndashD69

Montgomery E Charlesworth B Langley CH 1987 A test for the role ofnatural selection in the stabilization of transposable element copynumber in a population of Drosophila melanogaster Genet Res 4931ndash41

Montgomery EA Huang S-M Langley CH Judd BH 1991 Chromosomerearrangement by ectopic recombination in Drosophilamelanogaster genome structure and evolution Genetics 1291085ndash1098

NCI-NHGRI Working Group on Replication in Association Studies2007 Replicating genotype-phenotype associations Nature 447655ndash660

Nuzhdin SV Mackay TFC 1994 The genomic rate of transposable ele-ment movement in Drosophila melanogaster Mol Biol Evol 12180ndash181

Nuzhdin SV Pasyukova EG Mackay TFC 1997 Accumulation of trans-posable elements in laboratory lines of Drosophila melanogasterGenetica 100167ndash175

Ometto L Stephan W Lorenzo DD 2005 Insertiondeletion andnucleotide polymorphism data reveal constraints in Drosophilamelanogaster introns and intergenic regions Genetics 1691521ndash1527

Palsson A Gibson G 2004 Association between nucleotide variation inEgfr and wing shape in Drosophila melanogaster Genetics 1671187ndash1198

Petrov DA Fiston-Lavier A-S Lipatov M Lenkov K Gonzalez J 2011Population genomics of transposable elements in Drosophila mela-nogaster Mol Biol Evol 281633ndash1644

Presgraves DC 2006 Intron length evolution in Drosophila Mol Biol Evol132203ndash2213

Quinian AR Hall IM 2010 BEDTools a flexible suite of utilities forcomparing genomic features Bioinformatics 26841ndash842

R Development Core Team 2008 R a language and environment forstatistical computing Vienna Austria R Foundation for StatisticalComputing [cited 2013 May 1] Available from httpwwwR-projectorg

Rizzon C Marais G Gouy M Biemont C 2002 Recombination rate andthe distribution of transposable elements in the Drosophila mela-nogaster genome Genome Res 12400ndash407

Sackton TB Kulathinal RJ Bergman CM Quinlan AR Dopman EBCarneiro M Marth GT Hartl DL Clark AG 2009 Population geno-mic inferences from sparse high-throughput sequencing of twopopulations of Drosophila melanogaster Genome Biol Evol 1449ndash465

Sakurai M Aoki T Yoshikawa S et al (11 co-authors) 2009 Differentiallyexpressed Drl and Drl-2 play opposing roles in Wnt5 signalingduring Drosophila olfactory system development J Neurosci 294972ndash4980

Schmidt JM Good RT Appleton B et al (11 co-authors) 2010Copy number variation and transposable elements feature inrecent ongoing adaptation at the Cyp6g1 locus PLoS Genet 6e1000998

Shrimpton AE Mackay TFC Brown AJL 1990 Transposable element-in-duced response to artificial selection in Drosophila melanogastermolecular analysis of selection lines Genetics 125803ndash811

Slotkin RK Martienssen R 2007 Transposable elements and the epige-netic regulation of the genome Nat Rev Genet 8272ndash285

Smith PA Corces VG 1991 Drosophila transposable elements mecha-nisms of mutagenesis and interactions with the host genome AdvGenet 29229ndash300

Thornton K 2003 Libsequence a C ++ class library for evolutionarygenetic analysis Bioinformatics 192325ndash2327

Vazquez JF Albornoz J Dominguez A 2007 Direct determination of theeffects of genotype and extreme temperature on the transposition

2326

Cridland et al doi101093molbevmst129 MBE

of roo in long-term mutation accumulation lines of Drosophilamelanogaster Mol Genet Genomics 278653ndash664

Voight BF Pritchard JK 2005 Confounding from cryptic relatedness incase-control association studies PLoS Genet 1e32

Wakeley J 2009 Coalescent theory an introduction Greenwood VillageRoberts and Company Publishers

Wicker T Sabot F Hua-Van A et al (13 co-authors) 2007 A unifiedclassification system for eukaryotic transposable elements Nat RevGenet 8973ndash982

Xing J Witherspoon DJ Jorde LB 2013 Mobile element biology newpossibilities with high-throughput sequencing Trends Genet 29280ndash289

Yoshikawa S McKinnon RD Kokel M Thomas JB 2003 Wnt-mediated axon guidance via Drosophila derailed receptor Nature422583ndash588

Zachar Z Bingham PM 1982 Regulation of white locus expression thestructure of mutant alleles at the white locus of Drosophila mela-nogaster Cell 30529ndash541

2327

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

Page 13: Abundance and Distribution of Transposable Elements in Two ...

These reads identify the set of read-pairs which span a TEinsertion The uniquely mapping non-TE reads were thenclustered based on their start position and chromosomalstrand to identify putative TEs Clusters on the plus strandwere matched to clusters on the minus strand to producepairs of clusters indicating either end of the putative TE Onceclusters were identified we filtered out all events which con-tained fewer than three read-pairs consistent with a giveninsertion for each cluster This means that putative eventsat this stage have a minimum of six read-pairs indicating theinsertion three on the plus strand and three on the minusstrand

We then extracted all reads and their mates that aligned tothe identified regions of the genome and attempted toreconstruct the local area including the breakpoint of theTE Because many TEs have repetitive regions at the ends ofthe element we reconstructed the two breakpoints of the TEseparately (see also supplementary fig 23 from Mackay et al2012) Local reconstruction was done by running Phrap (ver-sion 1090518 Ewing and Green 1998) with the followingparameters (-forcelevel 10 -minscore 10 -minmatch 10)This set of parameters causes Phrap to do a Smith-Waterman comparison when it finds matches of at least 10characters but then relaxes the minimum alignment scoreand the final assembly parameters This is necessary to recon-struct TEs because repetitive sequence within the TEs canresult in a failure to reconstruct a contig if stricter assemblyparameters are used All code we developed for the detectionof TEs will be freely available at wwwmolpopgenorgData(last accessed August 1 2013)

Following reconstruction we used BlastN (verson 2222) tocompare reconstructed contigs to the D melanogaster refer-ence to confirm the presence of a TE In most cases we wereable to fully reconstruct both breakpoints of the TE Fullyreconstructed indicates that we could reconstruct a singlecontig which contained the breakpoint of the TE In mostcases the two breakpoints of the TE did not resolve to thesame nucleotide but instead identify the location of thetarget site duplication for the insertion (Linheiro andBergman 2012 similar to fig 1) These are cases where theupstream estimate of the breakpoint is slightly higher thanthe downstream estimate of the breakpoint and the differ-ence in position can be seen as the likely span of the target siteduplication However sometimes we were not able to fullyreconstruct a contig but instead generated two contigs onecontaining uniquely aligning sequence and the other contain-ing TE sequence These contigs are known to belong togetherbecause reads forming one contig have their mates in theother contig In these cases we have an approximate but notprecise estimate of the TE breakpoint and are unable toidentify the target site duplication

After the first round of detection we performed a secondround of detection where we examined every location wherea novel TE was detected in any line in all other lines Thisincluded examining positions in one data set that were iden-tified in the other data set In addition we examined thelocations of all known TEs in the D melanogaster genomeFor each location queried in each line we extracted reads

aligning to that location and their mates and used Phrap toreconstruct a contig This allowed us to produce presenceabsence or missing data calls for all positions where TE wasdetected in all lines The absence of a TE is determined byidentifying reconstructed contigs that spanned the insertionposition of the TE by at least 15 bp on either side Theseabsence calls were not included in Mackay et al (2012)This also allows us to identify TEs that may be in regions ofthe genome with lower average sequence coverage comparedwith the rest of the genome provided that insertion is sharedwith another line where it was possible to detect the insertion

Transposable Element Annotation

Annotation of novel TEs was performed by aligning recon-structed contigs with BlastN to the set of TEs found in theD melanogaster reference We chose the best match basedon overall length and BLAST reported e-value and called thenovel TE a TE of that family We also required a minimum of80 bp of contig matching a TE to make an annotation callThis is an additional annotation analysis to the one we per-formed in Mackay et al (2012) For each of the two contigsproduced for each insertion we determined the best matchand then compared these to each other TEs are classified asper the classification scheme put forth in Wicker et al (2007)We annotated each TE insertion to the highest level of agree-ment between the two contigs if the two contigs disagreedwith each other at every level of classification we called thatelement ldquoundeterminedrdquo This could result from nested in-sertions of elements where the 50 end of the insertion is truly adifferent element from the 30 end or because the TE sequenceis degraded and differs substantially from the referencesequence or because there was not enough TE sequence tomeet our annotation guidelines Because we can only recon-struct a maximum of few hundred bases at either end of theinsertion it is difficult to distinguish between these issues Wealso did not attempt to determine the strand of the TE

Validation

A subset of 190 TE insertion sites shared with the D melano-gaster reference were validated via PCR in a subset of nineof the DGRP lines by Blumenstiel JP Chen X He M BergmanCM (unpublished data httparxivorgabs12093456 lastaccessed June 26 2013) To independently check IBD callsusing SNP data we also determined if pairs of lines wereIBD for TEs in the same regions In this comparison we onlyincluded TE insertions where both lines have a positive pres-ence or absence call This can also serve as a validation of ourTE calling pipeline since if two segments are IBD they shouldshare the same set of TEs

Simulation

We selected a set of elements of different sizes and of TEclasses from the set of TEs in the reference sequence andinserted them semi-randomly into the D melanogaster refer-ence genome in silico Insertions were semi-random since weavoided areas already occupied by reference TEs and the cen-tromeric and telomeric regions Read pairs were then

2323

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

generated to construct fastq files We then ran our TE detec-tion pipeline as above through the initial annotation stepThis process was repeated three times to generate a simula-tion of 15 25 and 50 coverage

Recombination Rates

We wrote an R routine to estimate local recombination ratesas a function of physical position Briefly we used ldquocyto-ge-netic-seqtsvrdquo from wwwflybaseorg (last accessed January 302009) to assign a set of physical landmarks to genetic posi-tions For each chromosome arm we then fit a local polyno-mial relating cumulative cM to cumulative basepair (usinglocpoly) Unlike many prior local polynomial fits we madethe added assumption that this polynomial must be a mono-tonically increasing function (using monoproc) an assump-tion that is true for real data We then fit a smooth spline tothe above curve (using smoothspline) The advantage ofsmoothspline is that the resulting curves can be differentiated(and hence estimates of rates obtained directly) Finally we fita smooth spline to the first derivatives obtained from theprior smooth spline largely to make the recombinationrates less noisy near centromeres and telomeres TheR code is freely available at wwwmolpopgenorgData (lastaccessed August 1 2013)

Data Set Restriction

Analyses performed on these data sets were restricted in twoways First we restricted our analyses to regions of thegenome with low TE density and moderate recombinationrates We restricted our data set in this way because regions ofhigh TE density make it more difficult to make accurate callsBecause our pipeline requires unique reads to positionallylocate the TE insertion regions with high TE density canlack enough unique sequence to generate these reads Theregions included in the analyses are X300000ndash208000002L200000ndash20100000 2R2300000ndash21000000 3L100000ndash21900000 3R600000ndash27800000 which are not centromericand telomeric regions This excludes the majority of TEs an-notated in the reference genome since these TEs are largelyclustered together in centromeric regions However this doesnot affect our analysis since we are interested in TEs in theeuchromatic portions of the genome

Second to restrict the set of reference TEs studied to full-length copies we removed from the analysis all reference TEsthat had a length that was less than 75 of the canonicalannotated length of the TE While these elements may stillproduce phenotypic effects these elements are unlikely to beactive copies This reduced the number of reference TEs in-cluded in the analysis to 607 from 1085 TEs that are present inthe reference genome in the regions included in our analysis

Additionally some analyses were performed on the subsetof 23 lines in the DGRP data set which had an average se-quencing coverage of 25 or higher hereafter referred to asDGRP25 This restriction was to mitigate the effect of differ-ences in coverage between the two data sets Simulations oftransposable element detection via our pipeline data not

shown indicate that TE detection at 25 is comparable todetection at 50

Site Count Spectra

We generated derived allele count spectra for TEs in the DSPRdata set and the DGRP25 set of lines We included only siteswhere we were able to make a positive presence or absencecall for each line in the data set so that the number of lineswas normalized for each resource for this analysis Observedvalues were compared with expected values at counts 1 2 34 5 6 and 7 under an infinite sites model via a 2 testCategories 7 and higher were grouped due to low counts insome of these categories We also calculated count spectra forSNPs present in small introns (introns86 bp [Haddrill et al2005]) a group unlikely to have functional effects We alsoused SoFoS a site frequency rescaler (Hufford et al 2012) torescale the DSPR data set to the sample size of the DGRP25from 15 to 23 samples to see if the rescaled data show adifferent distribution of element counts from the observedDGRP25 data We then did a KolmogorovndashSmirnov test tocompare the two distributions as a test of different selectiveeffects acting on the populations

Genomic Context of TEs

Transposable elements were identified in regions of low(lt2 cMMb) or high recombination (2 cMMb) FASTAfiles containing the annotated coordinates of 50 UTRs30 UTRs intronic intergenic and full gene span regions forversion 513 of the D melanogaster reference were down-loaded from flybase (wwwflybaseorg last accessed January30 2009) Version 513 of the reference annotation was usedfor comparative purposes with the DGRP analysis (Mackayet al 2012) Because alternative transcripts can result in thesame genomic position being classified in multiple categorieswe labeled each TE as every appropriate category This oc-curred with 52 of TEs in the DGRP and 60 of TEs in theDSPR When calculating the percentage of the referencegenome in our restricted areas of analysis that were classifiedas each genomic type we followed the same strategy

We compared our observations of TE insertions to a bino-mial distribution to determine whether TEs were distributedrandomly throughout the genome We used the pbinomfunction in R (R Development Core Team 2008) setting thenumber of trials to the total number of TE insertion locationsin the data set and the probability of success to the propor-tion of the genome defined as each context (intron exon30 UTR 50 UTR or intergenic) We did one tailed tests in eachcase For exon 30 UTR and 50 UTR tests we calculated theprobability of observing the observed value or fewer TEs Forintron and intergenic tests we calculated the probability ofobserving the observed value or more TEs

Variables Affecting TE Density

For the DGRP25 and DSPR data sets we calculatedTE density in terms of TEsMb For this analysis we alsoexcluded chromosome 4 We examined the effects of recom-bination rate chromosome and TE element family on TE

2324

Cridland et al doi101093molbevmst129 MBE

density using the standard linear model in R We usedthe lm() function to define our model as followsTEsMb ~ Line + Recombination Rate + Chromosome +Recombination Rate Chromosome We performed thisanalysis both for the set of all TEs and also for 15 TE familieswith moderate to high number of copies in our data set(supplementary table S4 Supplementary Material online)We selected individual TE families for this additional analysisbased on their level of abundance in the data sets and theirhaving been previously studied by Montgomery et al (1987)and Bartolome et al (2002)

Gene Families

We downloaded the Dfam database from (httpwwwindianaedu~hahnlabflyDfamDBdrosophila_frbhtml last acc-essed September 1 2012) This database describes all mem-bers of all gene families clustered using a fuzzy reciprocalBLAST method as described in Hahn et al (2007) for 12Drosophila species We used R to perform a linear regressionusing a simple model where the square root (TE density) in agene is the random variable and log10 (gene family size) is theindependent variable

Variation between Resources

For each TE present in both data sets we compared thefrequency of the TE in each population using Fisherrsquos exacttest We excluded from this analysis all insertions where wewere unable to make a presenceabsence call in at least 100 ofthe DGRP lines or 13 of the DSPR lines For each gene with TEinsertions in both data sets we compared the frequency atwhich the gene contains TEs between data sets using aFisherrsquos exact test

Supplementary MaterialSupplementary tables S1ndashS6 and figures S1 and S2 are avail-able at Molecular Biology and Evolution online (httpwwwmbeoxfordjournalsorg)

Acknowledgments

The authors thank Casey Bergman and Miaomiao He for earlyaccess to their data as well as helpful comments from CBThey also thank Rebekah Rogers for help with editing themanuscript This work was supported by the NationalInstitute of Health grant NIH R01-GM085183 to KRT theNational Library of Medicine - National Institute of Healthtraining grant LM007443 to JMC and the National Instituteof Health grant NIH R01-OD010974 to ADL and SJM

ReferencesAltschul SF Gish W Miller W Myers EW Lipman DJ 1990 Basic local

alignment search tool J Mol Biol 215403ndash410Aminetzach YT Macpherson JM Petrov DA 2005 Pesticide resistance

via transposition-mediated adaptive gene truncation in DrosophilaScience 309764ndash767

Bartolome C Maside X Charlesworth B 2002 On the abundance anddistribution of transposable elements in the genome of Drosophilamelanogaster Mol Biol Evol 19926ndash937

Bennetzen JL 2000 Transposable element contributions to plant geneand genome evolution Plant Mol Biol 42251ndash269

Bergman CM Bensasson D 2007 Recent LTR retrotransposon insertioncontrasts with waves of non-LTR insertion since speciation inDrosophila melanogaster Proc Natl Acad Sci U S A 10411340ndash11345

Biemont C Vieira C 2006 Junk DNA as an evolutionary force Nature443521ndash524

Birchler JA Hiebert JC 1989 Interaction of the Enhancer of white-apricotwith transposable element alleles at the white locus in Drosophilamelanogaster Genetics 122129ndash138

Birchler JA Hiebert JC Rabinow L 1989 Interaction of the mottler ofwhite with transposable element alleles at the white locus inDrosophila melanogaster Genes Dev 373ndash84

Bowen NJ McDonald JF 2001 Drosophila euchromatic LTR retrotran-sposons are much younger than the host species in which theyreside Genome Res 111527ndash1540

Carr M Soloway JR Robinson TE Brookfield JF 2002 Mechanisms reg-ulating copy numbers of six LTR retrotransposons in the genome ofDrosophila melanogaster Chromosoma 110511ndash518

Charlesworth B 1994 The effect of background selection against dele-terious mutations on weakly selected linked variants Genet Res 63213ndash227

Charlesworth B Jarne P Assimacopoulos S 1994 The distribution oftransposable elements within and between chromosomes in a pop-ulation of Drosophila melanogaster III Element abundances in het-erochromatin Genet Res 64183ndash197

Charlesworth B Langley CH 1989 The population genetics ofDrosophila transposable elements Annu Rev Genet 23251ndash287

Charlesworth B Lapid A 1989 A study of ten families of transposableelements on X chromosomes from a population of Drosophila mel-anogaster Genet Res 54113ndash125

Charlesworth B Lapid A Canada D 1992 The distribution of transpos-able elements within and between chromosomes in a population ofDrosophila melanogaster I Element frequencies and distributionGenet Res 60103ndash114

Chen GK Marjoram P Wall JD 2009 Fast and flexible simulation ofDNA sequence data Genome Res 19136ndash142

Chung H Bogwitz MR McCart C Andrianopoulos A Ffrench-ConstantRH Batterham P Daborn PJ 2007 Cis-regulatory elements in theaccord retrotransposon result in tissue-specific expression of theDrosophila melanogaster insecticide resistance gene Cyp6g1Genetics 1751071ndash1077

Cutter AD Good JM Pappas CT Saunders MA Starrett DM Wheeler TJ2005 Transposable element orientation bias in the Drosophila mel-anogaster genome J Mol Evol 6733ndash741

Daborn P Boundy S Yen J Pittendrigh B Ffrench-Constant R 2001DDT resistance in Drosophila correlates with Cyp6g1 over-expressionand confers cross-resistance to the neonicotinoid imidacloprid MolGenet Genomics 266556ndash563

Deloger M Cavalli FMG Lerat E Biemont C Sagot M-F Vieira C 2009Identification of expressed transposable element insertionsin the sequenced genome of Drosophila melanogaster Gene 43955ndash62

Dworkin I Palsson A Gibson G 2005 Replication of an Egfr-wing shapeassociation in a wild-caught cohort of Drosophila melanogasterGenetics 1692115ndash2125

Ewing B Green P 1998 Base-calling of automated sequencer tracesusing Phred II Error probabilities Genome Res 8186ndash194

Gruber JD Genissel A Macdonald SJ Long AD 2007 How repeatable areassociations between polymorphisms in achaete-scute and bristlenumber variation in Drosophila Genetics 1751987ndash1997

Haddrill PR Charlesworth B Halligan DL Andolfatto P 2005 Patterns ofintron sequence evolution in Drosophila are dependent upon lengthand GC content Genome Biol 6R67

Hahn MW Han MV Han S-G 2007 Gene family evolution across 12Drosophila genomes PLoS Genet 3e197

Hirschhorn JN Daly MJ 2005 Genome-wide association stud-ies for common diseases and complex traits Nat Rev Genet 695ndash108

2325

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

Hudson RR Kaplan NL 1995 Deleterious background selection withrecombination Genetics 1411605ndash1617

Hufford MB Xu X van Heerwaarden J et al (25 co-authors) 2012Comparative population genomics of maize domestication andimprovement Nat Genet 44808ndash811

Kaminker JS Bergman CM Kronmiller B et al (12 co-authors) 2002The transposable elements of the Drosophila melanogastereuchromatin a genomics perspective Genome Biol3RESEARCH0084

Kapitonov VV Jurka J 2003 Molecular paleontology of transposableelements in the Drosophila melanogaster genome Proc Natl AcadSci U S A 1006569ndash6574

Kazazian HH 2004 Mobile elements drivers of genome evolutionScience 3031626ndash1632

Kidwell MG 2002 Transposable elements and the evolution of genomesize in eukaryotes Genetica 11549ndash63

King EG Macdonald SJ Long AD 2012a Properties and power of theDrosophila synthetic population resource for the routine dissectionof complex traits Genetics 191935ndash949

King EG Merkes CM McNeil CL Hoofer SR Sen S Broman KW LongAD Macdonald SJ 2012b Genetic dissection of a model complextrait using the Drosophila synthetic population resource GenomeRes 221558ndash1566

Kofler R Betancourt AJ Schlotterer C 2012 Sequencing of pooled DNAsamples (Pool-Seq) uncovers complex dynamics of transposableelement insertions in Drosophila melanogaster PLoS Genet 8e1002487

Langley CH Montgomery E Hudson R Kaplan N Charlesworth B 1988On the role of unequal exchange in the containment of transpos-able element copy number Genet Res 52223ndash235

Li H Durbin R 2009 Fast and accurate short readalignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760

Linheiro RS Bergman CM 2012 Whole genome resequencing revealsnatural target site preferences of transposable elements inDrosophila melanogaster PLoS One 7e30008

Lipatov M Lenkov L Petrov DA Bergman CM 2005 Paucity of chimericgene-transposable elements in the Drosophila melanogastergenome BMC Biol 324

Long AD Lyman RF Morgan AH Langley CH Mackay TFC 2000 Bothnaturally occurring insertions of transposable elements and inter-mediate frequency polymorphisms at the achaete-scute complexare associated with bristle number in Drosophila melanogasterGenetics 1541255ndash1269

Lonnig W-E Saedler H 2002 Chromosome rearrangements and trans-posable elements Annu Rev Genet 36389ndash410

Macdonald SJ Long AD 2007 Joint estimates of quantitativetrait locus effect and frequency using synthetic recombinantpopulations of Drosophila melanogaster Genetics 1761261ndash1281

Macdonald SJ Pastinen T Long AD 2005 The effect of polymorphismsin the Enhancer of split gene complex on bristle number variation ina large wild-caught cohort of Drosophila melanogaster Genetics 1711741ndash1756

Mackay TFC 1984 Jumping genes meet abdominal bristles hybrid dys-genesis-induced quantitative variation in Drosophila melanogasterGenet Res 44231ndash237

Mackay TFC 1989 Transposable elements and fitness in Drosophilamelanogaster Genome 31284ndash295

Mackay TFC 2010 Mutations and quantitative genetic variationlessons from Drosophila Philos Trans R Soc Lond B Biol Sci 3651229ndash1239

Mackay TFC Langley CH 1990 Molecular and phenotypic variation inthe achaete-scute region of Drosophila melanogaster Nature 34864ndash66

Mackay TFC Lyman RF Jackson MS 1992 Effects of P element inser-tions on quantitative traits in Drosophila melanogaster Genetics 130315ndash332

Mackay TFC Richards S Stone EA et al (52 co-authors) 2012 TheDrosophila melanogaster genetic reference panel Nature 482173ndash178

Meyer LR Zweig AS Hinrichs AS et al (32 co-authors) 2013 The UCSCgenome browser database extensions and updates 2013 NucleicAcids Res 41D64ndashD69

Montgomery E Charlesworth B Langley CH 1987 A test for the role ofnatural selection in the stabilization of transposable element copynumber in a population of Drosophila melanogaster Genet Res 4931ndash41

Montgomery EA Huang S-M Langley CH Judd BH 1991 Chromosomerearrangement by ectopic recombination in Drosophilamelanogaster genome structure and evolution Genetics 1291085ndash1098

NCI-NHGRI Working Group on Replication in Association Studies2007 Replicating genotype-phenotype associations Nature 447655ndash660

Nuzhdin SV Mackay TFC 1994 The genomic rate of transposable ele-ment movement in Drosophila melanogaster Mol Biol Evol 12180ndash181

Nuzhdin SV Pasyukova EG Mackay TFC 1997 Accumulation of trans-posable elements in laboratory lines of Drosophila melanogasterGenetica 100167ndash175

Ometto L Stephan W Lorenzo DD 2005 Insertiondeletion andnucleotide polymorphism data reveal constraints in Drosophilamelanogaster introns and intergenic regions Genetics 1691521ndash1527

Palsson A Gibson G 2004 Association between nucleotide variation inEgfr and wing shape in Drosophila melanogaster Genetics 1671187ndash1198

Petrov DA Fiston-Lavier A-S Lipatov M Lenkov K Gonzalez J 2011Population genomics of transposable elements in Drosophila mela-nogaster Mol Biol Evol 281633ndash1644

Presgraves DC 2006 Intron length evolution in Drosophila Mol Biol Evol132203ndash2213

Quinian AR Hall IM 2010 BEDTools a flexible suite of utilities forcomparing genomic features Bioinformatics 26841ndash842

R Development Core Team 2008 R a language and environment forstatistical computing Vienna Austria R Foundation for StatisticalComputing [cited 2013 May 1] Available from httpwwwR-projectorg

Rizzon C Marais G Gouy M Biemont C 2002 Recombination rate andthe distribution of transposable elements in the Drosophila mela-nogaster genome Genome Res 12400ndash407

Sackton TB Kulathinal RJ Bergman CM Quinlan AR Dopman EBCarneiro M Marth GT Hartl DL Clark AG 2009 Population geno-mic inferences from sparse high-throughput sequencing of twopopulations of Drosophila melanogaster Genome Biol Evol 1449ndash465

Sakurai M Aoki T Yoshikawa S et al (11 co-authors) 2009 Differentiallyexpressed Drl and Drl-2 play opposing roles in Wnt5 signalingduring Drosophila olfactory system development J Neurosci 294972ndash4980

Schmidt JM Good RT Appleton B et al (11 co-authors) 2010Copy number variation and transposable elements feature inrecent ongoing adaptation at the Cyp6g1 locus PLoS Genet 6e1000998

Shrimpton AE Mackay TFC Brown AJL 1990 Transposable element-in-duced response to artificial selection in Drosophila melanogastermolecular analysis of selection lines Genetics 125803ndash811

Slotkin RK Martienssen R 2007 Transposable elements and the epige-netic regulation of the genome Nat Rev Genet 8272ndash285

Smith PA Corces VG 1991 Drosophila transposable elements mecha-nisms of mutagenesis and interactions with the host genome AdvGenet 29229ndash300

Thornton K 2003 Libsequence a C ++ class library for evolutionarygenetic analysis Bioinformatics 192325ndash2327

Vazquez JF Albornoz J Dominguez A 2007 Direct determination of theeffects of genotype and extreme temperature on the transposition

2326

Cridland et al doi101093molbevmst129 MBE

of roo in long-term mutation accumulation lines of Drosophilamelanogaster Mol Genet Genomics 278653ndash664

Voight BF Pritchard JK 2005 Confounding from cryptic relatedness incase-control association studies PLoS Genet 1e32

Wakeley J 2009 Coalescent theory an introduction Greenwood VillageRoberts and Company Publishers

Wicker T Sabot F Hua-Van A et al (13 co-authors) 2007 A unifiedclassification system for eukaryotic transposable elements Nat RevGenet 8973ndash982

Xing J Witherspoon DJ Jorde LB 2013 Mobile element biology newpossibilities with high-throughput sequencing Trends Genet 29280ndash289

Yoshikawa S McKinnon RD Kokel M Thomas JB 2003 Wnt-mediated axon guidance via Drosophila derailed receptor Nature422583ndash588

Zachar Z Bingham PM 1982 Regulation of white locus expression thestructure of mutant alleles at the white locus of Drosophila mela-nogaster Cell 30529ndash541

2327

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

Page 14: Abundance and Distribution of Transposable Elements in Two ...

generated to construct fastq files We then ran our TE detec-tion pipeline as above through the initial annotation stepThis process was repeated three times to generate a simula-tion of 15 25 and 50 coverage

Recombination Rates

We wrote an R routine to estimate local recombination ratesas a function of physical position Briefly we used ldquocyto-ge-netic-seqtsvrdquo from wwwflybaseorg (last accessed January 302009) to assign a set of physical landmarks to genetic posi-tions For each chromosome arm we then fit a local polyno-mial relating cumulative cM to cumulative basepair (usinglocpoly) Unlike many prior local polynomial fits we madethe added assumption that this polynomial must be a mono-tonically increasing function (using monoproc) an assump-tion that is true for real data We then fit a smooth spline tothe above curve (using smoothspline) The advantage ofsmoothspline is that the resulting curves can be differentiated(and hence estimates of rates obtained directly) Finally we fita smooth spline to the first derivatives obtained from theprior smooth spline largely to make the recombinationrates less noisy near centromeres and telomeres TheR code is freely available at wwwmolpopgenorgData (lastaccessed August 1 2013)

Data Set Restriction

Analyses performed on these data sets were restricted in twoways First we restricted our analyses to regions of thegenome with low TE density and moderate recombinationrates We restricted our data set in this way because regions ofhigh TE density make it more difficult to make accurate callsBecause our pipeline requires unique reads to positionallylocate the TE insertion regions with high TE density canlack enough unique sequence to generate these reads Theregions included in the analyses are X300000ndash208000002L200000ndash20100000 2R2300000ndash21000000 3L100000ndash21900000 3R600000ndash27800000 which are not centromericand telomeric regions This excludes the majority of TEs an-notated in the reference genome since these TEs are largelyclustered together in centromeric regions However this doesnot affect our analysis since we are interested in TEs in theeuchromatic portions of the genome

Second to restrict the set of reference TEs studied to full-length copies we removed from the analysis all reference TEsthat had a length that was less than 75 of the canonicalannotated length of the TE While these elements may stillproduce phenotypic effects these elements are unlikely to beactive copies This reduced the number of reference TEs in-cluded in the analysis to 607 from 1085 TEs that are present inthe reference genome in the regions included in our analysis

Additionally some analyses were performed on the subsetof 23 lines in the DGRP data set which had an average se-quencing coverage of 25 or higher hereafter referred to asDGRP25 This restriction was to mitigate the effect of differ-ences in coverage between the two data sets Simulations oftransposable element detection via our pipeline data not

shown indicate that TE detection at 25 is comparable todetection at 50

Site Count Spectra

We generated derived allele count spectra for TEs in the DSPRdata set and the DGRP25 set of lines We included only siteswhere we were able to make a positive presence or absencecall for each line in the data set so that the number of lineswas normalized for each resource for this analysis Observedvalues were compared with expected values at counts 1 2 34 5 6 and 7 under an infinite sites model via a 2 testCategories 7 and higher were grouped due to low counts insome of these categories We also calculated count spectra forSNPs present in small introns (introns86 bp [Haddrill et al2005]) a group unlikely to have functional effects We alsoused SoFoS a site frequency rescaler (Hufford et al 2012) torescale the DSPR data set to the sample size of the DGRP25from 15 to 23 samples to see if the rescaled data show adifferent distribution of element counts from the observedDGRP25 data We then did a KolmogorovndashSmirnov test tocompare the two distributions as a test of different selectiveeffects acting on the populations

Genomic Context of TEs

Transposable elements were identified in regions of low(lt2 cMMb) or high recombination (2 cMMb) FASTAfiles containing the annotated coordinates of 50 UTRs30 UTRs intronic intergenic and full gene span regions forversion 513 of the D melanogaster reference were down-loaded from flybase (wwwflybaseorg last accessed January30 2009) Version 513 of the reference annotation was usedfor comparative purposes with the DGRP analysis (Mackayet al 2012) Because alternative transcripts can result in thesame genomic position being classified in multiple categorieswe labeled each TE as every appropriate category This oc-curred with 52 of TEs in the DGRP and 60 of TEs in theDSPR When calculating the percentage of the referencegenome in our restricted areas of analysis that were classifiedas each genomic type we followed the same strategy

We compared our observations of TE insertions to a bino-mial distribution to determine whether TEs were distributedrandomly throughout the genome We used the pbinomfunction in R (R Development Core Team 2008) setting thenumber of trials to the total number of TE insertion locationsin the data set and the probability of success to the propor-tion of the genome defined as each context (intron exon30 UTR 50 UTR or intergenic) We did one tailed tests in eachcase For exon 30 UTR and 50 UTR tests we calculated theprobability of observing the observed value or fewer TEs Forintron and intergenic tests we calculated the probability ofobserving the observed value or more TEs

Variables Affecting TE Density

For the DGRP25 and DSPR data sets we calculatedTE density in terms of TEsMb For this analysis we alsoexcluded chromosome 4 We examined the effects of recom-bination rate chromosome and TE element family on TE

2324

Cridland et al doi101093molbevmst129 MBE

density using the standard linear model in R We usedthe lm() function to define our model as followsTEsMb ~ Line + Recombination Rate + Chromosome +Recombination Rate Chromosome We performed thisanalysis both for the set of all TEs and also for 15 TE familieswith moderate to high number of copies in our data set(supplementary table S4 Supplementary Material online)We selected individual TE families for this additional analysisbased on their level of abundance in the data sets and theirhaving been previously studied by Montgomery et al (1987)and Bartolome et al (2002)

Gene Families

We downloaded the Dfam database from (httpwwwindianaedu~hahnlabflyDfamDBdrosophila_frbhtml last acc-essed September 1 2012) This database describes all mem-bers of all gene families clustered using a fuzzy reciprocalBLAST method as described in Hahn et al (2007) for 12Drosophila species We used R to perform a linear regressionusing a simple model where the square root (TE density) in agene is the random variable and log10 (gene family size) is theindependent variable

Variation between Resources

For each TE present in both data sets we compared thefrequency of the TE in each population using Fisherrsquos exacttest We excluded from this analysis all insertions where wewere unable to make a presenceabsence call in at least 100 ofthe DGRP lines or 13 of the DSPR lines For each gene with TEinsertions in both data sets we compared the frequency atwhich the gene contains TEs between data sets using aFisherrsquos exact test

Supplementary MaterialSupplementary tables S1ndashS6 and figures S1 and S2 are avail-able at Molecular Biology and Evolution online (httpwwwmbeoxfordjournalsorg)

Acknowledgments

The authors thank Casey Bergman and Miaomiao He for earlyaccess to their data as well as helpful comments from CBThey also thank Rebekah Rogers for help with editing themanuscript This work was supported by the NationalInstitute of Health grant NIH R01-GM085183 to KRT theNational Library of Medicine - National Institute of Healthtraining grant LM007443 to JMC and the National Instituteof Health grant NIH R01-OD010974 to ADL and SJM

ReferencesAltschul SF Gish W Miller W Myers EW Lipman DJ 1990 Basic local

alignment search tool J Mol Biol 215403ndash410Aminetzach YT Macpherson JM Petrov DA 2005 Pesticide resistance

via transposition-mediated adaptive gene truncation in DrosophilaScience 309764ndash767

Bartolome C Maside X Charlesworth B 2002 On the abundance anddistribution of transposable elements in the genome of Drosophilamelanogaster Mol Biol Evol 19926ndash937

Bennetzen JL 2000 Transposable element contributions to plant geneand genome evolution Plant Mol Biol 42251ndash269

Bergman CM Bensasson D 2007 Recent LTR retrotransposon insertioncontrasts with waves of non-LTR insertion since speciation inDrosophila melanogaster Proc Natl Acad Sci U S A 10411340ndash11345

Biemont C Vieira C 2006 Junk DNA as an evolutionary force Nature443521ndash524

Birchler JA Hiebert JC 1989 Interaction of the Enhancer of white-apricotwith transposable element alleles at the white locus in Drosophilamelanogaster Genetics 122129ndash138

Birchler JA Hiebert JC Rabinow L 1989 Interaction of the mottler ofwhite with transposable element alleles at the white locus inDrosophila melanogaster Genes Dev 373ndash84

Bowen NJ McDonald JF 2001 Drosophila euchromatic LTR retrotran-sposons are much younger than the host species in which theyreside Genome Res 111527ndash1540

Carr M Soloway JR Robinson TE Brookfield JF 2002 Mechanisms reg-ulating copy numbers of six LTR retrotransposons in the genome ofDrosophila melanogaster Chromosoma 110511ndash518

Charlesworth B 1994 The effect of background selection against dele-terious mutations on weakly selected linked variants Genet Res 63213ndash227

Charlesworth B Jarne P Assimacopoulos S 1994 The distribution oftransposable elements within and between chromosomes in a pop-ulation of Drosophila melanogaster III Element abundances in het-erochromatin Genet Res 64183ndash197

Charlesworth B Langley CH 1989 The population genetics ofDrosophila transposable elements Annu Rev Genet 23251ndash287

Charlesworth B Lapid A 1989 A study of ten families of transposableelements on X chromosomes from a population of Drosophila mel-anogaster Genet Res 54113ndash125

Charlesworth B Lapid A Canada D 1992 The distribution of transpos-able elements within and between chromosomes in a population ofDrosophila melanogaster I Element frequencies and distributionGenet Res 60103ndash114

Chen GK Marjoram P Wall JD 2009 Fast and flexible simulation ofDNA sequence data Genome Res 19136ndash142

Chung H Bogwitz MR McCart C Andrianopoulos A Ffrench-ConstantRH Batterham P Daborn PJ 2007 Cis-regulatory elements in theaccord retrotransposon result in tissue-specific expression of theDrosophila melanogaster insecticide resistance gene Cyp6g1Genetics 1751071ndash1077

Cutter AD Good JM Pappas CT Saunders MA Starrett DM Wheeler TJ2005 Transposable element orientation bias in the Drosophila mel-anogaster genome J Mol Evol 6733ndash741

Daborn P Boundy S Yen J Pittendrigh B Ffrench-Constant R 2001DDT resistance in Drosophila correlates with Cyp6g1 over-expressionand confers cross-resistance to the neonicotinoid imidacloprid MolGenet Genomics 266556ndash563

Deloger M Cavalli FMG Lerat E Biemont C Sagot M-F Vieira C 2009Identification of expressed transposable element insertionsin the sequenced genome of Drosophila melanogaster Gene 43955ndash62

Dworkin I Palsson A Gibson G 2005 Replication of an Egfr-wing shapeassociation in a wild-caught cohort of Drosophila melanogasterGenetics 1692115ndash2125

Ewing B Green P 1998 Base-calling of automated sequencer tracesusing Phred II Error probabilities Genome Res 8186ndash194

Gruber JD Genissel A Macdonald SJ Long AD 2007 How repeatable areassociations between polymorphisms in achaete-scute and bristlenumber variation in Drosophila Genetics 1751987ndash1997

Haddrill PR Charlesworth B Halligan DL Andolfatto P 2005 Patterns ofintron sequence evolution in Drosophila are dependent upon lengthand GC content Genome Biol 6R67

Hahn MW Han MV Han S-G 2007 Gene family evolution across 12Drosophila genomes PLoS Genet 3e197

Hirschhorn JN Daly MJ 2005 Genome-wide association stud-ies for common diseases and complex traits Nat Rev Genet 695ndash108

2325

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

Hudson RR Kaplan NL 1995 Deleterious background selection withrecombination Genetics 1411605ndash1617

Hufford MB Xu X van Heerwaarden J et al (25 co-authors) 2012Comparative population genomics of maize domestication andimprovement Nat Genet 44808ndash811

Kaminker JS Bergman CM Kronmiller B et al (12 co-authors) 2002The transposable elements of the Drosophila melanogastereuchromatin a genomics perspective Genome Biol3RESEARCH0084

Kapitonov VV Jurka J 2003 Molecular paleontology of transposableelements in the Drosophila melanogaster genome Proc Natl AcadSci U S A 1006569ndash6574

Kazazian HH 2004 Mobile elements drivers of genome evolutionScience 3031626ndash1632

Kidwell MG 2002 Transposable elements and the evolution of genomesize in eukaryotes Genetica 11549ndash63

King EG Macdonald SJ Long AD 2012a Properties and power of theDrosophila synthetic population resource for the routine dissectionof complex traits Genetics 191935ndash949

King EG Merkes CM McNeil CL Hoofer SR Sen S Broman KW LongAD Macdonald SJ 2012b Genetic dissection of a model complextrait using the Drosophila synthetic population resource GenomeRes 221558ndash1566

Kofler R Betancourt AJ Schlotterer C 2012 Sequencing of pooled DNAsamples (Pool-Seq) uncovers complex dynamics of transposableelement insertions in Drosophila melanogaster PLoS Genet 8e1002487

Langley CH Montgomery E Hudson R Kaplan N Charlesworth B 1988On the role of unequal exchange in the containment of transpos-able element copy number Genet Res 52223ndash235

Li H Durbin R 2009 Fast and accurate short readalignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760

Linheiro RS Bergman CM 2012 Whole genome resequencing revealsnatural target site preferences of transposable elements inDrosophila melanogaster PLoS One 7e30008

Lipatov M Lenkov L Petrov DA Bergman CM 2005 Paucity of chimericgene-transposable elements in the Drosophila melanogastergenome BMC Biol 324

Long AD Lyman RF Morgan AH Langley CH Mackay TFC 2000 Bothnaturally occurring insertions of transposable elements and inter-mediate frequency polymorphisms at the achaete-scute complexare associated with bristle number in Drosophila melanogasterGenetics 1541255ndash1269

Lonnig W-E Saedler H 2002 Chromosome rearrangements and trans-posable elements Annu Rev Genet 36389ndash410

Macdonald SJ Long AD 2007 Joint estimates of quantitativetrait locus effect and frequency using synthetic recombinantpopulations of Drosophila melanogaster Genetics 1761261ndash1281

Macdonald SJ Pastinen T Long AD 2005 The effect of polymorphismsin the Enhancer of split gene complex on bristle number variation ina large wild-caught cohort of Drosophila melanogaster Genetics 1711741ndash1756

Mackay TFC 1984 Jumping genes meet abdominal bristles hybrid dys-genesis-induced quantitative variation in Drosophila melanogasterGenet Res 44231ndash237

Mackay TFC 1989 Transposable elements and fitness in Drosophilamelanogaster Genome 31284ndash295

Mackay TFC 2010 Mutations and quantitative genetic variationlessons from Drosophila Philos Trans R Soc Lond B Biol Sci 3651229ndash1239

Mackay TFC Langley CH 1990 Molecular and phenotypic variation inthe achaete-scute region of Drosophila melanogaster Nature 34864ndash66

Mackay TFC Lyman RF Jackson MS 1992 Effects of P element inser-tions on quantitative traits in Drosophila melanogaster Genetics 130315ndash332

Mackay TFC Richards S Stone EA et al (52 co-authors) 2012 TheDrosophila melanogaster genetic reference panel Nature 482173ndash178

Meyer LR Zweig AS Hinrichs AS et al (32 co-authors) 2013 The UCSCgenome browser database extensions and updates 2013 NucleicAcids Res 41D64ndashD69

Montgomery E Charlesworth B Langley CH 1987 A test for the role ofnatural selection in the stabilization of transposable element copynumber in a population of Drosophila melanogaster Genet Res 4931ndash41

Montgomery EA Huang S-M Langley CH Judd BH 1991 Chromosomerearrangement by ectopic recombination in Drosophilamelanogaster genome structure and evolution Genetics 1291085ndash1098

NCI-NHGRI Working Group on Replication in Association Studies2007 Replicating genotype-phenotype associations Nature 447655ndash660

Nuzhdin SV Mackay TFC 1994 The genomic rate of transposable ele-ment movement in Drosophila melanogaster Mol Biol Evol 12180ndash181

Nuzhdin SV Pasyukova EG Mackay TFC 1997 Accumulation of trans-posable elements in laboratory lines of Drosophila melanogasterGenetica 100167ndash175

Ometto L Stephan W Lorenzo DD 2005 Insertiondeletion andnucleotide polymorphism data reveal constraints in Drosophilamelanogaster introns and intergenic regions Genetics 1691521ndash1527

Palsson A Gibson G 2004 Association between nucleotide variation inEgfr and wing shape in Drosophila melanogaster Genetics 1671187ndash1198

Petrov DA Fiston-Lavier A-S Lipatov M Lenkov K Gonzalez J 2011Population genomics of transposable elements in Drosophila mela-nogaster Mol Biol Evol 281633ndash1644

Presgraves DC 2006 Intron length evolution in Drosophila Mol Biol Evol132203ndash2213

Quinian AR Hall IM 2010 BEDTools a flexible suite of utilities forcomparing genomic features Bioinformatics 26841ndash842

R Development Core Team 2008 R a language and environment forstatistical computing Vienna Austria R Foundation for StatisticalComputing [cited 2013 May 1] Available from httpwwwR-projectorg

Rizzon C Marais G Gouy M Biemont C 2002 Recombination rate andthe distribution of transposable elements in the Drosophila mela-nogaster genome Genome Res 12400ndash407

Sackton TB Kulathinal RJ Bergman CM Quinlan AR Dopman EBCarneiro M Marth GT Hartl DL Clark AG 2009 Population geno-mic inferences from sparse high-throughput sequencing of twopopulations of Drosophila melanogaster Genome Biol Evol 1449ndash465

Sakurai M Aoki T Yoshikawa S et al (11 co-authors) 2009 Differentiallyexpressed Drl and Drl-2 play opposing roles in Wnt5 signalingduring Drosophila olfactory system development J Neurosci 294972ndash4980

Schmidt JM Good RT Appleton B et al (11 co-authors) 2010Copy number variation and transposable elements feature inrecent ongoing adaptation at the Cyp6g1 locus PLoS Genet 6e1000998

Shrimpton AE Mackay TFC Brown AJL 1990 Transposable element-in-duced response to artificial selection in Drosophila melanogastermolecular analysis of selection lines Genetics 125803ndash811

Slotkin RK Martienssen R 2007 Transposable elements and the epige-netic regulation of the genome Nat Rev Genet 8272ndash285

Smith PA Corces VG 1991 Drosophila transposable elements mecha-nisms of mutagenesis and interactions with the host genome AdvGenet 29229ndash300

Thornton K 2003 Libsequence a C ++ class library for evolutionarygenetic analysis Bioinformatics 192325ndash2327

Vazquez JF Albornoz J Dominguez A 2007 Direct determination of theeffects of genotype and extreme temperature on the transposition

2326

Cridland et al doi101093molbevmst129 MBE

of roo in long-term mutation accumulation lines of Drosophilamelanogaster Mol Genet Genomics 278653ndash664

Voight BF Pritchard JK 2005 Confounding from cryptic relatedness incase-control association studies PLoS Genet 1e32

Wakeley J 2009 Coalescent theory an introduction Greenwood VillageRoberts and Company Publishers

Wicker T Sabot F Hua-Van A et al (13 co-authors) 2007 A unifiedclassification system for eukaryotic transposable elements Nat RevGenet 8973ndash982

Xing J Witherspoon DJ Jorde LB 2013 Mobile element biology newpossibilities with high-throughput sequencing Trends Genet 29280ndash289

Yoshikawa S McKinnon RD Kokel M Thomas JB 2003 Wnt-mediated axon guidance via Drosophila derailed receptor Nature422583ndash588

Zachar Z Bingham PM 1982 Regulation of white locus expression thestructure of mutant alleles at the white locus of Drosophila mela-nogaster Cell 30529ndash541

2327

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

Page 15: Abundance and Distribution of Transposable Elements in Two ...

density using the standard linear model in R We usedthe lm() function to define our model as followsTEsMb ~ Line + Recombination Rate + Chromosome +Recombination Rate Chromosome We performed thisanalysis both for the set of all TEs and also for 15 TE familieswith moderate to high number of copies in our data set(supplementary table S4 Supplementary Material online)We selected individual TE families for this additional analysisbased on their level of abundance in the data sets and theirhaving been previously studied by Montgomery et al (1987)and Bartolome et al (2002)

Gene Families

We downloaded the Dfam database from (httpwwwindianaedu~hahnlabflyDfamDBdrosophila_frbhtml last acc-essed September 1 2012) This database describes all mem-bers of all gene families clustered using a fuzzy reciprocalBLAST method as described in Hahn et al (2007) for 12Drosophila species We used R to perform a linear regressionusing a simple model where the square root (TE density) in agene is the random variable and log10 (gene family size) is theindependent variable

Variation between Resources

For each TE present in both data sets we compared thefrequency of the TE in each population using Fisherrsquos exacttest We excluded from this analysis all insertions where wewere unable to make a presenceabsence call in at least 100 ofthe DGRP lines or 13 of the DSPR lines For each gene with TEinsertions in both data sets we compared the frequency atwhich the gene contains TEs between data sets using aFisherrsquos exact test

Supplementary MaterialSupplementary tables S1ndashS6 and figures S1 and S2 are avail-able at Molecular Biology and Evolution online (httpwwwmbeoxfordjournalsorg)

Acknowledgments

The authors thank Casey Bergman and Miaomiao He for earlyaccess to their data as well as helpful comments from CBThey also thank Rebekah Rogers for help with editing themanuscript This work was supported by the NationalInstitute of Health grant NIH R01-GM085183 to KRT theNational Library of Medicine - National Institute of Healthtraining grant LM007443 to JMC and the National Instituteof Health grant NIH R01-OD010974 to ADL and SJM

ReferencesAltschul SF Gish W Miller W Myers EW Lipman DJ 1990 Basic local

alignment search tool J Mol Biol 215403ndash410Aminetzach YT Macpherson JM Petrov DA 2005 Pesticide resistance

via transposition-mediated adaptive gene truncation in DrosophilaScience 309764ndash767

Bartolome C Maside X Charlesworth B 2002 On the abundance anddistribution of transposable elements in the genome of Drosophilamelanogaster Mol Biol Evol 19926ndash937

Bennetzen JL 2000 Transposable element contributions to plant geneand genome evolution Plant Mol Biol 42251ndash269

Bergman CM Bensasson D 2007 Recent LTR retrotransposon insertioncontrasts with waves of non-LTR insertion since speciation inDrosophila melanogaster Proc Natl Acad Sci U S A 10411340ndash11345

Biemont C Vieira C 2006 Junk DNA as an evolutionary force Nature443521ndash524

Birchler JA Hiebert JC 1989 Interaction of the Enhancer of white-apricotwith transposable element alleles at the white locus in Drosophilamelanogaster Genetics 122129ndash138

Birchler JA Hiebert JC Rabinow L 1989 Interaction of the mottler ofwhite with transposable element alleles at the white locus inDrosophila melanogaster Genes Dev 373ndash84

Bowen NJ McDonald JF 2001 Drosophila euchromatic LTR retrotran-sposons are much younger than the host species in which theyreside Genome Res 111527ndash1540

Carr M Soloway JR Robinson TE Brookfield JF 2002 Mechanisms reg-ulating copy numbers of six LTR retrotransposons in the genome ofDrosophila melanogaster Chromosoma 110511ndash518

Charlesworth B 1994 The effect of background selection against dele-terious mutations on weakly selected linked variants Genet Res 63213ndash227

Charlesworth B Jarne P Assimacopoulos S 1994 The distribution oftransposable elements within and between chromosomes in a pop-ulation of Drosophila melanogaster III Element abundances in het-erochromatin Genet Res 64183ndash197

Charlesworth B Langley CH 1989 The population genetics ofDrosophila transposable elements Annu Rev Genet 23251ndash287

Charlesworth B Lapid A 1989 A study of ten families of transposableelements on X chromosomes from a population of Drosophila mel-anogaster Genet Res 54113ndash125

Charlesworth B Lapid A Canada D 1992 The distribution of transpos-able elements within and between chromosomes in a population ofDrosophila melanogaster I Element frequencies and distributionGenet Res 60103ndash114

Chen GK Marjoram P Wall JD 2009 Fast and flexible simulation ofDNA sequence data Genome Res 19136ndash142

Chung H Bogwitz MR McCart C Andrianopoulos A Ffrench-ConstantRH Batterham P Daborn PJ 2007 Cis-regulatory elements in theaccord retrotransposon result in tissue-specific expression of theDrosophila melanogaster insecticide resistance gene Cyp6g1Genetics 1751071ndash1077

Cutter AD Good JM Pappas CT Saunders MA Starrett DM Wheeler TJ2005 Transposable element orientation bias in the Drosophila mel-anogaster genome J Mol Evol 6733ndash741

Daborn P Boundy S Yen J Pittendrigh B Ffrench-Constant R 2001DDT resistance in Drosophila correlates with Cyp6g1 over-expressionand confers cross-resistance to the neonicotinoid imidacloprid MolGenet Genomics 266556ndash563

Deloger M Cavalli FMG Lerat E Biemont C Sagot M-F Vieira C 2009Identification of expressed transposable element insertionsin the sequenced genome of Drosophila melanogaster Gene 43955ndash62

Dworkin I Palsson A Gibson G 2005 Replication of an Egfr-wing shapeassociation in a wild-caught cohort of Drosophila melanogasterGenetics 1692115ndash2125

Ewing B Green P 1998 Base-calling of automated sequencer tracesusing Phred II Error probabilities Genome Res 8186ndash194

Gruber JD Genissel A Macdonald SJ Long AD 2007 How repeatable areassociations between polymorphisms in achaete-scute and bristlenumber variation in Drosophila Genetics 1751987ndash1997

Haddrill PR Charlesworth B Halligan DL Andolfatto P 2005 Patterns ofintron sequence evolution in Drosophila are dependent upon lengthand GC content Genome Biol 6R67

Hahn MW Han MV Han S-G 2007 Gene family evolution across 12Drosophila genomes PLoS Genet 3e197

Hirschhorn JN Daly MJ 2005 Genome-wide association stud-ies for common diseases and complex traits Nat Rev Genet 695ndash108

2325

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

Hudson RR Kaplan NL 1995 Deleterious background selection withrecombination Genetics 1411605ndash1617

Hufford MB Xu X van Heerwaarden J et al (25 co-authors) 2012Comparative population genomics of maize domestication andimprovement Nat Genet 44808ndash811

Kaminker JS Bergman CM Kronmiller B et al (12 co-authors) 2002The transposable elements of the Drosophila melanogastereuchromatin a genomics perspective Genome Biol3RESEARCH0084

Kapitonov VV Jurka J 2003 Molecular paleontology of transposableelements in the Drosophila melanogaster genome Proc Natl AcadSci U S A 1006569ndash6574

Kazazian HH 2004 Mobile elements drivers of genome evolutionScience 3031626ndash1632

Kidwell MG 2002 Transposable elements and the evolution of genomesize in eukaryotes Genetica 11549ndash63

King EG Macdonald SJ Long AD 2012a Properties and power of theDrosophila synthetic population resource for the routine dissectionof complex traits Genetics 191935ndash949

King EG Merkes CM McNeil CL Hoofer SR Sen S Broman KW LongAD Macdonald SJ 2012b Genetic dissection of a model complextrait using the Drosophila synthetic population resource GenomeRes 221558ndash1566

Kofler R Betancourt AJ Schlotterer C 2012 Sequencing of pooled DNAsamples (Pool-Seq) uncovers complex dynamics of transposableelement insertions in Drosophila melanogaster PLoS Genet 8e1002487

Langley CH Montgomery E Hudson R Kaplan N Charlesworth B 1988On the role of unequal exchange in the containment of transpos-able element copy number Genet Res 52223ndash235

Li H Durbin R 2009 Fast and accurate short readalignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760

Linheiro RS Bergman CM 2012 Whole genome resequencing revealsnatural target site preferences of transposable elements inDrosophila melanogaster PLoS One 7e30008

Lipatov M Lenkov L Petrov DA Bergman CM 2005 Paucity of chimericgene-transposable elements in the Drosophila melanogastergenome BMC Biol 324

Long AD Lyman RF Morgan AH Langley CH Mackay TFC 2000 Bothnaturally occurring insertions of transposable elements and inter-mediate frequency polymorphisms at the achaete-scute complexare associated with bristle number in Drosophila melanogasterGenetics 1541255ndash1269

Lonnig W-E Saedler H 2002 Chromosome rearrangements and trans-posable elements Annu Rev Genet 36389ndash410

Macdonald SJ Long AD 2007 Joint estimates of quantitativetrait locus effect and frequency using synthetic recombinantpopulations of Drosophila melanogaster Genetics 1761261ndash1281

Macdonald SJ Pastinen T Long AD 2005 The effect of polymorphismsin the Enhancer of split gene complex on bristle number variation ina large wild-caught cohort of Drosophila melanogaster Genetics 1711741ndash1756

Mackay TFC 1984 Jumping genes meet abdominal bristles hybrid dys-genesis-induced quantitative variation in Drosophila melanogasterGenet Res 44231ndash237

Mackay TFC 1989 Transposable elements and fitness in Drosophilamelanogaster Genome 31284ndash295

Mackay TFC 2010 Mutations and quantitative genetic variationlessons from Drosophila Philos Trans R Soc Lond B Biol Sci 3651229ndash1239

Mackay TFC Langley CH 1990 Molecular and phenotypic variation inthe achaete-scute region of Drosophila melanogaster Nature 34864ndash66

Mackay TFC Lyman RF Jackson MS 1992 Effects of P element inser-tions on quantitative traits in Drosophila melanogaster Genetics 130315ndash332

Mackay TFC Richards S Stone EA et al (52 co-authors) 2012 TheDrosophila melanogaster genetic reference panel Nature 482173ndash178

Meyer LR Zweig AS Hinrichs AS et al (32 co-authors) 2013 The UCSCgenome browser database extensions and updates 2013 NucleicAcids Res 41D64ndashD69

Montgomery E Charlesworth B Langley CH 1987 A test for the role ofnatural selection in the stabilization of transposable element copynumber in a population of Drosophila melanogaster Genet Res 4931ndash41

Montgomery EA Huang S-M Langley CH Judd BH 1991 Chromosomerearrangement by ectopic recombination in Drosophilamelanogaster genome structure and evolution Genetics 1291085ndash1098

NCI-NHGRI Working Group on Replication in Association Studies2007 Replicating genotype-phenotype associations Nature 447655ndash660

Nuzhdin SV Mackay TFC 1994 The genomic rate of transposable ele-ment movement in Drosophila melanogaster Mol Biol Evol 12180ndash181

Nuzhdin SV Pasyukova EG Mackay TFC 1997 Accumulation of trans-posable elements in laboratory lines of Drosophila melanogasterGenetica 100167ndash175

Ometto L Stephan W Lorenzo DD 2005 Insertiondeletion andnucleotide polymorphism data reveal constraints in Drosophilamelanogaster introns and intergenic regions Genetics 1691521ndash1527

Palsson A Gibson G 2004 Association between nucleotide variation inEgfr and wing shape in Drosophila melanogaster Genetics 1671187ndash1198

Petrov DA Fiston-Lavier A-S Lipatov M Lenkov K Gonzalez J 2011Population genomics of transposable elements in Drosophila mela-nogaster Mol Biol Evol 281633ndash1644

Presgraves DC 2006 Intron length evolution in Drosophila Mol Biol Evol132203ndash2213

Quinian AR Hall IM 2010 BEDTools a flexible suite of utilities forcomparing genomic features Bioinformatics 26841ndash842

R Development Core Team 2008 R a language and environment forstatistical computing Vienna Austria R Foundation for StatisticalComputing [cited 2013 May 1] Available from httpwwwR-projectorg

Rizzon C Marais G Gouy M Biemont C 2002 Recombination rate andthe distribution of transposable elements in the Drosophila mela-nogaster genome Genome Res 12400ndash407

Sackton TB Kulathinal RJ Bergman CM Quinlan AR Dopman EBCarneiro M Marth GT Hartl DL Clark AG 2009 Population geno-mic inferences from sparse high-throughput sequencing of twopopulations of Drosophila melanogaster Genome Biol Evol 1449ndash465

Sakurai M Aoki T Yoshikawa S et al (11 co-authors) 2009 Differentiallyexpressed Drl and Drl-2 play opposing roles in Wnt5 signalingduring Drosophila olfactory system development J Neurosci 294972ndash4980

Schmidt JM Good RT Appleton B et al (11 co-authors) 2010Copy number variation and transposable elements feature inrecent ongoing adaptation at the Cyp6g1 locus PLoS Genet 6e1000998

Shrimpton AE Mackay TFC Brown AJL 1990 Transposable element-in-duced response to artificial selection in Drosophila melanogastermolecular analysis of selection lines Genetics 125803ndash811

Slotkin RK Martienssen R 2007 Transposable elements and the epige-netic regulation of the genome Nat Rev Genet 8272ndash285

Smith PA Corces VG 1991 Drosophila transposable elements mecha-nisms of mutagenesis and interactions with the host genome AdvGenet 29229ndash300

Thornton K 2003 Libsequence a C ++ class library for evolutionarygenetic analysis Bioinformatics 192325ndash2327

Vazquez JF Albornoz J Dominguez A 2007 Direct determination of theeffects of genotype and extreme temperature on the transposition

2326

Cridland et al doi101093molbevmst129 MBE

of roo in long-term mutation accumulation lines of Drosophilamelanogaster Mol Genet Genomics 278653ndash664

Voight BF Pritchard JK 2005 Confounding from cryptic relatedness incase-control association studies PLoS Genet 1e32

Wakeley J 2009 Coalescent theory an introduction Greenwood VillageRoberts and Company Publishers

Wicker T Sabot F Hua-Van A et al (13 co-authors) 2007 A unifiedclassification system for eukaryotic transposable elements Nat RevGenet 8973ndash982

Xing J Witherspoon DJ Jorde LB 2013 Mobile element biology newpossibilities with high-throughput sequencing Trends Genet 29280ndash289

Yoshikawa S McKinnon RD Kokel M Thomas JB 2003 Wnt-mediated axon guidance via Drosophila derailed receptor Nature422583ndash588

Zachar Z Bingham PM 1982 Regulation of white locus expression thestructure of mutant alleles at the white locus of Drosophila mela-nogaster Cell 30529ndash541

2327

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

Page 16: Abundance and Distribution of Transposable Elements in Two ...

Hudson RR Kaplan NL 1995 Deleterious background selection withrecombination Genetics 1411605ndash1617

Hufford MB Xu X van Heerwaarden J et al (25 co-authors) 2012Comparative population genomics of maize domestication andimprovement Nat Genet 44808ndash811

Kaminker JS Bergman CM Kronmiller B et al (12 co-authors) 2002The transposable elements of the Drosophila melanogastereuchromatin a genomics perspective Genome Biol3RESEARCH0084

Kapitonov VV Jurka J 2003 Molecular paleontology of transposableelements in the Drosophila melanogaster genome Proc Natl AcadSci U S A 1006569ndash6574

Kazazian HH 2004 Mobile elements drivers of genome evolutionScience 3031626ndash1632

Kidwell MG 2002 Transposable elements and the evolution of genomesize in eukaryotes Genetica 11549ndash63

King EG Macdonald SJ Long AD 2012a Properties and power of theDrosophila synthetic population resource for the routine dissectionof complex traits Genetics 191935ndash949

King EG Merkes CM McNeil CL Hoofer SR Sen S Broman KW LongAD Macdonald SJ 2012b Genetic dissection of a model complextrait using the Drosophila synthetic population resource GenomeRes 221558ndash1566

Kofler R Betancourt AJ Schlotterer C 2012 Sequencing of pooled DNAsamples (Pool-Seq) uncovers complex dynamics of transposableelement insertions in Drosophila melanogaster PLoS Genet 8e1002487

Langley CH Montgomery E Hudson R Kaplan N Charlesworth B 1988On the role of unequal exchange in the containment of transpos-able element copy number Genet Res 52223ndash235

Li H Durbin R 2009 Fast and accurate short readalignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760

Linheiro RS Bergman CM 2012 Whole genome resequencing revealsnatural target site preferences of transposable elements inDrosophila melanogaster PLoS One 7e30008

Lipatov M Lenkov L Petrov DA Bergman CM 2005 Paucity of chimericgene-transposable elements in the Drosophila melanogastergenome BMC Biol 324

Long AD Lyman RF Morgan AH Langley CH Mackay TFC 2000 Bothnaturally occurring insertions of transposable elements and inter-mediate frequency polymorphisms at the achaete-scute complexare associated with bristle number in Drosophila melanogasterGenetics 1541255ndash1269

Lonnig W-E Saedler H 2002 Chromosome rearrangements and trans-posable elements Annu Rev Genet 36389ndash410

Macdonald SJ Long AD 2007 Joint estimates of quantitativetrait locus effect and frequency using synthetic recombinantpopulations of Drosophila melanogaster Genetics 1761261ndash1281

Macdonald SJ Pastinen T Long AD 2005 The effect of polymorphismsin the Enhancer of split gene complex on bristle number variation ina large wild-caught cohort of Drosophila melanogaster Genetics 1711741ndash1756

Mackay TFC 1984 Jumping genes meet abdominal bristles hybrid dys-genesis-induced quantitative variation in Drosophila melanogasterGenet Res 44231ndash237

Mackay TFC 1989 Transposable elements and fitness in Drosophilamelanogaster Genome 31284ndash295

Mackay TFC 2010 Mutations and quantitative genetic variationlessons from Drosophila Philos Trans R Soc Lond B Biol Sci 3651229ndash1239

Mackay TFC Langley CH 1990 Molecular and phenotypic variation inthe achaete-scute region of Drosophila melanogaster Nature 34864ndash66

Mackay TFC Lyman RF Jackson MS 1992 Effects of P element inser-tions on quantitative traits in Drosophila melanogaster Genetics 130315ndash332

Mackay TFC Richards S Stone EA et al (52 co-authors) 2012 TheDrosophila melanogaster genetic reference panel Nature 482173ndash178

Meyer LR Zweig AS Hinrichs AS et al (32 co-authors) 2013 The UCSCgenome browser database extensions and updates 2013 NucleicAcids Res 41D64ndashD69

Montgomery E Charlesworth B Langley CH 1987 A test for the role ofnatural selection in the stabilization of transposable element copynumber in a population of Drosophila melanogaster Genet Res 4931ndash41

Montgomery EA Huang S-M Langley CH Judd BH 1991 Chromosomerearrangement by ectopic recombination in Drosophilamelanogaster genome structure and evolution Genetics 1291085ndash1098

NCI-NHGRI Working Group on Replication in Association Studies2007 Replicating genotype-phenotype associations Nature 447655ndash660

Nuzhdin SV Mackay TFC 1994 The genomic rate of transposable ele-ment movement in Drosophila melanogaster Mol Biol Evol 12180ndash181

Nuzhdin SV Pasyukova EG Mackay TFC 1997 Accumulation of trans-posable elements in laboratory lines of Drosophila melanogasterGenetica 100167ndash175

Ometto L Stephan W Lorenzo DD 2005 Insertiondeletion andnucleotide polymorphism data reveal constraints in Drosophilamelanogaster introns and intergenic regions Genetics 1691521ndash1527

Palsson A Gibson G 2004 Association between nucleotide variation inEgfr and wing shape in Drosophila melanogaster Genetics 1671187ndash1198

Petrov DA Fiston-Lavier A-S Lipatov M Lenkov K Gonzalez J 2011Population genomics of transposable elements in Drosophila mela-nogaster Mol Biol Evol 281633ndash1644

Presgraves DC 2006 Intron length evolution in Drosophila Mol Biol Evol132203ndash2213

Quinian AR Hall IM 2010 BEDTools a flexible suite of utilities forcomparing genomic features Bioinformatics 26841ndash842

R Development Core Team 2008 R a language and environment forstatistical computing Vienna Austria R Foundation for StatisticalComputing [cited 2013 May 1] Available from httpwwwR-projectorg

Rizzon C Marais G Gouy M Biemont C 2002 Recombination rate andthe distribution of transposable elements in the Drosophila mela-nogaster genome Genome Res 12400ndash407

Sackton TB Kulathinal RJ Bergman CM Quinlan AR Dopman EBCarneiro M Marth GT Hartl DL Clark AG 2009 Population geno-mic inferences from sparse high-throughput sequencing of twopopulations of Drosophila melanogaster Genome Biol Evol 1449ndash465

Sakurai M Aoki T Yoshikawa S et al (11 co-authors) 2009 Differentiallyexpressed Drl and Drl-2 play opposing roles in Wnt5 signalingduring Drosophila olfactory system development J Neurosci 294972ndash4980

Schmidt JM Good RT Appleton B et al (11 co-authors) 2010Copy number variation and transposable elements feature inrecent ongoing adaptation at the Cyp6g1 locus PLoS Genet 6e1000998

Shrimpton AE Mackay TFC Brown AJL 1990 Transposable element-in-duced response to artificial selection in Drosophila melanogastermolecular analysis of selection lines Genetics 125803ndash811

Slotkin RK Martienssen R 2007 Transposable elements and the epige-netic regulation of the genome Nat Rev Genet 8272ndash285

Smith PA Corces VG 1991 Drosophila transposable elements mecha-nisms of mutagenesis and interactions with the host genome AdvGenet 29229ndash300

Thornton K 2003 Libsequence a C ++ class library for evolutionarygenetic analysis Bioinformatics 192325ndash2327

Vazquez JF Albornoz J Dominguez A 2007 Direct determination of theeffects of genotype and extreme temperature on the transposition

2326

Cridland et al doi101093molbevmst129 MBE

of roo in long-term mutation accumulation lines of Drosophilamelanogaster Mol Genet Genomics 278653ndash664

Voight BF Pritchard JK 2005 Confounding from cryptic relatedness incase-control association studies PLoS Genet 1e32

Wakeley J 2009 Coalescent theory an introduction Greenwood VillageRoberts and Company Publishers

Wicker T Sabot F Hua-Van A et al (13 co-authors) 2007 A unifiedclassification system for eukaryotic transposable elements Nat RevGenet 8973ndash982

Xing J Witherspoon DJ Jorde LB 2013 Mobile element biology newpossibilities with high-throughput sequencing Trends Genet 29280ndash289

Yoshikawa S McKinnon RD Kokel M Thomas JB 2003 Wnt-mediated axon guidance via Drosophila derailed receptor Nature422583ndash588

Zachar Z Bingham PM 1982 Regulation of white locus expression thestructure of mutant alleles at the white locus of Drosophila mela-nogaster Cell 30529ndash541

2327

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE

Page 17: Abundance and Distribution of Transposable Elements in Two ...

of roo in long-term mutation accumulation lines of Drosophilamelanogaster Mol Genet Genomics 278653ndash664

Voight BF Pritchard JK 2005 Confounding from cryptic relatedness incase-control association studies PLoS Genet 1e32

Wakeley J 2009 Coalescent theory an introduction Greenwood VillageRoberts and Company Publishers

Wicker T Sabot F Hua-Van A et al (13 co-authors) 2007 A unifiedclassification system for eukaryotic transposable elements Nat RevGenet 8973ndash982

Xing J Witherspoon DJ Jorde LB 2013 Mobile element biology newpossibilities with high-throughput sequencing Trends Genet 29280ndash289

Yoshikawa S McKinnon RD Kokel M Thomas JB 2003 Wnt-mediated axon guidance via Drosophila derailed receptor Nature422583ndash588

Zachar Z Bingham PM 1982 Regulation of white locus expression thestructure of mutant alleles at the white locus of Drosophila mela-nogaster Cell 30529ndash541

2327

Transposable Elements in Two Drosophila QTL Mapping Resources doi101093molbevmst129 MBE