Primates and mouse NumtS in the UCSC Genome Browser

9
RESEARCH Open Access Primates and mouse NumtS in the UCSC Genome Browser Francesco Maria Calabrese, Domenico Simone, Marcella Attimonelli * From Eighth Annual Meeting of the Italian Society of Bioinformatics (BITS) Pisa, Italy. 20-22 June 2011 Abstract Background: NumtS ( Nuclear Mi Tochondrial Sequences) are mitochondrial DNA sequences that, after stress events involving the mitochondrion, colonized the nuclear genome. Accurate mapping of NumtS avoids contamination during mtDNA PCR amplification, thus supplying reliable bases for detecting false heteroplasmies. In addition, since they commonly populate mammalian genomes (especially primates) and are polymorphic, in terms of presence/ absence and content of SNPs, they may be used as evolutionary markers in intra- and inter-species population analyses. Results: The need for an exhaustive NumtS annotation led us to produce the Reference Human NumtS compilation, followed, as reported in this paper, by those for chimpanzee, rhesus macaque and mouse ones. Identification of NumtS inside the UCSC Genome Browser and their inter-species comparison required the design and the implementation of NumtS tracks, starting from the compilation data. NumtS retrieval through the UCSC Genome Browser, in the species examined, is now feasible at a glance. Conclusions: Analyses involving NumtS tracks, together with other genome element tracks publicly available at the UCSC Genome Browser, can provide deep insight into genome evolution and comparative genomics, thus improving studies dealing with the mechanisms that drove the generation of NumtS. In addition, the NumtS tracks constitute a useful tool in the design of mitochondrial DNA primers. Background Colonization of the nuclear genome by mitochondrial DNA fragments began shortly after endosymbiosis and, while still ongoing, has laid the groundwork for a sce- nario composed of recombination events which gener- ated duplications and de novo insertions. The integration or rather the captureof these fragments may occur during repair of double-strand breaks in nuclear DNA by NHEJ ( Non- Homologous End- Joining) recombinations. These events arise due to the action of endogenous and exogenous agents such as ionizing radiation and are strictly dependent on the rate of double-strand breaks in nuclear DNA [1,2]. In some organisms, particularly pri- mates, the same mitochondrial region occurs several times along the nuclear genome, showing how duplication events have driven the spread of NumtS among these species [1-3]. This is also indicated by the discrepancies in NumtS content among human and chimpanzee, although they are closely located in the phy- logenetic tree. Another important feature of NumtS con- cerns the intra-specific polymorphic state, both in terms of presence/absence and SNP content: they may occur in homo- or heterozygosis or are absent in various indivi- duals at specific loci. This feature makes them suitable for human and primate population analyses, since the most enriched species harbored this branch of the whole phylogenetic tree [4,5]. Although several clinical studies report data on NumtS involvement in causing hetero- plasmy artefacts of mtDNA PCR amplification, no works have so far been published reporting tools allowing the NumtS browsing. NumtS content among mammalian and non-mammalian species has also been investigated, but the data published so far do not allow the indepth * Correspondence: [email protected] Dipartimento di Bioscienze, Biotecnologie e Scienze Farmacologiche, Università di Bari, Bari, 70126, Italy Calabrese et al. BMC Bioinformatics 2012, 13(Suppl 4):S15 http://www.biomedcentral.com/1471-2105/13/S4/S15 © 2012 Calabrese et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Transcript of Primates and mouse NumtS in the UCSC Genome Browser

Page 1: Primates and mouse NumtS in the UCSC Genome Browser

RESEARCH Open Access

Primates and mouse NumtS in the UCSC GenomeBrowserFrancesco Maria Calabrese, Domenico Simone, Marcella Attimonelli*

From Eighth Annual Meeting of the Italian Society of Bioinformatics (BITS)Pisa, Italy. 20-22 June 2011

Abstract

Background: NumtS (Nuclear MiTochondrial Sequences) are mitochondrial DNA sequences that, after stress eventsinvolving the mitochondrion, colonized the nuclear genome. Accurate mapping of NumtS avoids contaminationduring mtDNA PCR amplification, thus supplying reliable bases for detecting false heteroplasmies. In addition, sincethey commonly populate mammalian genomes (especially primates) and are polymorphic, in terms of presence/absence and content of SNPs, they may be used as evolutionary markers in intra- and inter-species populationanalyses.

Results: The need for an exhaustive NumtS annotation led us to produce the Reference Human NumtScompilation, followed, as reported in this paper, by those for chimpanzee, rhesus macaque and mouse ones.Identification of NumtS inside the UCSC Genome Browser and their inter-species comparison required the designand the implementation of NumtS tracks, starting from the compilation data. NumtS retrieval through the UCSCGenome Browser, in the species examined, is now feasible at a glance.

Conclusions: Analyses involving NumtS tracks, together with other genome element tracks publicly available atthe UCSC Genome Browser, can provide deep insight into genome evolution and comparative genomics, thusimproving studies dealing with the mechanisms that drove the generation of NumtS. In addition, the NumtS tracksconstitute a useful tool in the design of mitochondrial DNA primers.

BackgroundColonization of the nuclear genome by mitochondrialDNA fragments began shortly after endosymbiosis and,while still ongoing, has laid the groundwork for a sce-nario composed of recombination events which gener-ated duplications and de novo insertions. The integrationor rather the “capture” of these fragments may occurduring repair of double-strand breaks in nuclear DNA byNHEJ (Non-Homologous End-Joining) recombinations.These events arise due to the action of endogenous andexogenous agents such as ionizing radiation and arestrictly dependent on the rate of double-strand breaks innuclear DNA [1,2]. In some organisms, particularly pri-mates, the same mitochondrial region occurs severaltimes along the nuclear genome, showing how

duplication events have driven the spread of NumtSamong these species [1-3]. This is also indicated by thediscrepancies in NumtS content among human andchimpanzee, although they are closely located in the phy-logenetic tree. Another important feature of NumtS con-cerns the intra-specific polymorphic state, both in termsof presence/absence and SNP content: they may occur inhomo- or heterozygosis or are absent in various indivi-duals at specific loci. This feature makes them suitablefor human and primate population analyses, since themost enriched species harbored this branch of the wholephylogenetic tree [4,5]. Although several clinical studiesreport data on NumtS involvement in causing hetero-plasmy artefacts of mtDNA PCR amplification, no workshave so far been published reporting tools allowing theNumtS browsing. NumtS content among mammalianand non-mammalian species has also been investigated,but the data published so far do not allow the indepth

* Correspondence: [email protected] di Bioscienze, Biotecnologie e Scienze Farmacologiche,Università di Bari, Bari, 70126, Italy

Calabrese et al. BMC Bioinformatics 2012, 13(Suppl 4):S15http://www.biomedcentral.com/1471-2105/13/S4/S15

© 2012 Calabrese et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the CreativeCommons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, andreproduction in any medium, provided the original work is properly cited.

Page 2: Primates and mouse NumtS in the UCSC Genome Browser

knowledge of the genomic rearrangements in whichNumtS are involved. The construction of annotationtracks within the UCSC Genome Browser will certainlyimprove such investigations. Setting the human “Numt-Some” catalog on hg18 build allowed the complete col-lection of a nuclear-mitochondrial “library” comprisingmore than 500 NumtS, i.e. RHNumtS.2 [6], and newcompilations for other organisms have also been pro-duced. In more detail, the same optimized protocolwhich allowed the detection of human NumtS, based onin silico hybridization between the human mtDNA refer-ence sequence (rCRS) [7] and the human nuclear genomeassembly, was applied to Pan troglodytes and Mus muscu-lus. Our choice fell on those species which, at the time ofthe inspection (2010), were reported in the NCBI genomestatistics pages as mammalian genomes in a completestate of assembly. In order to take into account an out-group for the hominoid branch, the NumtS annotation inrhesus macaque (Macaca mulatta) was also performed,although its genome was, at the time of the study, in adraft state of assembly. As regards chimpanzee andmouse, although several studies have published the num-ber of NumtS and their genome coverage [3-5,8,9], nodata about genome positions are available. Here wereport an improvement in terms of the quantity andquality of data and the complete annotations of NumtS.This kind of data for rhesus macaque have never beenpublished. The NumtS compilations produced for theabove-mentioned species allowed the design of tracksinside the UCSC Genome Browser. The growing demandfor genomic details in these species, which are commonlyused within the UCSC, encouraged us to publish theNumtS mouse (mm9 build) and human (hg19 build)tracks. Annotations about NumtS in the species (chim-panzee and rhesus macaque) which, in our opinion, willbe extensively used, can be uploaded as NumtS customannotation tracks from our local server. Our purpose isto provide a proper support to NumtS surveys and tofacilitate comparative NumtS analyses, an aspect whichhas previously been carried out only in chimpanzee/human comparisons [3-5].

MethodsBlasting of nuclear genome versus mitochondrial genomeBl2seq, which implements the BlastN program (release2.2.19 of the Blast suite) thus allowing pairwise compari-sons, was launched on the entire chromosomal set foreach species: 24 for Pan troglodytes (panTro2 build), 21for Macaca mulatta (rheMac2 build) and Mus musculus(mm9 build). The entire chromosome dataset for eachbuild was downloaded at the following address [10]. Jobexecutions were performed on a local server. The acces-sion numbers of the mtDNA reference sequences usedas queries are respectively NC_005943 for rhesus

macaque, NC_001643 for chimpanzee and NC_005089for mouse. The same protocol applied to the generationof the human compilation (RHNumtS.2) [6] was usedfor the other three species, with scoring parametersfixed as follows: 2 for match reward, -3 for mismatchpenalty, -5 for gap opening, -2 for gap extension. Theexpected value (e-value) was fixed at 1e-03. Each frag-ment of each chromosome aligned with the mtDNAwhose e-value was lower than the fixed threshold pro-duced an HSP (High Scoring Pair). Each HSP was asso-ciated with a NumtS. The complete list of NumtS locifor each species was annotated in a spreadsheet (NumtScompilation).

NumtS assemblyNumtS mapping at a distance of less than 2000 bp on aspecific chromosome and corresponding to two mtDNAfragments, at most distant 2000 bp and oriented in thesame direction, were combined in a single NumtS andcalled “assembled NumtS”. This was done by spread-sheet interpolation and manual inspection, well sup-ported by graphical display on the UCSC GenomeBrowser and with tools available within the Galaxy plat-form [11-14]. As in the generation of human assembledNumtS, the fragment-joining protocol was slightly modi-fied for HSPs, interposed by long repetitive elements inorder to compare fragments which followed the sameevolutionary history, at least among primate lineage.

NumtS track implementationStarting from the NumtS compilation for each species,four tracks were implemented: “NumtS” and “NumtSAssembled” tracks, displaying data from the correspond-ing compilations; the “NumtS on mitochondrion” track,showing mapping of NumtS on the mitochondrial gen-ome, and the “NumtS on mitochondrion with mis-matches” track, showing mismatches between NumtSsequence and mitochondrial genome. Since the mito-chondrial chromosome is not available in the UCSCGenome Browser for Macaca mulatta, the “NumtS onmitochondrion” and the “NumtS on mitochondrion withmismatches” tracks were not designed for this species.

Creating BED and BAM tracksUCSC Genome Browser tracks may be produced andconfigured in a variety of ways, to highlight featuresneeded. The format used for “NumtS”, “NumtSAssembled” and “NumtS on mitochondrion” tracks isthe Browser Extensible Data format (BED) which is aflexible way of showing genomic annotations. In-houseshells and Python scripts were used to produce BEDand BAM formatted data and to implement inter-NumtS track links, starting from the UCSC descriptionpages, providing further insights when switching from

Calabrese et al. BMC Bioinformatics 2012, 13(Suppl 4):S15http://www.biomedcentral.com/1471-2105/13/S4/S15

Page 2 of 9

Page 3: Primates and mouse NumtS in the UCSC Genome Browser

the nuclear locus of a NumtS to its mitochondrial coun-terpart. The BAM format, the compressed binary ver-sion of the Sequence Alignment/Map (SAM) format,was used for the “NumtS on mitochondrion with mis-matches” track generation. The LASTZ software [15]was used to realign each NumtS to its mitochondrialsequence, using default parameters and setting the SAMformat as output. SAM files were converted to BAMformat with the SAMtools package [16]. Figure 1 showsthe flowchart followed to obtain the NumtS tracks start-ing from the Blastn, giving HSPs and passing throughNumtS compilations.

Remapping of human tracks on hg19The revised reference human NumtS compilation(RHNumtS.2), already implemented inside the UCSCGenome Browser according to the hg18 build, wasupgraded to the GRCh37 build (hg19). Genomic coordi-nates were converted by using the “Lift-Over” tool avail-able through the Galaxy suite. Human hg19 compilationis available in additional file 1, RHNumtS.2 compilation.

Comparison with previously published compilationsTo compare NumtS intervals with those available inother previously published NumtS compilations, the“Lift-Over” tool, based on Lift-Over utility and Chaintrack, and the “Join” tool were used. They are both free

and available in the Galaxy suite at UCSC GenomeBrowser [14].

ResultsNumtS insertions appear to have been more or less con-tinuous over time in the lineages leading to the humangenome [2-5,8]. After the RHNumtS.2, in view of theproximity between human and chimpanzee (5-7 millionof years) and taking into account the level of similaritybetween their genomes, the NumtS comparison betweenthem is an important step in the evolutionary study ofthe NumtS. Thus, by applying the same protocol usedfor RHNumtS.2, we have produced respectively theRPNumtS, RRNumtS and RMNumtS compilations forPan troglodytes, Macaca mulatta and Mus musculusspecies. Detailed analyses of the quantity of the species-specific NumtS insertion/duplication events whichoccurred after the chimpanzee-human divergence areneeded, to understand the underlying phylogenetic rela-tionships. We produced the UCSC tracks with this aimin mind.

Gathering NumtS in Pan troglodytesThe RPNumtS (Reference Pan troglodytes NumtS) compi-lation (additional file 2, RPNumtS compilation) annotates776 NumtS, including 117 NumtS assembled by applyingthe criteria described in the Methods section. The nuclearspans range from 33 bp to 8984 bp, with a percentage ofsimilarity ranging from 64.4 to 100 (Figure 2). An ID wasassigned to each chimpanzee NumtS according to the for-mat Ptr_NumtS_xxx, where Ptr stands for Pan troglodytesand xxx is a three-digit code.

Gathering NumtS in Macaca mulattaThe number of HSPs returned by the similarity searchanalysis in rhesus macaque was 751, 113 of which wereconcatenated. The nuclear spans range from 28 bp to6642 bp with a percentage of similarity ranging from 64to 100 (Figure 2). The ID format used for the rhesusmacaque NumtS was Rhm_NumtS_xxx, where Rhmstands for rhesus macaque. The RRNumtS (ReferenceRhesus NumtS) compilation is available in additional file3, RRNumtS compilation.

Gathering NumtS in Mus musculusThe number of HSPs returned by the Blastn on themouse genome is 172 of which 13 were assembled. Thenuclear spans range from 33 bp to 4654 bp with a per-centage of similarity ranging from 66 to 100 (Figure 2).Also for mouse the nomenclature follows the same ruleas the other species: Mms_NumtS_xxx where Mmsstands for Mus musculus. The RMNumtS (ReferenceMouse NumtS) compilation is available in additionalfile 4, RMNumtS compilation.

Figure 1 NumtS track flowchart. Step-by-step protocol,implemented to produce both NumtS compilations and NumtStracks in the four species examined.

Calabrese et al. BMC Bioinformatics 2012, 13(Suppl 4):S15http://www.biomedcentral.com/1471-2105/13/S4/S15

Page 3 of 9

Page 4: Primates and mouse NumtS in the UCSC Genome Browser

Browsing the NumtS tracks inside the UCSC GenomeBrowserData in additional files 1, 2, 3 and 4 were used to set upthe “NumtS Sequence” track group on the UCSC Gen-ome Browser ("Variation and Repeats” section) forhuman on hg19, chimpanzee, rhesus macaque andmouse, respectively. Implementation of the NumtS com-pilations on the UCSC Genome Browser for human andmouse data facilitates access to the NumtSome annota-tions. Once connected to UCSC [17], the presence of anyNumtS in a human (hg18 or hg19 build) or murine(mm9) genomic region of interest (selected by typing itscoordinates in the “position/term” search box) can bechecked; alternatively, by typing a NumtS identifier in thesame search box, a list of tracks available for that NumtSappears. Clicking on one of the results displays theselected item in the Genome Browser. For each NumtS,the description page reports a link which guides users inthe shift from the NumtS nuclear position to its mito-chondrial counterpart and vice versa. Because tracks caneasily connect annotation data, they provide a very usefultool in understanding the evolutionary dynamics inwhich NumtS are involved and characterizing their geno-mic context. Pan troglodytes NumtS tracks, available onour local server, can be uploaded as custom annotationtracks starting from the “NumtS on mitochondriontrack” with the link reported in the reference section[18]. Starting from the mitochondrial track, clicking onany NumtS shows external page providing further specifi-cations about that NumtS (together with the others) in

UCSC Genome Browser “detail page” fashion. This pro-cedure substitutes the “NumtS Sequence” track activationdescribed above for human and mouse tracks. ForM. mulatta custom tracks activation, because of the lackin the rhesus macaque build of the mitochondrial gen-ome, means that the starting link points to a nuclearlocation, at the same time activating the NumtS track.The link is reported in the reference section [19].

Comparisons with previously published compilationsOther research groups have already reported NumtSoverviews in primates and mouse. In order to demon-strate the improvements obtained with our approach,comparisons between our compilations and publiclyavailable data for chimpanzee [3,5,9], rhesus macaque[3,9] and mouse [8,9] are reported here. The resultsare listed in additional file 5, NumtS compilationscomparison. HSP number and base pair coverage onthe entire genome, as well as the same features for theassembled NumtS, are given. Differences in compila-tion contents for the same organism are due to differ-ences between one release and another and to theBlastn algorithm parameters chosen. The only NumtSlocations available are those published in [5] concern-ing NumtS in Pan troglodytes. Merging of these anno-tations with our compilation shows that all the itemsreported in [5] are included within our dataset. Furtherdetails about the comparisons of NumtS compilationsare given in additional file 6, RPNumtS annotationcomparison.

Figure 2 Comparison of the box plots for NumtS lengths and similarity distributions among species examined. a) Box plot based ondistribution of NumtS percentage of similarity with respect to corresponding mitochondrial reference genome in three species examined. Musmusculus has the highest values. b) Box plots for NumtS lengths, based on mouse, chimpanzee and rhesus macaque compilations. Rhesusmacaque shows highest median and maximum values.

Calabrese et al. BMC Bioinformatics 2012, 13(Suppl 4):S15http://www.biomedcentral.com/1471-2105/13/S4/S15

Page 4 of 9

Page 5: Primates and mouse NumtS in the UCSC Genome Browser

Discussion and conclusionsThe above protocol produced 776, 751 and 173 HSPs inchimpanzee, rhesus macaque and mouse, respectively.These data confirm and extend the NumtS catalog for

different species, published so far and summarised in [8].To test the statistical significance of the NumtS lengthsin each species, we have used a Pearson correlationmatrix (additional file 7, NumtS and genome features

Figure 3 Browsing UCSC NumtS tracks to identify orthologous NumtS. Workflow shown assesses presence of identical NumtS in twospecies for which the UCSC annotation tracks were produced. Graphs show two human NumtS, (HSA_NumtS_006_b1, HSA_NumtS_009_b1) andChimpNet track, indicating that the former is also present in chimpanzee, whereas the latter is not. Following flowchart (and clicking on linksand boxes marked with red arrows), this hypothesis can be verified on Chimp Browser. Alignment provided by Net tracks provides additionalevidence of presence or absence of NumtS in species examined. See additional file 8, Figure 2 screenshots for magnified versions of screenshots.

Calabrese et al. BMC Bioinformatics 2012, 13(Suppl 4):S15http://www.biomedcentral.com/1471-2105/13/S4/S15

Page 5 of 9

Page 6: Primates and mouse NumtS in the UCSC Genome Browser

Figure 4 Repeated elements within NumtS: content comparisons among species examined. a) Total amount of RE within NumtS genomicintervals, with and without flanking regions (1000 bp upstream and downstream), calculated for each species. Trend in any of species examinedreveals more REs in flanking regions with respect to NumtS region alone, with greater emphasis on mouse genome. b) Total RE content (bp andnumber) in NumtS loci on total NumtS length and number respectively, calculated as percentage. c) Percentage of RE(bp) within NumtS loci onRE content within whole genome. Although mouse genome is less populated with REs, its NumtS are populated by REs with same percentageobserved in other species examined.

Table 1 Chimpanzee NumtS annotation as reported in the compilation limited to the mitochondrial region 1006-1771considered in the Primer selection example

#chrom chromStart chromEnd NumtS_ID score strand

chrM 318 2790 Ptr_NumtS_376_b1 983 +

chrM 0 2116 Ptr_NumtS_443_b1 951 -

chrM 457 2391 Ptr_NumtS_442_b1 944 -

chrM 0 2512 Ptr_NumtS_379_b2 840 +

chrM 0 2512 Ptr_NumtS_315_b2 838 +

chrM 0 2831 Ptr_NumtS_667_b2 808 -

chrM 835 2448 Ptr_NumtS_127_b1 805 +

chrM 0 2586 Ptr_NumtS_192_b2 794 +

chrM 19 2227 Ptr_NumtS_106_b2 790 -

chrM 703 2783 Ptr_NumtS_371_b1 790 +

chrM 21 5619 Ptr_NumtS_039_b1 786 -

chrM 79 2939 Ptr_NumtS_343_b2 785 -

chrM 0 2249 Ptr_NumtS_604_b1 779 -

chrM 0 2236 Ptr_NumtS_423_b1 775 +

chrM 19 5841 Ptr_NumtS_073_b2 774 +

chrM 4 4498 Ptr_NumtS_335_b2 770 +

chrM 84 9112 Ptr_NumtS_194_b2 770 -

chrM 0 1786 Ptr_NumtS_516_b1 767 -

chrM 0 4603 Ptr_NumtS_088_b1 765 -

chrM 0 4321 Ptr_NumtS_699_b2 758 +

chrM 0 4343 Ptr_NumtS_698_b1 758 -

chrM 0 4343 Ptr_NumtS_685_b1 758 -

chrM 0 4321 Ptr_NumtS_697_b2 757 +

chrM 0 4321 Ptr_NumtS_693_b1 757 -

chrM 0 4343 Ptr_NumtS_700_b1 756 -

Calabrese et al. BMC Bioinformatics 2012, 13(Suppl 4):S15http://www.biomedcentral.com/1471-2105/13/S4/S15

Page 6 of 9

Page 7: Primates and mouse NumtS in the UCSC Genome Browser

correlation matrix), highlighting a strong correlationbetween genome size and total amount of NumtS. Fig-ures 2a and 2b show that mouse NumtS are the least, theshortest but also the most highly conserved. The avail-ability of the tracks will allow us to go into further detailon this aspect. In order to facilitate NumtS inter- andintra-species studies on the basis of our compilations, weproduced the UCSC Genome Browser tracks,

representing a reliable and useful way of recognizing andinterrelating them with other genomic elements. Amongthe most important advantages linked to track imple-mentation are i) the comparison of NumtS data with anyother type of genomic element annotation through tracksin the browser, thus allowing NumtS to be mapped insyntenic regions as well as in repetitive regions; ii) thepossibility of retrieving the entire set of a track data

Table 1 Chimpanzee NumtS annotation as reported in the compilation limited to the mitochondrial region 1006-1771considered in the Primer selection example (Continued)

chrM 0 4321 Ptr_NumtS_695_b1 756 -

chrM 0 4321 Ptr_NumtS_694_b2 756 +

chrM 0 4343 Ptr_NumtS_692_b2 756 +

chrM 0 4343 Ptr_NumtS_691_b1 756 -

chrM 0 4321 Ptr_NumtS_690_b2 756 +

chrM 0 4343 Ptr_NumtS_687_b1 756 -

chrM 0 4343 Ptr_NumtS_623_b3 755 +

chrM 19 4343 Ptr_NumtS_686_b3 746 +

chrM 19 4343 Ptr_NumtS_696_b1 745 -

chrM 19 4343 Ptr_NumtS_688_b3 745 +

chrM 0 2183 Ptr_NumtS_684_b1 708 +

Figure 5 Evidence of risk of coamplification due to NumtS. Chimpanzee NumtS spanning within mitochondrial region from 1006 to 1771;first and second large blocks in red and blue include NumtS on minus and plus strand respectively. Red squared NumtS IDs (Ptr_NumtS_443_b1 and Ptr_NumtS _376_b1) correspond to potentially unintended nuclear regions found by Primer-Blast software, and have fewestmismatches. Red squared NumtS Ptr_NumtS _442_b1 not found by Primer-Blast software, has more mismatches.

Calabrese et al. BMC Bioinformatics 2012, 13(Suppl 4):S15http://www.biomedcentral.com/1471-2105/13/S4/S15

Page 7 of 9

Page 8: Primates and mouse NumtS in the UCSC Genome Browser

through the “Tables” section, allowing a custom NumtSdatabase to be produced; iii) the possibility of checkingfor risk of co-amplification in the process of mtDNAamplification [20].

Syntenic analysesThe whole UCSC Genome Browser is in fact a con-tainer of browsers for various species. It can be usedto classify the NumtS of two different species as ortho-logous or not simply by using genome browsers forwhich NumtS tracks have been designed and imple-mented. Starting from a NumtS in one of the species,for which data in the comparative genomic section areavailable, and shifting to another genome browser forwhich the NumtS tracks have also been set up, thepresence/absence of a gap in the NumtS region canidentify species-specific or orthologous NumtS. As anexample to demonstrate the advantage of the availabil-ity of NumtS tracks for different species, starting fromthe HSA_NumtS_006 in the human browser, Figure 3shows the path to be followed to recognize its ortholo-gous, Ptr_NumtS_006 in chimpanzee (for magnifiedscreenshots of Figure 3, see additional file 8, Figure 2screenshots).

NumtS and RE content among primates and mousegenomesFigure 4 shows the statistical estimate of the relationshipbetween NumtS and Repetitive Elements (RE). In all theconsidered species, as the number of RE in the regions,including the flanking region, is significantly high,NumtS insertions may occur preferentially in highlyrepetitive regions, with a bias in human NumtS, takinginto account the fact that the chimpanzee genome has agreater quantity of NumtS. A more detailed analysis interms of length of both NumtS and RE indicates thatchimpanzee NumtS contain a higher density of RE withrespect to the others. This is confirmed by the trend ofthe ratio between the base coverage of RE withinNumtS with respect to the same value found in theentire genome (Figures 4b and 4c).

Risk of co-amplificationAs a result of finding pair-specific PCR primer for achimpanzee mitochondrial region (e.g. range 1000-1800 in the mitochondrial genome), the Primer-Blastsoftware [21] returns two products on potentiallyunintended nuclear templates for the best primer pair.Browsing the NumtS tracks proves that the twonuclear regions each contain a NumtS. Table 1 liststhe NumtS sorted by score, which spans the samechimpanzee mitochondrial region, extracted from the“Ptr_NumtS on mitochondrion” track in the relativetable browser. The first two highest scores are for

Ptr_NumtS_376_b1 and Ptr_NumtS_443_b1, as shownin Figure 5. The third score value corresponds toPtr_NumtS_442_b1, which contains more differencesthan the other two with respect to hg18 (and is thusnot considered in the Primer_Blast report). The matchbetween the data demonstrates the advantage offeredby the NumtS tracks in avoiding NumtScoamplification.To conclude, as demonstrated in the above discussion,

the compilations reported here represent a complete setof primate and mouse NumtS loci.

Additional material

Additional file 1: RHNumtS.2 compilation. Human reference NumtScompilation on hg19 build. Each line refers to a NumtS and reportsNumtS ID, HSP_NumtS ID, chromosome where NumtS is located, andstrand where Blast mapped NumtS ("+” if in same direction as nuclearreference sequence, “-” if in opposite direction), chromosome andmitochondrial locations. For assembled NumtS (right), detailedinformation on each fragment is available. Last column shows identitypercentage as reported in Blast output for each HSP.

Additional file 2: RPNumtS compilation. Chimpanzee reference NumtScompilation on panTro2 build. Each line refers to a NumtS and givesNumtS ID, HSP_NumtS ID, chromosome where NumtS is located andstrand where Blast mapped NumtS ("+” if in same direction as nuclearreference sequence, “-” if in opposite direction), chromosome andmitochondrial locations. For assembled NumtS (right), detailedinformation on each fragment is available. Last column shows identitypercentage as reported in Blast output for each HSP.

Additional file 3: RRNumtS compilation. Macaca mulatta referenceNumtS compilation on rheMac2 build. Each line refers to a NumtS andgives NumtS ID, HSP_NumtS ID, chromosome where NumtS is locatedand strand where Blast mapped NumtS ("+” if in same direction asnuclear reference sequence, “-” if in opposite direction), chromosomeand mitochondrial locations. For assembled NumtS (right), detailedinformation on each fragment is available. Last column shows identitypercentage as reported in Blast output for each HSP.

Additional file 4: RMNumtS compilation. Mus musculus referenceNumtS compilation on mm9 build. Each line refers to a NumtS and givesNumtS ID, HSP_NumtS ID, chromosome where the NumtS is located andstrand where Blast mapped NumtS ("+” if in same direction as nuclearreference sequence, “-” if in opposite direction), chromosome andmitochondrial locations. For assembled NumtS (right), detailedinformation on each fragment is available. Last column shows identitypercentage as reported in Blast output for each HSP.

Additional file 5: NumtS compilation comparisons. Differences inNumtS number and length (bp) in same species are due to differentassembly and differences in parameters used to launch in silicohybridization. Discrepancies observed in both chimpanzee and rhesusmacaque data between our results and those reported in [3], where thesame assemblies and same e-value were used, cannot be explained,because information on Blast running in [3] is not complete. Instead,trend of differences observed in mouse match assembly time andparameters.

Additional file 6: RPNumtS annotation comparisons. Comparisonwere made by applying “Join” tool (UCSC Genome Browser) to compareour NumtS genomic intervals with those extracted from only previouslypublished NumtS set available. Since this starting dataset containsorthologous NumtS only (human/chimpanzee orthologous), resultingoutput is partial. Data reported in [5] exactly match our data.

Additional file 7: NumtS and genome features correlation matrix. APearson correlation matrix was calculated to test significance of NumtSlengths in species examined, with respect to relative genome lengths.

Calabrese et al. BMC Bioinformatics 2012, 13(Suppl 4):S15http://www.biomedcentral.com/1471-2105/13/S4/S15

Page 8 of 9

Page 9: Primates and mouse NumtS in the UCSC Genome Browser

Additional file 8: Figure 2 screenshots. This file provides magnifiedversions of screenshots shown in Figure 2.

List of abbreviations usedBAM: Binary Alignment/Map; BED: Browser Extensible Data; Bl2seq: BLAST 2sequences; e-value: expected value; HSP: High Scoring Pairs; Mms: Musmusculus; NHEJ: Non-Homologous End-Joining; NumtS: Nuclearmitochondrial Sequences; Ptr: Pan troglodytes; rCRS: revised CambridgeReference Sequence; RE: Repeated Elements; Rhm: Rhesus macaque;RHNumtS: Reference Human NumtS (compilation); RMNumtS: Reference Musmusculus NumtS (compilation); RPNumtS: Reference Pan troglodytes NumtS(compilation); RRNumtS: Reference Rhesus macaque NumtS (compilation);SAM: Sequence Alignment/Map.

AcknowledgementsWe are grateful to G. Pesole and E. Picardi for providing access to thePesoleLab server where most analyses were carried out; and A. Zweig and C.Li of the UCSC tracks annotation staff, who allowed us to publish the NumtStracks at the UCSC Genome Browser.This work was supported by “Fondo di Ateneo” (University of Bari).This article has been published as part of BMC Bioinformatics Volume 13Supplement 4, 2012: Italian Society of Bioinformatics (BITS): Annual Meeting2011. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/13/S4.

Authors’ contributionsFMC carried out bioinformatics analyses and wrote the manuscript, DS andFMC designed and implemented UCSC tracks in collaboration with theUCSC staff; MA coordinated and supervised the whole project. All authorsread and approved the final manuscript.

Competing interestsThe authors declare that they have no competing interests.

Published: 28 March 2012

References1. Gaziev AI, Shaĭkhaev GO: Nuclear mitochondrial pseudogenes. Mol Biol

(Mosk) 2010, 44(3):405-417.2. Hazkani-Covo E, Covo S: Numt-mediated double-strand break repair

mitigates deletions during primate genome evolution. PLoS Genet 2008,4(10):e1000834.

3. Hazkani-Covo E: Mitochondrial insertions into primate nuclear genomessuggest the use of numts as a tool for phylogeny. Mol Biol Evol 2009,26(10):2175-2179.

4. Jensen-Seaman MI, Wildschutte JH, Soto-Calderón ID, Anthony NM: Acomparative approach shows differences in patterns of numt insertionduring hominoid evolution. J Mol Evol 2009, 68(6):688-699.

5. Hazkani-Covo E, Graur D: A comparative analysis of numt evolution inhuman and chimpanzee. Mol Biol Evol 2007, 24(1):13-18.

6. Simone D, Calabrese FM, Lang M, Gasparre G, Attimonelli M: The referencehuman nuclear mitochondrial sequences compilation validated andimplemented on the UCSC genome browser. BMC Genomics 2011,12(1):517.

7. Andrews RM, Kubacka I, Chinnery PF, Lightowlers RN, Turnbull DM,Neil Howell: Reanalysis and revision of the Cambridge referencesequence for human mitochondrial DNA. Nat Genet 1999, 23:147.

8. Hazkani-Covo E, Zeller RM, Martin W: Molecular poltergeists: mitochondrialDNA copies (numts) in sequenced nuclear genomes. PLoS Genet 2010,6(2):e1000834.

9. Richly E, Leister D: NUMTs in sequenced eukaryotic genomes. Mol BiolEvol 2004, 21(6):1081-1084.

10. UCSC Chromosomes Download Page. [ftp://hgdownload.cse.ucsc.edu/goldenPath/].

11. Goecks J, Nekrutenko A, Taylor J, Galaxy Team: Galaxy: a comprehensiveapproach for supporting accessible, reproducible, and transparentcomputational research in the life sciences. Genome Biol 2010, 11(8):R86.

12. Blankenberg D, Von Kuster G, Coraor N, Ananda G, Lazarus R, Mangan M,Nekrutenko A, Taylor J: Galaxy: a web-based genome analysis tool forexperimentalists. Current Protocols in Molecular Biology 2010, Chapter19(Unit 19.10):1-21.

13. Giardine B, Riemer C, Hardison RC, Burhans R, Elnitski L, Shah P, Zhang Y,Blankenberg D, Albert I, Taylor J, Miller W, Kent WJ, Nekrutenko A: Galaxy: aplatform for interactive large-scale genome analysis. Genome Res 2005,15(10):1451-1455.

14. Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM,Haussler D: The human genome browser at UCSC. Genome Res 2002,12(6):996-1006.

15. Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D,Miller W: Human-Mouse Alignments with BLASTZ. Genome Res 2003,13(1):103-107.

16. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G,Abecasis G, Durbin R, 1000 Genome Project Data Processing Subgroup: TheSequence Alignment/Map (SAM) format and SAMtools. Bioinformatics2009, 25:2078-2079.

17. UCSC Genome Browser Homepage. [http://genome.ucsc.edu/].18. Chimpanzee Custom Tracks. [http://genome.ucsc.edu/cgi-bin/hgTracks?

db=panTro2&position=chrM&hgt.customText=http://dl.dropbox.com/u/31015858/all_PTR_tracks.txt].

19. Rhesus Macaque Custom Tracks. [http://genome.ucsc.edu/cgi-bin/hgTracks?db=rheMac2&position=chr1:5484600-5484695&hgt.customText=http://dl.dropbox.com/u/31015858/all_MMU_tracks.txt].

20. Calvignac S, Konecny L, Malard F, Douady CJ: Preventing the pollution ofmitochondrial datasets with nuclear mitochondrial paralogs (numts).Mitochondrion 2011, 11(2):246-254.

21. Primer-Blast Software. [http://www.ncbi.nlm.nih.gov/tools/primer-blast/].

doi:10.1186/1471-2105-13-S4-S15Cite this article as: Calabrese et al.: Primates and mouse NumtS in theUCSC Genome Browser. BMC Bioinformatics 2012 13(Suppl 4):S15.

Submit your next manuscript to BioMed Centraland take full advantage of:

• Convenient online submission

• Thorough peer review

• No space constraints or color figure charges

• Immediate publication on acceptance

• Inclusion in PubMed, CAS, Scopus and Google Scholar

• Research which is freely available for redistribution

Submit your manuscript at www.biomedcentral.com/submit

Calabrese et al. BMC Bioinformatics 2012, 13(Suppl 4):S15http://www.biomedcentral.com/1471-2105/13/S4/S15

Page 9 of 9