Exploring Overlapping Reading Frames (OvRFs) in viral genomes · frames of more than 11,000 genomes...
Transcript of Exploring Overlapping Reading Frames (OvRFs) in viral genomes · frames of more than 11,000 genomes...
Exploring Overlapping Reading Frames (OvRFs) in viral genomes
INTRODUCTIONGene overlap occurs when two or more genes share the same region of a nucleotide sequence in a genome. This particular gene arrangement is found in all kind of organisms. It is particularly common in viruses as a means to increase information content of genomes under selection to remain compact, and to regulate gene expression. In this work we elucidate some general trends of the overlapping reading frames of more than 11,000 genomes across all known virus families.
First, we downloaded a list of all complete viral genome records from NCBI. We used a Python script to retrieve the database annotations by record accession number, including nucleic acid, topology, taxonomy and host for each viral entry. Another Python script was also used to retrieve the coding sequences (CDS) and their coordinates on the genome. We also created a script to classify each viral family according to the Baltimore classification displayed on ViralZone (https://viralzone.expasy.org). With the information obtained, we calculated the number and length of the overlaps for each every pair of overlapping proteins in each viral genome.
METHODS
Figure 2. Baltimore classification of the data base. The data base consists on a total of 11,891 viral entries. The unknown portion correspond to families of viruses that are not present on ViralZone yet. However, from the information obtained from NCBI when retrieving the data, we do know that 400 entries are DNA viruses, and 1,552 are RNA.
ΦX174
BK
E
D
C
A
H
G
F
J
Frame 2
M V R W T L
Y G T L D F V
Frame 1
External scaffolding protein
Cell lysis protein
TATGGTACGCTGGACTTTGTG
2
1
0
-1
-2
Figure 3. Summary of overlapping regions in the viral genomes. A) Even though the genome length of dsDNA viruses is several times longer than other viruses, the mean number of nucleotides per overlap tend to be similar across Baltimore classes. However, some (+)ss RNA viruses, with genomes arround 5,000 nucleotides long, have long overlapping regions close to the half of their length. B) As expected, there is a positive correlation between the number of proteins encoded by each genome, and the number of overlaps.
Average overlap length per Baltimore classification
Overlap length per frame shift
Mean overlap length (nt)
Overlap length (nt)
RESULTS
Figure 4. Distribution of overlapping lengths across viruses according to Baltimore classification. Even though dsDNA viral genomes tend to have the longest genomes, the most extreme cases of gene overlap are on dsRNA and (+)ssRNA viruses. However dsRNA have three peaks on the distribution (arround 1, 4 and 5,000 nucleotides) weather (+)ssRNA viruses have an homogeneous distribution of the overlapping lengths that vary between 10 and 5,000 nucleotides. On the other hand, ssDNA viruses tend to
3404
1425
RT_viruses 208
1777
1241
Unknown dsDNA
ssDNA
+ssRNA
dsRNA
39circular ssRN(viroids)
1834
1963
-ssRNA
38004581
4934
35995
74083
2
1 0-1
-2
Figure 1. Overlapping genes in ΦX174 bacteriophage. This genes were first detected following the discovery that the cumulative length of protein sequences in bacteriophage ΦX174 exceeded the genome length (Barrel et al., 1976).
Laura Munoz-Baena, Art FY Poon
Department of Microbiology and Immunology, Department of Pathology and Medice Laboratory, Western University, ON, Canada
Figure 5. Distribution of overlapping lengths across ORFs according to frame shift. More than the 50% of the overlapping genes analyzed have a frame shift of +2, which usually involves only 1 or 6 nucleotides in the overlap. The rest of the frame shifts appear to be longer, even though they are less common. A frame shift of 0 means that the overlapping gene is being translated from the negative strand, and interesting those overlaps have the longest mean value of 500 nucleotides (being the longest of all).
DISCUSSION
Aknowledgement: This work is supported by The Schulich scool of Medice and Dentristry's program in Microbiology and Immunology, and the Poon Lab.References: [1] Barrel BG., Air GM., Hutchison CA. 1976. Overlapping genes in bacteriophage phiX174. Nature, 264(5581):34-41
RTviruses
dsRNA
(-)ssRNA
(+)ssRNA
ssDNA
dsDNA
Unknown
101 100 1,000 10,000
101 100 1,000 10,000
Fram
e Sh
ift
Mean overlap length per genome length Number of overlaps per number of ORFs
[log
10
] M
ean
nu
mber
of n
ucl
eot
ides
in o
verl
aps
[log10] Genome length (nt) [log10] Number of ORFs
[log
10
] N
um
ber
of o
verl
aps
020
060
010
00
050
100
150
(-)ssRNAssDNA
dsRNA
010
2030
40
020
000
4000
060
000
dsDNA
−2 −1 0 1 2
050
100
150
200
250
020
040
060
080
010
00
−2 −1 0 1 2
RTviruses (+)ssRNA
Frame shift Frame shift
Figure 7. Distribution of frame shifts according to Baltimore classification. Even though all the viral groups present overlaps with frames shift 2, both (-)ssRNA and (+)ssRNA, have 1 as their most common type of overlap. On the other hand, 0 overlap (that is, the same three nucleotides are encoding the amino-acid, but in the opposite direction) seems to be also common in RNA but not on DNA viruses.
T A T G G T A C G A T G G A C T T T G T GFrame shift : 2
T A T G G T A C G A T G G A C T T T G T GFrame shift : 0
T A T G G T A C G A T G G A C T T T G T GFrame shift : -1