Exploring Overlapping Reading Frames (OvRFs) in viral genomes · frames of more than 11,000 genomes...

1
Exploring Overlapping Reading Frames (OvRFs) in viral genomes INTRODUCTION Gene overlap occurs when two or more genes share the same region of a nucleotide sequence in a genome. This particular gene arrangement is found in all kind of organisms. It is particularly common in viruses as a means to increase information content of genomes under selection to remain compact, and to regulate gene expression. In this work we elucidate some general trends of the overlapping reading frames of more than 11,000 genomes across all known virus families. First, we downloaded a list of all complete viral genome records from NCBI. We used a Python script to retrieve the database annotations by record accession number, including nucleic acid, topology, taxonomy and host for each viral entry. Another Python script was also used to retrieve the coding sequences (CDS) and their coordinates on the genome. We also created a script to classify each viral family according to the Baltimore classication displayed on ViralZone (https:// viralzone.expasy.org). With the information obtained, we calculated the number and length of the overlaps for each every pair of overlapping proteins in each viral genome. METHODS Figure 2. Baltimore classication of the data base. The data base consists on a total of 11,891 viral entries. The unknown portion correspond to families of viruses that are not present on ViralZone yet. However, from the information obtained from NCBI when retrieving the data, we do know that 400 entries are DNA viruses, and 1,552 are RNA. ΦX174 B K E D C A H G F J Frame 2 M V R W T L Y G T L D F V Frame 1 External scaffolding protein Cell lysis protein TATGGTACGCTGGACTTTGTG 2 1 0 -1 -2 Figure 3. Summary of overlapping regions in the viral genomes. A) Even though the genome length of dsDNA viruses is several times longer than other viruses, the mean number of nucleotides per overlap tend to be similar across Baltimore classes. However, some (+)ss RNA viruses, with genomes arround 5,000 nucleotides long, have long overlapping regions close to the half of their length. B) As expected, there is a positive correlation between the number of proteins encoded by each genome, and the number of overlaps. Average overlap length per Baltimore classication Overlap length per frame shift Mean overlap length (nt) Overlap length (nt) RESULTS Figure 4. Distribution of overlapping lengths across viruses according to Baltimore classication. Even though dsDNA viral genomes tend to have the longest genomes, the most extreme cases of gene overlap are on dsRNA and (+)ssRNA viruses. However dsRNA have three peaks on the distribution (arround 1, 4 and 5,000 nucleotides) weather (+)ssRNA viruses have an homogeneous distribution of the overlapping lengths that vary between 10 and 5,000 nucleotides. On the other hand, ssDNA viruses tend to 3404 1425 RT_viruses 208 1777 1241 Unknown dsDNA ssDNA +ssRNA dsRNA 39 circular ssRN(viroids) 1834 1963 -ssRNA 3 8 0 0 4 5 8 1 4 9 3 4 3 5 9 9 5 7 4 0 8 3 2 1 0 - 1 - 2 Figure 1. Overlapping genes in ΦX174 bacteriophage. This genes were rst detected following the discovery that the cumulative length of protein sequences in bacteriophage ΦX174 exceeded the genome length (Barrel et al., 1976). Laura Munoz-Baena, Art FY Poon Department of Microbiology and Immunology, Department of Pathology and Medice Laboratory, Western University, ON, Canada Figure 5. Distribution of overlapping lengths across ORFs according to frame shift. More than the 50% of the overlapping genes analyzed have a frame shift of +2, which usually involves only 1 or 6 nucleotides in the overlap. The rest of the frame shifts appear to be longer, even though they are less common. A frame shift of 0 means that the overlapping gene is being translated from the negative strand, and interesting those overlaps have the longest mean value of 500 nucleotides (being the longest of all). DISCUSSION Aknowledgement: This work is supported by The Schulich scool of Medice and Dentristry's program in Microbiology and Immunology, and the Poon Lab. References: [1] Barrel BG., Air GM., Hutchison CA. 1976. Overlapping genes in bacteriophage phiX174. Nature, 264(5581):34-41 RTviruses dsRNA (-)ssRNA (+)ssRNA ssDNA dsDNA Unknown 10 1 100 1,000 10,000 10 1 100 1,000 10,000 Frame Shift Mean overlap length per genome length Number of overlaps per number of ORFs [log10] Mean number of nucleotides in overlaps [log10] Genome length (nt) [log10] Number of ORFs [log10] Number of overlaps 0 200 600 1000 0 50 100 150 (-)ssRNA ssDNA dsRNA 0 10 20 30 40 0 20000 40000 60000 dsDNA 2 1 0 1 2 0 50 100 150 200 250 0 200 400 600 800 1000 2 1 0 1 2 RTviruses (+)ssRNA F r a m e s h i f t F r a m e s h i f t Figure 7. Distribution of frame shifts according to Baltimore classication. Even though all the viral groups present overlaps with frames shift 2, both (-)ssRNA and (+)ssRNA, have 1 as their most common type of overlap. On the other hand, 0 overlap (that is, the same three nucleotides are encoding the amino-acid, but in the opposite direction) seems to be also common in RNA but not on DNA viruses. TATGGTACGATGGACTTTGTG Frame shift : 2 TATGGTACGATGGACTTTGTG Frame shift : 0 TATGGTACGATGGACTTTGTG Frame shift : -1

Transcript of Exploring Overlapping Reading Frames (OvRFs) in viral genomes · frames of more than 11,000 genomes...

Page 1: Exploring Overlapping Reading Frames (OvRFs) in viral genomes · frames of more than 11,000 genomes across all known virus families. First, we downloaded a list of all complete viral

Exploring Overlapping Reading Frames (OvRFs) in viral genomes

INTRODUCTIONGene overlap occurs when two or more genes share the same region of a nucleotide sequence in a genome. This particular gene arrangement is found in all kind of organisms. It is particularly common in viruses as a means to increase information content of genomes under selection to remain compact, and to regulate gene expression. In this work we elucidate some general trends of the overlapping reading frames of more than 11,000 genomes across all known virus families.

First, we downloaded a list of all complete viral genome records from NCBI. We used a Python script to retrieve the database annotations by record accession number, including nucleic acid, topology, taxonomy and host for each viral entry. Another Python script was also used to retrieve the coding sequences (CDS) and their coordinates on the genome. We also created a script to classify each viral family according to the Baltimore classification displayed on ViralZone (https://viralzone.expasy.org). With the information obtained, we calculated the number and length of the overlaps for each every pair of overlapping proteins in each viral genome.

METHODS

Figure 2. Baltimore classification of the data base. The data base consists on a total of 11,891 viral entries. The unknown portion correspond to families of viruses that are not present on ViralZone yet. However, from the information obtained from NCBI when retrieving the data, we do know that 400 entries are DNA viruses, and 1,552 are RNA.

ΦX174

BK

E

D

C

A

H

G

F

J

Frame 2

M V R W T L

Y G T L D F V

Frame 1

External scaffolding protein

Cell lysis protein

TATGGTACGCTGGACTTTGTG

2

1

0

-1

-2

Figure 3. Summary of overlapping regions in the viral genomes. A) Even though the genome length of dsDNA viruses is several times longer than other viruses, the mean number of nucleotides per overlap tend to be similar across Baltimore classes. However, some (+)ss RNA viruses, with genomes arround 5,000 nucleotides long, have long overlapping regions close to the half of their length. B) As expected, there is a positive correlation between the number of proteins encoded by each genome, and the number of overlaps.

Average overlap length per Baltimore classification

Overlap length per frame shift

Mean overlap length (nt)

Overlap length (nt)

RESULTS

Figure 4. Distribution of overlapping lengths across viruses according to Baltimore classification. Even though dsDNA viral genomes tend to have the longest genomes, the most extreme cases of gene overlap are on dsRNA and (+)ssRNA viruses. However dsRNA have three peaks on the distribution (arround 1, 4 and 5,000 nucleotides) weather (+)ssRNA viruses have an homogeneous distribution of the overlapping lengths that vary between 10 and 5,000 nucleotides. On the other hand, ssDNA viruses tend to

3404

1425

RT_viruses 208

1777

1241

Unknown dsDNA

ssDNA

+ssRNA

dsRNA

39circular ssRN(viroids)

1834

1963

-ssRNA

38004581

4934

35995

74083

2

1 0-1

-2

Figure 1. Overlapping genes in ΦX174 bacteriophage. This genes were first detected following the discovery that the cumulative length of protein sequences in bacteriophage ΦX174 exceeded the genome length (Barrel et al., 1976).

Laura Munoz-Baena, Art FY Poon

Department of Microbiology and Immunology, Department of Pathology and Medice Laboratory, Western University, ON, Canada

Figure 5. Distribution of overlapping lengths across ORFs according to frame shift. More than the 50% of the overlapping genes analyzed have a frame shift of +2, which usually involves only 1 or 6 nucleotides in the overlap. The rest of the frame shifts appear to be longer, even though they are less common. A frame shift of 0 means that the overlapping gene is being translated from the negative strand, and interesting those overlaps have the longest mean value of 500 nucleotides (being the longest of all).

DISCUSSION

Aknowledgement: This work is supported by The Schulich scool of Medice and Dentristry's program in Microbiology and Immunology, and the Poon Lab.References: [1] Barrel BG., Air GM., Hutchison CA. 1976. Overlapping genes in bacteriophage phiX174. Nature, 264(5581):34-41

RTviruses

dsRNA

(-)ssRNA

(+)ssRNA

ssDNA

dsDNA

Unknown

101 100 1,000 10,000

101 100 1,000 10,000

Fram

e Sh

ift

Mean overlap length per genome length Number of overlaps per number of ORFs

[log

10

] M

ean

nu

mber

of n

ucl

eot

ides

in o

verl

aps

[log10] Genome length (nt) [log10] Number of ORFs

[log

10

] N

um

ber

of o

verl

aps

020

060

010

00

050

100

150

(-)ssRNAssDNA

dsRNA

010

2030

40

020

000

4000

060

000

dsDNA

−2 −1 0 1 2

050

100

150

200

250

020

040

060

080

010

00

−2 −1 0 1 2

RTviruses (+)ssRNA

Frame shift Frame shift

Figure 7. Distribution of frame shifts according to Baltimore classification. Even though all the viral groups present overlaps with frames shift 2, both (-)ssRNA and (+)ssRNA, have 1 as their most common type of overlap. On the other hand, 0 overlap (that is, the same three nucleotides are encoding the amino-acid, but in the opposite direction) seems to be also common in RNA but not on DNA viruses.

T A T G G T A C G A T G G A C T T T G T GFrame shift : 2

T A T G G T A C G A T G G A C T T T G T GFrame shift : 0

T A T G G T A C G A T G G A C T T T G T GFrame shift : -1