Conservation of exonic and non-exonic sequences in the vertebrate genome

Exonic and non-exonic genomic sequences

in vertebrates are similarly conserved.

Nhi Hin

INTRODUCTION

Only a small proportion of the vertebrate genome consists of exonic DNA that is transcribed into mRNA

and translated into proteins (2.2% in total in the human genome, according to Frith et al. 2005). Of the

remaining ~98% of non-exonic genomic DNA, a surprisingly large proportion is highly conserved across many

species, suggesting that these non-exonic sequences also have important biological functions.

It is easy to see why many exonic sequences are conserved. Because exons are spliced together to form

mature mRNA that generally encode proteins, changes to the nucleotide sequence often result in codon

changes that alter the amino acid sequence. In turn, this alters the folding of the protein, affecting its shape

and function. Most random substitutions result in deleterious mutations and are evolutionarily selected

against. This explains why many exonic sequences tend to have low substitution rates, particularly if the

protein they encode requires a very specific structure to properly execute its function. There are many

examples of these proteins, including transcription factors that bind to specific DNA sequences and enzymes

that recognise and bind certain substrates (Drummond et al. 2006). In contrast, non-exonic DNA sequences

do not code for proteins, making the high conservation of many of these non-exonic sequences surprising.

In recent years, research has significantly progressed in elucidating the possible functions of these non-exonic

sequences however. It is now thought that non-exonic sequences are important in the regulation of a diverse

range of functions, including gene expression, chromosome assembly and DNA replication (Ludwig 2002). In

gene regulation, non-exonic DNA sequences are particularly implicated as cis-acting regulatory elements,

including promoters, enhancers and silencers (Jegga & Aronow 2008). Conservation of sequence is thought

to be important in cis-regulatory elements to ensure that they can recognise and bind specific transcription

factors (Ludwig 2002).

In the present analysis, highly conserved genomic sequences comprising both exonic and non-exonic regions

from 100 different vertebrate species are compared in regards to their substitution rates and element lengths

to determine whether they are differently conserved.

EXPERIMENTAL

As stated in the Genetics 3111 Practical Manual for “Genomics 2 – Conserved Sequences” (Adelson, 2016)

without modification.

RESULTS

Table 1 shows a summary of the number of regions at each step of the Galaxy analysis. An intersect on the

MAF blocks of the highly conserved intervals (6) and coding exons (4) was done to isolate the highly

conserved coding exon intervals (8). Next, to isolate the highly conserved non-exonic intervals (9), the exons

(4) were subtracted from the MAF blocks of highly conserved intervals (6). Using this data to generate

datasets 10-13, it was possible to then generate histograms of the distributions of substitution rates (Figure

1) and element length (Figure 2) for these highly conserved intervals.

The histograms in Figure 1 show the distributions of substitution rates in all highly conserved intervals (Figure

1A), all highly conserved coding exonic intervals (Figure 1B) and all highly conserved non-exonic intervals

(Figure 1C). The histograms in Figure 2 show the distribution of element lengths in all highly conserved

intervals (Figure 2A), all highly conserved coding exonic intervals (Figure 2B) and all highly conserved non-

exonic intervals (Figure 2C). To visualise the association between the substitution rate and element length

for highly conserved coding exonic and non-exonic regions, the scatterplot in Figure 3 was constructed.

Finally, Table 2 shows the total number of base pairs in each interval set: all highly conserved intervals, highly

conserved exonic intervals, and highly conserved non-exonic intervals. From this table, it can be seen that

the number of base pairs sampled in the highly conserved intervals was 2358210 bp. The non-exonic intervals

(1,882,164 bp) accounted for a much larger proportion of these intervals compared to the exonic intervals

sampled (96,736 bp).

Table 1. Number of regions at each step of the Galaxy analysis.

Dataset Description Regions

1 Highly conserved intervals 4,673

2 Coding exon intervals 450,000

3 Merged coding exon intervals 200,000

4 All exons 560,000

5 Merged all exons 240,000

6 MAF blocks of highly conserved intervals 4,503

7 Substitution rates of highly conserved intervals 4,503

8 Highly conserved coding exon intervals 482

9 Highly conserved non-exonic intervals 4,408

10 MAF blocks of highly conserved coding exons 473

11 MAF blocks of highly conserved non-exonic intervals 4,079

12 Substitution rate of highly conserved coding exons 473

13 Substitution rate of highly conserved non-exonic intervals 4,079

Figure 1. Histograms showing the distributions of substitution rates (nucleotide substitutions per site) of highly

conserved genomic regions across 100 vertebrate species. (A) All highly conserved intervals. (B) Highly conserved

coding exonic intervals. (C) Highly conserved non-exonic intervals. (D) Summary statistics for each distribution.

Substitution Rate

(substitutions/nucleotide site)

Mean St. Dev.

(A) All highly conserved intervals 0.273487 0.082704

(B) Coding Exonic intervals 0.244096 0.102126

(C) Non-exonic intervals 0.271711 0.096935

(A) (B)

(C) (D)

Figure 2. Histograms showing the distributions of element lengths (in base pairs) of highly conserved genomic regions

across 100 vertebrate species. (A) All highly conserved intervals. (B) Highly conserved coding exonic intervals. (C)

Highly conserved non-exonic intervals. (D) Summary statistics for each distribution.

Element Length (base pairs) Mean St. Dev.

All highly conserved intervals 303.776 175.207

Coding Exonic intervals 184.679 132.311

Non-exonic intervals 255.186 181.031

(A) (B)

(C) (D)

Figure 3. Scatterplot showing the association between substitution rate (substitutions/nucleotide site) and element

length (bp) for coding exonic regions (blue) and non-exonic regions (black).

Table 2. Base pair coverage in all highly conserved intervals, highly conserved exonic

intervals, and highly conserved non-exon intervals.

Base Pair Coverage Base Pairs

All highly conserved intervals 2358210

Coding Exonic intervals 96736

Non-exonic intervals 1882164

DISCUSSION

In general, highly conserved sequence intervals have a low substitution rate and tend to have

short element lengths.

Figure 1A is a distribution of the substitution rates of all highly conserved sequence intervals. Although the

substitution rates range from 0-1.0 substitutions/nucleotide site, the mean substitution rate is 0.273

substitutions/nucleotide site and the majority of substitution rates are clustered around this mean, with

substitution rates ranging from ~0.1 to ~0.4. Only a very small number of highly conserved sequences have

substitution rates greater than 0.5 substitutions/nucleotide site. This makes sense in evolutionary terms, as

highly conserved sequences would be expected to have functional value and play important roles in the

survival of the organism (contribute to maintaining/increasing fitness). Exonic sequences generally encode

proteins, and although the role of non-exonic sequences is still not well defined, many have been found to

have important regulatory roles (discussed in more detail later). Nucleotide substitutions resulting in

deleterious effects to the function of the products of these sequences would be selected against and have

less chance of becoming fixed in the population, resulting in lower substitution rates for these highly

conserved sequences.

Figure 2A shows a distribution of the element lengths of all highly conserved sequence intervals. The shape

of the distribution is clearly left-skewed, indicating that most highly conserved sequences are short, with

element lengths <600bp. The mean length of these highly conserved sequences is 303.8 bp, and the majority

of highly conserved sequences show lengths that cluster around this mean. There are only a very small

number of sequences with lengths >800bp. This is also consistent with expectations; regardless of whether

they are exonic or non-exonic, highly conserved sequences are likely to have functional value. The function

of proteins is largely influenced by their structure, which places constraints into the possible encoding DNA

sequences. Similarly, the diverse functions of non-exonic DNA sequence transcripts are thought to depend

on their secondary structure, if they are cis-regulatory sites that bind specific transcription factors. The

specific motif of DNA that is crucial to the function of the transcribed product is likely to be highly conserved,

while surrounding DNA that plays a less important functional role may have a higher substitution rate. This

helps to explain why these functionally-significant sequences of DNA that are highly conserved also tend to

be short.

Overall, coding exons and non-exonic intervals show similar trends in their substitution rates

and element lengths.

The histograms in Figure 1 show the distributions of substitution rates in all highly conserved intervals (Figure


(Figure 1C). From these histograms and the summary statistics in Figure 1D, it can be seen that coding exonic

intervals have a slightly lower mean substitution rate (0.244 substitutions/nucleotide site) compared to the

non-exonic intervals (0.272 substitutions/nucleotide site), suggesting coding exons may be more highly

conserved than non-exonic intervals. The substitution rate of the non-exonic intervals more closely

resembles that of the mean substitution rate of all highly conserved intervals (0.273 substitutions/nucleotide

sites). The spread of substitution rates in each distribution in Figures 1A, 1B and 1C is similar (standard

deviation ranging from 0.08-0.1). However, the non-exonic intervals have a larger range of substitution rates

compared to the exonic intervals, as shown by the tail on the right side, which indicates a small number of

non-exonic intervals with very high substitution rates (0.6-1.0), likely as these sequences are not as well

conserved. However, the sample size of the non-exonic intervals is several times larger than the coding

exonic intervals, meaning that there could have been more opportunity for variation. It should be noted that

despite the different mean substitution rates for both coding exonic and non-exonic intervals, the vast

majority of both coding exonic and non-exonic intervals have substitution rates between 0-0.5

substitutions/nucleotide site.

The histograms in Figure 2 show the distribution of element lengths in all highly conserved intervals (Figure


(Figure 2C). From these histograms and the summary statistics in Figure 2D, it can be seen that the mean

length of exonic intervals is 184.7 bp, which is smaller than that of non-exonic intervals (255.2 bp). In addition,

the distribution of element lengths for the exonic intervals show less variation around the mean, with a

smaller standard deviation of 132.3 bp compared to the non-exonic intervals which have standard deviation

of 181.0 bp. The shapes of the element length distributions for coding exonic and non-exonic intervals are

similar in that they are both skewed to the left, suggesting that both coding exonic and non-exonic intervals

tend to be short rather than long. However, the distribution of the non-exonic intervals has a larger range,

and the right-tail shows a small number of non-exonic intervals with very large lengths (>900 bp). In contrast,

the element lengths of the coding exonic intervals are all unquestionably below 900 bp. The variation in length

for non-exonic elements may reflect the diverse regulatory roles they play. Examples include micro RNAs

(miRNA) and long non-coding RNAs (lncRNAs), both of which are heavily involved in the regulation of gene

expression in many species. While miRNA are very small (~22 nt), lncRNA are often larger than 200 nt

(Ludwig 2002). In addition, many non-exonic sequences are cis-regulatory regions such as promoters and

enhancers, which display wide variation in length (~100-2000 nt, with a median of 455 nt, according to

Kristiansson et al. 2009), depending on the downstream gene. Research by Kristiansson et al. (2009) suggested

that genes that respond to environmental stimuli (e.g. temperature shock, oxidative stress, osmotic stress)

tended to have longer promoters, perhaps to provide more specificity and mimimise the chance of

unnecessarily activating these genes.

Figure 3 is a scatterplot that shows the association between the substitution rate and element length for

coding exonic and non-exonic regions. There is significant overlap between the data points of the coding

exonic and non-exonic regions, implying that the substitution rate and element length of the coding exonic

intervals does not appear to differ significantly from the non-exonic intervals. However, the non-exonic

intervals has a larger range of element lengths and substitution rates (as shown by the greater spread of data

points) compared to the exonic intervals. Figure 3 shows that most highly conserved intervals have short

element lengths (<500 bp), which display a wide range of subsitution rates (0-0.8 substitutions/nucleotide

site/generation). Longer element lengths appear to display a slightly smaller range of substitution rates

however, although there is still significant variation in their substitution rates. Taken together, these results

suggest that the element length does not appear to show a significant correlation with the substitution rate.

Overall, the data suggest that exonic and non-exonic intervals are conserved similarly, in

contrast with the hypothesis that they are differently conserved.

The substitution rates and element lengths of coding exonic and non-exonic intervals were found to be similar

with significant overlap. This suggests that their conservation does not significantly differ, a finding which is in

contrast to the initial hypothesis. On the basis of their similar substitution rates and element lengths, it cannot

be concluded that coding exonic and non-exonic intervals are differently conserved.

Evolutionary basis for conservation of exonic intervals.

Coding exonic intervals can be spliced together to produce mRNA sequences that can be translated to form

proteins. These proteins are often required for basic cellular function, stability or reproduction (Jegga &

Aronow 2013). There is a clear relationship between the original DNA sequence, structure, and function of

a protein. For proteins to perform their specific function, they must possess a suitable shape, and this creates

specific constraints to the underlying DNA sequence. Deleterious mutations often result in less than optimal

protein function and are usually selected against, while synonymous mutations that do not alter amino acid

sequence are often allowed to persist as they tend to have minimal phenotypic effect.

In this analysis, it was observed that coding exonic sequence intervals displayed a wide range of substitution

rates. It is well established that proteins evolve at vastly different rates, with expression level, type of function

being performed, structural characteristics and intermolecular interactions all contributing to this wide

variation (Drummond et al. 2006, Kim et al. 2006). Additionally, the evolutionary rate also varies significantly

between different amino acids depending on whether they are actively involved in protein function. For

example, amino acids in active-sites of enzymes, structurally-important residues, and transcription factor sites

involved in binding to DNA are examples where amino acid substitution can be particularly deleterious

(Simon et al. 2002). This implies that sequences that are more highly conserved are preferentially involved in

more biologically important roles.

A significant proportion of highly conserved genomic regions are non-exonic.

In addition to protein-coding sequences, vertebrate genomes contain a significant amount of non-exonic DNA

which are highly conserved. This may seem surprising initially as non exonic sequences are not translated

into proteins. However, the established evolutionary conservation of non-exonic sequences across many

species implies that they have important functions. These functions are considered diverse; some non-exonic

DNA are transcribed into functional RNA molecules (e.g. regulatory RNAs, transfer RNAs), while many

other non-exonic DNA actas cis-regulatory elements (e.g. enhancers, promoters, insulators, silencers and

matrix or scaffold attachment regions) (Jegga & Aronow 2008). These elements play crucial roles in gene

regulation. For example, enhancers increase transcription from a promoter through allowing transcription

factors to bind to specific conserved sequences. They are usually 100-300 bp long according to Ludwig (2002),

and this length is consistent with the element lengths displayed in the current analysis. Because the structure

and function of these transcription factors is often conserved, this implies that conservation of the sequence

of the cis-regulatory element is also often important to ensure correct recognition and binding. Non-exonic

DNA is considered to play an important role in the complex regulation of gene expression in eukaryotes,

and it is well-established that there is a strong correlation between the amount of exonic genomic DNA and

biological complexity (Taft et al. 2007).

Factors affecting interpretation of the data

There are several factors that could have influenced interpretation of the data. When intersecting the highly

conserved sequences (PhastCons Elements 100 way intervals) with coding exons, a 50 bp overlap was used.

This means that only sequences that overlapped by at least 50 bp would be included while shorter sequences

with length < 50 bp were automatically excluded, even if they had high sequence similarity. Ba et al. (2012)

identified some highly conserved, short (6-30 nt) DNA sequences that formed specific amino acid motifs in

proteins with similar functions. This may have contributed to the positively skewed distributions of element

lengths, along with the clustering of the majority of scatterplot points towards the left side (smaller element

lengths).

In addition, while conserved DNA sequences are usually indicative of conserved protein function, the

relationship between DNA sequence and non-exonic DNA products is still unclear. In the current analysis,

sequences were considered highly conserved if their nucleotide sequences were very similar (significant

overlap of 50bp). However, this may have excluded certain elements which have conserved structure and

function while having less well-conserved DNA sequences. For example, Johnsson et al. (2014) found evidence

of long non-coding RNAs (lncRNAs) with well-conserved secondary structures despite often having poor

DNA sequence conservation between species. This is an example of a non-exonic DNA sequence product

that can still have highly conserved structure and function while still having a higher substitution rate.

CONCLUSION

Highly conserved sequences in vertebrate genomes comprise both exonic and non-exonic elements. Despite

their differing biological functions, both types of sequences display similar trends in their conservation, with

similar low substitution rates and short lengths. Their evolutionary conservation suggest that these sequences

play biologically important, conserved roles across many vertebrate species.

REFERENCES

Adelson, D 2016, ‘Genomics 2 – Conserved Regions’, practical notes in the course Genetics 3111, University

of Adelaide, viewed 29 May 2016, <https://myuni.adelaide.edu.au/bbcswebdav/pid-6999491-dt-content-rid-

9629976_1/courses/3610_GENETICS_COMBINED_0001/Genomics%202%20%E2%80%93%20Conserved%

20Regions%202016_b.pdf>.

Ba, ANN, Yeh, BJ, van Dyk, D, Davidson, AR, Andrews, BJ, Weiss, EL & Moses, AM 2012, ‘Proteome-wide

discovery of evolutionary conserved sequences in disordered regions’, Science Signalling, vol. 5, no.215, pp.1-

14.

Drummond, DA, Raval, A & Wilke, CO 2006, ‘A single determinant dominates the rate of yeast protein

evolution’, Molecular Biology and Evolution, vol. 23, no. 2, pp.327-337.

Frith, MC, Pheasant, M & Mattick, JS 2005, ‘Genomics: The amazing complexity of the human

transcriptome’, European Journal of Human Genetics, vol. 13, no. 8, pp.894-897.

Jegga, AG & Aronow, BJ 2008. ‘Evolutionarily conserved noncoding DNA’. Encyclopedia of Life Sciences.

Johnsson, P, Lipovich, L, Grandér, D & Morris, KV 2014. ‘Evolutionary conservation of long non-coding RNAs;

sequence, structure, function’, Biochimica et Biophysica Acta (BBA)-General Subjects, vol. 1840, no. 3, pp.1063-

1071.

Kim, PM, Lu, LJ, Xia, Y & Gerstein, MB 2006, ‘Relating three-dimensional structures to protein networks

provides evolutionary insights’, Science, vol. 314, no. 5807, pp.1938-1941.

Kristiansson, E, Thorsen, M, Tamás, MJ & Nerman, O 2009, ‘Evolutionary forces act on promoter length:

identification of enriched cis-regulatory elements’, Molecular biology and evolution, vol. 26, no. 6, pp.1299-1307.

Ludwig, MZ 2002, ‘Functional evolution of noncoding DNA’, Current opinion in genetics & development, vol. 12,

no. 6, pp.634-639.

Simon, AL, Stone, EA & Sidow, A 2002, ‘Inference of functional regions in proteins by quantification of

evolutionary constraints’, Proceedings of the National Academy of Sciences, vol. 99, no. 5, pp.2912-2917.

Taft, RJ, Pheasant, M. & Mattick, JS 2007, ‘The relationship between non‐protein‐coding DNA and eukaryotic

complexity’, Bioessays, vol. 29, no. 3, pp.288-299.

Conservation of exonic and non-exonic sequences in the vertebrate genome

Science

Transcript of Conservation of exonic and non-exonic sequences in the vertebrate genome