Conservation of exonic and non-exonic sequences in the vertebrate genome
Transcript of Conservation of exonic and non-exonic sequences in the vertebrate genome
Exonic and non-exonic genomic sequences
in vertebrates are similarly conserved.
Nhi Hin
INTRODUCTION
Only a small proportion of the vertebrate genome consists of exonic DNA that is transcribed into mRNA
and translated into proteins (2.2% in total in the human genome, according to Frith et al. 2005). Of the
remaining ~98% of non-exonic genomic DNA, a surprisingly large proportion is highly conserved across many
species, suggesting that these non-exonic sequences also have important biological functions.
It is easy to see why many exonic sequences are conserved. Because exons are spliced together to form
mature mRNA that generally encode proteins, changes to the nucleotide sequence often result in codon
changes that alter the amino acid sequence. In turn, this alters the folding of the protein, affecting its shape
and function. Most random substitutions result in deleterious mutations and are evolutionarily selected
against. This explains why many exonic sequences tend to have low substitution rates, particularly if the
protein they encode requires a very specific structure to properly execute its function. There are many
examples of these proteins, including transcription factors that bind to specific DNA sequences and enzymes
that recognise and bind certain substrates (Drummond et al. 2006). In contrast, non-exonic DNA sequences
do not code for proteins, making the high conservation of many of these non-exonic sequences surprising.
In recent years, research has significantly progressed in elucidating the possible functions of these non-exonic
sequences however. It is now thought that non-exonic sequences are important in the regulation of a diverse
range of functions, including gene expression, chromosome assembly and DNA replication (Ludwig 2002). In
gene regulation, non-exonic DNA sequences are particularly implicated as cis-acting regulatory elements,
including promoters, enhancers and silencers (Jegga & Aronow 2008). Conservation of sequence is thought
to be important in cis-regulatory elements to ensure that they can recognise and bind specific transcription
factors (Ludwig 2002).
In the present analysis, highly conserved genomic sequences comprising both exonic and non-exonic regions
from 100 different vertebrate species are compared in regards to their substitution rates and element lengths
to determine whether they are differently conserved.
EXPERIMENTAL
As stated in the Genetics 3111 Practical Manual for “Genomics 2 – Conserved Sequences” (Adelson, 2016)
without modification.
RESULTS
Table 1 shows a summary of the number of regions at each step of the Galaxy analysis. An intersect on the
MAF blocks of the highly conserved intervals (6) and coding exons (4) was done to isolate the highly
conserved coding exon intervals (8). Next, to isolate the highly conserved non-exonic intervals (9), the exons
(4) were subtracted from the MAF blocks of highly conserved intervals (6). Using this data to generate
datasets 10-13, it was possible to then generate histograms of the distributions of substitution rates (Figure
1) and element length (Figure 2) for these highly conserved intervals.
The histograms in Figure 1 show the distributions of substitution rates in all highly conserved intervals (Figure
1A), all highly conserved coding exonic intervals (Figure 1B) and all highly conserved non-exonic intervals
(Figure 1C). The histograms in Figure 2 show the distribution of element lengths in all highly conserved
intervals (Figure 2A), all highly conserved coding exonic intervals (Figure 2B) and all highly conserved non-
exonic intervals (Figure 2C). To visualise the association between the substitution rate and element length
for highly conserved coding exonic and non-exonic regions, the scatterplot in Figure 3 was constructed.
Finally, Table 2 shows the total number of base pairs in each interval set: all highly conserved intervals, highly
conserved exonic intervals, and highly conserved non-exonic intervals. From this table, it can be seen that
the number of base pairs sampled in the highly conserved intervals was 2358210 bp. The non-exonic intervals
(1,882,164 bp) accounted for a much larger proportion of these intervals compared to the exonic intervals
sampled (96,736 bp).
Table 1. Number of regions at each step of the Galaxy analysis.
Dataset Description Regions
1 Highly conserved intervals 4,673
2 Coding exon intervals 450,000
3 Merged coding exon intervals 200,000
4 All exons 560,000
5 Merged all exons 240,000
6 MAF blocks of highly conserved intervals 4,503
7 Substitution rates of highly conserved intervals 4,503
8 Highly conserved coding exon intervals 482
9 Highly conserved non-exonic intervals 4,408
10 MAF blocks of highly conserved coding exons 473
11 MAF blocks of highly conserved non-exonic intervals 4,079
12 Substitution rate of highly conserved coding exons 473
13 Substitution rate of highly conserved non-exonic intervals 4,079
Figure 1. Histograms showing the distributions of substitution rates (nucleotide substitutions per site) of highly
conserved genomic regions across 100 vertebrate species. (A) All highly conserved intervals. (B) Highly conserved
coding exonic intervals. (C) Highly conserved non-exonic intervals. (D) Summary statistics for each distribution.
Substitution Rate
(substitutions/nucleotide site)
Mean St. Dev.
(A) All highly conserved intervals 0.273487 0.082704
(B) Coding Exonic intervals 0.244096 0.102126
(C) Non-exonic intervals 0.271711 0.096935
(A) (B)
(C) (D)
Figure 2. Histograms showing the distributions of element lengths (in base pairs) of highly conserved genomic regions
across 100 vertebrate species. (A) All highly conserved intervals. (B) Highly conserved coding exonic intervals. (C)
Highly conserved non-exonic intervals. (D) Summary statistics for each distribution.
Element Length (base pairs) Mean St. Dev.
All highly conserved intervals 303.776 175.207
Coding Exonic intervals 184.679 132.311
Non-exonic intervals 255.186 181.031
(A) (B)
(C) (D)
Figure 3. Scatterplot showing the association between substitution rate (substitutions/nucleotide site) and element
length (bp) for coding exonic regions (blue) and non-exonic regions (black).
Table 2. Base pair coverage in all highly conserved intervals, highly conserved exonic
intervals, and highly conserved non-exon intervals.
Base Pair Coverage Base Pairs
All highly conserved intervals 2358210
Coding Exonic intervals 96736
Non-exonic intervals 1882164
DISCUSSION
In general, highly conserved sequence intervals have a low substitution rate and tend to have
short element lengths.
Figure 1A is a distribution of the substitution rates of all highly conserved sequence intervals. Although the
substitution rates range from 0-1.0 substitutions/nucleotide site, the mean substitution rate is 0.273
substitutions/nucleotide site and the majority of substitution rates are clustered around this mean, with
substitution rates ranging from ~0.1 to ~0.4. Only a very small number of highly conserved sequences have
substitution rates greater than 0.5 substitutions/nucleotide site. This makes sense in evolutionary terms, as
highly conserved sequences would be expected to have functional value and play important roles in the
survival of the organism (contribute to maintaining/increasing fitness). Exonic sequences generally encode
proteins, and although the role of non-exonic sequences is still not well defined, many have been found to
have important regulatory roles (discussed in more detail later). Nucleotide substitutions resulting in
deleterious effects to the function of the products of these sequences would be selected against and have
less chance of becoming fixed in the population, resulting in lower substitution rates for these highly
conserved sequences.
Figure 2A shows a distribution of the element lengths of all highly conserved sequence intervals. The shape
of the distribution is clearly left-skewed, indicating that most highly conserved sequences are short, with
element lengths <600bp. The mean length of these highly conserved sequences is 303.8 bp, and the majority
of highly conserved sequences show lengths that cluster around this mean. There are only a very small
number of sequences with lengths >800bp. This is also consistent with expectations; regardless of whether
they are exonic or non-exonic, highly conserved sequences are likely to have functional value. The function
of proteins is largely influenced by their structure, which places constraints into the possible encoding DNA
sequences. Similarly, the diverse functions of non-exonic DNA sequence transcripts are thought to depend
on their secondary structure, if they are cis-regulatory sites that bind specific transcription factors. The
specific motif of DNA that is crucial to the function of the transcribed product is likely to be highly conserved,
while surrounding DNA that plays a less important functional role may have a higher substitution rate. This
helps to explain why these functionally-significant sequences of DNA that are highly conserved also tend to
be short.
Overall, coding exons and non-exonic intervals show similar trends in their substitution rates
and element lengths.
The histograms in Figure 1 show the distributions of substitution rates in all highly conserved intervals (Figure
1A), all highly conserved coding exonic intervals (Figure 1B) and all highly conserved non-exonic intervals
(Figure 1C). From these histograms and the summary statistics in Figure 1D, it can be seen that coding exonic
intervals have a slightly lower mean substitution rate (0.244 substitutions/nucleotide site) compared to the
non-exonic intervals (0.272 substitutions/nucleotide site), suggesting coding exons may be more highly
conserved than non-exonic intervals. The substitution rate of the non-exonic intervals more closely
resembles that of the mean substitution rate of all highly conserved intervals (0.273 substitutions/nucleotide
sites). The spread of substitution rates in each distribution in Figures 1A, 1B and 1C is similar (standard
deviation ranging from 0.08-0.1). However, the non-exonic intervals have a larger range of substitution rates
compared to the exonic intervals, as shown by the tail on the right side, which indicates a small number of
non-exonic intervals with very high substitution rates (0.6-1.0), likely as these sequences are not as well
conserved. However, the sample size of the non-exonic intervals is several times larger than the coding
exonic intervals, meaning that there could have been more opportunity for variation. It should be noted that
despite the different mean substitution rates for both coding exonic and non-exonic intervals, the vast
majority of both coding exonic and non-exonic intervals have substitution rates between 0-0.5
substitutions/nucleotide site.
The histograms in Figure 2 show the distribution of element lengths in all highly conserved intervals (Figure
2A), all highly conserved coding exonic intervals (Figure 2B) and all highly conserved non-exonic intervals
(Figure 2C). From these histograms and the summary statistics in Figure 2D, it can be seen that the mean
length of exonic intervals is 184.7 bp, which is smaller than that of non-exonic intervals (255.2 bp). In addition,
the distribution of element lengths for the exonic intervals show less variation around the mean, with a
smaller standard deviation of 132.3 bp compared to the non-exonic intervals which have standard deviation
of 181.0 bp. The shapes of the element length distributions for coding exonic and non-exonic intervals are
similar in that they are both skewed to the left, suggesting that both coding exonic and non-exonic intervals
tend to be short rather than long. However, the distribution of the non-exonic intervals has a larger range,
and the right-tail shows a small number of non-exonic intervals with very large lengths (>900 bp). In contrast,
the element lengths of the coding exonic intervals are all unquestionably below 900 bp. The variation in length
for non-exonic elements may reflect the diverse regulatory roles they play. Examples include micro RNAs
(miRNA) and long non-coding RNAs (lncRNAs), both of which are heavily involved in the regulation of gene
expression in many species. While miRNA are very small (~22 nt), lncRNA are often larger than 200 nt
(Ludwig 2002). In addition, many non-exonic sequences are cis-regulatory regions such as promoters and
enhancers, which display wide variation in length (~100-2000 nt, with a median of 455 nt, according to
Kristiansson et al. 2009), depending on the downstream gene. Research by Kristiansson et al. (2009) suggested
that genes that respond to environmental stimuli (e.g. temperature shock, oxidative stress, osmotic stress)
tended to have longer promoters, perhaps to provide more specificity and mimimise the chance of
unnecessarily activating these genes.
Figure 3 is a scatterplot that shows the association between the substitution rate and element length for
coding exonic and non-exonic regions. There is significant overlap between the data points of the coding
exonic and non-exonic regions, implying that the substitution rate and element length of the coding exonic
intervals does not appear to differ significantly from the non-exonic intervals. However, the non-exonic
intervals has a larger range of element lengths and substitution rates (as shown by the greater spread of data
points) compared to the exonic intervals. Figure 3 shows that most highly conserved intervals have short
element lengths (<500 bp), which display a wide range of subsitution rates (0-0.8 substitutions/nucleotide
site/generation). Longer element lengths appear to display a slightly smaller range of substitution rates
however, although there is still significant variation in their substitution rates. Taken together, these results
suggest that the element length does not appear to show a significant correlation with the substitution rate.
Overall, the data suggest that exonic and non-exonic intervals are conserved similarly, in
contrast with the hypothesis that they are differently conserved.
The substitution rates and element lengths of coding exonic and non-exonic intervals were found to be similar
with significant overlap. This suggests that their conservation does not significantly differ, a finding which is in
contrast to the initial hypothesis. On the basis of their similar substitution rates and element lengths, it cannot
be concluded that coding exonic and non-exonic intervals are differently conserved.
Evolutionary basis for conservation of exonic intervals.
Coding exonic intervals can be spliced together to produce mRNA sequences that can be translated to form
proteins. These proteins are often required for basic cellular function, stability or reproduction (Jegga &
Aronow 2013). There is a clear relationship between the original DNA sequence, structure, and function of
a protein. For proteins to perform their specific function, they must possess a suitable shape, and this creates
specific constraints to the underlying DNA sequence. Deleterious mutations often result in less than optimal
protein function and are usually selected against, while synonymous mutations that do not alter amino acid
sequence are often allowed to persist as they tend to have minimal phenotypic effect.
In this analysis, it was observed that coding exonic sequence intervals displayed a wide range of substitution
rates. It is well established that proteins evolve at vastly different rates, with expression level, type of function
being performed, structural characteristics and intermolecular interactions all contributing to this wide
variation (Drummond et al. 2006, Kim et al. 2006). Additionally, the evolutionary rate also varies significantly
between different amino acids depending on whether they are actively involved in protein function. For
example, amino acids in active-sites of enzymes, structurally-important residues, and transcription factor sites
involved in binding to DNA are examples where amino acid substitution can be particularly deleterious
(Simon et al. 2002). This implies that sequences that are more highly conserved are preferentially involved in
more biologically important roles.
A significant proportion of highly conserved genomic regions are non-exonic.
In addition to protein-coding sequences, vertebrate genomes contain a significant amount of non-exonic DNA
which are highly conserved. This may seem surprising initially as non exonic sequences are not translated
into proteins. However, the established evolutionary conservation of non-exonic sequences across many
species implies that they have important functions. These functions are considered diverse; some non-exonic
DNA are transcribed into functional RNA molecules (e.g. regulatory RNAs, transfer RNAs), while many
other non-exonic DNA actas cis-regulatory elements (e.g. enhancers, promoters, insulators, silencers and
matrix or scaffold attachment regions) (Jegga & Aronow 2008). These elements play crucial roles in gene
regulation. For example, enhancers increase transcription from a promoter through allowing transcription
factors to bind to specific conserved sequences. They are usually 100-300 bp long according to Ludwig (2002),
and this length is consistent with the element lengths displayed in the current analysis. Because the structure
and function of these transcription factors is often conserved, this implies that conservation of the sequence
of the cis-regulatory element is also often important to ensure correct recognition and binding. Non-exonic
DNA is considered to play an important role in the complex regulation of gene expression in eukaryotes,
and it is well-established that there is a strong correlation between the amount of exonic genomic DNA and
biological complexity (Taft et al. 2007).
Factors affecting interpretation of the data
There are several factors that could have influenced interpretation of the data. When intersecting the highly
conserved sequences (PhastCons Elements 100 way intervals) with coding exons, a 50 bp overlap was used.
This means that only sequences that overlapped by at least 50 bp would be included while shorter sequences
with length < 50 bp were automatically excluded, even if they had high sequence similarity. Ba et al. (2012)
identified some highly conserved, short (6-30 nt) DNA sequences that formed specific amino acid motifs in
proteins with similar functions. This may have contributed to the positively skewed distributions of element
lengths, along with the clustering of the majority of scatterplot points towards the left side (smaller element
lengths).
In addition, while conserved DNA sequences are usually indicative of conserved protein function, the
relationship between DNA sequence and non-exonic DNA products is still unclear. In the current analysis,
sequences were considered highly conserved if their nucleotide sequences were very similar (significant
overlap of 50bp). However, this may have excluded certain elements which have conserved structure and
function while having less well-conserved DNA sequences. For example, Johnsson et al. (2014) found evidence
of long non-coding RNAs (lncRNAs) with well-conserved secondary structures despite often having poor
DNA sequence conservation between species. This is an example of a non-exonic DNA sequence product
that can still have highly conserved structure and function while still having a higher substitution rate.
CONCLUSION
Highly conserved sequences in vertebrate genomes comprise both exonic and non-exonic elements. Despite
their differing biological functions, both types of sequences display similar trends in their conservation, with
similar low substitution rates and short lengths. Their evolutionary conservation suggest that these sequences
play biologically important, conserved roles across many vertebrate species.
REFERENCES
Adelson, D 2016, ‘Genomics 2 – Conserved Regions’, practical notes in the course Genetics 3111, University
of Adelaide, viewed 29 May 2016, <https://myuni.adelaide.edu.au/bbcswebdav/pid-6999491-dt-content-rid-
9629976_1/courses/3610_GENETICS_COMBINED_0001/Genomics%202%20%E2%80%93%20Conserved%
20Regions%202016_b.pdf>.
Ba, ANN, Yeh, BJ, van Dyk, D, Davidson, AR, Andrews, BJ, Weiss, EL & Moses, AM 2012, ‘Proteome-wide
discovery of evolutionary conserved sequences in disordered regions’, Science Signalling, vol. 5, no.215, pp.1-
14.
Drummond, DA, Raval, A & Wilke, CO 2006, ‘A single determinant dominates the rate of yeast protein
evolution’, Molecular Biology and Evolution, vol. 23, no. 2, pp.327-337.
Frith, MC, Pheasant, M & Mattick, JS 2005, ‘Genomics: The amazing complexity of the human
transcriptome’, European Journal of Human Genetics, vol. 13, no. 8, pp.894-897.
Jegga, AG & Aronow, BJ 2008. ‘Evolutionarily conserved noncoding DNA’. Encyclopedia of Life Sciences.
Johnsson, P, Lipovich, L, Grandér, D & Morris, KV 2014. ‘Evolutionary conservation of long non-coding RNAs;
sequence, structure, function’, Biochimica et Biophysica Acta (BBA)-General Subjects, vol. 1840, no. 3, pp.1063-
1071.
Kim, PM, Lu, LJ, Xia, Y & Gerstein, MB 2006, ‘Relating three-dimensional structures to protein networks
provides evolutionary insights’, Science, vol. 314, no. 5807, pp.1938-1941.
Kristiansson, E, Thorsen, M, Tamás, MJ & Nerman, O 2009, ‘Evolutionary forces act on promoter length:
identification of enriched cis-regulatory elements’, Molecular biology and evolution, vol. 26, no. 6, pp.1299-1307.
Ludwig, MZ 2002, ‘Functional evolution of noncoding DNA’, Current opinion in genetics & development, vol. 12,
no. 6, pp.634-639.
Simon, AL, Stone, EA & Sidow, A 2002, ‘Inference of functional regions in proteins by quantification of
evolutionary constraints’, Proceedings of the National Academy of Sciences, vol. 99, no. 5, pp.2912-2917.
Taft, RJ, Pheasant, M. & Mattick, JS 2007, ‘The relationship between non‐protein‐coding DNA and eukaryotic
complexity’, Bioessays, vol. 29, no. 3, pp.288-299.