Chapter 6. Comparative Genomics - Auburn University€¦ · Comparative Genomics Contents 6....

Chapter 6. Comparative Genomics

Contents

6. Comparative Genomics 6.1. Nucleotide and Amino Acid Substitutions

6.1.1. Sequence Similarity 6.1.2. Sequence Comparison by Alignment 6.1.3. Jukes-Cantor Model for Base Substitution

6.2. Comparative Genomic Analysis 6.2.1. Components of Comparative Genomic Analysis 6.2.2. Molecular Clocks

6.3. Molecular Phylogeny 6.3.1. Phylogenetic Trees 6.3.2. Gene Versus Species Trees 6.3.3. Methods of Reconstructing Phylogenetic Trees

6.4. Tree of Life 6.5. Genome Evolution

6.5.1. Multigene Families 6.5.2. Gene Duplication and Gene Conversion 6.5.3. Domain (Exon) Shuffling

CONCEPTS OF GENOMIC BIOLOGY Page 6- 1

As populations phenotypically change over

evolutionary time, so too does their genetic structure. Molecular evolution examines DNA and proteins, addressing two types of questions: 1) How have DNA and protein molecules evolved; and 2) How are genes and organisms evolutionarily related? As we have seen in section 2.7, population genetics focuses on changes in population genetic structure between generations. Molecular evolution considers the hundreds, or thousands of generations needed for speciation, where small departures from Hardy–Weinberg equilibrium, random effects, and slight differences in fitness can become very significant in the development of novel species.

Development of techniques in molecular biology makes it possible to study molecular evolution, using genomes as historical records that can reveal the dynamics of evolutionary processes, Indicate the chronology of change, identify phylogenetic relationships between organisms. Such information can be useful in biomedical sciences, food and fiber production, and environmental science where it can find meaningful application.

Before we begin our discussion of comparative genomics we may need a few discipline specific words defined. These are commonly misused, and this can lead to confusion.

CHAPTER 6. COMPARATIVE GENOMICS (RETURN)

Definition of HOMOLOG, ORTHOLOG AND PARALOG found in the NCBI Glossary

• homolog/homologous – Homologous genes (homologs) are related to by descent from a common ancestral DNA sequence. The term, homolog, may apply to the relationship between genes separated by the event of speciation (see ortholog) or to the relationship between genes separated by the event of genetic duplication (see paralog).

• ortholog/orthologous – Orthologous genes are genes from different species that are derive from a common ancestor, i.e., they are direct evolutionary counterparts. Normally, orthologs retain the same function in the course of evolution. Identification of orthologs is critical for reliable prediction of gene function in newly sequenced genomes.

• paralog/paralogous – describes the relationship of homologous genes that arose by gene duplication within a genome. Orthologs retain the same function in the course of evolution, whereas paralogs evolve new functions, even if these are related to the original one. For example, the mouse α-globin and β-globin genes are paralogs. The relationship between mouse α-globin and chick β-globin is also considered paralogous.

• Speciation is the formation of a new species capable of existence independently from the species from which it arose. Usually speciation involves some barrier to genetic exchange with the parent species.

https://www.ncbi.nlm.nih.gov/books/NBK21106/

https://www.ncbi.nlm.nih.gov/books/NBK21106/def-item/app73/


In section 5.4, we defined various base substitutions

in DNA molecules and related these to functional consequences at the genomic and protein levels. Whether such changes alter amino acid sequence of translated proteins or alter the processing and regulation of the transcript, it is clear that such changes can alter the functional performance of proteins in either a positive or negative way. It is fundamental that such changes in performance derived from point mutations in the genome are the basis for evolutionary change.

6.1.1. Sequence Similarity (RETURN)

Patterns of variation within homologous genes show that some amino acid substitutions are found more frequently than others. Substitutions permanently retained in genomes often involve the substitution of one amino acid for another with similar chemical characteristics. This supports two key evolutionary principles: 1) Mutations are rare events, and 2) Most dramatic changes in genes are removed by natural selection.

Chemically similar amino acids tend to have similar codons (Figure 6.1.) and so may result from a single base-pair change (mutation) as SNPs found in the coding regions of proteins. For example, the change of a CAU codon for histidine to CGU codon for arginine is a conservative change leading to the substitution of a basic amino acid for a different basic amino acid. Such changes in amino acid sequence of proteins alter functional aspects of protein activity.

6.1. NUCLEOTIDE AND AMINOACID SUBSTITUTIONS (RETURN)

Figure 6.1. Codon table showing groups of similar amino acids with their corresponding codons. All nonpolar amino acids are sown in red, neutral polar amino acids are shown in green. Basic amino acids are shown in light purple, and acidic amino acids are shown in dark purple.


More substantial alterations of protein primary structure, e.g. a basic amino acid for an acid amino acid, are likely to be deleterious and so be removed from the gene pool. It is the rare sequence change that makes an amino acid substitution that produces a more fit protein that enhances the survival of the mutation in the face of natural selection, but the slow accumulation of multiple positive mutations is precisely what evolutionary theory is based upon.

6.1.2. Sequence Comparison by Alignment (RETURN)

Sequence comparison begins with a multiple sequence alignment using computer algorithms based on the idea that the best alignments reflect true ancestral relationships. A number of possible programs can be used for such alignments including the COBALT program at NCBI and the Clustal series of programs available and EMBL-EBI (European Bioinformatics Institute). Matching nucleotides are interpreted as unchanged since being derived from a common ancestor. Substitutions, insertions, and deletions can be identified, and Gaps can be inserted to maximize the similarity between aligned sequences. This indicates the occurrence of insertions or deletions (indels). Many alignments are possible between sequences, and algorithms in the computer programs used typically maximize the matching number

of amino acids or nucleotides, invoking the smallest possible number of indel events.

Figure 6.2. Clustal-W multiple sequence alignment. Amino acids are color-coded according to similarity group (see Figure 6.1.). Note that this is only a partial alignment of sequence as the full sequence would require several pages.

https://www.ncbi.nlm.nih.gov/tools/cobalt/cobalt.cgi

https://www.ncbi.nlm.nih.gov/tools/cobalt/cobalt.cgi

https://www.ebi.ac.uk/Tools/msa/clustalo/


6.1.3. Jukes-Cantor Model for base Substitution

(RETURN)

When DNA sequences diverge, they begin to collect mutations. The number of substitutions (K) found in an alignment is widely used in molecular evolution analysis. If the alignment shows few substitutions, a simple count is used. If many substitutions have occurred, it is likely that a simple count will underestimate the substitution events, due to the probability of multiple changes at the same site (Figure 6.3.).

Jukes and Cantor (1969) assumed that each nucleotide is equally likely to change into any other nucleotide, and they created a mathematical model to describe multiple base substitutions. As data became available a decade later, the observation that different mutations occur at different rates (e.g., transitions are more common than transversions) revealed over-simplifications in the Jukes–Cantor model. The model provided a framework to estimate the actual number of substitutions (K) when multiple substitutions were possible.

Figure 6.2. Using the Jukes-Cantor Model. Rate of change to any of the other three nucleotides is designated as a, so the overall rate of substitution for any given nucleotide is 3a.In the beginning (t = 0) nucleotide was C, the probability (P) of the site still being C at the first time point (t = 1), is PC(1) = 1 - 3a. After more time has passed (t = 2), the probability (PC(2)) is calculated from the equation: PC(2) = (1 - 3a)PC(1) + a [1 - PA(1)]. The probability of that site containing C at any given time in the future is defined by the equation PC(t) = 1⁄4 + (3⁄4)e-4at.

CONCEPTS OF GENOMIC BIOLOGY - 5

The number of substitutions in homologous sequences since divergence is central to molecular evolution analysis. The number of substitutions per site (K) coupled with divergence time (T) is converted to a rate (r) of substitution in the equation r = K/(2T). Substitutions are assumed to accumulate simultaneously and independently in both species. Substitution rate comparison provides insight into the mechanisms of molecular change and evolutionary events.

Studies show that different regions of genes appear to evolve at different rates. Distinctions are seen between and within coding and noncoding regions. Examples of noncoding regions include introns, leaders and trailers, nontranscribed flanking regions, and pseudogenes. Even within the coding region, not all nucleotide substitutions create changes in the gene product (e.g., a substitution at the third position of a codon may produce a synonymous codon). Different gene regions evolve at different rates (Table 23.1).

Synonymous changes, which do not alter the amino acids in the protein, are found five times more often than nonsynonymous changes. Both types of change are equally likely to occur, but nonsynonymous changes are usually detrimental to fitness and are eliminated by natural selection. This creates a distinction between mutations and substitutions, i.e. mutations are changes in nucleotide sequences due to errors in replication or repair while substitutions are mutations that have passed

through the filter of natural selection. Synonymous substitutions more nearly reflect the actual mutation rate in the genome. Nonsynonymous substitution rates do not.

Changes in 3’ flanking regions have no known effect

on amino acid sequence, and little effect on gene express ion, so most are tolerated by natural selection. Introns have rates of change higher than those of exons, but not as high as 3’ flanking regions, due to their need to retain sequences required at splice junctions and branch points.


In some cases, alternative ORFs used by alternative splicing that takes place in some tissues but not others.

The 5’ flanking regions have lower rates of change than do 3’ regions, due to the presence of promoters and other gene regulatory elements. Small changes in these sequences may have a large effect on protein production and so be subject to natural selection.

Leader and trailer regions have lower rates than do the 5’ flanking region, because they contain signals for processing and translation of mRNA.

Nonsynonymous coding sequences have the lowest rate of change because most protein-coding sequences produce products optimized for their role and environment. Most substitutions are eliminated by natural selection.

Pseudogenes are nonfunctional gene-like sequences lacking functional promoters. Pseudogenes have the highest evolution rate seen. Since pseudogenes no longer code for proteins, changes in them do not impact fitness and are not eliminated by natural selection.


6.2.1. Components of Comparative Genomic Analysis (RETURN)

Genome sequencing provides a map to genes but does not reveal their function. Comparative genome analysis compares genes with low evolutionary rate and high functional significance. Pseudogenes, which are free to mutate, are used to calculate expected mutation rates. Regions of high sequence similarity in distantly related species are likely to contain functional genes. Between mice and humans, for example, pseudogenes show about five times as many changes as regions that encode proteins or regulate gene expression. Natural selection evaluates the consequences of an enormous number of changes, on an evolutionary time scale. Comparative genome analysis can point the way to meaningful experiments by saving the effort of saturation mutagenesis allowing use of model organisms (e.g., yeast).

6.2.2. Molecular Clocks (RETURN)

Genes with similar functions can show very uniform rates of molecular evolution over long periods of time, acting as molecular clocks. This led Zuckerkland and

Pauling (1960) to suggest that amino acid changes accumulate at a constant rate over many tens of millions of years, functioning as a molecular clock that measures divergence from a common ancestor.

The molecular clock runs at different rates with different proteins. Comparison of the divergence between two homologous proteins correlates well with time since speciation. This allows calculation of phylogenetic relationships between species and the time of their divergence (in much the same way as radioactive decay is used to date geological times).

The molecular clock hypothesis has been challenged on the basis of inconsistencies with morphological (classical) evolution, based on a fossil record that has a more erratic tempo and lack of uniformity in evolutionary rates of all genes (Figure 23.2).

Divergence dates from the fossil record are of questionable accuracy. As DNA sequence data has become available, the molecular clock premise has been tested. Substitution rates are similar in rats and mice, but substitution rates in humans and apes are about 1⁄2 as rapid as those in rodents. The molecular clock clearly varies among taxonomic groups, complicating the use of molecular divergence to date the last common ancestor. In groups with a uniform clock (e.g., rodents) this model is useful.

6.2. COMPARATIVE GENOMIC ANALYSIS (RETURN)


Some possible explanations for the observed differences in evolutionary rates are that generation time varies greatly between species. Substitution rates should be related more closely to the number of germ-line replications than to simple divergence times. Other differences in the lines since the time of divergence may be involved. These include average repair efficiency, average exposure to mutagens, and opportunities to to new ecological niches and environments. Fossil information can sometimes be used to calibrate rates of molecular divergence.

Figure 6.3. Molecular Clocks shown for 3 highly conserved proteins: 1) The mitochondrial protein cytochrome C; 2) Hemoglobin; and 3) Fibrinopeptides. Note the vast differences in apparent evolutionary rates.


Evolution is defined as genetic change that takes

place over time, and so genetic relationships are key to understanding evolutionary relationships. Organisms that are similar at the molecular level are expected to be more closely related than dissimilar organisms. Phylogenetic relationships among living things are inferred from molecular similarity.

Before genomic biology, phenotype was used for evolutionary studies to infer genetic information. Original studies used gross anatomy. Later, behavioral, ultrastruc- tural, and biochemical traits were also used. Evolutionary trees were constructed for many groups of plants and animals, and these continue to provide a basis for evolutionary study.

Phenotypes can be misleading, because they do not always reflect genetic relatedness. Sometimes similar-ities result from convergent evolution, complicating the study of divergence among organisms (e.g., wings alone would put birds, bats, and insects in the same evolutionary group). Also not all organisms have easily studied phenotypic features (e.g., bacteria). Among distant relatives (e.g., humans and bacteria), few phenotypic features are shared, and it is difficult to determine how such species should be compared.

Molecular evolution provides important information,

because the effects of natural selection are generally less

6.3. MOLECULAR PHYLOGENY (RETURN)

Figure 6.4. Example phylogenetic tree.


pronounced at the DNA sequence level. Comparison of molecular and morphological phylogenies is valuable for examining the effect of natural selection on phenotypic differences at levels from molecular to gross anatomical. In either case, Phylogenetic tree of some type becomes a way of showing the detailed quantitative relationships of organisms that can be obtained by analysis of all types of information about organisms.

6.3.1. Phylogenetic Trees (RETURN)

Phylogenetic trees are diagrams used to describe the relationship between species (Figure 6.4.). All living things on Earth share a common ancestor that lived about 4 billion years ago. Every phylogenetic tree uses branches that connect adjacent nodes. Terminal nodes indicate taxa for which molecular information is available. Internal nodes represent common ancestors of the two (or more) groups. Branch length may be scaled to show the amount of divergence between taxa. If all nodes on the tree have a common ancestor, it is possible to make it a rooted tree, indicating an evolutionary path. Unrooted trees show a relationship between nodes and do not indicate an evolutionary path. Roots for unrooted trees can usually be determined by using an outgroup for comparison. In a situation where only three taxa are considered, there are three possible rooted trees and only one unrooted tree (Figure 6.5).

As more taxa are considered, the number of possible trees quickly becomes enormous (Table 23.4). The number of trees can be determined for any number of taxa (n). For rooted trees (NR) the equation is:

NR = (2n - 3)! / [2n-2(n - 2)!] For unrooted trees (NU) the equation is:

NU = (2n - 5)! / [2n-3(n - 3)!] The value for n can be as large as every species, or even every individual, but smaller numbers of groups are more practical in this type of analysis.

As an example, consider the following table:

Figure 6.5. Three rooted phylogenetic trees derived from the unrooted tree on the right depending on where the root (A) is placedon the unrooted three.


Clearly, the number of possible trees grows very large as the number of sequences analyzed increases.

6.3.2. Gene Versus Species Trees (RETURN)

A gene tree is a phylogenetic tree based on divergence within a single homologous gene. A gene tree represents the history of the gene, but not necessarily the history of the species. Whereas a species trees usually analyze data from multiple genes. Divergence within genes typically occurs prior to speciation (Figure 6.5). This means that members of separate groups may be more similar to each other than they are to members of their own population. Divergence is especially high for loci where diversity is advantageous (e.g., MHC). On the basis of MHC alone, many humans would be grouped with gorillas rather than other humans because the polymorphism predates the split in the two lineages.

Species trees are less influenced by horizontal gene transfer than are gene trees.

6.3.3. Methods of Reconstructing Phylogenetic Trees (RETURN)

Many possibilities exist for the REAL phylogenetic trees, and it is generally impossible to know which is the true tree that represents actual events in evolution. Most phylogenetic trees generated with molecular data are considered inferred trees.

Computer algorithms that generate these inferred trees use three types of approaches: 1) Distance matrix methods; 2) Parsimony-based methods: and 3) Maximum likelihood methods.

Large numbers (e.g., >30 species) of long sequences are difficult to analyze, even with fast computers and streamlined algorithms. Neither distance matrix nor maximum parsimony methods can guarantee the correct tree; but generally, if a similar tree results from both of these fundamentally different methods, it is considered fairly reliable.

The confidence level for portions of inferred trees can be determined by bootstrap tests, in which a subset of the original data is drawn with replacement and a new tree inferred. When this test is repeated hundreds or thousands of times, and the same groupings usually emerge, these parts of the tree are well supported. The

Seq # unrooted. # rooted # trees trees ================================== 2 1 1 3 1 3 4 3 15 5 15 105 6 105 945 7 945 10,395 8 10,395 135,135 9 135,135 2,027,025 10 2,027,025 34,459,425


fraction of similar groupings is placed next to the nodes in bootstrapped trees to convey the confidence in that part of the tree.

Caution is needed in interpreting bootstrap results. Several hundred iterations are needed for reliability, especially when analyzing large numbers of sequences, and thousands of iterations are recommended. Simulations show that bootstrapping underestimates the confidence level at high values and overestimates it at low values. Correction methods should be used to adjust for estimation biases. Some results may appear statistically significant because they emerge by random chance when a tree with a large number of branches is considered. A method that collapses branches to multifurcations at a stringent threshold of bootstrap values will yield a truer tree.


DNA and RNA sequences were first used for

phylogenetic purposes in the mid-1980s. Woese and Pace constructed an evolutionary tree based upon 16S rRNA sequences, because homologs are found in all organisms as well as in mitochondria and chloroplasts (Figure 6.6).

The tree showed three major domains: i. Bacteria, including traditional bacteria, mito-

chondria, and chloroplasts. ii. Archaea, including extremophiles and little-

known organisms. iii. Eukarya.

Archaea and bacteria, although both prokaryotes, were as different genetically as eubacteria are from eukaryotes. Later work comparing other genes (e.g., 5S rRNAs, large rRNAs, and genes for fundamental proteins) supports this phylogeny and shows that eukaryotic mitochondrial and chloroplast genes have different origins than their nuclear counterparts.

6.4. THE TREE OF LIFE PROJECT (RETURN)

Figure 6.6. The tree of life. Showing the 3 Domains of life: the Bacteria, the Archaea, and the Eukarya.


6.5.1. Multigene Families (RETURN)

Eukaryotes often have tandemly arrayed copies of genes with very similar sequences (multigene families) that appear to be the result of gene duplication. The human globin genes are a classic example of a multigene family, with a general distribution of seven alpha-like genes on chromosome 16 and six beta-like genes on chromosome 11. Globin-like genes are found in many animals and even plants, suggesting an ancient origin.

Animal globin genes have the same general structure (three exons and two introns), but their number and order vary among species (Figure 6.7). Sequence and structure suggest duplication of an ancestral gene, which diverged to produce the alpha-like and beta-like genes. Duplication and divergence would then produce the modern alpha-like and beta-like gene groups. Variation in globin-gene number and distribution found in modern humans suggests that duplication and deletion of genes is an ongoing process still operating today. Duplications and deletions may result from unequal crossing-over. Duplications may also arise through transposition.

6.5.2. Gene Duplication and Gene Conversion (RETURN)

Duplication frees a copy of the sequence to undergo changes, since a functional copy will still exist. Most changes would produce less functional products, or even nonfunctional pseudogenes. A few changes, however, might alter function and/or pattern of expression to something more advantageous for the organism. Selection would allow these genes to become widespread in the population.

6.5. GENOME EVOLUTION (RETURN)

Figure 6.7. Comparison of the Globin gene families and pseudogenes from Human, Mouse, Rabbit, and Goat.


Misalignment between a pseudogene and a

functional copy can result in gene conversion through recombination events. The allele on one homolog is copied and replaces the DNA sequence of the allele on the other homolog; it is not reciprocal exchange. Gene conversion gives organisms even more opportunities to create a gene with a new function.

Gene conversion continues to operate in modern

humans. An example is two genes for red-green color vision on the X chromosome that undergo gene conversion in most of the known cases of spontaneous deficiencies in green color vision.

6.5.3. Domain (Exon) Shuffling (RETURN)

Often, less than an entire gene is duplicated, resulting in copies of protein domains. An example is human serum albumin, whose gene has three copies of a 195-amino-acid domain. Internal duplication is not a rapid method of producing proteins with new functions, however. Most complex proteins arise from assemblages of several protein domains performing different functions (e.g., substrate binding or membrane spanning). The beginnings and ends of exons and protein domains often correspond.

Gilbert (1978) proposed that most gene families today arose through domain shuffling involving duplication and rearrangement of domains (usually encoded by single exons). Domain shuffling theory proposes that introns were a feature of early life on Earth, even though they are now missing from prokaryotes. Numerous examples of complex genes made from segments of other genes are known, and clearly some novel functions have been created in this way.

Gene X

Gene X Gene X

Gene X Gene X’

Duplication

Chapter 6. Comparative Genomics - Auburn University€¦ · Comparative Genomics Contents 6....

Documents

Transcript of Chapter 6. Comparative Genomics - Auburn University€¦ · Comparative Genomics Contents 6....