Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

79
Biology 4900 Biology 4900 Biocomputing

Transcript of Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

Page 1: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

Biology 4900Biology 4900

Biocomputing

Page 2: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

Chapter 7Chapter 7

Molecular Phylogeny and Evolution

Page 3: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

OutlineOutline

• Introduction to evolution and phylogeny

• Nomenclature of relationship trees

• Five stages of molecular phylogeny:1. Selecting sequences2. Multiple sequence alignment3. Models of substitution4. Tree-building5. Tree evaluation

Page 4: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

Historical background: EvolutionHistorical background: Evolution

• Evolution: Theory that groups of organisms change over time; descendants are structurally and functionally different than ancestors.

• Darwin (1859) proposed the role of natural selection in hereditary change.

– Heredity is generally conservative: changes in offspring tend to be small.

• Prior to 1950’s, phylogeny (evolution history of organisms) was studied by comparing living species with fossils.

– Modern techniques of molecular biology not available!

Pevsner, Bioinformatics and Functional Genomics, 2009; http://www.ucsd.tv/evolutionmatters/lesson2/study.shtml

Page 5: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

Current View of the Tree of LifeCurrent View of the Tree of Life

http://www.allvoices.com/contributed-news/4553607-is-chimps-as-smart-as-human; After Pace NR (1997) Science 276:734; http://en.wikipedia.org/wiki/File:E_coli_at_10000x,_original.jpg

Page 6: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

4 main mechanisms of evolutionary change4 main mechanisms of evolutionary change• Natural Selection

– One or more differences in an organism that increase it’s likelihood of surviving in an environment long enough to reproduce and pass traits to offspring.

– Ex. Green beetles are easier to identify by predators than brown beetles.• Mutation

– Offspring may acquire small mutations in genetic material from 2 parents.– Ex. Green beetle parents produce a brown beetle offspring.

• Genetic drift– Change in frequency of gene variant in population due to random sampling.– Ex. A family of green beetles in a mixed population is killed by insecticide,

leaving a higher % of brown beetles for reproduction.• Migration

– Transfer of genes from one population to another.– Ex. Several brown beetles move into an area full of green beetles and

reproduce with the green beetles.

Pevsner, Bioinformatics and Functional Genomics, 2009

Page 7: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

Nucleotide SubstitutionsNucleotide Substitutions

• Nonsynonymous substitution: Nucleotide mutations that alters the amino acid sequence of a protein. • These result in a biological change in the organism, so they are subject

to natural selection.

• Synonymous substitution: Nucleotide mutations that do not alter amino acid sequences. • i.e., Different codons that encode the same AA.

Page 8: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

Natural SelectionNatural Selection

• Positive natural selection: Force that drives the increase in prevalence of advantageous traits.– If a change leads to improvement, keep it

• Negative natural selection: The selective removal of harmful alleles.– “If it ain’t broke, don’t fix it”– Helps to explain why detrimental mutations are not

passed along hereditary lines

Page 9: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

History of Molecular EvolutionHistory of Molecular Evolution

• Studies of molecular evolution began with the first sequencing of proteins, beginning in the 1950s (following discovery of DNA in 1953 by Watson and Crick).

• Also in 1953, Frederick Sanger and colleagues determined the primary amino acid sequence of insulin (NP_000198)– Most amino acid changes in insulin A chain were restricted to a

disulfide loop region. These were non-random, but neutral changes.– Later studies at the DNA level showed that rate of nucleotide (and of

amino acid) substitution is about 6- to 10-fold higher in the C peptide (proinsulin, the insulin precursor), relative to the A and B chains.

Pevsner, Bioinformatics and Functional Genomics, 2009; http://en.wikipedia.org/wiki/Frederick_Sanger

Page 10: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

Note the sequence divergence in the disulfide loop region of the A chain Fig. 7.3

Page 220

Page 11: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

Number of nucleotide substitutions/site/year

0.1 x 10-9

0.1 x 10-91 x 10-9

Fig. 7.3Page 220

Page 12: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

Surprisingly, insulin from the guinea pig (and from the related coypu) evolve 7 times faster than insulin from other species. Why?

The answer: Guinea pig and coypu insulin do not bind two zinc ions, while insulin molecules frommost other species do. There was a relaxation on the structural constraints of these molecules, and so the genes diverged rapidly.

Page 219

Sus scrofa insulin (1ZNI.pdb)

Historical background: insulinHistorical background: insulin

Page 13: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

Guinea pig and coypu insulin have undergone anGuinea pig and coypu insulin have undergone anextremely rapid rate of evolutionary changeextremely rapid rate of evolutionary change

Arrows indicate positions at which guinea pig insulin (A and B chains) differs from human and mouse

Pevsner, Bioinformatics and Functional Genomics, 2009, p. 220

A chain

B chain

Page 14: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

Historical Background: Vasopressin and OxytocinHistorical Background: Vasopressin and Oxytocinvasopressin oxytocin

• Peptides vasopressin and oxytocin also sequenced in 1950’s.• Sequences differ by only 2 residues, but functions differ significantly

• Vasopressin – Reduces the volume of urine formed in the body to retain water

• Oxytocin– Causes contractions in smooth muscles of uterus during labor

Page 15: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

Molecular clock hypothesisMolecular clock hypothesis• In 1960s, sequence data accumulated for small, abundant

proteins (e.g., globins, cytochromes c, fibrinopeptides). • Some proteins appeared to evolve slowly, while others

evolved rapidly.• Linus Pauling, Emanuel Margoliash and others proposed

the hypothesis of a molecular clock: • For every given protein, rate of molecular evolution is

approximately constant in all evolutionary lineages.

Species A

Species B

Species C

Species D

Protein X

Mutations over time

Pevsner, Bioinformatics and Functional Genomics, 2009, p. 221

Page 16: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

Molecular clock hypothesisMolecular clock hypothesis

• Richard Dickerson (1971) plotted sequence data from three protein families: cytochrome c, hemoglobin, and fibrinopeptides.

• X-axis shows the divergence times of the species, estimated from paleontological data.

• Y-axis shows m, the corrected number of amino acid changes per 100 residues.

• n is observed number of amino acid changes per 100 residues, and it is corrected to m to account for changes that occur but are not observed.

Dickerson, 1972

Millions of years since divergence

corr

ecte

d a

min

o a

cid

ch

ang

es

per

100

res

idu

es (m

)

Page 17: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

Molecular clock hypothesisMolecular clock hypothesis

Page 18: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

Pevsner, Bioinformatics and Functional Genomics, 2009, p. 2225

Molecular Clock: Calculating rates of Molecular Clock: Calculating rates of nucleotide substitutionsnucleotide substitutions

• Rate of nucleotide substitution (r) = number of nucleotide substitutions that occur per site per year (same for AAs)

• Time of divergence (T) = how long ago the two sequences split from a common sequence.

• A constant (K) defines the number of non-synonymous substitutions per site.

• α-globins from rat and human differ by 0.093 non-synonymous substitutions by site• Rate of change = 0.58 × 10-9 nonsynonymous nucleotide substitutions per site per year

Page 19: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

Dickerson drew the following conclusions:

• For each protein, the data lie on a straight line. Thus, the rate of amino acid substitution has remained constant for each protein.

• The average rate of change differs for each protein. The time for a 1% change to occur between two lines of evolution is 20 MY (cytochrome c), 5.8 MY (hemoglobin), and 1.1 MY (fibrinopeptides).

• The observed variations in rate of change reflect functional constraints imposed by natural selection.

Molecular clock hypothesis: conclusionsMolecular clock hypothesis: conclusions

Pevsner, Bioinformatics and Functional Genomics, 2009, p. 223

Page 20: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

Pevsner, Bioinformatics and Functional Genomics, 2009, p. 2225

Molecular clock hypothesis: problemsMolecular clock hypothesis: problems

This hypothesis has largely been discarded due to the following reasons:•Rate of molecular evolution varies among different organisms (e.g., viral sequences changes very rapidly).•Clock varies among different genes and parts of genes – guided in part by selection (e.g., animals with shorter generational times may have faster clocks).•Clock only applicable when gene retains its function over evolutionary time.

– Genes may become nonfunctional– Rate of evolution may accelerate following gene duplication

Page 21: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

Pevsner, Bioinformatics and Functional Genomics, 2009, p. 2225

Neutral theory of molecular evolutionNeutral theory of molecular evolution

• Proposed after the Molecular Clock hypothesis to address inconsistencies

• E.g., significant amount of DNA polymorphism across all species that is difficult to account for by conventional natural selection.

• Kimura (1969, 1983) proposed this as an alternative model to Darwinian evolution at DNA level.– Observed that rate of AA substitution averages ~1 change per 28 × 106

years for proteins of 100 residues.– Corresponding rate of nucleotide substitution is 1 base pair in population

genome every 2 years.– Kimura concluded that most DNA substitutions must be neutral and that

main cause of evolutionary change at molecular level is random drift of mutant alleles.

Page 22: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

Page 231

Molecular phylogeny: nomenclature of treesMolecular phylogeny: nomenclature of trees

• Molecular Phylogeny: Study of evolutionary relationships among organisms or molecules– Determined via molecular biology

• Two main kinds of information inherent to any tree: – topology– branch length

Page 23: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

A

B

C

D

E

F

G

HI

time

6

2

1 1

2

1

2

6

1

2

2

1

A

BC

2

1

2

D

Eone unit

Molecular phylogeny uses trees to depict evolutionaryrelationships among organisms. These trees are basedupon DNA and protein sequence data.

Pevsner, Bioinformatics and Functional Genomics, 2009, p. 232

Change in Time Change in Magnitude

Cladogram Phylogram

Page 24: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

A

B

C

D

E

F

G

HI

timeFig. 7.8Page 232

Tree NomenclatureTree Nomenclature

Node (intersection or terminating point of two or more branches)Branch (edge) connecting 2 nodes

Page 25: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

A

B

C

D

E

F

G

HI

time

Tree Nomenclature: NodesTree Nomenclature: Nodes

Root NodeInternal Node (Taxon)Operational taxonomic unit (OTU)

Pevsner, Bioinformatics and Functional Genomics, 2009, p. 232

OTUs: Taxa that provide observable features (e.g., existing protein sequences, morphological features)

Page 26: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

A

B

C

D

E

F

G

HI

time

6

2

1 1

2

1

2

6

1

2

2

1

A

BC

2

1

2

D

Eone unit

Tree Nomenclature: Cladogram vs. PhylogramTree Nomenclature: Cladogram vs. Phylogram

Branches are unscaled... Branches are scaled...

…branch lengths areproportional to number ofamino acid changes

…OTUs are neatly aligned,and nodes reflect time

Pevsner, Bioinformatics and Functional Genomics, 2009, p. 232

Cladogram Phylogram

Page 27: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

A

B

C

D

E

F

G

HI

time

6

2

1 1

2

1

2

6

1

2

2

1

A

BC

22

D

Eone unit

Tree nomenclature: Internal NodesTree nomenclature: Internal Nodes

bifurcatinginternal node

multifurcatinginternalnode

Pevsner, Bioinformatics and Functional Genomics, 2009, p. 232

Page 28: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

A

B

C

D

E

F

G

HI

6

2

1 1

2

1

2

Tree nomenclature: cladesTree nomenclature: clades

Clade ABF (monophyletic group)

Pevsner, Bioinformatics and Functional Genomics, 2009, p. 232

Clade: Group of taxa derived from a common ancestor* Groups may include sub-groups

Clade CDH (monophyletic group)

A

B

C

D

E

F

G

HI

6

2

1 1

2

1

2

Clade ABF/CDH/G (paraphyletic)

Page 29: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

Tree nomenclature: Newick FormatTree nomenclature: Newick Format

http://en.wikipedia.org/wiki/Newick_format

(,,(,)); no nodes are named (A,B,(C,D)); leaf nodes are named (only the terminal values)(A,B,(C,D)E)F; all nodes are named (:0.1,:0.2,(:0.3,:0.4):0.5); all but root node have a distance to parent (:0.1,:0.2,(:0.3,:0.4):0.5):0.0; all have a distance to parent (A:0.1,B:0.2,(C:0.3,D:0.4):0.5); distances and leaf names (popular) (A:0.1,B:0.2,(C:0.3,D:0.4)E:0.5)F; distances and all names ((B:0.2,(C:0.3,D:0.4)E:0.5)F:0.1)A; a tree rooted on a leaf node (rare)

Newick tree format is a way to represent graph-theoretical trees with edge lengths using parentheses and commas

Page 30: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

Tree nomenclature: Newick FormatTree nomenclature: Newick Format

http://en.wikipedia.org/wiki/Newick_format

(A:0.1,B:0.2,(C:0.3,D:0.4)E:0.5)F; distances and all names

A, B, E are branches related to root FC, D are branches related to node E

Page 31: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

ExamplesExamples

Ex. 1 Convert the following Newick format to a tree format:

(A:0.3,B:0.5,(C:0.7,D:0.9)E:0.6)F

Ex. 2 Convert the following Newick format to a tree format:

(A:0.1,(D:1.1,E:0.8,(I:0.1,J:0.2)F:0.4)B:0.2,(G:0.7,H:0.9)C:0.6)K

A, B, E are branches from root FC, D are branches from node E

A, B, C are branches from root KD, E, F are branches from node BG, H are branches from node CI, J are branches from node F

Page 32: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

ExamplesExamples

Ex. 1 Convert the following tree format to a Newick format:

((((A:0.2, B:0.3)F:0.3,(C:0.5, D:0.3)G:0.2):0.3, E:0.7)E:1.0)H

A, B are branches from node FC, D are branches from node GF, G, E are branches from root H

F

GH

Page 33: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

ExamplesExamples

Ex. 1 Convert the following tree format to a Newick format:

(((A:1.00,B:2.00)n1:3.00,(C:4.00,D:5.00)n2:6.00)n3:7.00,E:8.00)n4

A, B are branches from node n1C, D are branches from node n2n1, n2 are branches from node n3 n3, E are branches from root n4

Page 34: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

http://evolution.berkeley.edu/evosite/evo101/IIBPhylogeniesp2.shtml

Clade: Grouping that includes a common ancestor and all the descendents (living and extinct) of that ancestor. These can be represented by phylogenic trees.

Examples of cladesExamples of clades

Page 35: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

Examples Examples of cladesof clades

Lindblad-Toh et al., Nature 438: 803 (2005), fig. 10

Page 36: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

Tree RootsTree Roots

• The root of a phylogenetic tree represents the common ancestor of the sequences. Some trees are unrooted, and thus do not specify the common ancestor.

past

present

1

2 3 4

5

6

7 8

9

4

5

87

1

2

36

Rooted tree (specifies evolutionary path)

Unrooted tree

Pevsner, Bioinformatics and Functional Genomics, 2009, p. 233

Page 37: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

Tree RootsTree Roots

• A tree can be rooted using an outgroup (that is, a taxon known to be distantly related from all other Operational Taxonomic Units).

Pevsner, Bioinformatics and Functional Genomics, 2009, p. 233

past

present

1

2 3 4

5

6

7 8

9

Rooted tree

1

2 3 4

5 6

Outgroup(used to place the root)

7 9

10

root

8

Page 38: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

Cavalii-Sforza and Edwards (1967) derived the numberof possible unrooted trees (NU) for n OTUs (n > 3):

NU =

The number of bifurcating rooted trees (NR)

NR =

For 10 OTUs (e.g. 10 DNA or protein sequences),the number of possible rooted trees is 34 million,and the number of unrooted trees is 2 million.

(2n-5)!2n-3(n-3)!

(2n-3)!2n-2(n-2)!

Unrooted Trees Are Faster to Calculate Than Rooted Unrooted Trees Are Faster to Calculate Than Rooted TreesTrees

Pevsner, Bioinformatics and Functional Genomics, 2009, p. 235

Page 39: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

[1] Selection of sequences for analysis (DNA vs. Protein)

[2] Multiple sequence alignment

[3] Selection of a substitution model

[4] Tree building

[5] Tree evaluation

Pevsner, Bioinformatics and Functional Genomics, 2009, p. 243

Five stages of phylogenetic analysisFive stages of phylogenetic analysis

Page 40: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

Pevsner, Bioinformatics and Functional Genomics, 2009, p. 240

Stage 1: Which is more useful: DNA or protein?Stage 1: Which is more useful: DNA or protein?DNADNA lets you study synonymous and nonsynonymous changes

Synonymous changes do not have corresponding protein changes.

synonymous substitution rate (dS)

nonsynonymous substitution rate (dN)

If dS > dN, DNA sequence is under negative (purifying) selection.

Selective removal of harmful alleles Limits change in the sequence (e.g. insulin A chain)

If dS < dN, positive selection occurs. Ex. Duplicated gene may evolve rapidly to assume new functions.

Rates of transitions and transversions can be measured

Page 41: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

Evaluation of Changes: Transitions vs. TransversionsEvaluation of Changes: Transitions vs. Transversions

GC and AT are normal pairings

Transitions: interchanges of two-ring purines (A G) or of one-ring pyrimidines (C T): These involve bases of similar shape.

Transversions: Interchanges of purine for pyrimidine bases, which therefore involve exchange of one-ring and two-ring structures

Page 42: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

Pevsner, Bioinformatics and Functional Genomics, 2009, p. 241, Figure 7.16

Stage 1: Use of DNA or proteinStage 1: Use of DNA or proteinDNA•Some substitutions in a DNA sequence alignment can be directly observed:

A A A AAG G → C G → C CC Parallel substitutionT T T TTC C C → G CG Single substitutionC C C → T → A CA Sequential substitutionT T T TTG G G GGT T → A T → C AC Coincidental substitutionT T → G T GTC C → G C → T → G GG Convergent substitutionA A A AAG G → T → G G GG Back substitution

hypothetical ancestral

globin

human globin mouse globin observed alignment

Page 43: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

Stage 1: Use of DNA or proteinStage 1: Use of DNA or protein

In this example, there are 2 observed amino acid mismatches and 5 observed nucleotide mismatches.

Page 44: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

Pevsner, Bioinformatics and Functional Genomics, 2009, p. 243

Stage 1: Use of DNA or proteinStage 1: Use of DNA or protein

Proteins•Proteins frequently more informative than DNA in pairwise alignment and in BLAST searching.•Proteins have 20 states (amino acids) instead of only 4 for DNA, so there is a stronger phylogenetic signal (we can observe changes over longer period).•Nucleotides are unordered characters: any one nucleotide can change to any other in one step.•Amino acid sequences are partially ordered character states: there is a variable number of states between the starting value and the final value (i.e., may require more than 1 nucleotide change).

Page 45: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

Five stages of phylogenetic analysisFive stages of phylogenetic analysis

[1] Selection of sequences for analysis

[2] Multiple sequence alignment

[3] Selection of a substitution model

[4] Tree building

[5] Tree evaluation

Page 46: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

Stage 2: Multiple sequence alignmentStage 2: Multiple sequence alignment

Fundamental basis of phylogenetic tree is the MSA.

1.Confirm that all sequences are homologous

– Trees can still be generated even with misalignment, or if a non-homologous sequence is included in the alignment.

2.Adjust gap creation and extension penalties as needed to optimize the alignment

3.Restrict phylogenetic analysis to regions of the multiple sequence alignment for which data are available for all taxa (delete columns having incomplete data or gaps. Similar to masking).

-----GADEG-YFGPVILAADGEVA---dlgnvGA-EGDYFGPAI--AEGEVArpl

Pevsner, Bioinformatics and Functional Genomics, 2009

Page 47: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

Five stages of phylogenetic analysisFive stages of phylogenetic analysis

[1] Selection of sequences for analysis

[2] Multiple sequence alignment

[3] Selection of a substitution model

[4] Tree building

[5] Tree evaluation

Page 48: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

Stage 3: Selection of substitution modelStage 3: Selection of substitution model

• There are different ways to mathematically model substitutions that occur between sequences.

• Simplest approach to measuring distances between sequences is to align pairs of sequences, and then to count the number of differences.

• Degree of divergence is called the Hamming distance. • For alignment of length N with n sites at which there are

differences, the degree of divergence D is:D = (n / N) × 100

• This models for observed differences, which do not equal genetic distance!– Genetic distance involves mutations that are not observed directly.

Pevsner, Bioinformatics and Functional Genomics, 2009

Page 49: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

• Here, d is the distance and p is the proportion of residues that differ.

• The Poisson correction distance assumes equality of substitution rates among sites and equal amino acid frequencies while correcting for multiple substitutions at the same site.

Stage 3: Selection of substitution modelStage 3: Selection of substitution model

Pevsner, Bioinformatics and Functional Genomics, 2009

The Poisson correction was proposed to account for unobserved substitutions:

d = - ln(1 – p)

Page 50: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

Poisson correction increased the distance values.

Also see reductions in some bootstrap values (red numbers).

Fundamental units of clades did not change.

Page 51: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

Jukes and Cantor (1969) proposed a different corrective formula:

• This model describes the probability that one nucleotide will change into another.

• Still imperfect, as it assumes that each residue is equally likely to change into any other (i.e. the rate of transversions equals the rate of transitions).

• In practice, the transition is typically greater than the transversion rate.

Stage 3: Selection of substitution modelStage 3: Selection of substitution model

Pevsner, Bioinformatics and Functional Genomics, 2009

Page 52: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

Stage 3: Selection of substitution modelStage 3: Selection of substitution model

Jukes and Cantor (1969) proposed a corrective formula:

Consider an alignment where 3/60 aligned residues differ.The normalized Hamming distance is 3/60 = 0.05.The Jukes-Cantor correction is

When 30/60 aligned residues differ, the Jukes-Cantor correction is more substantial:

Pevsner, Bioinformatics and Functional Genomics, 2009

Page 53: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

Five stages of phylogenetic analysisFive stages of phylogenetic analysis

[1] Selection of sequences for analysis

[2] Multiple sequence alignment

[3] Selection of a substitution model

[4] Tree building

[5] Tree evaluation

Page 54: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

Stage 4: Tree-building methodsStage 4: Tree-building methods

Distance-based methods•Distance-based methods involve a distance metric, such as the number of amino acid changes between the sequences, or a distance score. •Examples of distance-based algorithms are UPGMA and neighbor-joining.Character-based methods•Include maximum parsimony and maximum likelihood.•Parsimony analysis involves the search for the tree with the fewest amino acid (or nucleotide) changes that account for the observed differences between taxa.

Pevsner, Bioinformatics and Functional Genomics, 2009

Page 55: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

Tree-building methods: UPGMATree-building methods: UPGMA

1 2

3

4

5

UPGMA (unweighted pair group method using arithmetic mean) is based on clustering of sequences

Pevsner, Bioinformatics and Functional Genomics, 2009

Page 56: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

Tree-building methods: UPGMA

Step 1: compute the pairwise distances of allthe proteins. Get ready to put the numbers 1-5at the bottom of your new tree.

1 2

3

4

5

Pevsner, Bioinformatics and Functional Genomics, 2009

Page 57: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

Step 2: Find the two proteins with the smallest pairwise distance. Cluster them.

1 2

3

4

5

1 2

6

Tree-building methods: UPGMATree-building methods: UPGMA

Pevsner, Bioinformatics and Functional Genomics, 2009

Page 58: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

Tree-building methods: UPGMATree-building methods: UPGMA

Step 3: Do it again. Find the next two proteins with the smallest pairwise distance. Cluster them.

1 2

3

4

5

1 2

6

4 5

7

Pevsner, Bioinformatics and Functional Genomics, 2009

Page 59: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

Tree-building methods: UPGMATree-building methods: UPGMA

Step 4: Keep going. Cluster.

1 2

3

4

5 1 2

6

4 5

7

3

8

Pevsner, Bioinformatics and Functional Genomics, 2009

Page 60: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

Tree-building methods: UPGMATree-building methods: UPGMA

Step 4: Last cluster! This is your tree.

1 2

3

4

5

1 2

6

4 5

7

3

8

9

Pevsner, Bioinformatics and Functional Genomics, 2009

Page 61: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

Another View of UPGMA: Pairwise Distance MatrixAnother View of UPGMA: Pairwise Distance Matrix

Pevsner, Bioinformatics and Functional Genomics, 2009

AA BB CC DD

BB 99 ---- ---- ----

CC 88 1111 ---- ----

DD 1212 1515 1010 ----

EE 1515 1818 1313 55

AA BB CC

BB ddABAB ---- ----

CC ddACAC ddBCBC ----

DD ddADAD ddBDBD ddCDCD

Page 62: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

Another View of UPGMA: Pairwise Distance MatrixAnother View of UPGMA: Pairwise Distance Matrix

A B C D E

B 2

C 4 4

D 6 6 6

E 6 6 6 4

F 8 8 8 8 8

Example: http://www.icp.ucl.ac.be/~opperd/private/upgma.html

Start with pair having shortest distance (A, B)The branching point is positioned at a distance of 2 / 2 = 1 substitution. We thus constuct a subtree as follows:

Page 63: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

Another View of UPGMA: Pairwise Distance MatrixAnother View of UPGMA: Pairwise Distance Matrix

A, B C D E

C 4

D 6 6

E 6 6 4

F 8 8 8 8

Example: http://www.icp.ucl.ac.be/~opperd/private/upgma.html

Calculate a new distance matrix

dist(A,B),C = (distAC + distBC) / 2 = 4 dist(A,B),D = (distAD + distBD) / 2 = 6 dist(A,B),E = (distAE + distBE) / 2 = 6 dist(A,B),F = (distAF + distBF) / 2 = 8

Page 64: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

Another View of UPGMA: Pairwise Distance MatrixAnother View of UPGMA: Pairwise Distance Matrix

A, B C D, E

C 4

D, E 6 6

F 8 8 8

Example: http://www.icp.ucl.ac.be/~opperd/private/upgma.html

Third cycle of matrix

dist(A,B),C = (distAC + distBC) / 2 = 4 dist(A,B),(D,E) = (dist(A,B)D + dist(AB)E) / 2 = 6 distC,(D,E) = (distCD + distCE) / 2 = 6dist(D,E),F = (distDF + distEF) / 2 = 8

Page 65: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

Another View of UPGMA: Pairwise Distance MatrixAnother View of UPGMA: Pairwise Distance Matrix

AB, C D, E

D, E 6

F 8 8

Example: http://www.icp.ucl.ac.be/~opperd/private/upgma.html

4th cycle of matrix

dist(AB,C),(D,E) = (dist(AB)DE + dist(C)DE) / 2 = 6 dist(AB,C),F = (dist(AB)F + dist(C)F) / 2 = 8

Page 66: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

Another View of UPGMA: Pairwise Distance MatrixAnother View of UPGMA: Pairwise Distance Matrix

ABC, DE

F 8

Example: http://www.icp.ucl.ac.be/~opperd/private/upgma.html

5th cycle of matrix

dist(ABC,DE),(F) = (dist(ABC)F + dist(DE)F) / 2 = 8

Page 67: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

UPGMA is a simple approach for making trees.

• A UPGMA tree is always rooted.

• An assumption of the algorithm is that the molecular clock is constant for sequences in the tree. If there are unequal substitution rates, the tree may be wrong.

• While UPGMA is simple, it is less accurate than the neighbor-joining approach (described next).

Distance-based methods: UPGMA treesDistance-based methods: UPGMA trees

Pevsner, Bioinformatics and Functional Genomics, 2009

Page 68: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

The neighbor-joining method of Saitou and Nei (1987) is especially useful for making trees with: •Large number of taxa•Lineages with varying rates of evolution

Begin by placing all the taxa in a star-like structure.

Making trees using neighbor-joiningMaking trees using neighbor-joining

Pevsner, Bioinformatics and Functional Genomics, 2009

Page 69: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

Tree-building methods: Neighbor joiningTree-building methods: Neighbor joining

Next, identify neighbors (e.g. 1 and 2) that are most closely related. Connect these neighbors to other OTUs via an internal branch, XY. At each successive stage, minimize the sum of the branch lengths.

Pevsner, Bioinformatics and Functional Genomics, 2009

Page 70: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

Define the distance from X to Y by

dXY = 1/2(d1Y + d2Y – d12)

Pevsner, Bioinformatics and Functional Genomics, 2009

Tree-building methods: Neighbor joiningTree-building methods: Neighbor joining

Page 71: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

Example of aneighbor-joiningtree: phylogeneticanalysis of 13RBPs

Page 72: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

0.1

1RRO

2PVB

1A75

1BU3

5PAL

1PVA

1RWY

1XO5

1OMR1G8I1S6C A

1K9U2OPO1QV1

1SL8

2SCP

2H2K2HQ8

1SRA

1S6C B

2CCM

1WDC A

2BL0 A 2AAO

1WDC B 2BL0 B

1WDC C2BL0 C

1EXR 1GGZ

1K96 1XK4 A

1E8A

1J55

1QX2

4ICB

1K94

1Y1X

2HQ8

2H2K

1WDC C 2BL0 C

S100 S100

S100

S100

S100

Calbindin d9k

Calbindin d9k

Penta-EF

Penta-EF

Parvalbumin

Parvalbumin

Parvalbumin

Parvalbumin

ParvalbuminParvalbumin

Parvalbumin

Polcalcin

Polcalcin

Osteonectin

R2

R1

Unrooted N-J Phylogenic Tree generated by Treeview

Distribution of SC angles

Page 73: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

Parsimony PrincipleParsimony Principle

• The parsimony principle tells us to choose the simplest scientific explanation that fits the evidence. In tree-building, this means that the best hypothesis is the one that requires the fewest evolutionary changes.

http://evolution.berkeley.edu/evolibrary/article/phylogenetics_08

Hypothesis 1 is the most parsimonious because it requires fewer evolutionary changes. Hypothesis 2 requires formation of bony skeleton at 2 different points.

Page 74: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

• Rather than pairwise distances between proteins, aligned columns of AA residues are evaluated.

• The main idea of character-based methods is to find the tree with the shortest branch lengths. Thus we seek the most parsimonious (“simple”) tree.

• Identify informative sites. For example, constant characters are not parsimony-informative.

• Construct trees, counting the number of changes required to create each tree.

• Select the shortest tree (or trees).

Building trees using character-based methodsBuilding trees using character-based methods

Pevsner, Bioinformatics and Functional Genomics, 2009

Page 75: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

Pevsner, Bioinformatics and Functional Genomics, 2009

As an example of tree-building using maximum parsimony, consider these four taxa:

AAGAAAGGAAGA

How might they have evolved from a common ancestor such as AAA?

Page 76: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

Parsimony PrincipleParsimony Principle

• The parsimony principle tells us to choose the simplest scientific explanation that fits the evidence. In tree-building, this means that the best hypothesis is the one that requires the fewest evolutionary changes.

http://evolution.berkeley.edu/evolibrary/article/phylogenetics_08

AAG AAA GGA AGA

AAAAAA

1 1AGA

AAG AGA AAA GGA

AAAAAA

1 2AAA

AAG GGA AAA AGA

AAAAAA

1 1AAA

1 2

Cost = 3 Cost = 4 Cost = 4

1

In maximum parsimony, choose the tree(s) with the lowest cost (shortest branch lengths).

Page 77: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

[1] Selection of sequences for analysis

[2] Multiple sequence alignment

[3] Selection of a substitution model

[4] Tree building

[5] Tree evaluation

Five stages of phylogenetic analysisFive stages of phylogenetic analysis

Pevsner, Bioinformatics and Functional Genomics, 2009

Page 78: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

Bootstrapping is common approach to measuring robustness of a tree topology. Given a branching order, how consistently does an algorithm find that branching order in a randomly permuted version of the original data set?

To bootstrap, create artificial dataset by randomly sampling columns from your multiple sequence alignment. Make dataset the same size as the original. Do 100 (to 1,000) bootstrap replicates. Observe the percent of cases in which the assignment of clades in the original tree is supported by the bootstrap replicates. >70% is considered significant.

Stage 5: Evaluating trees via bootstrappingStage 5: Evaluating trees via bootstrapping

Pevsner, Bioinformatics and Functional Genomics, 2009

Page 79: Biology 4900 Biocomputing. Chapter 7 Molecular Phylogeny and Evolution.

In 61% of the bootstrapresamplings, ssrbp and btrbp(pig and cow RBP) formed adistinct clade. In 39% of the cases, another protein joinedthe clade (e.g. ecrbp), or oneof these two sequences joinedanother clade.