Post on 22-Jun-2022
Introduction to Molecular Phylogeny
n Starting point: a set of homologous, aligned DNA orprotein sequences
n Result of the process: a tree describing evolutionaryrelationships between studied sequences= a genealogy of sequences= a phylogenetic tree
CLUSTAL W (1.74) multiple sequence alignment
Xenopus ATGCATGGGCCAACATGACCAGGAGTTGGTGTCGGTCCAAACAGCGTT---GGCTCTCTAGallus ATGCATGGGCCAGCATGACCAGCAGGAGGTAGC---CAAAATAACACCAACATGCAAATGBos ATGCATCCGCCACCATGACCAGCAGGAGGTAGCACCCAAAACAGCACCAACGTGCAAATGHomo ATGCATCCGCCACCATGACCAGCAGGAGGTAGCACTCAAAACAGCACCAACGTGCAAATGMus ATGCATCCGCCACCATGACCAGCAGGAGGTAGCACTCAAAACAGCACCAACGTGCAAATGRattus ATGCATCCGCCACCATGACCAGCGGGAGGTAGCTCTCAAAACAGCACCAACGTGCAAATG ****** **** ********* * *** * * *** * * *
Phylogenetic Tree
n Internal branch : between 2 nodes. External branch : between a node and aleaf
n Horizontal branch length is proportional to evolutionary distances betweensequences and their ancestors (unit = substitution / site).
n Tree Topology = shape of tree = branching order between nodes
Homo
Bos
Mus
Rattus0.011
0.025
0.012
0.011
Gallus
0.038
0.066
0.01
Root
Node
Leaf
Branch
Alignment and Gaps
n The quality of the alignment is essential : each column ofthe alignment (site) is supposed to contain homologousresidues (nucleotides, amino acids) that derive from acommon ancestor.
==> Unreliable parts of the alignment must be omittedfrom further phylogenetic analysis.
n Most methods take into account only substitutions ; gaps(insertion/deletion events) are not used.
==> gaps-containing sites are ignored.Xenopus ATGCATGGGCCAACATGACCAGGAGTTGGTGTCggtCCAAACAGCGTT---GGCTCTCTAGallus ATGCATGGGCCAGCATGACCAGCAGGAGGTAGC---CAAAATAACACCaacATGCAAATGBos ATGCATCCGCCACCATGACCAGCAGGAGGTAGCagtCAAAACAGCACCaacGTGCAAATGHomo ATGCATCCGCCACCATGACCAGCAGGAGGTAGCagtCAAAACAGCACCaacGTGCAAATGMus ATGCATCCGCCACCATGACCAGCAGGAGGTAGCactCAAAACAGCACCaacGTGCAAATGRattus ATGCATCCGCCACCATGACCAGCGGGAGGTAGCtctCAAAACAGCACCaacGTGCAAATG
Rooted and Unrooted Trees
n Most phylogenetic methods produce unrooted trees. Thisis because they detect differences between sequences, buthave no means to orient residue changes relatively to time.
n Two means to root an unrooted tree :
l The outgroup method : include in the analysis a group ofsequences known a priori to be external to the group under study;the root is by necessity on the branch joining the outgroup to othersequences.
l Make the molecular clock hypothesis : all lineages aresupposed to have evolved with the same speed since divergencefrom their common ancestor. The root is at the equidistant pointfrom all tree leaves.
Unrooted Tree
0.02
RattusMus
BosHomo
Gallus
Rooted Tree
Xenopus
Homo
Bos
Mus
Rattus
Gallus0.02
Universal phylogeny(1)
deduced from comparisonof SSU and LSU rRNAsequences (2508homologous sites) usingKimura’s 2-parameterdistance and the NJmethod.
The absence of root in thistree is expressed using acircular design.
BacteriaArchaea
Eucarya
Universal phylogeny(2)
Schematic drawing of a universal rRNA tree.The location of the root corresponds to that proposed byreciprocally rooted gene phylogenies.
Brown & Doolittle (1997) Microbiol.Mol.Biol.Rev. 61:456-502
Number of possible tree topologiesfor n taxa
n Ntrees
4 3
5 15
6 105
7 945
... ...
10 2,027,025
... ...
20 ~ 2 x1020
Ntrees =3.5.7...(2n−5)= (2n−5)!2n−3(n−3)!
Methods for Phylogeneticreconstruction
Three main families of methods :l Parsimony
l Distance methods
l Maximum likelihood methods
Parsimony (1)n Step 1: for a given tree topology (shape), and for a given
alignment site, determine what ancestral residues (at treenodes) require the smallest total number of changes in thewhole tree.Let d be this total number of changes.
Example: At this site and for this tree shape, at least 3 substitution events areneeded to explain the nucleotide pattern at tree leaves. Several distinctscenarios with 3 changes are possible.
C
1 2 3
4
56
A
A
G GG
AA A
G
X : ancestral nucleotide : substitution event
1 2 3
4
56
A
A
G G
GG
G GG
C
Parsimony (2)n Step 2:
l Compute d (step 1) for each alignment site.
l Add d values for all alignment sites.
l This gives the length L of tree.
n Step 3:l Compute L value (step 2) for each possible tree shape.
l Retain the shortest tree(s)= the tree(s) that require the smallest number of changes= the most parsimonious tree(s).
Some properties of Parsimony
n Several trees can be equally parsimonious (same length, theshortest of all possible lengths).
n The position of changes on each branch is not uniquelydefined=> parsimony does not allow to define tree branch lengths ina unique way.
n The number of trees to evaluate grows extremely fast withthe number of processed sequences :ð Parsimony can be very computation - intensive.
ð The search for the shortest tree must often be restricted to a fractionof the set of all possible tree shapes (heuristic search)=> there is no mathematical certainty of finding the shortest (mostparsimonious) tree.
Building phylogenetic trees bydistance methods
General principle :
Sequence alignment
ê (1)
Matrix of evolutionary distances between sequence pairs
ê (2)
(unrooted) tree
n (1) Measuring evolutionary distances.
n (2) Tree computation from a matrix of distance values.
A B C
A B CA 0
B 1 0
C 4 3 0
tree Distance matrix
Correspondence between trees anddistance matrices
•Any phylogenetic tree induces a matrix ofdistances between sequence pairs• “Perfect” distance matrices correspond to asingle phylogenetic tree
Evolutionary Distances
n They measure the total number ofsubstitutions that occurred on both lineagessince divergence from last commonancestor.
n Divided by sequence length.
n Expressed in substitutions / site
ancestor
sequence 1 sequence 2
Quantification of evolutionary distances (1):
The problem of hidden or multiple changesn D (true evolutionary distance) ≥ fraction of
observed differences (p)
n D = p + hidden changes
n Through hypotheses about the nature of the residuesubstitution process, it becomes possible toestimate D from observed differences betweensequences.
n Estimated D : d
A
A G
C
A G
G
A A
A
A G
C
Quantification of evolutionary distances(2):
Jukes and Cantor’s distance (DNA)
n Hypotheses of the model (Jukes & Cantor, 1969) :(a) All sites evolve independently and following the same process.
(b) All substitutions have the same probability.
(c) The base substitution process is constant in time.
n Quantification of evolutionary distance (d) as a function ofthe fraction of observed differences (p):
N = number of compared sites
d = −3
4ln(1 −
4
3p)
V(d) =9p(1− p)
(3 − 4 p)2 N
p d0,10 0,110,20 0,230,40 0,570,60 1,210,75 + ∞
Quantification of evolutionary distances (3):
Poisson distances (proteins)
n Hypotheses of the model :(a) All sites evolve independently and following the same process.
(b) All substitutions have the same probability.
(c) The amino acid substitution process is constant in time.
n Quantification of evolutionary distance (d) as a function ofthe fraction of observed differences (p) :
d = - ln(1 - p)
n !! The hypotheses of the Jukes-Cantor and the Poissonmodels are very simplistic !!
Quantification of evolutionary distances (3bis):
PAM and Kimura’s distances (proteins)n Hypotheses of the model (Dayhoff, 1979) :
(a) All sites evolve independently and following the same process.
(b) Each type of amino acid replacement has a given, empirical probability :Large numbers of highly similar protein sequences have been collected;probabilities of replacement of any a.a. by any other have been tabulated.
(c) The amino acid substitution process is constant in time.
n Quantification of evolutionary distance (d) :the number of replacements most compatible with the observedpattern of amino acid changes and individual replacementprobabilities.
n Kimura’s empirical approximation : d = - ln( 1 - p - 0.2 p2 ) (Kimura, 1983) where p = fraction of observed differences
n Hypotheses of the model :(a) All sites evolve independently and following the same process.
(b) Substitutions occur according to two probabilities :
One for transitions, one for transversions.
Transitions : G <—>A or C <—>T Transversions : other changes
(c) The base substitution process is constant in time.
n Quantification of evolutionary distance (d) as a function of thefraction of observed differences (p: transitions, q: transversions):
Quantification of evolutionary distances (4):
Kimura’s two parameter distance (DNA)
d = −1
2ln[(1− 2 p − q) 1 − 2q ]
Kimura (1980) J. Mol. Evol. 16:111
Quantification of evolutionary distances (5):
Synonymous and non-synonymous distances(coding DNA): Ka, Ks
n Hypothesis of previous models :(a) All sites evolve independently and following the same process.
n Problem: in protein-coding genes, there are two classes ofsites with very different evolutionary rates.l non-synonymous substitutions (change the a.a.): slow
l synonymous substitutions (do not change the a.a.): fast
n Solution: compute two evolutionary distancesl Ka = non-synonymous distance
l Ka = nbr. non-synonymous substitutions / nbr. non-synonymous sites
l Ks = synonymous distance
l Ks = nbr. synonymous substitutions / nbr. synonymous sites
The genetic code
TTT Phe TCT Ser TAT Tyr TGT CysTTC Phe TCC Ser TAC Tyr TGC CysTTA Leu TCA Ser TAA stop TGA stopTTG Leu TCG Ser TAG stop TGG Trp
CTT Leu CCT Pro CAT His CGT ArgCTC Leu CCC Pro CAC His CGC ArgCTA Leu CCA Pro CAA Gln CGA ArgCTG Leu CCG Pro CAG Gln CGG Arg
ATT Ile ACT Thr AAT Asn AGT SerATC Ile ACC Thr AAC Asn AGC SerATA Ile ACA Thr AAA Lys AGA ArgATG Met ACG Thr AAG Lys AGG Arg
GTT Val GCT Ala GAT Asp GGT GlyGTC Val GCC Ala GAC Asp GGC GlyGTA Val GCA Ala GAA Glu GGA GlyGTG Val GCG Ala GAG Glu GGG Gly
Substitution rate = f (mutation,selection)
NB: the vast majority of mutations are either neutral (i.e.have no phenotypic effect), or deleterious.Advantageous mutations are very rare.
Mutations
Selection
Substitutions
Quantification of evolutionary distances (6):
Calculation of Ka and Ks
n The details of the method are quite complex. Roughly :l Split all sites of the 2 compared genes in 3 categories :
I: non degenerate, II: partially degenerate, III: totally degenerate
l Compute the number of non-synonymous sites = I + 2/3 II
l Compute the number of synonymous sites = III + 1/3 II
l Compute the numbers of synonymous and non-synonymous changes
l Compute, with Kimura’s 2-parameter method, Ka and Ks
n Frequently, one of these two situations occur :l Evolutionarily close sequences : Ks is informative, Ka is not.
l Evolutionarily distant sequences : Ks is saturated , Ka is informative.
Li, Wu & Luo (1985) Mol.Biol.Evol. 2:150
Ka and Ks : example
# sites observed diffs. J & C K2P KA KS
10254 0.077 0.082 0.082 0.035 0.228
Urotrophin gene of rat (AJ002967) and mouse (Y12229)
Saturation: loss of phylogenetic signaln When compared homologous sequences have experienced too
many residue substitutions since divergence, it is impossibleto determine the phylogenetic tree, whatever the tree-buildingmethod used.
n NB: with distance methods, the saturation phenomenon mayexpress itself through mathematical impossibility to computed. Example: Jukes-Cantor: p › 0.75 => d --› ∞ and V(d) --› ∞
n NB: often saturation may not be detectable
ancestor
seq. 1 seq. 2 seq. 3
Quantification of evolutionary distances (7):
Other distance measures
n Several other, more realistic models of theevolutionary process at the molecular levelhave been used :l Accounting for biased base compositions
(Tajima & Nei).
l Accounting for variation of the evolutionaryrate across sequence sites.
l etc ...
Building phylogenetic trees bydistance methods
General principle :
Sequence alignment
ê (1)
Matrix of evolutionary distances between sequence pairs
ê (2)
(unrooted) tree
n (1) Measuring evolutionary distances.
n (2) Tree computation from a matrix of distance values.
A (bad) method : UPGMAHuman Chimpanzee Gorilla Orang-utan Gibbon
Human - 0.088 0.103 0.160 0.181Chimpanzee 0.094 - 0.106 0.170 0.189Gorilla 0.111 0.115 - 0.166 0.189Orang-utan 0.180 0.194 0.188 - 0.188Gibbon 0.207 0.218 0.218 0.216 -
Proportion ofdifferences (p)(above diagonal)and Kimura’s 2-parameterdistances (d)(below) formitochondrialDNA sequences(895 bp).
Gibbon
Orang-utan
Gorilla
Chimpanzee
Human
0.047
0.047
0.056
0.009
0.093
0.037
0.107
0.014ResultingUPGMA tree
d(Gibbon,[Human+Chimp]) = 1/2 [ d(Gibbon,Human) + d(Gibbon,Chimp) ]
Example of extremely unequal evolutionary rates
Distance-based analysis of 42LSU rRNA sequences frommicrosporidia and othereukaryotes.
Distances were corrected foramong-site rate variation.
Van de peer et al. (2000) Gene 246:1
UPGMA : properties
n UPGMA produces a rooted tree with branchlength.
n It is a very fast method.
n But UPGMA fails if evolutionary ratevaries among lineages.
n UPGMA would not have recovered thefungal evolutionary origin of microsporidia.
==> need methods insensitive to ratevariations.
Distance matrix -> tree (1): preliminary
n Let us consider the following tree :
n Let us consider two sets of distances between sequence pairs :
l d = distance as measured on sequences
l δ = distance induced by the above tree :
δi,j = li + lj δi,k = li + lc + lk
n It is possible (with a computer) to compute branch lengths (li, lj,lc, etc.) so that distances δ correspond “best” to distances d.
”Best" means that the divergence ∆ between d and δ values is
minimal :
n It is then possible to compute the total tree length, S :
S = li + lj + lc + … + lk + ...
i
j
kli
ljlc
lk
∆ = (dx , y −1≤ x< y≤ n∑ x, y )2
Distance matrix -> tree (2):
The Minimum Evolution Method
n Step 1: for a given tree topology (shape), compute branchlengths that minimise ∆; compute tree length S.
n Step 2: repeat step 1 for all possible topologies.Keep the tree with smallest S value.
n Problem: this method is very computation intensive. It ispractically not usable with more than ≈ 25 sequences.=> approximate (heuristic) methods are used.
Example: Neighbor-Joining.
Distance matrix -> tree (3):
The Neighbor-Joining Method: algorithmn Start from a star - topology and progressively construct a
tree as :l Step 1: Use d distances measured between the N sequences
l Step 2: For all pairs i et j: consider the following tree topology, andcompute Si,j , the sum of all “best” branch lengths. (Saitou and Neihave found a simple way to compute Si,j ).
l Step 3: Retain the pair (i,j) with smallest Si,j value . Group i and j inthe tree.
i
j
kli
ljlc
lk
Saitou & Nei (1987) Mol.Biol.Evol. 4:406
Distance matrix -> tree (4):
The Neighbor-Joining Method: algorithm (2)
l Step 4: Compute new distances d between N-1 objects:pair (i,j) and the N-2 remaining sequences.
d(i,j),k = (di,k + dj,k) / 2
l Step 5: Return to step 1 as long as N ≥ 4.When N = 3, an (unrooted) tree is obtained
n Example
1 2
3
45
6
12
3
456
12
3
4
5
6
1
2
3
4
5
6
Distance matrix -> tree (5):
The Neighbor-Joining Method (NJ): properties
n NJ is a fast method, even for hundreds of sequences.
n The NJ tree is an approximation of the minimum evolutiontree (that whose total branch length is minimum).
n In that sense, the NJ method is very similar to parsimonybecause branch lengths represent substitutions.
n NJ produces always unrooted trees, that need to be rootedby the outgroup method.
n NJ always finds the correct tree if distances are tree-like.
n NJ performs well when substitution rates vary amonglineages. Thus NJ should find the correct tree if distancesare well estimated.
Maximum likelihood methods(program fastDNAml, Olsen & Felsenstein)
n Hypotheses
l The substitution process follows a probabilistic modelwhose mathematical expression, but not parametervalues, is known a priori.
l Sites evolve independently from each other.
l All sites follow the same substitution process (somemethods use a more realistic hypothesis).
l Substitution probabilities do not change with time onany tree branch. They may vary between branches.
Maximum likelihood methods (1)
Simple example : one - parameter substitution model :
v = probability that a base changes per unit time
(fastDNAml uses a more elaborate model)
n Let us consider evolution along a tree branch :
n Our probabilistic model allows to compute the probability ofsubstitution x Õ y along this branch :
n Quantity l = 3vt is the average number of substitutions / sitealong this branch, i.e. the branch length.
Maximum likelihood methods (2)
base x
ancestor
base y
descendantt time units
Pl(x,y) =
3
4e
−4
3l
+1
4 if x = y
−1
4e
−4
3l
+1
4 if x ≠ y
with l =3vt
Maximum likelihood algorithm (1)
n Step 1: Let us consider a given rooted tree, a given site, anda given set of branch lengths. Let us compute the probabilitythat the observed pattern of nucleotides at that site hasevolved along this tree.
S1, S2, S3, S4: observed bases at site in seq. 1, 2, 3, 4
S5, S6, S7: unknown and variable ancestral bases
l1, l2, …, l6: given branch lengths
P(S1, S2, S3, S4)=
ΣS7ΣS5ΣS6P(S7) Pl5(S7,S5) Pl6(S7,S6) Pl1(S5,S1) Pl2(S5,S2) Pl3(S6,S3) Pl4(S6,S4)
where P(S7) is estimated by the average base frequencies in studied sequences.
S1S2
S3
S4
S5 S6
S7
l1 l2l3
l4
l5 l6
Maximum likelihood algorithm (2)
n Step 2: Let us compute the probability that entire sequenceshave evolved :
P(Sq1, Sq2, Sq3, Sq4) = Πall sites P(S1, S2, S3, S4)
n Step 2: Let us compute branch lengths l1, l2, …, l6 that givethe highest P(Sq1, Sq2, Sq3, Sq4) value. This is the likelihood ofthe tree.
n Step 3: Let us compute the likelihood of all possible trees.The tree predicted by the method is that having the highestlikelihood.
Maximum likelihood : properties
n This is the best justified method from a theoreticalviewpoint.
n Sequence simulation experiments have shown that thismethod works better than all others in most cases.
n But it is a very computer-intensive method.
n It is nearly always impossible to evaluate all possible treesbecause there are too many. A partial exploration of thespace of possible trees is done. The mathematical certaintyof obtaining the most likely tree is lost.
Reliability of phylogenetic trees: thebootstrap
n The phylogenetic information expressed by an unrooted treeresides entirely in its internal branches.
n The tree shape can be deduced from the list of its internalbranches.
n Testing the reliability of a tree = testing the reliability ofeach internal branch.
1 2
4
3
56
internal branch separating group 1+4 from group 6+2+5+3
Bootstrap procedure
The support of each internal branch is expressed as percent ofreplicates.
1 Nacgtacatagtatagcgtctagtggtaccgtatgaggtacatagtatgg-gtatactggtaccgtatgacgtaaat-gtatagagtctaatggtac-gtatgacgtacatggtatagcgactactggtaccgtatg
real alignment
random sampling, with replacement, of N sites
1 Ngatcagtcatgtataggtctagtggtacgtatattgagagtcatgtatggtgtatactggtacgtaattgac-gtaatgtataggtctaatggtactgtaattgacggtcatgtataggactactggtacgtatat
“artificial” alignments
} 1000 times
tree-building method
same tree-building method
tree = series of internal branches
“artificial” trees
for each internal branch, compute
fraction of “artificial” trees containing this
internal branch
"bootstrapped” tree
Xenopus
Homo
Bos
Mus
Rattus
Gallus0.02
97
91
46
Bootstrap procedure : properties
n Internal branches supported by ≥ 90% of replicates areconsidered as statistically significant.
n The bootstrap procedure only detects if sequence length isenough to support a particular node.
n The bootstrap procedure does not help determining if thetree-building method is good. A wrong tree can have 100% bootstrap support for all its branches!
Gene tree vs. Species tree
n The evolutionary history of genes reflectsthat of species that carry them, except if :l horizontal transfer = gene transfer between
species (e.g. bacteria, mitochondria)
l Gene duplication : orthology/ paralogy
Orthology / Paralogy
Homology : two genes are homologous iff they have a common ancestor.
Orthology : two genes are orthologous iff they diverged following a speciation event.
Paralogy : two genes are paralogous iff they diverged following a duplication event.
Orthology ≠ functional equivalence
PrimatesRodents
Human
ancestral GNS gene
GNS GNS1 GNS1
GNS1 GNS2
GNS2 GNS2
!
speciation
duplication
Rat MouseRat Mouse
Reconstruction of species phylogeny:artefacts due to paralogy
!! Gene loss can occur during evolution : even with complete genome sequences it may bedifficult to detect paralogy !!
Rat Mouse Rat Mouse
GNS1 GNS1
GNS1 GNS2
GNS2 GNS2GNS1 GNS2
Hamster Hamster
speciation duplication
Mouse Rat
GNS GNSGNS
Hamster
true tree tree obtained with a partial sampling of homologous genes
Phylogenetic tree ofthe Bcl-2 familyderived from the NJmethod applied toPAM evolutionarydistances (94homologous sites).
The tree suggestshuman NRH, mouseDiva, chicken Nr-13,and Danio Nr-13 to beorthologous genes.
The tree also suggeststhe 2 mammaliangenes have evolvedmuch faster than otherfamily members.
Exploring the Bcl-2 family of inhibitors of apoptosis
Aouacheria et al. (20001) Oncogene 20:5846
WWW resources for molecular phylogeny (1)
n Compilationsð A list of sites and resources:
http://www.ucmp.berkeley.edu/subway/phylogen.html
ð An extensive list of phylogeny programshttp://evolution.genetics.washington.edu/
phylip/software.html
n Databases of rRNA sequences and associatedsoftware
ð The rRNA WWW Server - Antwerp, Belgium.http://rrna.uia.ac.be
ð The Ribosomal Database Project - Michigan State Universityhttp://rdp.cme.msu.edu/html/
WWW resources for molecular phylogeny (2)
n Database similarity searches (Blast) :http://www.ncbi.nlm.nih.gov/BLAST/http://www.infobiogen.fr/services/menuserv.htmlhttp://bioweb.pasteur.fr/seqanal/blast/intro-fr.htmlhttp://pbil.univ-lyon1.fr/BLAST/blast.html
n Multiple sequence alignmentðClustalX : multiple sequence alignment with a graphical interface(for all types of computers).http://www.ebi.ac.uk/FTP/index.html and go to ‘software’
ðWeb interface to ClustalW algorithm for proteins:http://pbil.univ-lyon1.fr/ and press “clustal”
WWW resources for molecular phylogeny (3)
n Sequence alignment editorð SEAVIEW : for windows and unix
http://pbil.univ-lyon1.fr/software/seaview.html
n Programs for molecular phylogenyð PHYLIP : an extensive package of programs for all platforms
http://evolution.genetics.washington.edu/phylip.html
ð CLUSTALX : beyond alignment, it also performs NJð PAUP* : a very performing commercial package
http://paup.csit.fsu.edu/index.html
ð PHYLO_WIN : a graphical interface, for unix onlyhttp://pbil.univ-lyon1.fr/software/phylowin.html
ð WWW-interface at Institut Pasteur, Parishttp://bioweb.pasteur.fr/seqanal/phylogeny
WWW resources for molecular phylogeny (4)
n Tree drawingNJPLOT (for all platforms)http://pbil.univ-lyon1.fr/software/njplot.html
n Lecture notes of molecular systematicshttp://www.bioinf.org/molsys/lectures.html
WWW resources for molecular phylogeny (5)
n Booksð Laboratory techniques
Molecular Systematics (2nd edition), Hillis,Moritz & Mable eds.; Sinauer, 1996.
ð Molecular evolutionFundamentals of molecular evolution (2ndedition); Graur & Li; Sinauer, 2000.
ð Evolution in generalEvolution (2nd edition); M. Ridley; Blackwell,1996.