9 aula filogenia - Federal University of Rio de Janeiro · Sequence based tree of life Pace...
Transcript of 9 aula filogenia - Federal University of Rio de Janeiro · Sequence based tree of life Pace...
Bioinformática Básica Filogenia Molecular
Rafael Dias Mesquita [email protected]
Laboratório de Bioinformática
Departamento de Bioquímica Instituto de Química - UFRJ
Tree of life hX
p://en
.wikiped
ia.org/w
iki/S
cien
@fic_classifi
ca@o
n!
hXp://bill.srnr.arizona.edu/classes/182/Tree!of!Life/kingdoms!of!Life.htm!
hXp://en
.wikiped
ia.org/w
iki/S
cien
@fic_classifi
ca@o
n!
hXp://bill.srnr.arizona.edu/classes/182/Tree!of!Life/kingdoms!of!Life.htm!
Anisimova, M. CBRG/ETH Zurich
Sequence based tree of life Pace described a tree of life based on small subunit rRNA sequences. Pace, N. R. (1997) Science 276, 734-740
This tree shows the mainthree branches describedby Woese and colleagues.
Tekaia, F. Pasteur Institut, France
Evolutionary events
Tekaia, F. Pasteur Institut, France
Ancestor
species genome
Phylogeny
duplication genesis
Expansion
HGT
Exchange
loss
Deletion
Evolutionary events
Tekaia, F. Pasteur Institut, France
Hurles M (2004) Gene Duplication: The Genomic Trade in Spare Parts. PLoS Biol 2(7): e206.
Original version
Actual version
Gene relations
!H o m o l o g o u s : H a v e a c o m m o n a n c e s t o r . H o m o l o g y c a n n o t b e measured. !Orthologous: The same gene in different species . It is the result of speciation (common ancestral) !Paralogous: Related genes (already diverged) in the same species. It is the result of genomic rearrangements or duplication
Molecular evolution
Tekaia, F. Pasteur Institut, France
GACGACCATAGACCAGCATAG
GACTACCATAGA-CTGCAAAG
*** ******** * *** **
GACGACCATAGACCAGCATAG
GACTACCATAGACT-GCAAAG
*** ********* *** **
Two possible positions for the indel
Molecular Phylogenetic Analysis
• Study of evolutionary relationships between genes and species
• The actual pattern of evolutionary history is the phylogeny or evolutionary tree which we try to estimate.
• A tree is a mathematical structure which is used to model the actual evolutionary history of a group of sequences or organisms.
Tekaia, F. Pasteur Institut, France
Molecular Phylogenetic Analysis
• Reconstruction of phylogenetic trees is a statistical problem, and a reconstructed tree is an estimate of a true tree with a given topology and given branch length;
• The accuracy of this estimation should be statistically established;
• In practice, phylogenetic analyses usually generate phylogenetic trees with accurate parts and imprecise parts.
Tekaia, F. Pasteur Institut, France
Molecular Phylogenetic Analysis
• A phylogenetic tree is characterised by its topology (form) and its length (sum of its branch lengths) ;
• Each node of a tree is an estimation of the ancestor of the elements included in this node;
Tekaia, F. Pasteur Institut, France
Nucleotide x amino-acid sequencesc
Tekaia, F. Pasteur Institut, France
• DNA yields more phylogenetic information than proteins. The nucleotide sequences of a pair of homologous genes have a higher information content than the amino acid sequences of the corresponding proteins, because mutations that result in synonymous changes alter the DNA sequence but do not affect the amino acid sequence. (But amino-acid sequences are more efficiently aligned)
2 of the nucleotide substitutions are synonymous and one is non-synonymous.
Tekaia, F. Pasteur Institut, France
• An unrooted tree
• Rooted trees
C D B
A
1
C D A
B
2
A
B CD
3
A B C
D
4
A B D
C
5
A
B
C
D internal nodes
branches
external nodes
external nodes
Hypothetical ancestor
• •
Key features of DNA-based phylogenetic trees
Rooted and Unrooted trees
• In rooted trees a single node is designated as a common ancestor, and a unique path leads from it through evolutionary time to any other node.
• Unrooted trees only specify the relationship between nodes and say nothing about the direction in which evolution occured.
• Roots can usually be assigned to unrooted trees through the use of an outgroup.
A
B
C
D
• • A B C
D
•
Tekaia, F. Pasteur Institut, France
Gene A
Gene B
Gene C
Gene D
Gene E
Mutation events
Gene tree
Species A
Species B
Species C
Species D
Species E
Speciation events
Species tree
These two events - mutation and speciation- are not expected to occur at the same time. So gene trees cannot represent species tree.
Genes x Species Trees
Tekaia, F. Pasteur Institut, France
•
•
Time Duplication
Duplication
Speciation
Speciation
A B C
A B C
Species tree
A B C
Gene tree
Genes x Species Trees
Tekaia, F. Pasteur Institut, France
Phylogenetic gene trees: How many?
The numbers of possible rooted (NR) and unrooted (NU) trees for n sequences are given by:
n NR NU
2 1 1
3 3 1
4 15 3
5 105 15
10 34459425 2027025
Note that only one of all possible trees can represent the true tree that represents phylogenetic relationships among the sequences.
NR = (2n-3)! / 2n-2 . (n-2)!
NU = (2n-5)! / 2n-3 . (n-3)!
Tekaia, F. Pasteur Institut, France
Phylogenetic gene trees: How to construct?
1. Consider the set of sequences to analyse ;
2. Align "properly" these sequences ;
3. Apply phylogenetic making tree methods with bootstrap ;
4. Construct consensus tree;
5. Evaluate statistically the obtained phylogenetic tree.
Tekaia, F. Pasteur Institut, France
Phylogenetic gene trees: How to construct?
1. Consider the set of sequences to analyse;
Almost identical sequences can do not have enough evolutionary information.
Sequences that are too different also can do not have enough evolutionary information.
Phylogenetic gene trees: How to construct?
2. Align "properly" these sequences;
Even when a DNA alignment is the objective, the alignment can be based on the protein sequences. This should guarantee right codon alignments.
The conservation can be seen more clearly at protein level because of the degenerated genetic code.
Phylogenetic gene trees: How to construct?
3. Apply phylogenetic making tree methods with bootstrap ;
4. Construct consensus tree;
5. Evaluate statistically the obtained phylogenetic tree.
Depending of the software package this 3 points can be separated in many steps or integrated in one.
Phylogenetic tree construction methods Methods directly based on sequences : Maximum Parsimony: find a phylogenetic tree that explains the data, with as few evolutionary changes as possible. Maximum likelihood: find a tree that maximizes the probability of the genetic data given the tree. Bayesian: find a tree that represents the most likely clades, based on the posterior distribution. Methods indirectly based on sequences (distance based): Neighbour Joining, UPGMA and Fitch-Margolian: Find a tree such that branch lengths of paths between sequences (species) fit a matrix of pairwise distances between sequences. Minimum evolution: The sum of branch lengths measures the fit of the tree to data. Shorter tree is chosen.
Tekaia, F. Pasteur Institut, France
Parsimony
Tekaia, F. Pasteur Institut, France
The concept of parsimony is at the heart of all character-based methods of phylogenetic reconstruction.
The 2 fundamental ideas of biological parsimony are:
1- Mutations are exceedingly rare events (?) ;
2- the more unlikely events a model invokes, the less likely the model is to be correct.
As a result, the relationship that requires the fewest number of mutations to explain the current state of the sequences being considered, is the relationship that is most likely to be correct.
Parsimony
Tekaia, F. Pasteur Institut, France
Informative and Uninformative Sites:
Example:
seq 1 2 3 4 5 6
1 G G G G G G
2 G G G A G T
3 G G A T A G
4 G A T C A T Position 1 is said invariant and therefore uninformative, because all trees invoke the same number of mutations (0);
Position 2 is uninformative because 1 mutation occurs in all three possible trees;
Position 3 idem, because 2 mutations occur; Position 4 requires 3 mutations in all possible trees.
Positions 5 and 6 are informative, because one of the trees invokes only one mutation and the other 2 alternative trees both require 2 mutations.
In general, for a position to be informative regardless of how many sequences are aligned, it has to have at least 2 different nucleotides, and each of these nucleotides has to be present at least twice.
Krane & Raymer 2002
1 G G 1G
2G G4
G3 G G
1G
3G G4
G2 G G
1G
4G G3
G3
G G 1G
2G A4
G3 2
1G
3G A4
G2 G G
1G
4A G3
G2 G G
1 G G 1G
2G G4
G3 G G
1G
3G G4
G2 G G
1G
4G G3
G3
G G 1G
2G A4
G3 2
1G
3G A4
G2 G G
1G
4A G3
G2 G G
G A 1G
2G T4
A3 3 1G
3A T4
G2 G G
1G
4T A3
G2 G G
1 G G 1G
2G G4
G3 G G
1G
3G G4
G2 G G
1G
4G G3
G3
G G 1G
2G A4
G3 2
1G
3G A4
G2 G G
1G
4A G3
G2 G G
G A 1G
2G T4
A3 3 1G
3A T4
G2 G G
1G
4T A3
G2 G G
G T 1G
2A C4
T3 4
1G
3T C4
A2 G A
1G
4C T3
A2 G A
1 G G 1G
2G G4
G3 G G
1G
3G G4
G2 G G
1G
4G G3
G3
G G 1G
2G A4
G3 2
1G
3G A4
G2 G G
1G
4A G3
G2 G G
G A 1G
2G T4
A3 3 1G
3A T4
G2 G G
1G
4T A3
G2 G G
G T 1G
2A C4
T3 4
1G
3T C4
A2 G A
1G
4C T3
A2 G A
G A 1G
2G A4
A3 5 1G
3A A4
G2 G G
1G
4A A3
G2 G G
1 G G 1G
2G G4
G3 G G
1G
3G G4
G2 G G
1G
4G G3
G3
G G 1G
2G A4
G3 2
1G
3G A4
G2 G G
1G
4A G3
G2 G G
G A 1G
2G T4
A3 3 1G
3A T4
G2 G G
1G
4T A3
G2 G G
G T 1G
2A C4
T3 4
1G
3T C4
A2 G A
1G
4C T3
A2 G A
G A 1G
2G A4
A3 5 1G
3A A4
G2 G G
1G
4A A3
G2 G G
1 G G 1G
2G G4
G3 G G
1G
3G G4
G2 G G
1G
4G G3
G3
G G 1G
2T T4
G3 6
1G
3G T4
T2 G T
1G
4T T3
T2 G T
Maximum Parsimony - MP (Fitch, 1977)
Tekaia, F. Pasteur Institut, France
The maximum parsimony algorithm searches for the minimum number of genetic events (nucleotide substitutions or amino-acid changes) to infer the most parsimonious tree from a set of sequences. The best tree is the one which needs the fewest changes. Good news: 1. There is an evolutionary model ! Bad news: 1. Does evolution always follow the shortest possible route? Is the evolutionary model always correct? 2. Within practical computational limits, this often leads to the generation of tens or more "equally most parsimonious trees" which makes it difficult to justify the choice of a particular tree; 3. long computation time is needed to construct a tree; 4. No branch lengths;
Maximum Likelihood (ML)
Tekaia, F. Pasteur Institut, France
• Similar to the scoring idea of a HMM or a PSSM. • Alignment positions are independent. Within it, each base or
AA have the log-likelihood calculated individually based on a particular probability model.
• All topologies are tested, and for each one the log-likelihoods sum is maximized to estimate the branch lengths of the tree.
• The topology that shows the highest likelihood is chosen as the final tree.
Maximum Likelihood (ML)
Tekaia, F. Pasteur Institut, France
• Since transitions (exchanging purine for a purine and pyrimidine for a pyrimidine) are observed roughly 3 times more than transversions (exchanging a purine for a pyrimidine or vice versa); it can be reasonably argued that a greater likelihood exists that the sequence with C and T are more closely related to each other than they are to the sequence with G.
Good news: 1. ML estimates the branch lengths of the final tree; 2. ML methods are usually consistent; Bad news: 1. No evolutionary model assumed, no information about ancestor sequence, multiple substitutions can’t be considered; 2. Positions can not be independent (mutation rate); 3. They need long computation time to construct a tree;
.. C..
..T..
..G..
Bayesian
• Similar to Maximum likelihood • Assumes a prior probability distribution (even if it’s “flat” - ex:
All values are equally likely to occur in a variable) • This prior probability x the likelihood (same as calculated in
ML) = posterior probability • Best tree will maximizes posterior probability • Bayesian using Flat prior probability is different of ML
because they estimate some variables in the evolutionary models differently.
• Faster than ML.
Distance Based Methods
DISTANCE MATRIX Pairwise distance values calculated based on different models/formulas (can use HMM or PSSM). METHODS:
CLUSTERING METHODS Neighbour-joining, UPGMA and Fitch-Margolian
OPTIMALLY CRITERION Minimum evolution (ME)
Distance based - Clustering • Clustering methods Neighbour-joining, UPGMA and Fitch-Margolian NJ and UPGMA: The phylogenetic topology tree is constructed using a cluster analysis method. The tree is fitted to the matrix. UPGMA assumes a molecular clock – All species have the same mutation rate at any time in evolution. All tree branch lengths have the same distance.
• Fitch-Margolian do the clustering based on the least square criterion. • Closely related sequences are given more weight in the tree construction
process to correct for the increased inaccuracy in measuring distances between distantly related sequences
• The best fit in the least-squares sense minimizes the sum of squared residuals (residual being the difference between an observed value and the fitted value provided by a model).
• Closer the tree distance is to matrix distance smaller the least-square:
Di,j = the matrix distance between i and j sequences;
di,j = sum of branches on the tree path from i to j;
∑( Di,j - di,j )2/ D2ij
i,j
Distance based - Clustering
• • •
A B
C
D
E
F
G
H H
G
A
B
C
D
EF
Neighbor-Joining Method (Saitou & Nei 1987)
Tekaia, F. Pasteur Institut, France
Distance based – Minimum evolution (ME)
• All trees searched using optimally criterion – Heuristics can be used.
• Smallest sum of branches gives the best tree • Distance between to sequences in tree can’t be smaller than in
the distance matrix. Disadvantage: All distance based methods only give a tree, but no information of the changes in the sequences along this tree.
Optimal x Heuristic tree search
• Optimal strategy search all trees and find the best one. • Heuristics do nor guarantee the best but a very very good tree
as the search is guided to save time. • Heuristics can be used in MP, ML, Bayesian and ME • Heuristics can use different strategies • Stepwise addition • Star decomposition • Brach Swapping
Tekaia, F. Pasteur Institut, France
• None of the previous methods of phylogenetic reconstruction makes any garantee that they yield the one true tree that describes the evolutionary history of a set of aligned sequences
• There is at present no statistical method allowing comparisons of trees obtained from different phylogenetic methods; nevertheless many attempts have been made to compare the relative consistency of the existing methods.
• The consistency depends on many factors, including the topology and branch lengths of the real tree, the transition/transversion rate and the variability of the substitution rates.
• In practice, one infers phylogeny between sequences which do not generally meet the specified hypothesis.
• One expects that if sequences have strong phylogenetic relationships, different methods will result in the same phylogenetic tree.
Methods comparison
Tekaia, F. Pasteur Institut, France
• Most of phylogenetic methods construct unrooted trees.
• It is best to root such trees on biological grounds.
• The most used technique consists of including in the sequence data set to be analysed, a sequence which has some relation with the considered sequences without belonging to the same family.
• The aim is to normalize the branches of the unrooted tree relatively to the length of the branch related to the outgroup.
Outgroup
Tekaia, F. Pasteur Institut, France
Bootstraping
• How strong is the evolutionary signal?
• Constructs n new multiple alignments at random from the real alignment, with the same size.
• 50% of the columns are affected.
• Each chosen column can be sampled zero, one or more times in different positions.
ATAGCCATA
ATACCCATG
ATACCCATA
ATAGCCATA
ATCCCCCAT
TCAAATGCA
TCGAATCCA
TCAAATCCA
TCAAATGCA
TCAACACCC
Consensus Tree
A!consensus!tree!shows!clades!that!are!shared!by!a!set!of!trees!The!strict$consensus$tree$shows!a!clade!only!if!it!is!in!every!tree!of!a!set!The!majorityErule$consensus$tree$shows!a!clade!if!it!is!in!>50%!of!a!set!
B
A
C
D F
E G
H
C
A
B
D F
E G
H
B
A
C
D H
E G
F
67! 100!100!100!
B
A
C
D F
E G
H 67!
majorityVrule!consensus!
B
A
C
D F
E G
H
strict!consensus!
III.!Consensus!trees!
Anisimova, M. CBRG/ETH Zurich
Sites Very good, phylogeny from “one click” to advanced options: http://www.phylogeny.fr/ Pasteur online. Has a phylogenetic method comparison pipeline among other software: http://mobyle.pasteur.fr/cgi-bin/portal.py RAxML: http://embnet.vital-it.ch/raxml-bb/ Online servers list: http://evolution.genetics.washington.edu/phylip/software.serv.html#servers
Protein based CDS alignemnt PRANK http://wasabiapp.org/software/prank/ Tranlatorx http://translatorx.co.uk