Phylogenetics - Distance-Based Methods
description
Transcript of Phylogenetics - Distance-Based Methods
Phylogenetics - Distance-Based
Methods
CIS 667 March 11, 2204
Phylogenetics
• Attempts to infer the evolutionary history of a group of organisms or sequences of nucleic acids or proteins Phylogenetic methods can be used for the
study of evolutionary relationships between species of organisms as well as genes
Attempt to reconstruct evolutionary ancestors
Estimate time of divergence from ancestor
Phylogenetic Trees
• We can use phylogenetic trees to illustrate the evolutionary relationships among groups of species or genes
• Leaf nodes of the tree are the species or genes we are comparing, interior nodes are inferred common ancestors
Phylogenetic Trees
Phylogenetic Tree for Close Human Relatives
Humans
Orangutans
Chimpanzees Gorillas
Common Ancestor of Gorillas Chimps
Comon Ancestor Gorillas, Chimps, Orangs
Common Ancestor of Humans and Apes
History
• Taxonomists used anatomy and physiology to group and classify organisms Morphological features like presence of
feathers or number of legs• When protein sequencing, and later
DNA sequencing became common, amino acid and DNA sequences became the common way to contruct trees
Phylogenetic Tree constructed from aa sequences of Cytochrome C
protein
QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.
The Big Picture
• Determine the species or genes to be studied
• Acquire homologous sequence data• Use multiple sequence alignment software
like ClustalW to align• Clean up data by hand• Use phylogenetic analysis software like
Phylip based on techniques we will study• Verify experimentally
Phylogenetics
• Can be used to solve a number of interesting problems Forensics
HIV virus mutates rapidly Predicting evolution of influenza viruses Predicting functions of uncharacterized
genes - ortholog detection Drug discovery Vaccine development
Target inferred common ancestor
Types of Data
• Two categories Numerical data
Distance between objects E.g.evolutionary distance between two species Usually derived from sequence data
Character data Each character has a finite number of states E.g. number or legs = 1, 2, 4 DNA = {A, C, T, G}
Phylogenetic Trees
• Trees are composed of nodes and branches Terminal or leaf nodes correspond to a
gene or organism for which data has been collected
Internal nodes usually represent an inferred common ancestor that gave rise to two independent lineages sometime in the past
Rooted and Unrooted Trees
• Some trees make an inference about a common ancestor and the direction of evolution and some don’t First type is called a rooted tree and has
a single node designated as root which is the common ancestor
Second type is called an unrooted tree Specifies only relationship between nodes
and says nothing about direction of evolution
Rooted and Unrooted Trees
R
A B C D E
Time
B C
AE
D
Rooted and Unrooted Trees
• Roots can usually be assigned to unrooted trees using an outgroup Species unambiguously separated the
earliest from others being studied E.g. baboons in case of humans and
gorillas For three species there are 3 possible
rooted trees, but only one possible unrooted tree
Rooted and Unrooted Trees
• In fact the numbers of rooted (NR) and unrooted trees (NU) for n species is NR = (2n - 3)!/2n-2(n - 2)! NU = (2n - 5)!/2n-3(n - 3)!
Data Sets
Rooted Trees Unrooted Trees
2 1 1
3 3 1
4 15 3
5 105 15
10 34,459,425 2,027,025
15 213,458,046,767,875 7,905,853,580,625
20 8,200,794,532,637,891,559,375 221,643,095,476,699,771,875
Rooting Trees
• Trees can be rooted by using the outgroup method previously mentioned, or by putting the root midway between the two most distant species as determined by branch length Branch length measures the amount of
difference that occurred along a branch Assumes the species are evolving in a
clock-like manner
Rooting a Tree
QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.
More Tree Terminology
• Structure of a phylogenetic tree can be represented in Newick format using nested parentheses (((B, C), (D, E)), A)
• If we lack data to tell in which order two or more independent lineages occurred in the past, the tree may be multifurcating (more than two ancestors) otherwise, it is bifurcating (exactly two ancestors per interior node)
Character and Distance Data
• Character-based methods use aligned DNA or protein sequences directly for tree inferenceSpecies A ATCGAATCGTTCCGGASpecies B ATCCAATAGTTCCGGASpecies C AACGAATCCTACCGGTSpecies D ATCGTTTCCAACCGCTSpecies E ATAGATTCGTTCGGGA
Character and Distance Data
• Distance-based methods must transform the sequence data into a pairwise similarity matrix for use during tree inference
Species
A B C D
B 2 - - -C 4 5 - -D 7 9 5 -E 3 5 7 8
Distance-Based Methods
• Given such an input matrix we want to find an edge-weighted tree where the leafs of the tree correspond to the species and the distances measured between two leaves corresponds to the corresponding matrix value for the leaves
UPGMA
• UPGMA (Unweighted Pair Group Method with Arithmetic mean) is the oldest distance matrix method Uses a distance matrix representing
measure of genetic distance between pairs of species being considered
Clusters the two closest species Compute new distance matrix using
arithmetic mean to first cluster Repeat until all species grouped
UPGMA
A
B
C E
D
A B C E D
Estimation of Branch Length
• Scaled trees, where the length of the branches correspond to the degree to which sequences have diverged are called cladograms
• If rates of evolution are assumed to be constant in all lineages then internal nodes are placed at equal distances from each of the species they give rise to on a bifurcating tree (UPGMA ex.)
UPGMA
• So UPGMA is very simple and generates rooted trees, however…
• Major weakness is that the algorithm assumes that rates of evolution are the same among different lineages
• This does not fit existing biological data, so probably shouldn’t use UPGMA to build phylogenetic trees
Transformed Distance Method
• Several distance matrix-based alternatives to UPGMA allow different rates of evolution within different lineages Oldest and simplest is the transformed distance
method which takes advantage of an outgroup Other lineages only evolve separately from each
other after they diverged and since the outgroup diverged first we can use it as a frame of reference to compare how much the other lineages evolved by seeing when they diverged
Neighbor’s Relation Method
• One variant of UPGMA tries to pair species in such a way as to minimize the sum of the branch lengths On a rooted tree, pairs of species
separated from each other by only one node are called neighbors
We have important relationships between neighbors of a phylogenetic tree with four nodes
Neighbor’s Relation Method
A
B
C
D
a
b d
e
c
dAC + dBD = dAD + dBC = a + b + c + d + 2e = dAB + dCD + 2edAB + dCD < dAC + dBD dAB + dCD < dAD + dBC
The following hold for this tree
Neighbor’s Relation Method
• Consider all possible pairwise arrangements of four species, and determine which satisfies the four point condition (set of 2 inequalities)
• This process can be iterated to generate a complete tree, but the process is unfeasible for large sets of species
Neighbor-Joining Methods
• Other neighborliness approaches are available as well
• Neighbor-joining methods start with all species arranged in a star tree
ab
d
c
e
a
b
cd
e
Neighbor-Joining Methods
• The pair of nodes pulled out (grouped) at each iteration are chosen so that the total length of the branches on the tree is minimized
• After a pair of nodes is pulled out, it forms a cluster in the tree and is included in further rounds of iteration (and a new distance matrix is generated)
• The tree’s total branch length is calculated as: Q12 = (N - 2)d12 - (d1i )- (d2i )