Estimating Evolutionary Distances from DNA Sequences Lecture 14 ©Shlomo Moran, parts based on Ilan...
-
date post
22-Dec-2015 -
Category
Documents
-
view
215 -
download
0
Transcript of Estimating Evolutionary Distances from DNA Sequences Lecture 14 ©Shlomo Moran, parts based on Ilan...
Estimating Evolutionary Distances from DNA Sequences
Lecture 14
©Shlomo Moran, parts based on Ilan Gronau
1. - Compute distances between all taxon-pairs2. - Find a tree (edge-weighted) best-describing the distances
Distance Based Methods for Reconstructing Phylogenies
0
30
980
1514180
171620220
1615192190
D
4 5
7 21
210 61
The distances are implied by the assumed “model tree”
3
AATCCTG
ATAGCTGAATGGGC
GAACGTA
AAACCGA
ACGGTCA
ACGGATA
ACGGGTA
ACCCGTG
ACCGTTG
TCTGGTA
TCTGGGA
TCCGGAA AGCCGTG
GGGGATT
AAAGTCA
AAAGGCG AAACACAAAAGCTG
Model Tree: A Probabilistic Model of Evolution
stochastic transition matrices at the edges
DNA distribution at the root
mutations along the edges occur with probabilities defined by the transition matrices
4
assign edge-lengthsAdditivedistancematrix
Need to assign lengths d(e) to the edges of the tree, s.t for all u,v, d(u,v) = ∑{d(e): edge e is on the path connecting u and v }.
From Model Tree to Additive Distances
We do this for a simple evolutionary model – the CFN model
5
{ }A G
{ }C T
Transitions
Transversion
s
Transitions
{0}
{1}
CFN: ignore transitions,count only transversions
α
α
β
Purines
Pyrimidines
The CFN 2-states model distinguish between two types of DNA bases:Purines {A,G} and pyrimidines {C,T} :
The CFN (Cavendar Farris Neyman) 2-States Model
6
The CFN 2-States Model
• Purines are marked by 0 and pyrimidines by 1.• Uniform distribution on the root: prob(s(r)=0)=prob(s(r)=1)=0.5• On each (directed) edge e=(uv),
0< p(s(v)=0|s(u)=1) = p(s(v)=1|s(u)=0)=pe <0.5.
0 1
0 1-pe pe
1 pe1-pe
This implies a uniform distribution at each vertex
Mutation probabilities of edges are undirected
The mutation - state changes - probabilities of each (directed) edge
in a CFN model tree are the same in both directions:
For each edge e=(u,v) and b{0,1}:
p(s(v)=b|s(u)=1-b) = p(s(u)=1-b|s(v)=b)=pe.
7
u v
0 1
0 1-pe pe
1 pe 1-pe
0 1
0 1-pe
pe
1 pe 1-pe
puv
pvu=
State-Change Probabilities along Paths
Hence, we can ignore the direction of the edges when computing the state-change probabilities for any pair of vertices (u,v).
This state-change probability is symmetricpuv = p(s(v)=0|s(u)=1) = p(s(v)=1|s(u)=0)
Our goal is to convert the values of puv to additive distances d(u,v).
First we express puv. for any pair of vertices u,v, as a function of the
mutation probabilities along the path connecting u and v.
vu
8
9
Consider a path of two edges: Since state-changes probabilities are
equal in both directions, the directions of the edges can be ignored.
Probability of State-Changes along a Path in the CFN 2-States Model
vup1 p2
,
1 2 2 1 1 2 1 2
(s( ) s( ))
(1 ) (1 ) 2u vp pr u v
p p p p p p p p
Direct generalization of this formula to longer paths is tedious.
However there is a simple formula for the probability of a state
change along an arbitrary long path from u to v:
10
State-Changes Probabilities along a Path
•Each edge e has probability 0<pe <0.5 to change states.
•Let e1,…,el be the path of l edges from u to v.
Claim: The probability
is given by:
1
11 1 2
2 i
l
u v e
i
p p
, ( ( ))
, ( ( ) ( ))u vp pr s u s v
11
State-Changes Probabilities along a Path (cont)
1
11 1 2
2 i
l
u v e
i
p p
, ( ( ))
Define the following imaginary stochastic “bond-opening process” for
changing a state along an edge e.
Initially all edges are “bonded”. For each edge e:
1. With probability 2pe open the bond on e.
2. If the bond was opened, set the state to 0 or 1 with equal probability
(0.5).
This process implies that at each edge ei the state is changed with
probability pei. With this process we have:
1
0 5 (at least one bond was opened)
0 5 1 1 2i
u v
l
e
i
p prob
p
, .
. ( ( ))
Proof of the formula
12
Bond Probabilities Additive distances
1
1
1
1 1 1 2 implies
2
1 2 1 2
Define =1 2 the bond of remained closed). Then
the bonds of all edges in remained closed)=
Taking logarithms
i
i
i
l
u v e
i
l
u v e
i
e e
u v
l
e
i
p p
p p
p p e
p path u v
,
,
,
( ( ))
( )
(
( ( , )
.1
we get i
l
u v ei
,log log
Thus, d(u,v) = –logθu,v is an additive metric on the tree.
13
A Physical Interpretation of d(u,v)=-logθu,v
Common physical models of evolution view mutations as “Poisson
processes”.
For the CFN model, this means that a mutation on edge e is a random
event that occurs at some frequency λe (i.e., λe is the average number
of mutations per site on e).
With this interpretation, it can be shown that
1 2
2 2e e
e
p
log( ) log
1 2
2 2uvpd u v
log( )( , )
14
B : AATCCTG
C : ATAGCTG
A : AATGGGC
D : GAACGTA
E : AAACCGA
J : ACCGTTG
G : TCTGGGAH : TCCGGAA
I : AGCCGTG
F : GGGGATT
We saw that the values
{d(u,v )=log θuv: u,v are leaves of T}
form an additive metric on T’s leaves. Hence, if we had these distances, we
could reconstruct T in O(n2) time (eg by DLCA).
EstimatedDistance
matrix
The distances d(u,v)=-logθuv are estimated by using the fact that θuv =1-2puv,
and puv is naturally approximated by the Hamming distance between
the sequences at u and v, as we show next.
estimate {d(u,v)=-log θuv}from the sequences
15
Estimating the Additive Distances
Definition: H(u,v) , the Hamming distance between (the sequence
at) u and (the sequence at) v, is the number of sites in which u
and v have different states.
H(u,v) can be used to estimate d(u,v) by the following steps:
, ,k
, , , ,
, ,
ˆ1. ( , ) /
ˆ ˆ2. 1 2 1 2
ˆ ˆ3. ( , ) log log ( , )
u v u v
u v u v u v u vk
u v u vk
p H u v k p
p p
d u v d u v
Consistency of Distance Based Algorithms in the CFN Model
A tree reconstruction algorithm is said to be “consistent” for a probabilistic model of evolution, if the following holds for any phylogenetic tree which fits the model:
When the sequences length goes to , the reconstructed tree is w.h.p. the true tree.
Thus, the previous slide shows that distance based methods are consistent for the CFN model.
16
17
Reconstructing Trees Generated by the CFN Model
The longer are the sequences, the more accurate is our estimation of
d(u,v)=-log θuv . Thus the accuracy of the estimation is a function of the
sequences lengths, k.
A practical questions: How long should the sequences be in order to
guarantee an accurate reconstruction?
Much research on this question was done in the last decade. The bottom
line is that estimations of long distances are very noisy: the sequence
length needed to accurately estimate a distance d grows exponentially
with d (recall that d is proportional to the expected number of
mutations between vertices). Hence reconstruction should attempt to
use only small distances.
18
More Involved Mutations Models
More involved models allow different types of mutations to have different probabilities.
The bad news are that the same exponential lower bound on the length of the sequences needed to estimate the distances holds for these models .
The good news are that when the model allows several types of mutations, there are many different distance functions which can be used, so it is possible to select for each model tree a distance function which is best for this tree.