Estimating Evolutionary Distances from DNA Sequences Lecture 14 ©Shlomo Moran, parts based on Ilan...

Estimating Evolutionary Distances from DNA Sequences

Lecture 14

©Shlomo Moran, parts based on Ilan Gronau

1. - Compute distances between all taxon-pairs2. - Find a tree (edge-weighted) best-describing the distances

Distance Based Methods for Reconstructing Phylogenies

0

30

980

1514180

171620220

1615192190

D

4 5

7 21

210 61

The distances are implied by the assumed “model tree”

3

AATCCTG

ATAGCTGAATGGGC

GAACGTA

AAACCGA

ACGGTCA

ACGGATA

ACGGGTA

ACCCGTG

ACCGTTG

TCTGGTA

TCTGGGA

TCCGGAA AGCCGTG

GGGGATT

AAAGTCA

AAAGGCG AAACACAAAAGCTG

Model Tree: A Probabilistic Model of Evolution

stochastic transition matrices at the edges

DNA distribution at the root

mutations along the edges occur with probabilities defined by the transition matrices

4

assign edge-lengthsAdditivedistancematrix

Need to assign lengths d(e) to the edges of the tree, s.t for all u,v, d(u,v) = ∑{d(e): edge e is on the path connecting u and v }.

From Model Tree to Additive Distances

We do this for a simple evolutionary model – the CFN model

5

{ }A G

{ }C T

Transitions

Transversion

s

Transitions

{0}

{1}

CFN: ignore transitions,count only transversions

α

α

β

Purines

Pyrimidines

The CFN 2-states model distinguish between two types of DNA bases:Purines {A,G} and pyrimidines {C,T} :

The CFN (Cavendar Farris Neyman) 2-States Model

6

The CFN 2-States Model

• Purines are marked by 0 and pyrimidines by 1.• Uniform distribution on the root: prob(s(r)=0)=prob(s(r)=1)=0.5• On each (directed) edge e=(uv),

0< p(s(v)=0|s(u)=1) = p(s(v)=1|s(u)=0)=pe <0.5.

0 1

0 1-pe pe

1 pe1-pe

This implies a uniform distribution at each vertex

Mutation probabilities of edges are undirected

The mutation - state changes - probabilities of each (directed) edge

in a CFN model tree are the same in both directions:

For each edge e=(u,v) and b{0,1}:

p(s(v)=b|s(u)=1-b) = p(s(u)=1-b|s(v)=b)=pe.

7

u v

0 1

0 1-pe pe

1 pe 1-pe

0 1

0 1-pe

pe

1 pe 1-pe

puv

pvu=

State-Change Probabilities along Paths

Hence, we can ignore the direction of the edges when computing the state-change probabilities for any pair of vertices (u,v).

This state-change probability is symmetricpuv = p(s(v)=0|s(u)=1) = p(s(v)=1|s(u)=0)

Our goal is to convert the values of puv to additive distances d(u,v).

First we express puv. for any pair of vertices u,v, as a function of the

mutation probabilities along the path connecting u and v.

vu

8

9

Consider a path of two edges: Since state-changes probabilities are

equal in both directions, the directions of the edges can be ignored.

Probability of State-Changes along a Path in the CFN 2-States Model

vup1 p2

,

1 2 2 1 1 2 1 2

(s( ) s( ))

(1 ) (1 ) 2u vp pr u v

p p p p p p p p

Direct generalization of this formula to longer paths is tedious.

However there is a simple formula for the probability of a state

change along an arbitrary long path from u to v:

10

State-Changes Probabilities along a Path

•Each edge e has probability 0<pe <0.5 to change states.

•Let e1,…,el be the path of l edges from u to v.

Claim: The probability

is given by:

1

11 1 2

2 i

l

u v e

i

p p

, ( ( ))

, ( ( ) ( ))u vp pr s u s v

11

State-Changes Probabilities along a Path (cont)

1

11 1 2

2 i

l

u v e

i

p p

, ( ( ))

Define the following imaginary stochastic “bond-opening process” for

changing a state along an edge e.

Initially all edges are “bonded”. For each edge e:

1. With probability 2pe open the bond on e.

2. If the bond was opened, set the state to 0 or 1 with equal probability

(0.5).

This process implies that at each edge ei the state is changed with

probability pei. With this process we have:

1

0 5 (at least one bond was opened)

0 5 1 1 2i

u v

l

e

i

p prob

p

, .

. ( ( ))

Proof of the formula

12

Bond Probabilities Additive distances

1

1

1

1 1 1 2 implies

2

1 2 1 2

Define =1 2 the bond of remained closed). Then

the bonds of all edges in remained closed)=

Taking logarithms

i

i

i

l

u v e

i

l

u v e

i

e e

u v

l

e

i

p p

p p

p p e

p path u v

,

,

,

( ( ))

( )

(

( ( , )

.1

we get i

l

u v ei

,log log

Thus, d(u,v) = –logθu,v is an additive metric on the tree.

13

A Physical Interpretation of d(u,v)=-logθu,v

Common physical models of evolution view mutations as “Poisson

processes”.

For the CFN model, this means that a mutation on edge e is a random

event that occurs at some frequency λe (i.e., λe is the average number

of mutations per site on e).

With this interpretation, it can be shown that

1 2

2 2e e

e

p

log( ) log

1 2

2 2uvpd u v

log( )( , )

14

B : AATCCTG

C : ATAGCTG

A : AATGGGC

D : GAACGTA

E : AAACCGA

J : ACCGTTG

G : TCTGGGAH : TCCGGAA

I : AGCCGTG

F : GGGGATT

We saw that the values

{d(u,v )=log θuv: u,v are leaves of T}

form an additive metric on T’s leaves. Hence, if we had these distances, we

could reconstruct T in O(n2) time (eg by DLCA).

EstimatedDistance

matrix

The distances d(u,v)=-logθuv are estimated by using the fact that θuv =1-2puv,

and puv is naturally approximated by the Hamming distance between

the sequences at u and v, as we show next.

estimate {d(u,v)=-log θuv}from the sequences

15

Estimating the Additive Distances

Definition: H(u,v) , the Hamming distance between (the sequence

at) u and (the sequence at) v, is the number of sites in which u

and v have different states.

H(u,v) can be used to estimate d(u,v) by the following steps:

, ,k

, , , ,

, ,

ˆ1. ( , ) /

ˆ ˆ2. 1 2 1 2

ˆ ˆ3. ( , ) log log ( , )

u v u v

u v u v u v u vk

u v u vk

p H u v k p

p p

d u v d u v

Consistency of Distance Based Algorithms in the CFN Model

A tree reconstruction algorithm is said to be “consistent” for a probabilistic model of evolution, if the following holds for any phylogenetic tree which fits the model:

When the sequences length goes to , the reconstructed tree is w.h.p. the true tree.

Thus, the previous slide shows that distance based methods are consistent for the CFN model.

16

17

Reconstructing Trees Generated by the CFN Model

The longer are the sequences, the more accurate is our estimation of

d(u,v)=-log θuv . Thus the accuracy of the estimation is a function of the

sequences lengths, k.

A practical questions: How long should the sequences be in order to

guarantee an accurate reconstruction?

Much research on this question was done in the last decade. The bottom

line is that estimations of long distances are very noisy: the sequence

length needed to accurately estimate a distance d grows exponentially

with d (recall that d is proportional to the expected number of

mutations between vertices). Hence reconstruction should attempt to

use only small distances.

18

More Involved Mutations Models

More involved models allow different types of mutations to have different probabilities.

The bad news are that the same exponential lower bound on the length of the sequences needed to estimate the distances holds for these models .

The good news are that when the model allows several types of mutations, there are many different distance functions which can be used, so it is possible to select for each model tree a distance function which is best for this tree.

Estimating Evolutionary Distances from DNA Sequences Lecture 14 ©Shlomo Moran, parts based on Ilan...

Documents

Transcript of Estimating Evolutionary Distances from DNA Sequences Lecture 14 ©Shlomo Moran, parts based on Ilan...