Comp. Genomics Recitation 8 Phylogeny. Outline Phylogeny: Distance based Probabilistic Parsimony.

33
Comp. Genomics Recitation 8 Phylogeny

Transcript of Comp. Genomics Recitation 8 Phylogeny. Outline Phylogeny: Distance based Probabilistic Parsimony.

Comp. Genomics

Recitation 8Phylogeny

Outline

Phylogeny:• Distance based• Probabilistic• Parsimony

Exercise

• Show that in UPGMA, for some new cluster k

• The distances dkl are given by:

for any cluster l

k i jC C C

| | | |

| | | |il i jl j

kli j

d C d Cd

C C

Solution

• Since the members of k are the members of i and j, the sum of distances between members of k and l can be written as:

, , ,k l i l j l

xy xy xyx C y C x C y C x C y C

d d d

• This is equal to:

| | | | | | | |il i l jl j ld C C d C C

Solution

• By the definition of distance between clusters, we divide the latter sum by |Ck|·|Cl|:

| | | | | | | |

| | | |il i l jl j l

k l

d C C d C C

C C

| | | | | | | |

(| | | |) | |il i l jl j l

i j l

d C C d C C

C C C

Exercise

• Show that every parent in a tree constructed by UPGMA is never lower than its daughter nodes

Exercise

k

j i

hk=dij/2

n

hn=dkl/2

l

Can n be lower than k?

Solution

• Since hn=dkl/2, we will show that for every k and l dkl≥dij and therefore node n

is higher than node k

• According to the previous exercise:

| | | |

| | | |il i jl j

kli j

d C d Cd

C C

il jlmin(d ,d ) kld

Solution

• Since i and j were merged and not i and l or j and l, we can conclude that

il jl ijmin(d ,d ) dkld

Exercise

• Show an example in which the parent node height is equal to the child node height (UPGMA).

Solution

• Suppose 3 pairs of sequences have the same distance d.

• We choose to merge leafs 1 and 2 and produce node 4, with height d/2.

• The new distance, d43, is exactly d

• So when we merge node 4 and leaf 3, we create a new node 5 of height d/2

Solution

1 2 3

4 height=d/2 5 height=d/2

Solution

1 2 3

4 5

Exercise

• The famous paleontologist R. Geller argued to his sister that the last common ancestor of birds and dinosaurs lived 100 million years ago.

• His sister claimed that the ancestor lived 200 million years ago.

• The evidence are 1000nt long homologous genes with 350 differences (its not contamination this time…)

Exercise

• Both accept the Jukes-Cantor model• Both accept the assumption of a

molecular clock• If mutations occur independently,

with rate 10-9 mutations per year, whose theory is more likely to be correct?

Solution

• According to Jukes-Cantor, the probability of a nucleotide remaining unchanged over t time units is:

41(1 3 )

4t

x xP e

• The probability for a specific change:

41(1 )

4t

x y xP e

Solution

Bird Dinosaur

Ancestor

tt

Molecular clock – both species evolve at the same rate

Tree T

Solution

• The likelihood of the tree at site i is:

( ) ( , | , )i i iL t P bird dinosaur T time from parent is t

( | , ) ( | , )i

i

ancestor i i i iancestor

q P bird ancestor t P dinosaur ancestor t

( | , ) ( | , )i

i

bird i i i iancestor

q P dinosaur ancestor t P ancestor bird t

Likelihood of a tree

Jukes-Cantor Reversibility property

( | , 2 )ibird i iq P dinosaur bird t

Jukes-Cantor Additivity

Less work to do

Solution

• Since the distance between the species is 2t, the probability of every site in which there is a match is:

4 (2 ) 4 (2 )1 1( ) 1 3 1 3

4 16t t

i iL t q e e

• For a mismatch, the probability is:

4 (2 )1( ) 1

16t

iL t e

Solution

• So the likelihood of the tree T is

650 350

4 (2 ) 4 (2 )1 11 3 1

16 16t te e

Solution

• The log likelihood of the trees suggested by Dr. Geller and his sister is:

6 6

6 6

650 3504 (2 100 10 ) 4 (2 100 10 )

1650 350

4 (2 200 10 ) 4 (2 200 10 )2

1 3 1( )ln

( ) 1 3 1

e eL T

L T e e

6 6

6 6

4 (2 100 10 ) 4 (2 100 10 )

4 (2 200 10 ) 4 (2 200 10 )

1 3 1650 ln 350 ln

1 3 1

e e

e e

3α=10-9 α=1/3*10-9

Solution

0.26) 0.26

0.52 0.52

1 3 1650 ln 350 ln

1 3 1

e e

e e

0.26) 0.52

0.26 0.52

650 ln 1 3 650 ln 1 3

350 ln 1 350 ln 1

e e

e e

779 666 516 317 86

Yay!

Exercise

• Assume that the substitution cost for a weighted parsimony algorithm is a metric, i.e. it satisfied S(a,a)=0, S(a,b)=S(b,a) and S(a,c)≤S(a,b)+S(b,c).

• Show the tree with minimal cost is independent of the position of the root.

Solution

• We have a set of species and we are given a minimal weight tree for it. Denote the root in this tree by k

k

i j

l m

We will show that deleting kand moving it to this edge does not change the cost of the tree

Solution

• What is the cost of the tree before translocation of the root?

k

i j

l m

min ( , ) ( ) min ( , ) ( )

min ( , ) ( ) min ( , ) ( )

T i ja b

i ja b

S S c a S a S c b S b

S a c S a S c b S b

For a specific choice of character c at the root:The minimal choice is the cost of this tree:

min TcS

• And the minimal cost of the tree is:

Solution

• Due to the triangle inequality, S(a,b)≤S(a,c)+S(c,b)

k

i j

l m

• If we set c to a (or equivalently to b), we get:

min ( , ) ( ) min ( , ) ( )

( ) min ( , ) ( )

T i ja b

i jb

S S a a S a S a b S b

S a S a b S b

min( ( ) min ( , ) ( ))i ja bS a S a b S b

Solution

• Now we move the root:

k

i j

l m

k

• Call this tree T’

Solution

• Denote the character at l as d

k

l i

jm

• The new cost is:

' '

,

'

,

min ( ) ( , ) ( )

min ( ) ( , ) ( )

i la d

i la d

S a S a d S d

S a S a d S d

where the S’ is due to the change in subtree

Solution

k

l i

jm

k

i j

l m

'

,min ( ) ( , ) ( )i la dS a S a d S d

,min ( ) ( , ) ( )i ja bS a S a b S b

Solution

k

l i

jm

k

i j

l m

'

,

,

min ( ) ( , ) ( )

min ( ) ( , )

min min ( ) ( , )

( , ) ( )

i la d

jb

ma d e

l

S a S a d S d

S b S a b

S e S a e

S a d S d

,

,

min ( ) ( , ) ( )

min ( ) ( , )

min min ( ) ( , )

( , ) ( )

i ja b

ld

ma b e

j

S a S a b S b

S d S a d

S e S a e

S a b S b

Solution

• We proved that when moving the root to an adjacent position does not change the minimal cost.

• Why is the case of moving the root to a non-adjacent position easier to prove?

Question

• Does every symmetric distance with 0 on the diagonal have a tree?

Answer

• No!• Example:

• If d(a,d) = 0.25 and d(b,d)=0.25, then it must be that d(a,b) ≤ 0.5

a b c d

a 0 1 1 0.25

b 1 0 1 0.25

c 1 1 0 0.25

d 0.25 0.25 0.25 0