. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N...

44
. Phylogeny II : Parsimony, ML, SEMPHY
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    218
  • download

    2

Transcript of . Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N...

Page 1: . Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.

.

Phylogeny II : Parsimony, ML, SEMPHY

Page 2: . Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.

Phylogenetic Tree

Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2

leaf

branch internal node

Page 3: . Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.

Character Based Methods

We start with a multiple alignment Assumptions:

All sequences are homologous Each position in alignment is homologous Positions evolve independently No gaps

We seek to explain the evolution of each position in the alignment

Page 4: . Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.

Parsimony

Character-based method A way to score trees (but not to build trees!)

Assumptions: Independence of characters (no interactions) Best tree is one where minimal changes take place

Page 5: . Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.

A Simple Example

What is the parsimony score of

Aardvark Bison Chimp Dog Elephant

A: CAGGTAB: CAGACAC: CGGGTAD: TGCACTE: TGCGTA

Page 6: . Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.

A Simple Example

Each column is scored separately. Let’s look at the first column:

Minimal tree has one evolutionary change:C

C

CC

C

T

T

T

T C

A: CAGGTAB: CAGACAC: CGGGTAD: TGCACTE: TGCGTA

Page 7: . Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.

Evaluating Parsimony Scores

How do we compute the Parsimony score for a given tree?

Traditional Parsimony Each base change has a cost of 1

Weighted Parsimony Each change is weighted by the score c(a,b)

Page 8: . Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.

Traditional Parsimony

}{},{

1 1min);,...,(vu xx

Evun TxxPar

nodesinternal

a g a

{a,g}

{a}

•Solved independently for each position

•Linear time solution

a

a

Page 9: . Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.

Evaluating Weighted Parsimony

Dynamic programming on the tree

S(i,a) = cost of tree rooted at i if i is labeled by a

Initialization: For each leaf i set S(i,a) = 0 if i is labeled by a,

otherwise S(i,a) = Iteration: if k is a node with children i and j, then

S(k,a) = minb(S(i,b)+c(a,b)) + minb(S(j,b)+c(a,b))

Termination: cost of tree is minaS(r,a) where r is the root

Page 10: . Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.

Cost of Evaluating Parsimony

Score is evaluated on each position independetly. Scores are then summed over all positions.

If there are n nodes, m characters, and k possible values for each character, then complexity is O(nmk)

By keeping traceback information, we can reconstruct most parsimonious values at each ancestor node

Page 11: . Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.

Maximum Parsimony

1 2 3 4 5 6 7 8 9 10

Species 1 - A G G G T A A C T G

Species 2 - A C G A T T A T T A

Species 3 - A T A A T T G T C T

Species 4 - A A T G T T G T C G

How many possible unrooted trees?

Page 12: . Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.

Maximum Parsimony

How many possible unrooted trees?

1

3

2

4

1

2

3

4

1

4

3

2

1 2 3 4 5 6 7 8 9 10

Species 1 - A G G G T A A C T GSpecies 2 - A C G A T T A T T ASpecies 3 - A T A A T T G T C TSpecies 4 - A A T G T T G T C G

Page 13: . Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.

Maximum Parsimony

How many substitutions?

A

A

G

GA G

1 change

A

A

G

GG A

5 changes

1

2

3

4

tree

MP

Page 14: . Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.

Maximum Parsimony

1 2 3 4 5 6 7 8 9 10

1 - A G G G T A A C T G

2 - A C G A T T A T T A

3 - A T A A T T G T C T

4 - A A T G T T G T C G0

0

0

1

3

2

4

1

2

3

4

1

4

3

2

Page 15: . Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.

Maximum Parsimony

1 2 3 4 5 6 7 8 9 10

1 - A G G G T A A C T G

2 - A C G A T T A T T A

3 - A T A A T T G T C T

4 - A A T G T T G T C G0 3

0 3

0 3

1

3

2

4

1

2

3

4

1

4

3

2

Page 16: . Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.

Maximum Parsimony

4

1 - G

2 - C

3 - T

4 - A

1

2

3

4A

G

C

T

C

A

G

T

C1

3

2

4C

C

G

A

T1

4

3

2C

3

3

3

Page 17: . Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.

Maximum Parsimony

1 2 3 4 5 6 7 8 9 10

1 - A G G G T A A C T G

2 - A C G A T T A T T A

3 - A T A A T T G T C T

4 - A A T G T T G T C G0 3 2

0 3 2

0 3 2

1

3

2

4

1

2

3

4

1

4

3

2

Page 18: . Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.

Maximum Parsimony

1 2 3 4 5 6 7 8 9 10

1 - A G G G T A A C T G

2 - A C G A T T A T T A

3 - A T A A T T G T C T

4 - A A T G T T G T C G0 3 2 2

0 3 2 1

0 3 2 2

1

3

2

4

1

2

3

4

1

4

3

2

Page 19: . Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.

Maximum Parsimony

4

1 - G

2 - A

3 - A

4 - G

1

2

3

4G

G

A

A

A

G

G

A

A1

3

2

4A

G

A

A

G1

4

3

2A

2

2

1

Page 20: . Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.

Maximum Parsimony

0 3 2 2 0 1 1 1 1 3 14

0 3 2 1 0 1 2 1 2 3 15

0 3 2 2 0 1 2 1 2 3 16

1

3

2

4

1

2

3

4

1

4

3

2

Page 21: . Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.

Maximum Parsimony

1 2 3 4 5 6 7 8 9 10

1 - A G G G T A A C T G

2 - A C G A T T A T T A

3 - A T A A T T G T C T

4 - A A T G T T G T C G

0 3 2 2 0 1 1 1 1 3 14

1

2

3

4

Page 22: . Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.

Searching for Trees

#Taxa #Trees #Taxa #Trees

3 1 10 2 x 106

4 3 50 3 x 1074

5 15 100 2 x 10182

Page 23: . Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.

Searching for the Optimal Tree

Exhaustive Search Very intensive

Branch and Bound A compromise

Heuristic Fast Usually starts with NJ

Page 24: . Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.

Phylogenetic Tree Assumptions

Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2

Lengths t = {ti} for each branch Phylogenetic tree = (Topology, Lengths) = (T,t)

leaf

branch internal node

Page 25: . Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.

Probabilistic Methods

The phylogenetic tree represents a generative probabilistic model (like HMMs) for the observed sequences.

Background probabilities: q(a) Mutation probabilities: P(a|b, t) Models for evolutionary mutations

Jukes Cantor Kimura 2-parameter model

Such models are used to derive the probabilities

Page 26: . Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.

Jukes Cantor model

A model for mutation rates

•Mutation occurs at a constant rate •Each nucleotide is equally likely to mutate into any other nucleotide with rate .

Page 27: . Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.

Kimura 2-parameter model

Allows a different rate for transitions and transversions.

Page 28: . Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.

Mutation Probabilities

The rate matrix R is used to derive the mutation probability matrix S:

S is obtained by integration. For Jukes Cantor:

q can be obtained by setting t to infinity

RItS )(

)()(),|(

)()(),|(

tag

taa

etStagP

etStaaP

4

4

14

1

314

1

Page 29: . Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.

Mutation Probabilities

Both models satisfy the following properties:

Lack of memory:

Reversibility: Exist stationary probabilities

{Pa} s.t.

A

G T

C

b

cbbaca tPtPttP )'()()'(

)()( tPPtPP abbbaa

Page 30: . Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.

Probabilistic Approach

Given P,q, the tree topology and branch lengths, we can compute:

x1 x2 x3

x4

x5

),|(),|(),|(),|()(

),|,,,,(

2421413534545

54321

txxptxxptxxptxxpxq

tTxxxxxP

t1t2 t3

t4

Page 31: . Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.

Computing the Tree Likelihood

54

54321321xx

tTxxxxxPtTxxxP,

),|,,,,(),|,,(

We are interested in the probability of observed data given tree and branch “lengths”:

Computed by summing over internal nodes This can be done efficiently using a tree upward

traversal pass.

Page 32: . Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.

Tree Likelihood Computation

Define P(Lk|a)= prob. of leaves below node k given that xk=a

Init: for leaves: P(Lk|a)=1 if xk=a ; 0 otherwise Iteration: if k is node with children i and j, then

Termination:Likelihood is

ba

jik cjLtacPbiLtabPaLP,

)|(),|()|(),|()|(

)()|(),|,,( aqaLPtTxxPa

root31

Page 33: . Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.

Maximum Likelihood (ML)

Score each tree by Assumption of independent positions

Branch lengths t can be optimized Gradient ascent EM

We look for the highest scoring tree Exhaustive search Sampling methods (Metropolis)

m

nn tTmxmxPtTXXP ),|][,],[(),|,,( 11

Page 34: . Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.

Optimal Tree Search

Perform search over possible topologiesT1 T3

T4

T2

Tn

Parametric optimization

(EM)

Parameter space

Local Maxima

Page 35: . Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.

Computational Problem

Such procedures are computationally expensive! Computation of optimal parameters, per candidate,

requires non-trivial optimization step. Spend non-negligible computation on a candidate,

even if it is a low scoring one. In practice, such learning procedures can only

consider small sets of candidate structures

Page 36: . Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.

Structural EM

Idea: Use parameters found for current topology to help evaluate new topologies.

Outline: Perform search in (T, t) space. Use EM-like iterations:

E-step: use current solution to compute expected sufficient statistics for all topologies

M-step: select new topology based on these expected sufficient statistics

Page 37: . Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.

The Complete-Data ScenarioSuppose we observe H, the ancestral sequences.

Tjijiji

Tji m mx

jimxmx

i mmx

mN

complete

StFconst

p

tpp

tTmxPHDtTl

j

ji

i

),(,,

),(

,

22...1

),(

)(loglog

),|(log,:,

),(max ,,, , jijitji StFwji

Tji

jiw),(

,

Define:

Find: topology T that maximizes

Si,j is a matrix of # of co-occurrences for each pair (a,b) in the taxa i,jF is a linear function of Si,j

Page 38: . Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.

Expected Likelihood

Start with a tree (T0,t0) Compute

Formal justification: Define:

Theorem:

Consequence: improvement in expected score improvement in likelihood

m

mN

mj

miji tTxbXaXPbaSE ),,|,()],([ 00

],,1[),(

Tjijiji

complete

constSEtF

tTtTHDlEtTQ

),(,,

00

])[,(

],|),:,([),(

),:(),:(),(),( 0000 tTDltTDltTQtTQ

Page 39: . Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.

Algorithm Outline

Original Tree (T0,t0)

Unlike standard EM for trees, we compute all possible pairwise statistics

Time: O(N2M)

Compute: ],,|),([ 00),( tTDbaSE ji

])[,(max ,, jitji SEtFw Weights:

Page 40: . Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.

Pairwise weights

This stage also computes the branch length for each pair (i,j)

Algorithm Outline

Compute: ],,|),([ 00),( tTDbaSE ji

])[,(max ,, jitji SEtFw Weights:

Tji

jiT wT),(

,maxarg'Find:

Page 41: . Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.

Max. Spanning Tree

Fast greedy procedure to find tree

By construction:Q(T’,t’) Q(T0,t0)

Thus, l(T’,t’) l(T0,t0)

Algorithm Outline

Compute: ],,|),([ 00),( tTDbaSE ji

])[,(max ,, jitji SEtFw Weights:

Tji

jiT wT),(

,maxarg'Find:

Construct bifurcation T1

Page 42: . Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.

Fix Tree

Remove redundant nodesAdd nodes to break large degree

This operation preserves likelihood l(T1,t’) =l(T’,t’) l(T0,t0)

Algorithm Outline

Compute: ],,|),([ 00),( tTDbaSE ji

Tji

jiT wT),(

,maxarg'Find:

])[,(max ,, jitji SEtFw Weights:

Construct bifurcation T1

Page 43: . Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.

Assessing trees: the Bootstrap

Often we don’t trust the tree found as the “correct” one.

Bootstrapping: Sample (with replacement) n positions from the

alignment Learn the best tree for each sample Look for tree features which are frequent in all

trees. For some models this procedure approximates the

tree posterior P(T| X1,…,Xn)

Page 44: . Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.

New TreeThm: l(T1,t1) l(T0,t0)

Algorithm Outline

Compute: ],,|),([ 00),( tTDbaSE ji

Construct bifurcation T1

Tji

jiT wT),(

,maxarg'Find:

])[,(max ,, jitji SEtFw Weights:

These steps are then repeated until convergence