Lecture 27. Phylogeny methods, part 4 (Models of DNA and...

44
Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) Joe Felsenstein Department of Genome Sciences and Department of Biology Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) – p.1/26

Transcript of Lecture 27. Phylogeny methods, part 4 (Models of DNA and...

Page 1: Lecture 27. Phylogeny methods, part 4 (Models of DNA and ...evolution.gs.washington.edu/gs541/2005/lecture27.pdfPhylogeny methods, part 4 (Models of DNA and protein change) Πp.5/26

Lecture 27. Phylogeny methods, part 4 (Models of DNA andprotein change)

Joe Felsenstein

Department of Genome Sciences and Department of Biology

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) – p.1/26

Page 2: Lecture 27. Phylogeny methods, part 4 (Models of DNA and ...evolution.gs.washington.edu/gs541/2005/lecture27.pdfPhylogeny methods, part 4 (Models of DNA and protein change) Πp.5/26

The Jukes-Cantor model (1969)

A G

C T

u/3

u/3

u/3u/3 u/3

u/3the simplest symmetrical model of DNA evolution

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) – p.2/26

Page 3: Lecture 27. Phylogeny methods, part 4 (Models of DNA and ...evolution.gs.washington.edu/gs541/2005/lecture27.pdfPhylogeny methods, part 4 (Models of DNA and protein change) Πp.5/26

Transition probabilities under the Jukes-Cantor model

All sites change independently

All sites have the same stochastic process working at them

Make up a fictional kind of event, such that when it happens thesite changes to one of the 4 bases chosen at random

(equiprobably)

Assertion: Having these events occur at rate 4

3u is the same as

having the Jukes-Cantor model events occur at rate u

The probability of none of these fictional events happens in time t

is exp(− 4

3ut)

No matter how many of these fictional events occur, provided it is

not zero, the chance of ending up at a particular base is 1

4.

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) – p.3/26

Page 4: Lecture 27. Phylogeny methods, part 4 (Models of DNA and ...evolution.gs.washington.edu/gs541/2005/lecture27.pdfPhylogeny methods, part 4 (Models of DNA and protein change) Πp.5/26

Transition probabilities under the Jukes-Cantor model

All sites change independently

All sites have the same stochastic process working at them

Make up a fictional kind of event, such that when it happens thesite changes to one of the 4 bases chosen at random

(equiprobably)

Assertion: Having these events occur at rate 4

3u is the same as

having the Jukes-Cantor model events occur at rate u

The probability of none of these fictional events happens in time t

is exp(− 4

3ut)

No matter how many of these fictional events occur, provided it is

not zero, the chance of ending up at a particular base is 1

4.

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) – p.3/26

Page 5: Lecture 27. Phylogeny methods, part 4 (Models of DNA and ...evolution.gs.washington.edu/gs541/2005/lecture27.pdfPhylogeny methods, part 4 (Models of DNA and protein change) Πp.5/26

Transition probabilities under the Jukes-Cantor model

All sites change independently

All sites have the same stochastic process working at them

Make up a fictional kind of event, such that when it happens thesite changes to one of the 4 bases chosen at random

(equiprobably)

Assertion: Having these events occur at rate 4

3u is the same as

having the Jukes-Cantor model events occur at rate u

The probability of none of these fictional events happens in time t

is exp(− 4

3ut)

No matter how many of these fictional events occur, provided it is

not zero, the chance of ending up at a particular base is 1

4.

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) – p.3/26

Page 6: Lecture 27. Phylogeny methods, part 4 (Models of DNA and ...evolution.gs.washington.edu/gs541/2005/lecture27.pdfPhylogeny methods, part 4 (Models of DNA and protein change) Πp.5/26

Transition probabilities under the Jukes-Cantor model

All sites change independently

All sites have the same stochastic process working at them

Make up a fictional kind of event, such that when it happens thesite changes to one of the 4 bases chosen at random (equiprobably)

Assertion: Having these events occur at rate 4

3u is the same as

having the Jukes-Cantor model events occur at rate u

The probability of none of these fictional events happens in time t

is exp(− 4

3ut)

No matter how many of these fictional events occur, provided it is

not zero, the chance of ending up at a particular base is 1

4.

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) – p.3/26

Page 7: Lecture 27. Phylogeny methods, part 4 (Models of DNA and ...evolution.gs.washington.edu/gs541/2005/lecture27.pdfPhylogeny methods, part 4 (Models of DNA and protein change) Πp.5/26

Transition probabilities under the Jukes-Cantor model

All sites change independently

All sites have the same stochastic process working at them

Make up a fictional kind of event, such that when it happens thesite changes to one of the 4 bases chosen at random (equiprobably)

Assertion: Having these events occur at rate 4

3u is the same as

having the Jukes-Cantor model events occur at rate u

The probability of none of these fictional events happens in time t

is exp(− 4

3ut)

No matter how many of these fictional events occur, provided it is

not zero, the chance of ending up at a particular base is 1

4.

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) – p.3/26

Page 8: Lecture 27. Phylogeny methods, part 4 (Models of DNA and ...evolution.gs.washington.edu/gs541/2005/lecture27.pdfPhylogeny methods, part 4 (Models of DNA and protein change) Πp.5/26

Transition probabilities under the Jukes-Cantor model

All sites change independently

All sites have the same stochastic process working at them

Make up a fictional kind of event, such that when it happens thesite changes to one of the 4 bases chosen at random (equiprobably)

Assertion: Having these events occur at rate 4

3u is the same as

having the Jukes-Cantor model events occur at rate u

The probability of none of these fictional events happens in time t

is exp(− 4

3ut)

No matter how many of these fictional events occur, provided it is

not zero, the chance of ending up at a particular base is 1

4.

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) – p.3/26

Page 9: Lecture 27. Phylogeny methods, part 4 (Models of DNA and ...evolution.gs.washington.edu/gs541/2005/lecture27.pdfPhylogeny methods, part 4 (Models of DNA and protein change) Πp.5/26

Jukes-Cantor transition probabilities, cont’d

Putting all this together, the probability of changing to C, given the site iscurrently at A, in time t is

Prob (C|A, t) =1

4

(

1 − e−

4

3ut

)

while

Prob (A|A, t) = e−

4

3t +

1

4

(

1 − e−

4

3ut

)

or

Prob (A|A, t) =1

4

(

1 + 3e−

4

3ut

)

so that the total probability of change is

Prob (change|t) =3

4

(

1 − e−4

3ut

)

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) – p.4/26

Page 10: Lecture 27. Phylogeny methods, part 4 (Models of DNA and ...evolution.gs.washington.edu/gs541/2005/lecture27.pdfPhylogeny methods, part 4 (Models of DNA and protein change) Πp.5/26

Fraction of sites different, Jukes-Cantor

0

1

0

0.75

0.49

0.7945

Branch length

Differencesper site

after branches of different length, under the Jukes-Cantor model

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) – p.5/26

Page 11: Lecture 27. Phylogeny methods, part 4 (Models of DNA and ...evolution.gs.washington.edu/gs541/2005/lecture27.pdfPhylogeny methods, part 4 (Models of DNA and protein change) Πp.5/26

Kimura’s (1980) K2P model of DNA change,

A G

C Ta

b b b b

a

which allows for different rates of transitions and transversions,

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) – p.6/26

Page 12: Lecture 27. Phylogeny methods, part 4 (Models of DNA and ...evolution.gs.washington.edu/gs541/2005/lecture27.pdfPhylogeny methods, part 4 (Models of DNA and protein change) Πp.5/26

Motoo Kimura

Motoo Kimura, with family in Mishima, Japan in the 1960’s

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) – p.7/26

Page 13: Lecture 27. Phylogeny methods, part 4 (Models of DNA and ...evolution.gs.washington.edu/gs541/2005/lecture27.pdfPhylogeny methods, part 4 (Models of DNA and protein change) Πp.5/26

Transition probabilities for the K2P model

with two kinds of events:

I. At rate α, if the site has a purine (A or G), choose one of the twopurines at random and change to it. If the site has a pyrimidine (C

or T), choose one of the pyrimidines at random and change to it.

II. At rate β, choose one of the 4 bases at random and change to it.

By proper choice of α and β one can achieve the overall rate of changeand Ts/Tn ratio R you want. For rate of change 1, the transition

probabilities (warning: terminological tangle).

Prob (transition|t) = 1

4− 1

2exp

(

−R+

1

2

R+1t

)

+ 1

4exp

(

− 2

R+1t

)

Prob (transversion|t) = 1

2− 1

2exp

(

− 2

R+1t

)

.

(the transversion probability is the sum of the probabilities of both kindsof transversions).

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) – p.8/26

Page 14: Lecture 27. Phylogeny methods, part 4 (Models of DNA and ...evolution.gs.washington.edu/gs541/2005/lecture27.pdfPhylogeny methods, part 4 (Models of DNA and protein change) Πp.5/26

Transitions, transversions expected

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.0 0.5 1.0 1.5 2.0 2.5 3.0

Total differences

Transitions

Transversions

Time (branch length)

Diff

eren

ces

R = 10

in different amounts of branch length under the K2P model, for Ts/Tn = 10

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) – p.9/26

Page 15: Lecture 27. Phylogeny methods, part 4 (Models of DNA and ...evolution.gs.washington.edu/gs541/2005/lecture27.pdfPhylogeny methods, part 4 (Models of DNA and protein change) Πp.5/26

Transitions, transversions expected

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.0 0.5 1.0 1.5 2.0 2.5 3.0

Transversions

Transitions

Total differences

Diff

eren

ces

Time (branch length)

R = 2

in different amounts of branch length under the K2P model, for Ts/Tn = 2

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) – p.10/26

Page 16: Lecture 27. Phylogeny methods, part 4 (Models of DNA and ...evolution.gs.washington.edu/gs541/2005/lecture27.pdfPhylogeny methods, part 4 (Models of DNA and protein change) Πp.5/26

Other commonly used models include:

Two models that specify the equilibrium base frequencies (you providethe frequencies πA, πC, πG, πT and they are set up to have an equilibrium

which achieves them), and also let you control thetransition/transversion ratio:

The Hasegawa-Kishino-Yano (1985) model:

to : A G C Tfrom :

A − απG + βπG απC απT

G απA + βπA − απC απT

C απA απG − απT + βπT

T απA απG απC + βπC −

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) – p.11/26

Page 17: Lecture 27. Phylogeny methods, part 4 (Models of DNA and ...evolution.gs.washington.edu/gs541/2005/lecture27.pdfPhylogeny methods, part 4 (Models of DNA and protein change) Πp.5/26

My F84 model

to : A G C Tfrom :

A − απG + β πG

πRαπC απT

G απA + β πA

πR− απC απT

C απA απG − απT + βπT

πY

T απA απG απC + β πC

πY−

where πR = πA + πG and πY = πC + πT (The equilibrium frequencies ofpurines and pyrimidines)

Both of these models have formulas for the transition probabilities, andboth are subcases of a slightly more general class of models, theTamura-Nei model (1993).

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) – p.12/26

Page 18: Lecture 27. Phylogeny methods, part 4 (Models of DNA and ...evolution.gs.washington.edu/gs541/2005/lecture27.pdfPhylogeny methods, part 4 (Models of DNA and protein change) Πp.5/26

Reversibility

πi

PijPji

πj

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) – p.13/26

Page 19: Lecture 27. Phylogeny methods, part 4 (Models of DNA and ...evolution.gs.washington.edu/gs541/2005/lecture27.pdfPhylogeny methods, part 4 (Models of DNA and protein change) Πp.5/26

The General Time-Reversible model (GTR)

It maintains “detailed balance" so that the probability of starting at (say)A and ending at (say) T in evolution is the same as the probability of

starting at T and ending at A:

to : A G C Tfrom :

A − απG βπC γπT

G απA − δπC επT

C βπA δπG − υπT

T γπA επG υπC −

And there is of course the general 12-parameter model which has arbitraryrates for each of the 12 possible changes (from each of the 4 nucleotides

to each of the 3 others). (Neither of these has formulas for the transitionprobabilities, but those can be done numerically.)

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) – p.14/26

Page 20: Lecture 27. Phylogeny methods, part 4 (Models of DNA and ...evolution.gs.washington.edu/gs541/2005/lecture27.pdfPhylogeny methods, part 4 (Models of DNA and protein change) Πp.5/26

Relation between models

There are many other models, but these are the most widely-used ones.Here is a general scheme of which models are subcases of which otherones:

General 12−parameter model (12)

General time−reversible model (9)

Kimura K2P (2)

Tamura−Nei (6)

HKY (5) F84 (5)

Jukes−Cantor (1)Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) – p.15/26

Page 21: Lecture 27. Phylogeny methods, part 4 (Models of DNA and ...evolution.gs.washington.edu/gs541/2005/lecture27.pdfPhylogeny methods, part 4 (Models of DNA and protein change) Πp.5/26

Rate variation among sites

In reality, rates of evolution are not constant among sites.

Fortunately, in the transition probability formulas, rates come in assimple multiples of times

Thus if we know the rates at two sites, we can compute theprobabilities of change by simply, for each site, multiplying all

branch lengths by the appropriate rate

If we don’t know the rates, we can imagine averaging them over adistribution of rates. Usually the Gamma distribution is used

In practice a discrete histogram of rates approximates theintegration

(For the Gamma it seems best to use Generalized Laguerre

Quadrature to pick the rates and frequencies in the histogram).

Also, there are actually autocorrelations with neighboring sites

having similar rates of change.

This can be handled by Hidden Markov Models, which we coverlater.

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) – p.16/26

Page 22: Lecture 27. Phylogeny methods, part 4 (Models of DNA and ...evolution.gs.washington.edu/gs541/2005/lecture27.pdfPhylogeny methods, part 4 (Models of DNA and protein change) Πp.5/26

Rate variation among sites

In reality, rates of evolution are not constant among sites.

Fortunately, in the transition probability formulas, rates come in assimple multiples of times

Thus if we know the rates at two sites, we can compute theprobabilities of change by simply, for each site, multiplying all

branch lengths by the appropriate rate

If we don’t know the rates, we can imagine averaging them over adistribution of rates. Usually the Gamma distribution is used

In practice a discrete histogram of rates approximates theintegration

(For the Gamma it seems best to use Generalized Laguerre

Quadrature to pick the rates and frequencies in the histogram).

Also, there are actually autocorrelations with neighboring sites

having similar rates of change.

This can be handled by Hidden Markov Models, which we coverlater.

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) – p.16/26

Page 23: Lecture 27. Phylogeny methods, part 4 (Models of DNA and ...evolution.gs.washington.edu/gs541/2005/lecture27.pdfPhylogeny methods, part 4 (Models of DNA and protein change) Πp.5/26

Rate variation among sites

In reality, rates of evolution are not constant among sites.

Fortunately, in the transition probability formulas, rates come in assimple multiples of times

Thus if we know the rates at two sites, we can compute theprobabilities of change by simply, for each site, multiplying all

branch lengths by the appropriate rate

If we don’t know the rates, we can imagine averaging them over adistribution of rates. Usually the Gamma distribution is used

In practice a discrete histogram of rates approximates theintegration

(For the Gamma it seems best to use Generalized Laguerre

Quadrature to pick the rates and frequencies in the histogram).

Also, there are actually autocorrelations with neighboring sites

having similar rates of change.

This can be handled by Hidden Markov Models, which we coverlater.

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) – p.16/26

Page 24: Lecture 27. Phylogeny methods, part 4 (Models of DNA and ...evolution.gs.washington.edu/gs541/2005/lecture27.pdfPhylogeny methods, part 4 (Models of DNA and protein change) Πp.5/26

Rate variation among sites

In reality, rates of evolution are not constant among sites.

Fortunately, in the transition probability formulas, rates come in assimple multiples of times

Thus if we know the rates at two sites, we can compute theprobabilities of change by simply, for each site, multiplying all

branch lengths by the appropriate rate

If we don’t know the rates, we can imagine averaging them over adistribution of rates. Usually the Gamma distribution is used

In practice a discrete histogram of rates approximates theintegration

(For the Gamma it seems best to use Generalized Laguerre

Quadrature to pick the rates and frequencies in the histogram).

Also, there are actually autocorrelations with neighboring sites

having similar rates of change.

This can be handled by Hidden Markov Models, which we coverlater.

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) – p.16/26

Page 25: Lecture 27. Phylogeny methods, part 4 (Models of DNA and ...evolution.gs.washington.edu/gs541/2005/lecture27.pdfPhylogeny methods, part 4 (Models of DNA and protein change) Πp.5/26

Rate variation among sites

In reality, rates of evolution are not constant among sites.

Fortunately, in the transition probability formulas, rates come in assimple multiples of times

Thus if we know the rates at two sites, we can compute theprobabilities of change by simply, for each site, multiplying all

branch lengths by the appropriate rate

If we don’t know the rates, we can imagine averaging them over adistribution of rates. Usually the Gamma distribution is used

In practice a discrete histogram of rates approximates theintegration

(For the Gamma it seems best to use Generalized Laguerre

Quadrature to pick the rates and frequencies in the histogram).

Also, there are actually autocorrelations with neighboring sites

having similar rates of change.

This can be handled by Hidden Markov Models, which we coverlater.

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) – p.16/26

Page 26: Lecture 27. Phylogeny methods, part 4 (Models of DNA and ...evolution.gs.washington.edu/gs541/2005/lecture27.pdfPhylogeny methods, part 4 (Models of DNA and protein change) Πp.5/26

Rate variation among sites

In reality, rates of evolution are not constant among sites.

Fortunately, in the transition probability formulas, rates come in assimple multiples of times

Thus if we know the rates at two sites, we can compute theprobabilities of change by simply, for each site, multiplying all

branch lengths by the appropriate rate

If we don’t know the rates, we can imagine averaging them over adistribution of rates. Usually the Gamma distribution is used

In practice a discrete histogram of rates approximates theintegration

(For the Gamma it seems best to use Generalized Laguerre

Quadrature to pick the rates and frequencies in the histogram).

Also, there are actually autocorrelations with neighboring sites

having similar rates of change.

This can be handled by Hidden Markov Models, which we coverlater.

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) – p.16/26

Page 27: Lecture 27. Phylogeny methods, part 4 (Models of DNA and ...evolution.gs.washington.edu/gs541/2005/lecture27.pdfPhylogeny methods, part 4 (Models of DNA and protein change) Πp.5/26

Rate variation among sites

In reality, rates of evolution are not constant among sites.

Fortunately, in the transition probability formulas, rates come in assimple multiples of times

Thus if we know the rates at two sites, we can compute theprobabilities of change by simply, for each site, multiplying all

branch lengths by the appropriate rate

If we don’t know the rates, we can imagine averaging them over adistribution of rates. Usually the Gamma distribution is used

In practice a discrete histogram of rates approximates theintegration

(For the Gamma it seems best to use Generalized Laguerre

Quadrature to pick the rates and frequencies in the histogram).

Also, there are actually autocorrelations with neighboring sites

having similar rates of change.

This can be handled by Hidden Markov Models, which we coverlater.

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) – p.16/26

Page 28: Lecture 27. Phylogeny methods, part 4 (Models of DNA and ...evolution.gs.washington.edu/gs541/2005/lecture27.pdfPhylogeny methods, part 4 (Models of DNA and protein change) Πp.5/26

Rate variation among sites

In reality, rates of evolution are not constant among sites.

Fortunately, in the transition probability formulas, rates come in assimple multiples of times

Thus if we know the rates at two sites, we can compute theprobabilities of change by simply, for each site, multiplying all

branch lengths by the appropriate rate

If we don’t know the rates, we can imagine averaging them over adistribution of rates. Usually the Gamma distribution is used

In practice a discrete histogram of rates approximates theintegration

(For the Gamma it seems best to use Generalized Laguerre

Quadrature to pick the rates and frequencies in the histogram).

Also, there are actually autocorrelations with neighboring sites

having similar rates of change.

This can be handled by Hidden Markov Models, which we coverlater.

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) – p.16/26

Page 29: Lecture 27. Phylogeny methods, part 4 (Models of DNA and ...evolution.gs.washington.edu/gs541/2005/lecture27.pdfPhylogeny methods, part 4 (Models of DNA and protein change) Πp.5/26

A pioneer of protein evolution

Margaret Dayhoff, about 1966Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) – p.17/26

Page 30: Lecture 27. Phylogeny methods, part 4 (Models of DNA and ...evolution.gs.washington.edu/gs541/2005/lecture27.pdfPhylogeny methods, part 4 (Models of DNA and protein change) Πp.5/26

Models of amino acid change in proteins

There are a variety of models put forward since the mid-1960’s:

Amino acid transition matricesDayhoff (1968) model. Tabulation of empirical changes inclosely related pairs of proteins, normalized. The PAM100

matrix, for example, is the expected transition matrix given 1substitution per position.

Jones, Taylor and Thornton (1992) recalculated PAM matrices(the JTT matrix) from a much larger set of data.

Jones, Taylor, and Thurnton (1994a, 1994b) have tabulated aseparate mutation data matrix for transmembrane proteins.

Koshi and Goldstein (1995) have described the tabulation offurther context-dependent mutation data matrices.

Henikoff and Henikoff (1992) have tabulated the BLOSUMmatrix for conserved motifs in gene families.

Goldman and Yang (1994) and Muse and Gaut (1994) pioneered

codon-based models (see next screen).

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) – p.18/26

Page 31: Lecture 27. Phylogeny methods, part 4 (Models of DNA and ...evolution.gs.washington.edu/gs541/2005/lecture27.pdfPhylogeny methods, part 4 (Models of DNA and protein change) Πp.5/26

PAM-like amino acid modelsA C D E F G H I K L M N P Q R S T V W Y

A

C

D

E

F

G

HI

K

L

M

N

P

Q

R

S

T

V

W

Y

etc.

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) – p.19/26

Page 32: Lecture 27. Phylogeny methods, part 4 (Models of DNA and ...evolution.gs.washington.edu/gs541/2005/lecture27.pdfPhylogeny methods, part 4 (Models of DNA and protein change) Πp.5/26

A PAM2 model tabulation

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) – p.20/26

Page 33: Lecture 27. Phylogeny methods, part 4 (Models of DNA and ...evolution.gs.washington.edu/gs541/2005/lecture27.pdfPhylogeny methods, part 4 (Models of DNA and protein change) Πp.5/26

Approaches to protein sequence models

Making a model for protein sequence evolution (a not-very-practicalapproach):

Use a good model of DNA evolution.

Use the appropriate genetic code.

When an amino acid changes, accept this with a probability thatdeclines as the amino acids become more different.

Fit this to empirical information on protein evolution.

Take into account variation of rate from site to site.

Take into account correlation of rate variation in neighboring sites.

How about protein structure (Secondary structure? 3D structure?)

This “codon model" is widely used to detect evidence of positive

selection, by detecting loci or lineage-site combinations that show

evidence of too high a rate of substitution.

It is the best way to take DNA and protein sequences into accountsimultaneously when inferring phylogenies.

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) – p.21/26

Page 34: Lecture 27. Phylogeny methods, part 4 (Models of DNA and ...evolution.gs.washington.edu/gs541/2005/lecture27.pdfPhylogeny methods, part 4 (Models of DNA and protein change) Πp.5/26

Approaches to protein sequence models

Making a model for protein sequence evolution (a not-very-practicalapproach):

Use a good model of DNA evolution.

Use the appropriate genetic code.

When an amino acid changes, accept this with a probability thatdeclines as the amino acids become more different.

Fit this to empirical information on protein evolution.

Take into account variation of rate from site to site.

Take into account correlation of rate variation in neighboring sites.

How about protein structure (Secondary structure? 3D structure?)

This “codon model" is widely used to detect evidence of positive

selection, by detecting loci or lineage-site combinations that show

evidence of too high a rate of substitution.

It is the best way to take DNA and protein sequences into accountsimultaneously when inferring phylogenies.

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) – p.21/26

Page 35: Lecture 27. Phylogeny methods, part 4 (Models of DNA and ...evolution.gs.washington.edu/gs541/2005/lecture27.pdfPhylogeny methods, part 4 (Models of DNA and protein change) Πp.5/26

Approaches to protein sequence models

Making a model for protein sequence evolution (a not-very-practicalapproach):

Use a good model of DNA evolution.

Use the appropriate genetic code.

When an amino acid changes, accept this with a probability thatdeclines as the amino acids become more different.

Fit this to empirical information on protein evolution.

Take into account variation of rate from site to site.

Take into account correlation of rate variation in neighboring sites.

How about protein structure (Secondary structure? 3D structure?)

This “codon model" is widely used to detect evidence of positive

selection, by detecting loci or lineage-site combinations that show

evidence of too high a rate of substitution.

It is the best way to take DNA and protein sequences into accountsimultaneously when inferring phylogenies.

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) – p.21/26

Page 36: Lecture 27. Phylogeny methods, part 4 (Models of DNA and ...evolution.gs.washington.edu/gs541/2005/lecture27.pdfPhylogeny methods, part 4 (Models of DNA and protein change) Πp.5/26

Approaches to protein sequence models

Making a model for protein sequence evolution (a not-very-practicalapproach):

Use a good model of DNA evolution.

Use the appropriate genetic code.

When an amino acid changes, accept this with a probability thatdeclines as the amino acids become more different.

Fit this to empirical information on protein evolution.

Take into account variation of rate from site to site.

Take into account correlation of rate variation in neighboring sites.

How about protein structure (Secondary structure? 3D structure?)

This “codon model" is widely used to detect evidence of positive

selection, by detecting loci or lineage-site combinations that show

evidence of too high a rate of substitution.

It is the best way to take DNA and protein sequences into accountsimultaneously when inferring phylogenies.

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) – p.21/26

Page 37: Lecture 27. Phylogeny methods, part 4 (Models of DNA and ...evolution.gs.washington.edu/gs541/2005/lecture27.pdfPhylogeny methods, part 4 (Models of DNA and protein change) Πp.5/26

Approaches to protein sequence models

Making a model for protein sequence evolution (a not-very-practicalapproach):

Use a good model of DNA evolution.

Use the appropriate genetic code.

When an amino acid changes, accept this with a probability thatdeclines as the amino acids become more different.

Fit this to empirical information on protein evolution.

Take into account variation of rate from site to site.

Take into account correlation of rate variation in neighboring sites.

How about protein structure (Secondary structure? 3D structure?)

This “codon model" is widely used to detect evidence of positive

selection, by detecting loci or lineage-site combinations that show

evidence of too high a rate of substitution.

It is the best way to take DNA and protein sequences into accountsimultaneously when inferring phylogenies.

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) – p.21/26

Page 38: Lecture 27. Phylogeny methods, part 4 (Models of DNA and ...evolution.gs.washington.edu/gs541/2005/lecture27.pdfPhylogeny methods, part 4 (Models of DNA and protein change) Πp.5/26

Approaches to protein sequence models

Making a model for protein sequence evolution (a not-very-practicalapproach):

Use a good model of DNA evolution.

Use the appropriate genetic code.

When an amino acid changes, accept this with a probability thatdeclines as the amino acids become more different.

Fit this to empirical information on protein evolution.

Take into account variation of rate from site to site.

Take into account correlation of rate variation in neighboring sites.

How about protein structure (Secondary structure? 3D structure?)

This “codon model" is widely used to detect evidence of positive

selection, by detecting loci or lineage-site combinations that show

evidence of too high a rate of substitution.

It is the best way to take DNA and protein sequences into accountsimultaneously when inferring phylogenies.

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) – p.21/26

Page 39: Lecture 27. Phylogeny methods, part 4 (Models of DNA and ...evolution.gs.washington.edu/gs541/2005/lecture27.pdfPhylogeny methods, part 4 (Models of DNA and protein change) Πp.5/26

Approaches to protein sequence models

Making a model for protein sequence evolution (a not-very-practicalapproach):

Use a good model of DNA evolution.

Use the appropriate genetic code.

When an amino acid changes, accept this with a probability thatdeclines as the amino acids become more different.

Fit this to empirical information on protein evolution.

Take into account variation of rate from site to site.

Take into account correlation of rate variation in neighboring sites.

How about protein structure (Secondary structure? 3D structure?)

This “codon model" is widely used to detect evidence of positive

selection, by detecting loci or lineage-site combinations that showevidence of too high a rate of substitution.

It is the best way to take DNA and protein sequences into account

simultaneously when inferring phylogenies.

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) – p.21/26

Page 40: Lecture 27. Phylogeny methods, part 4 (Models of DNA and ...evolution.gs.washington.edu/gs541/2005/lecture27.pdfPhylogeny methods, part 4 (Models of DNA and protein change) Πp.5/26

The codon model

phe

phe

leu

leu

leu

leu

leu

leu

ile

ile

ile

met

val

val

val

val

ser stop stop

U C

U

C

C

U

U

C

A

G

A G

A

G

A

G

U

C

A

G

U

C

A

G

UUU

UUC

UUA

UUG

CUU

CUC

CUA

CUG

AUU

AUC

AUA

AUG

GUU

GUC

GUA

GUG

UCA UAA UGA

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) – p.22/26

Page 41: Lecture 27. Phylogeny methods, part 4 (Models of DNA and ...evolution.gs.washington.edu/gs541/2005/lecture27.pdfPhylogeny methods, part 4 (Models of DNA and protein change) Πp.5/26

References

Barry, D., and J. A. Hartigan. 1987. Statistical analysis of hominoidmolecular evolution. Statistical Science 2: 191-210. [Early use of full12-parameter model]

Dayhoff, M. O. and R. V. Eck. 1968. Atlas of Protein Sequence and Structure1967-1968. National Biomedical Research Foundation, Silver Spring,Maryland. [Dayhoff’s PAM modelfor proteins]

Goldman, N., and Z. Yang. 1994. A codon-based model of nucleotidesubstitution for protein-coding DNA sequences. Molecular Biology andEvolution 11: 725-736 . [codon-based protein/DNA models]

Hasegawa, M., H. Kishino, and T. Yano. 1985. Dating of the human-ape

splitting by a molecular clock of mitochondrial DNA. Journal of MolecularEvolution 22: 160-174. [HKY model]

Henikoff, S. and J. G. Henikoff. 1992. Amino acid substitution matrices fromprotein blocks. Proceedings of the National Academy of Sciences, USA 89:10915-10919. [BLOSUM protein model]

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) – p.23/26

Page 42: Lecture 27. Phylogeny methods, part 4 (Models of DNA and ...evolution.gs.washington.edu/gs541/2005/lecture27.pdfPhylogeny methods, part 4 (Models of DNA and protein change) Πp.5/26

References

Jones, D. T., W. R. Taylor, and J. M. Thornton. 1992. The rapid generation ofmutation data matrices from protein sequences. Computer Applcations inthe Biosciences (CABIOS) 8: 275-282. [JTT model for proteins]

Jones, D. T., W. R. Taylor, and J. M. Thornton. 1994a. A model recognition

approach to the prediction of all-helical membrane protein structureand topology. Biochemistry 33: 3038-3049. JTT membrane protein model]

Jones, D. T., W. R. Taylor, and J. M. Thornton. 1994b. A mutation data matrix

for transmembrane proteins. FEBS Letters 339: 269-275 . [JTT membraneprotein model]

Jukes, T. H. and C. Cantor. 1969. Evolution of protein molecules. pp. 21-132in Mammalian Protein Metabolism, ed. M. N. Munro. Academic Press, NewYork. [Jukes-Cantor model]

Kimura, M. 1980. A simple model for estimating evolutionary rates of base

substitutions through comparative studies of nucleotide sequences.Journal of Molecular Evolution 16: 111-120. [Kimura’s 2-parameter model]

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) – p.24/26

Page 43: Lecture 27. Phylogeny methods, part 4 (Models of DNA and ...evolution.gs.washington.edu/gs541/2005/lecture27.pdfPhylogeny methods, part 4 (Models of DNA and protein change) Πp.5/26

References

Koshi, J. M. and R. A. Goldstein. 1995. Context-dependent optimalsubstitution matrices. Protein Engineering 8: 641-645. [generating otherkinds of protein model matrices]

Lanave, C., G. Preparata, C. Saccone, and G. Serio. 1984. A new method forcalculating evolutionary substitution rates. Journal of Molecular Evolution20: 86-93. [General reversible model]

Lockhart, P. J., M. A. Steel, M. D. Hendy, and D. Penny. 1994. Recovering

evolutionary trees under a more realistic model of sequence evolution.Molecular Biology and Evolution 11: 605-612. [The LogDet distance forcorrecting for changing base composition]

Muse, S. V. and B. S. Gaut. 1994. A likelihood approach for comparingsynonymous and nonsynonymous nucleotide substitution rates, with

application to the chloroplast genome. Molecular Biology and Evolution 11:715-724. [codon-based protein/DNA models]

Tamura, K. and M. Nei. 1993. Estimation of the number of nucleotidesubstitutions in the control region of mitochondrial DNA in humans andchimpanzees. Molecular Biology and Evolution 10: 512-526. [Tamura-Neimodel]

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) – p.25/26

Page 44: Lecture 27. Phylogeny methods, part 4 (Models of DNA and ...evolution.gs.washington.edu/gs541/2005/lecture27.pdfPhylogeny methods, part 4 (Models of DNA and protein change) Πp.5/26

How it was done

This projection produced

using the prosper style in LaTeX,

using Latex to make a .dvi file,

using dvips to turn this into a Postscript file,

using ps2pdf to make it into a PDF file, and

displaying the slides in Adobe Acrobat Reader.

Result: nice slides using freeware.

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) – p.26/26