1 Introduction to molecular evolution Lecture 13, Statistics 246 March 4, 2004.

33
1 Introduction to molecular evolution Lecture 13, Statistics 246 March 4, 2004
  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    214
  • download

    1

Transcript of 1 Introduction to molecular evolution Lecture 13, Statistics 246 March 4, 2004.

Page 1: 1 Introduction to molecular evolution Lecture 13, Statistics 246 March 4, 2004.

1

Introduction to molecular evolution

Lecture 13, Statistics 246

March 4, 2004

Page 2: 1 Introduction to molecular evolution Lecture 13, Statistics 246 March 4, 2004.

2

Evolution using molecules: implicit assumptions

Our DNA is inherited from our parents more or less unchanged.

Molecular evolution is dominated by mutations that are neutral from the standpoint of natural selection.

Mutations accumulate at fairly steady rates in surviving lineages.

We can study the evolution of (macro) molecules and reconstruct the evolutionary history of organisms using their molecules.

Page 3: 1 Introduction to molecular evolution Lecture 13, Statistics 246 March 4, 2004.

3

Some important dates in history(billions of years ago)

Origin of the universe 15 4

Formation of the solar system 4.6

First self-replicating system 3.5 0.5

Prokaryotic-eukaryotic divergence 1.8 0.3

Plant-animal divergence 1.0

Invertebrate-vertebrate divergence 0.5

Mammalian radiation beginning 0.1

86 CSH Doolittle et al.

Page 4: 1 Introduction to molecular evolution Lecture 13, Statistics 246 March 4, 2004.

4

The three kingdoms

Page 5: 1 Introduction to molecular evolution Lecture 13, Statistics 246 March 4, 2004.

5

Two important early observations

Different proteins evolve at different rates, and this seems more or less independent of the host organism, including its generation time.

It is necessary to adjust the observed percent difference between two homologous proteins to get a distance more or less linearly related to the time since their common ancestor. ( Later we offer a rational basis for doing this.)

An striking early version of these observations is next.

Page 6: 1 Introduction to molecular evolution Lecture 13, Statistics 246 March 4, 2004.

6

Evolution ofthe globins

Hemoglobin

Fib

rinop

eptid

es1.

1 M

Y

5.8 MY

Cytochrome c

20.0 MYSeparation of ancestorsof plants and animals

1

23

4

67

8 910

5

Mam

mal

s

Bird

s/R

eptil

es

Rep

tiles

/Fis

h

Car

p/La

mpr

ey

Mam

mal

s/R

eptil

es

Ver

tebr

ates

/In

sect

s

a

220

200

180

160

140

120

100

80

60

40

20

0

200100 300 400 500 600 700 800

Millions of years since divergenceAfter Dickerson (1971)

Cor

rect

ed a

min

o ac

id c

hang

es p

er 1

00 r

esid

ues

900 1000 1100 1200 1300 1400

bcde f h ig j

Hur

on

ian

Alg

on

kia

n

Cam

bria

n

Ord

ovi

cian

Silu

ria

nD

evo

nia

n

Per

mia

nTr

iass

icJu

rass

ic

Cre

tace

ous

Pal

eo

cen

e

Olig

oce

ne

Mio

cen

eP

lioce

ne

Eoc

en

e

Car

bo

nife

rou

s

Rates of macromolecular evolution

Page 7: 1 Introduction to molecular evolution Lecture 13, Statistics 246 March 4, 2004.

7

Protein aPAMs/100 residues/108 years Theoretical lookback timeb

Pseudogenes 400 45c

Fibrinopeptides 90 200c

Lactalbumins 27 670c

Lysozymes 24 750c

Ribonucleases 21 850c

Hemoglobins 12 1.5d

Acid proteases 8 2.3d

Triosephosphate isomerase 3 6d

Phosphoglyceraldehyde dehydrogenase 2 9d

Glutamate dehydrogenase 1 18d

______________________________________________________________________________________

aPAMs, Accepted point mutations (explained shortly). bUseful lookback time = 360 PAMs.

cMillion years. dBillion years. From Doolittle 1986

Different rates of change for different proteins

Page 8: 1 Introduction to molecular evolution Lecture 13, Statistics 246 March 4, 2004.

8

Rates of change in protein familiesProtein Ratea Protein Rate

Fibrinopeptides 90 Thyrotropin beta chain 7.4Growth hormone 37 Parathyrin 7.3Ig kappa chain C region 37 Parvalbumin 7.0Kappa casein 33 BPTI Protease inhibitors 6.2Ig gamma chain C region 31 Trypsin 5.9Lutropin beta chain 30 Melanotropin beta 5.6Ig lambda chain C region 27 Alpha crystallin A chain 5.0Complement C3a 27 Endorphin 4.8Lactalbumin 27 Cytochrome b5 4.5Epidermal growth factor 26 Insulin 4.4Somatotropin 25 Calcitonin 4.3Pancreatic ribonuclease 21 Neurophysin 2 3.6Lipotropin beta 21 Plastocyanin 3.5Haptoglobin alpha chain 20 Lactate dehydrogenase 3.4Serum albumin 19 Adenylate cyclase 3.2Phospholipase A2 19 Triosephosphate isomerase 2.8Protease inhibitor PST1 type 18 Vasoactive intestinal peptide 2.6Prolactin 17 Corticotropin 2.5Pancreatic hormone 17 Glyceraldehyde 3-P DH 2.2Carbonic anydrase C 16 Cytochrome C 2.2Lutropin alpha chain 16 Plant ferredoxin 1.9Hemoglobin alpha chain 12 Collagen 1.7Hemoglobin beta chain 12 Troponin C, skeletal muscle 1.5Lipid-binding protein A-II 10 Alpha crystallin B-chain 1.5Gastrin 9.8 Glucagon 1.2Animal lysozyme 9.8 Glutamate DH 0.9 Myoglobin 8.9 Histone H2B 0.9Amyloid A 8.7 Histone H2A 0.5Nerve growth factor 8.5 Histone H3 0.14Acid proteases 8.4 Ubiquitin 0.1Myelin basic protein 7.4 Histone H4 0.1

apercent/100My From (Nei, 1987; Dayhoff et al., 1978)

Page 9: 1 Introduction to molecular evolution Lecture 13, Statistics 246 March 4, 2004.

9

Some terminology

In evolution, homology (here of proteins), means similarity due to common ancestry.

A common mode of protein evolution is by duplication. Depending on the relations between duplication and speciation dates, we have two different types of homologous proteins. Loosely,

Orthologues: the “same” gene in different organisms;common ancestry goes back to a speciation event.

Paralogues: different genes in the same organism; common ancestry goes back to a gene duplication.

Lateral gene transfer gives another form of homology.

Page 10: 1 Introduction to molecular evolution Lecture 13, Statistics 246 March 4, 2004.

10

Beta-globins (orthologues)

10 20 30 40

M V H L T P E E K S A V T A L W G K V N V D E V G G E A L G R L L V V Y P W T Q BG-human- . . . . . . . . N . . . T . . . . . . . . . . . . . . . . . . . . . . . . . . BG-macaque

- - M . . A . . . A . . . . F . . . . K . . . . . . . . . . . . . . . . . . . . BG-bovine- . . . S G G . . . . . . N . . . . . . I N . L . . . . . . . . . . . . . . . . BG-platypus

. . . W . A . . . Q L I . G . . . . . . . A . C . A . . . A . . . I . . . . . . BG-chicken- . . W S E V . L H E I . T T . K S I D K H S L . A K . . A . M F I . . . . . T BG-shark

50 60 70 80

R F F E S F G D L S T P D A V M G N P K V K A H G K K V L G A F S D G L A H L D BG-human. . . . . . . . . . S . . . . . . . . . . . . . . . . . . . . . . . . . N . . . BG-macaque

. . . . . . . . . . . A . . . . N . . . . . . . . . . . . D S . . N . M K . . . BG-bovine. . . . A . . . . . S A G . . . . . . . . . . . . A . . . T S . G . A . K N . . BG-platypus

. . . A . . . N . . S . T . I L . . . M . R . . . . . . . T S . G . A V K N . . BG-chicken. Y . G N L K E F T A C S Y G - - - - - . . E . A . . . T . . L G V A V T . . G BG-shark

90 100 110 120

N L K G T F A T L S E L H C D K L H V D P E N F R L L G N V L V C V L A H H F G BG-human. . . . . . . Q . . . . . . . . . . . . . . . . K . . . . . . . . . . . . . . . BG-macaque

D . . . . . . A . . . . . . . . . . . . . . . . K . . . . . . . V . . . R N . . BG-bovineD . . . . . . K . . . . . . . . . . . . . . . . N R . . . . . I V . . . R . . S BG-platypus. I . N . . S Q . . . . . . . . . . . . . . . . . . . . D I . I I . . . A . . S BG-chicken

D V . S Q . T D . . K K . A E E . . . . V . S . K . . A K C F . V E . G I L L K BG-shark

130 140

K E F T P P V Q A A Y Q K V V A G V A N A L A H K Y HBG-human. . . . . Q . . . . . . . . . . . . . . . . . . . . .BG-macaque

. . . . . V L . . D F . . . . . . . . . . . . . R . .BG-bovine. D . S . E . . . . W . . L . S . . . H . . G . . . .BG-platypus. D . . . E C . . . W . . L . R V . . H . . . R . . .BG-chicken

D K . A . Q T . . I W E . Y F G V . V D . I S K E . . BG-shark

. means same as reference sequence

- means deletion

Page 11: 1 Introduction to molecular evolution Lecture 13, Statistics 246 March 4, 2004.

11

Beta-globins: uncorrected pairwise distances

DISTANCES between protein sequences, calculated over: 1 to 147Below diagonal: observed number of differencesAbove diagonal: number of differences per 100 amino acids

hum mac bov pla chi sha

hum ---- 5 16 23 31 65 mac 7 ---- 17 23 30 62 bov 23 24 ---- 27 37 65

pla 34 34 39 ---- 29 64

chi 45 44 52 42 ---- 61 sha 91 88 91 90 87 ----

Page 12: 1 Introduction to molecular evolution Lecture 13, Statistics 246 March 4, 2004.

12

Beta-globins: corrected pairwise distances

DISTANCES between protein sequences, calculated over 1 to 147. Below diagonal: observed number of differences Above diagonal: estimated number of substitutions per 100 amino acids Correction method: Jukes-Cantor hum mac bov pla chi sha

hum ---- 5 17 27 37 108 mac 7 ---- 18 27 36 102 bov 23 24 ---- 32 46 110

pla 34 34 39 ---- 34 106

chi 45 44 52 42 ---- 98 sha 91 88 91 90 87 ----

Page 13: 1 Introduction to molecular evolution Lecture 13, Statistics 246 March 4, 2004.

13

UPGMA tree

BG-bovine

BG-humanBG-macaque

BG-platypus

BG-chicken

BG-shark

Page 14: 1 Introduction to molecular evolution Lecture 13, Statistics 246 March 4, 2004.

14

Human globins (paralogues)

10 20 30

- V L S P A D K T N V K A A W G K V G A H A G E Y G A E A L E R M F L S F P T T alpha-humanV H . T . E E . S A . T . L . . . . - - N V D . V . G . . . G . L L V V Y . W . beta-humanV H . T . E E . . A . N . L . . . . - - N V D A V . G . . . G . L L V V Y . W . delta-human

V H F T A E E . A A . T S L . S . M - - N V E . A . G . . . G . L L V V Y . W . epsilon-humanG H F T E E . . A T I T S L . . . . - - N V E D A . G . T . G . L L V V Y . W . gamma-human- G . . D G E W Q L . L N V . . . . E . D I P G H . Q . V . I . L . K G H . E . myo-human

40 50 60 70

K T Y F P H F - D L S H G S A - - - - - Q V K G H G K K V A D A L T N A V A H V alpha-humanQ R F . E S . G . . . T P D . V M G N P K . . A . . . . . L G . F S D G L . . L beta-humanQ R F . E S . G . . . S P D . V M G N P K . . A . . . . . L G . F S D G L . . L delta-humanQ R F . D S . G N . . S P . . I L G N P K . . A . . . . . L T S F G D . I K N M epsilon-humanQ R F . D S . G N . . S A . . I M G N P K . . A . . . . . L T S . G D . I K . L gamma-human

L E K . D K . K H . K S E D E M K A S E D L . K . . A T . L T . . G G I L K K K myo-human

80 90 100 110

D D M P N A L S A L S D L H A H K L R V D P V N F K L L S H C L L V T L A A H L alpha-human. N L K G T F A T . . E . . C D . . H . . . E . . R . . G N V . V C V . . H . F beta-human. N L K G T F . Q . . E . . C D . . H . . . E . . R . . G N V . V C V . . R N F delta-human

. N L K P . F A K . . E . . C D . . H . . . E . . . . . G N V M V I I . . T . F epsilon-human

. . L K G T F A Q . . E . . C D . . H . . . E . . . . . G N V . V T V . . I . F gamma-humanG H H E A E I K P . A Q S . . T . H K I P V K Y L E F I . E . I I Q V . Q S K H myo-human

120 130 140

P A E F T P A V H A S L D K F L A S V S T V L T S K Y R - - - - - - alpha-humanG K . . . . P . Q . A Y Q . V V . G . A N A . A H . . H . . . . . . beta-human

G K . . . . Q M Q . A Y Q . V V . G . A N A . A H . . H . . . . . . delta-humanG K . . . . E . Q . A W Q . L V S A . A I A . A H . . H . . . . . . epsilon-humanG K . . . . E . Q . . W Q . M V T A . A S A . S . R . H . . . . . . gamma-human

. G D . G A D A Q G A M N . A . E L F R K D M A . N . K E L G F Q G myo-human

Page 15: 1 Introduction to molecular evolution Lecture 13, Statistics 246 March 4, 2004.

15

Human globins: uncorrected pairwise distances

DISTANCES between protein sequence, calculated over 1 to 154.Below diagonal: observed number of differencesAbove diagonal: number of differences per 100 amino acids

alpha beta delta epsil gamma myo

alpha ---- 55 55 60 57 74 beta 82 ---- 7 25 27 75 delta 82 10 ---- 27 29 74

epsil 89 35 39 ---- 20 77

gamma 85 39 42 29 ---- 76 myo 116 117 116 119 118 ----

Page 16: 1 Introduction to molecular evolution Lecture 13, Statistics 246 March 4, 2004.

16

Human globins: corrected pairwise distances

DISTANCES between protein sequences,calculated over 1-141 Below diagonal: observed number of differences Above diagonal: estimated number of substitutions per 100 amino acids Correction method: Jukes-Cantor

alpha beta delta epsil gamma myo

alpha ---- 281 281 281 313 208 beta 82 ---- 7 30 31 1000 delta 82 10 ---- 34 33 470

epsil 89 35 39 ---- 21 402

gamma 85 39 42 29 ---- 470 myo 116 117 116 119 118 ----

Page 17: 1 Introduction to molecular evolution Lecture 13, Statistics 246 March 4, 2004.

17

10 20 30 40 50 60

C C G A C A G G C A C G G T G G C T C A C A C C T G T A A T C C C A G T A C T T T G G G A G G C T G A G G C G A G A G G hum-1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . hum-2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . hum-3. . . . . . . A . . . . . . . . . . . . . . . . . . . C . . . . . . . C . . . . . . . . . . . . . . . . . . . G . . . . chimp. . . . . . . A . . . . . . . . . . . . . . . . . . . C . . . . . . . C . . . . . . . . . . . . . . . . . . . G . . . . bonob. . . . . . . . . . . A . . . . . . . . . . . . . . . . . . . . . . . C . . . . . . . . . . C . . . . . . . . G . . . . goril. G . . . . . A . . . . . . . . . . . . . G . . . . . . . . . . . . . C . . . . . . . . . . . . C . . . . T . G . C . . orang

70 80 90 100 110 120

A T C A C C T G A G G T C G G G A G T T T G A G A C C A G C C T G A C C A A T A T G G A G A A A C C C C A G T T A T A C hum-1. . . . . . . . . . . . . . . . . . . . . . . . . T . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . hum-2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C . . . . . hum-3. . . . . . . . . . . . . . . . . . . . C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . chimp. . . . . . . . . . . . . . . . . . . . C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . bonob. . . . . . . . . . . . . . . . . . . . C . . . . . . . . . . . . . . . . . . . . . . . . . . . T . . . . . . . . . . . goril. . . . . . . . . . . . T . . . . . . . C . . A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C . . . orang

130 140 150 160 170 180

T A A A A A T A C A A A A T T A G C T G G G T G T G G T G G C G C A T G C C T G T A A T C C T A G C T A C T A G G A A G hum-1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . hum-2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . hum-3. . . . . . . . . . . . . . . . . . . . . . . . . . C . . . . . . . . . . . . . . . . . . . C . . . . . . . . . . G . . chimp. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C . . . . . . . . . . G . . bonob. . . . . . . . . . . . . . . . . . . . . . . . G . . . . . . . . . . . . . . . . . . . . . C . . . . . . . . . . G . . goril. . . . . . . . . . . . . . . . . . . . . . C . . . . . . . . . . . . . . . . . . . . . T . C . . . . . . . . . . G . . orang

190 200 210 220 230 240

G C T G A G G C A G G A G A A T C G C T T G A A C C C G G G A G G T G G A G G T T G A G G T G A G C T G A G A T C A C G hum-1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . hum-2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . hum-3. . . . . . . . . . . . A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . chimp. . . . . . . . . . . . A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . bonob. . . . . . . . . . . . . . . . . A . . . . . . . . . . . . . A . . . . . . . . . T . . . . . . . . . . . . . . . . . A goril. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . T . . . . . . . . . . . . . . G . . orang

250 260 270 280 290 300

C C A T T G C A C T C C A G C C T G G G C A A C A A G A G C A A A A C T C C G T C T C A A A A A A T A A A T A A A T A A hum-1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . hum-2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . hum-3. . . C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A . . . . . . . . . . . . . . . . C . . . . chimp. . . C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A . . . . . . . . . . . . . . . . . . . . . bonob. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A . . . . . . . . . . . . . . . . . . . . . goril. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A . . . T . . . . . . . . . . . . . . . . . orang

Alu sequences (a-globin2 Alu 1, Knight et al., 1996)

Page 18: 1 Introduction to molecular evolution Lecture 13, Statistics 246 March 4, 2004.

18

Alu sequences:uncorrected pairwise distances

DISTANCES between nucleic acid sequences calculated over: 1 to 300, considering all base positions Below diagonal: observed number of differences Above diagonal: number of differences per 100 bases hum-1 hum-2 hum-3 chimp bonob goril orang

hum-1 ---- 0 0 4 3 5 7 hum-2 1 ---- 1 4 4 5 7 hum-3 1 2 ---- 4 4 5 7

chimp 12 13 13 ---- 1 5 6

bonob 10 11 11 2 ---- 4 5 goril 14 15 15 14 12 ---- 7

orang 20 21 21 18 16 22 ----

Page 19: 1 Introduction to molecular evolution Lecture 13, Statistics 246 March 4, 2004.

19

Alu sequences: corrected pairwise distances

DISTANCES between nucleic acid sequences, calculated over: 1 to 300, considering all base positions Below diagonal: observed number of differences Above diagonal: estimated number of substitutions per 100 bases Correction method: Jukes-Cantor

hum-1 hum-2 hum-3 chimp bonob goril orang

hum-1 ---- 0 0 4 3 5 7 hum-2 1 ---- 1 4 4 5 7 hum-3 1 2 ---- 4 4 5 7

chimp 12 13 13 ---- 1 5 6

bonob 10 11 11 2 ---- 4 6 goril 14 15 15 14 12 ---- 8

orang 20 21 21 18 16 22 ----

Page 20: 1 Introduction to molecular evolution Lecture 13, Statistics 246 March 4, 2004.

20

Correcting distances between DNA and protein sequences

We mentioned earlier that it is necessary to adjust observed percent differences to get a distance measure which scales linearly with time. This is because we can have multiple and back substitutions at a given position along a lineage.

All of the correction methods (with names like Jukes-Cantor, 2-parameter Kimura, etc) are justified by simple probabilistic arguments involving Markov chains whose basis is worth mastering.

The same molecular evolutionary models are used in scoring sequence alignments.

Page 21: 1 Introduction to molecular evolution Lecture 13, Statistics 246 March 4, 2004.

21

Markov chain

State space = {A,C,G,T}.

p(i,j) = pr(next state Sj | current state Si)

Markov assumption:

p(i,j) = pr(next state Sj | current state Si & any configuration of states before this)

Only the present state, not previous states, affects the probs of moving to next states.

Page 22: 1 Introduction to molecular evolution Lecture 13, Statistics 246 March 4, 2004.

22

A simple 4-state Markov chain: the Kimura 2-parameter model for nucleotide change

A G

TC

Transition probabilities: Horizontal: a Diagonal & vertical: b Self: c = a2b

c a b b

a c b b

b b c a

b b a c

A G C TA

G

C

T

Page 23: 1 Introduction to molecular evolution Lecture 13, Statistics 246 March 4, 2004.

23

The multiplication rule

pr(state after next is Sk | current state is Si)

= ∑j pr(state after next is Sk, next state is Sj | current state is Si) [addition rule]

= ∑j pr(next state is Sj| current state is Si) x pr(state after next is Sk | current

state is Si, next state is Sj) [multiplication rule]

= ∑j p(i,j) x p(j,k) [Markov assumption]

= (i,k)-element of P2, where P=(p(i,j)).

More generally,

pr(state t steps from now is Sk | current state is Si) = i,k element of Pt

Page 24: 1 Introduction to molecular evolution Lecture 13, Statistics 246 March 4, 2004.

24

Continuous-time version

For any s,t write pij(t) = pr(Sj at time t+s | Si at time s) for the

stationary (time-homogeneous) transition probabilities.

Write P(t) = (pij(t)) for the matrix of pij(t)’s.

Then for any t,u: P(t+u) = P(t) P(u).

It follows that P(t) = exp(Qt), where Q = P’(0) is the derivative of

P(t) at t = 0.

Q is called the infinitesimal matrix of P(t), and satisfies

P’(t) = QP(t) = P(t)Q.

Page 25: 1 Introduction to molecular evolution Lecture 13, Statistics 246 March 4, 2004.

25

Interpretation of Q

Roughly, q(i,j) is the rate of transitions of i to j, while

q(i,i) = j q(i,j), so each row sum is 0. If under some initial

conditions, we have a Markov chain evolving in continuous time

with infinitesimal matrix Q, and pj(t) = pr(Sj at time t), then

pj(t+h) =i pr(Si at t, Sj at t+h)

= i pr(Si at t)pr(Sj at t+h | Si at t)

= pj(t)x(1+hqjj) + i jpi(t)x hqij

i.e., h-1[pj(t+h) - pj(t)] = pj(t)q(j,j) + i j pi(t)q(i,j)

which becomes P’ = QP as h 0.

Important approximation: when t is small, P(t) I + Qt.

Page 26: 1 Introduction to molecular evolution Lecture 13, Statistics 246 March 4, 2004.

26

Q =

P(t) =

The Jukes-Cantor model (1969)

-3 -3 -3 -3

r s s s

s r s s

s s r s

s s s r

r = (1+3e4t)/4, s = (1 e4t)/4.

Commonancestor ofhuman and orang.

t time units

human (now)

Consider e.g. the 2nd position in a-globin2 Alu1.

Page 27: 1 Introduction to molecular evolution Lecture 13, Statistics 246 March 4, 2004.

27

Let P(t) = exp(Qt). Then the A,G element of P(t) is

pr(G now | A then) = (1 e4t)/4.

Same for all pairs of different nucleotides.

Overall rate of change k = 3t.

When k = .01, described as 1 PAM

PAM = accepted point mutation

Put t = .01/3 = 1/300. Then the resulting

P = P(1/300) is called the PAM(1) matrix.

Why use PAMs?

Definition of PAM

Page 28: 1 Introduction to molecular evolution Lecture 13, Statistics 246 March 4, 2004.

28

Evolutionary time, PAM

Since sequences evolve at different rates, it is

convenient to rescale time so that 1 PAM of evolutionary time corresponds to 1% expected substitutions.

For Jukes-Cantor, k = 3t is the expected number of substitutions in [0,t], so is a distance. (Show this.)

Set 3t = 1/100, or t = 1/300, so 1 PAM = 1/300 years.

Page 29: 1 Introduction to molecular evolution Lecture 13, Statistics 246 March 4, 2004.

29

Distance adjustment

For a pair of sequences, k = 3t is the desired metric, but not observable. Instead, pr(different) is observed. So we use a model to convert pr(different) to k.

This is completely analogous to the conversion of

= pr(recombination)

to genetic (map) distance (= expected number of crossovers) using the Haldane map function

= 1/2 (1 e-2d),

assuming the no-interference (Poisson) model.

Page 30: 1 Introduction to molecular evolution Lecture 13, Statistics 246 March 4, 2004.

30

common ancestor

Gorang

Chuman

still 2nd position in a-globin Alu 1

Assume that the common ancestor has A, G, C or T with probability 1/4.

Then the chance of the nt differing

p≠ = 3/4 (1 e8t) = 3/4 (1 e4k/3), since k =2 3t

t

3/4

Towards Jukes-Cantor adjustment

Page 31: 1 Introduction to molecular evolution Lecture 13, Statistics 246 March 4, 2004.

31

If we suppose all nucleotide positions behave identically and independently, and n≠ differ out of n, we can invert this, obtaining

= 3/4 log(1 4/3 n≠/n).

This is the corrected or adjusted fraction of differences (under this simple model). 100 to get PAMs

The analogous simple model for amino acid sequences has

= 19/20 log(1 20/19 n≠/n).

100 for PAM.

Jukes-Cantor adjustment

Page 32: 1 Introduction to molecular evolution Lecture 13, Statistics 246 March 4, 2004.

32

Illustration

1. Human and bovine beta-globins are aligned with no deletions at 145 out of 147 sites. They differ at 23 of these sites. Thus n≠/n = 23/145, and the corrected distance using the Jukes-Cantor formula is (natural logs)

19/20 log(1 20/19 23/145) = 17.3 10-2.

2. The hum-1 and gorilla sequences are aligned without gaps across all 300 bp, and differ at 14 sites. Thus n≠/n = 14/300, and the corrected distance using the Jukes-Cantor formula is

3/4 log(1 4/3 14/300) = 4.8 10-2.

Page 33: 1 Introduction to molecular evolution Lecture 13, Statistics 246 March 4, 2004.

33

Correspondence between observed a.a. differences and the evolutionary distance (Dayhoff et al., 1978)

Observed Percent Difference Evolutionary Distance in PAMs

1511172330384756678094112133159195246

1510152025303540455055606570758085 328