Extracting Information From DNA Sequences Using Models of Sequence Evolution - Gavin Huttley

59
© 2013 Gavin Huttley Extracting information from DNA sequences using models of sequence evolution Gavin Huttley John Curtin School of Medical Research Australian National University 1

Transcript of Extracting Information From DNA Sequences Using Models of Sequence Evolution - Gavin Huttley

© 2013 Gavin Huttley

Extracting information from DNA sequences using

models of sequence evolutionGavin Huttley

John Curtin School of Medical Research Australian National University

!1

© 2013 Gavin Huttley !2

.org

© 2013 Gavin Huttley

Overview

• Darwin meets Mendel meets …

• Felsenstein applies Fisher

• Markov processes of sequence evolution

• Measuring model support

!3

Darwin & Mendel — Genetic change during

evolution

!4

© 2013 Gavin Huttley

Mendels underpinning

• Population of 1 self-pollinating plant

• Dies at reproduction

• Founder is A/G at one locus

• What is the probability the second generation population is homozygous at this locus?

http://www.flickr.com/photos/22281745@N04/2149169348/

!5

© 2013 Gavin Huttley

Gen 1A/A A/G G/G

0.25 0.5 0.25

Gen 2 A/A A/G G/G

0.25 0.5 0.25

The probability of being fixed by generation 2 is then

Prob(fixed by gen 2) =

1

2

+

1

4

=

3

4

Prob(not fixed gen 1) =

1

2

Prob(fixed gen 2) =

1

2

⇥ 1

2

=

1

4

Prob(fixed) =

1

4

+

1

4

=

1

2

A/G

!6

© 2013 Gavin Huttley

Time

Freq

uenc

y (A

)Each line is a separate

population

!7

© 2013 Gavin Huttley

Time

Freq

uenc

y (A

)Each line is a separate

population

!7

Hartl & Clark, 2007, p 121

“[Ne is] the number of individuals in a theoretically ideal population having the same

magnitude of random genetic drift as the actual population.

!8

© 2013 Gavin Huttley

RGD Summary

• probability of fixation is just the allele frequency, which initially is 1/(2Ne)

• the expected time to fixation is 4Ne generations

• Big populations have more variation than small ones

• Future allele frequencies depend only on the current population frequency, not past frequencies

!9

Mutation

!10

© 2013 Gavin Huttley

µ

So population size has no effect!!

Number of new mutations = 2Nµ

Prob(fixation of a new mutant) =

1

2N

Rate of fixation = 2Nµ⇥ 1

2N

= µ

!11

© 2013 Gavin Huttley

The Neutral Theory

!12

© 2013 Gavin Huttley

What does neutral mean?

• That a genetic variant is ‘invisible’ to natural selection. (Hence, selectively neutral.)

• The evolutionary dynamics (changes in frequency) are dictated by random genetic drift and mutation only.

• “functionally less important molecules or parts of a molecule evolve faster than more important ones”

!13

© 2013 Gavin Huttley

• Say only hydrophilic amino acids are allowed at a specific position

• In this case, the mutation events that produce non-hydrophilic amino acids will be eliminated by natural selection, the fixation probability will be less and ditto for the substitution rate

R HK

D

E

ST

N

QC

U

G

P

!14

Kimura’s rule of thumb: Natural selection only

effective against RGD when

4Nes >> 1

!15

What happened in populations is what we see between species, ie

polymorphism and substitution are related.

!16

© 2013 Gavin Huttley

110 1 2 3 4 5 6 7 8 9 10

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Time

Frequency

Positive Natural Selection

Negative Natural Selection

Balancing Natural Selection

Neutral Evolution

!17

© 2013 Gavin Huttley

Summary• The Neutral Theory as the null hypothesis for

evolutionary analysis.

• Darwin's filter as a basis for predicting biological significance

• Slowly evolving is taken as evidence for functionally important

• In other words, inferring selection requires quantifying a neutral process

!18

Felsenstein applies Fisher — likelihood for

phylogenetics

!19

–Edwards 1992

“The likelihood, L(H|R), of the hypothesis H given the data R, and a specific model, is

proportional to P(R|H), the constant of proportionality being arbitrary.”

!20

© 2013 Gavin Huttley

Utility of phylogenetic techniques

• identify relationships among sequences

• understand divergence mechanisms

!21

© 2013 Gavin Huttley

The phylogenetic “hypothesis”

• The tree topology

• Representation of sequence divergence

• Substitution matrices — P(n) in the figure — for each branch specifying probabilities of change from one sequence state to another

• Ancestral state frequencies

!22

© 2013 Gavin Huttley

Likelihood for 3 sequences

Unobserved ancestral states P (t1)

P (t2)

P (t3)

For this alignment column, the likelihood the ancestral base was A

L(A) = �A � pA,A(t1)

�pA,G(t2)� pA,C(t3)

L1 = LA + LG + LC + LT

The full likelihood is

!23

© 2013 Gavin Huttley

Some Assumptions

• edges are independent

• the same tree holds for all nucleotides

• + assumptions of substitution model

!24

© 2013 Gavin Huttley

Likelihood & consistency

• Consistency (convergence of an estimate to the true parameter value) occurs by addition of aligned columns, ie longer alignments

• JT Chang Full reconstruction of Markov models on evolutionary trees: Identifiability and consistency Mathematical Biosciences 137:1, 51-73

!25

Markov processes

!26

© 2013 Gavin Huttley

Q for F81/HKY85/GTR

P (ti) = exp

Qiti

Q =

2

664

� rA$C⇡C rA$G⇡G rA$T⇡T

rA$C⇡A � rC$G⇡G rC$T⇡T

rA$G⇡A rC$G⇡C � rG$T⇡T

rA$T⇡A rC$T⇡C rG$T⇡G �

3

775

ri$j – exchangeability term

⇡i – probability of base i!27

© 2013 Gavin Huttley

Typical assumptions

• positions evolve iid

• time homogeneous (embeddable)

• reversible

• stationary

!28

Comparing models

!29

© 2013 Gavin Huttley

Comparing nested models• Tree topology must be the same between null and

alternate models

• Processes in null and alternate models must be nested, e.g. HKY and GTR

• THEN (typically), the conventional likelihood ratio test can be employed and related to the χ2 distribution

LR = 2(lnL1 � lnL0)

!30

© 2013 Gavin Huttley

F81 vs HKY85

!31

© 2013 Gavin Huttley

The χ2 approximation

!32

© 2013 Gavin Huttley

The χ2 approximation

!32

© 2013 Gavin Huttley

The χ2 approximation

!32

© 2013 Gavin Huttley

Comparing non-nested models

• Also use a LR statistic

• The probability of observing a LR statistic of equal or greater value by chance under the null hypothesis is estimated using a parametric bootstrap procedure in which data are simulated under the fitted null model

• Goldman N. (1993) J. Mol. Evol. 36:2, 182-98

!33

© 2013 Gavin Huttley

Other comparison approaches

• information criterion (AIC, BIC)

• estimates of a parameter of interest

!34

Model choice considerations

!35

© 2013 Gavin Huttley

• Consistency meaningful for the true (generating) model

• Model specification is THE most important issue

• Black-box model comparison procedures can support choices that are mechanistically invalid

• Models should be mechanistically coherent, interpretable and explain the data well

!36

© 2013 Gavin Huttley

I will show

• Context dependent models warranted

• For context dependent models, model specification choices have profound consequences

• Not just about number of parameters

• One approach to an empirical check

!37

DNA encodes information in a context dependent

manner

!38

© 2013 Gavin Huttley

Encoding proteins with DNA• 20 aa are encoded by triplets of

nucleotides (codons)

• There are three special “stop” codons

• changes to codons classified as

• synonymous (syn) changes do not modify encoded aa

• nonsynonymous (nsyn) changes do

• nonsense changes create a stop codon

!39

© 2013 Gavin Huttley

Modelling codon evolution

• Split alignments into non-overlapping trinucleotides and treat each such column as evolving independently

!40

© 2013 Gavin Huttley

Readily tested in protein coding sequences

• Nonsynonymous substitutions can be affected by natural selection

• Synonymous substitutions do not modify the encoded amino acid and are presumed “neutral”

• The rate ratio (nsyn/syn), termed ω, is taken as an indicator of the mode of natural selection

!41

© 2013 Gavin Huttley

© 1988 Nature Publishing Group

1686

Mol. Biol. Evol. 19(10):1686–1694. 2002� 2002 by the Society for Molecular Biology and Evolution. ISSN: 0737-4038

Phylogenetic Evidence for Frequent Positive Selection and Recombination

in the Meningococcal Surface Antigen PorB

Rachel Urwin,* Edward C. Holmes,† Andrew J. Fox,‡ Jeremy P. Derrick,§ andMartin C. J. Maiden**The Peter Medawar Building for Pathogen Research and Department of Zoology, University of Oxford; †Department ofZoology, University of Oxford; ‡Meningococcus Reference Unit, Public Health Laboratory, Withington Hospital, Manchester;and §Department of Biomolecular Sciences, University of Manchester Institute of Science and Technology

Previous estimates of rates of synonymous (dS) and nonsynonymous (dN) substitution among Neisseria meningitidisgene sequences suggested that the surface loops of the variable outer membrane protein PorB were under only weakselection pressure from the host immune response. These findings were consistent with studies indicating that PorBvariants were not always protective in immunological and microbiological assays and questioned the suitability ofthis protein as a vaccine component. PorB, which is expressed at high levels on the surface of the meningococcus,has been implicated in mechanisms of pathogenesis and has also been used as a typing target in epidemiologicalinvestigations. In this work, using more precise estimates of selection pressures and recombination rates, we haveshown that some residues in the surface loops of PorB are under very strong positive selection, as great as thatobserved in human immunodeficiency virus-1 surface glycoproteins, whereas amino acids within the loops and themembrane-spanning regions of the protein are under purifying selection, presumably because of structural con-straints. Congruence tests showed that recombination occurred at a rate that was not sufficient to erase all phylo-genetic similarity and did not greatly bias selection analysis. Homology models of PorB structure indicated thatmany strongly selected sites encoded residues that were predicted to be exposed to host immune responses, implyingthat this protein is under strong immune selection and requires further examination as a potential vaccine candidate.These data show that phylogenetic inference can be used to complement immunological and biochemical data inthe choice of vaccine candidates.

Introduction

The generation of antigenic diversity has evolvedas a strategy for evading immune attack in a wide rangeof pathogenic and commensal organisms (Deitsch, Mox-on, and Wellems 1997). It is effective against both nat-ural and artificially induced immunity and represents amajor obstacle to the development of vaccines againstpathogens as diverse as Plasmodium falciparum, humanimmunodeficiency virus (HIV), and Neisseria meningi-tidis, the meningococcus. In an era when many new vac-cine candidates are being identified by genomic tech-niques, sequence data of the antigen genes obtainedfrom population samples of pathogens can be analyzedby phylogenetic and biochemical modeling techniquesto provide a picture of the evolutionary processes actingon these sequences and hence a preliminary evaluationof their vaccine potential.

The meningococcus is an appropriate model systemto evaluate this approach because it is a pathogen ofglobal significance which is genetically and antigenical-ly diverse and for which no comprehensive vaccine ex-ists (Pollard and Frasch 2001). Further, large geneticallydefined isolate collections have been assembled andmodels of the population biology of this organism areavailable (Caugant et al. 1987; Maiden et al. 1998).Amongst the candidate vaccine components proposedare the variable outer membrane proteins, the trimericporins, which act as pores for the passage of solutes into

Key words: Neisseria meningitidis, porB, evolution, recombina-tion, selection.

Address for correspondence and reprints: Martin C. J. Maiden,Department of Zoology, University of Oxford, South Parks Road, Ox-ford OX1 3PS, U.K. E-mail: [email protected].

the cell (Tommassen et al. 1990). These molecules aretargeted by the host immune response and have beenused in meningococcal typing schemes (Bjune et al.1991a, 1991b; Sierra et al. 1991; van der Ley and Pool-man 1992; van der Ley, van der Biezen, and Poolman1995). Unlike most other Neisseria species, the menin-gococcus expresses two porins, PorA and PorB. Ex-pression of PorA is regulated at transcription and exhib-its three levels depending on the length of the poly-guanidine stretch in the promoter region of the porAgene (van der Ende et al. 1995), whereas there is noevidence to suggest that PorB proteins are subject tophase variation. Mutant meningococcal strains that lackPorB do not grow well (Tommassen et al. 1990), sug-gesting that PorB has a function essential for growth.PorB is also capable of translocating vectorially into themembranes of mammalian cells (Blake and Gotschlich1987), and of binding ATP and GTP, which down reg-ulates pore size and alters voltage dependence and ionselectivity (Rudel et al. 1996). These functional andstructural characteristics are thought to influence the ear-ly stages of neutrophil activation and therefore implicatethe PorB protein in meningococcal pathogenesis (Rudelet al. 1996).

A PorB topology model has been constructed onthe basis of nucleotide sequence data (Maiden et al.1991; van der Ley et al. 1991) and, more recently, thestructural similarity between the Neisseria porins andthe Escherichia coli porins OmpF and PhoE has beenexploited to generate a three-dimensional homologymodel for Neisseria porins (Derrick et al. 1999). Thesemodels predicted eight surface exposed ‘‘loops’’ inter-spersed with highly conserved outer membrane-span-ning sequences that formed a ‘‘�-barrel’’ (Kleffel et al.

and thousands more

pic from wikipedia

!42

Modelling Contextual influences

!43

© 2013 Gavin Huttley

q(a,b)= 0 More than one diff.r(a,b)πe Otherwise

⎧⎨⎪

⎩⎪

Context dependent rate matrices

• Multi-position changes disallowed

• Definition of πe distinguishes competing model forms

• GY frequency of ending tuple

• MG frequency of ending base

• CNF conditional frequency of ending base

!44

© 2013 Gavin Huttley

Κ = κπ (R)π (Y )

Two state alphabet {R,Y}

QNF (κ ) =

− π (Y ) π (Y ) 0π (R) − 0 κπ (Y )π (R) 0 − κπ (Y )0 κπ (R) κπ (R) −

⎢⎢⎢⎢⎢

⎥⎥⎥⎥⎥

MG

QTF (Κ) =

− π (R)π (Y ) π (Y )π (R) 0π (R)π (R) − 0 Κπ (Y )π (Y )π (R)π (R) 0 − Κπ (Y )π (Y )

0 Κπ (R)π (Y ) Κπ (Y )π (R) −

⎢⎢⎢⎢⎢

⎥⎥⎥⎥⎥

GY

NB: GY here has multiplicative form

!45

© 2013 Gavin Huttley

Simulated F81 AT-rich seqs

Lindsay, H., Yap, V. B., Ying, H., & Huttley, G. A. (2008). Pitfalls of the most commonly used models of context dependent

substitution. Biol Direct, 3, 52.

MG!GY

!46

© 2013 Gavin Huttley

• MG

• π is multiplicative, meaning it’s the product of the monomer frequencies*

• to get to an independent (monomer) processes you remove context parameters

• GY

• π is not multiplicative.

• π is a more realistic representation of tuple frequencies in real data

• to get to an independent (monomer) processes you add context parameters

!47

© 2013 Gavin Huttley

The Conditional Nucleotide Frequency (CNF) model

Consider the exchange AAA → ATA CNF: !e is the conditional probability of T given 5’-A•A-3’.

q(a,b)= 0 More than one diff.r(a,b)πe Otherwise

⎧⎨⎪

⎩⎪

Yap, V. B., Lindsay, H., Easteal, S., & Huttley, G. (2010). Estimates of the effect of natural selection on protein coding content. Molecular

Biology and Evolution, 27(3), 726-34.

!48

© 2013 Gavin Huttley

Codon models

!49

qa,b

=

8>>>>>><

>>>>>>:

0 more than 1 di�erence

⇥x

synonymous transversion

⇥x

· ⇤ nonsynonymous transversion

⇥x

· � synonymous transition

⇥x

· � · ⇤ nonsynonymous transition

HKY form

© 2013 Gavin Huttley

Multiplicative non-Multiplicative

AT-rich

AT≈GC

GC-rich

GY MG

!50

© 2013 Gavin Huttley

Limits of Simulation

http://xkcd.com/221/!51

© 2013 Gavin Huttley

GYtri,HKY

MGtri,GTR

CNFtri,GTR

!52

BUT does a model explain the data well?

!53

© 2013 Gavin Huttley

How good are the models?• Comparing models by LR tests or AIC

• rubbish against rubbish?

• Do they explain the data well?!

• How can we evaluate this?

• G-statistic (expecteds computed from MLEs)

• Comparison with Goldman’s best likelihood

!54

© 2013 Gavin Huttley

Summary• Multitude of evolutionary models published

• Underlying rate matrices share assumptions that are not well supported

• Model comparison approaches amongst these do not address fundamental issue of how well the data are described by the model

• Metrics of tree support meaningful only if model explains data well

• More on this by Dr Ben Kaehler on friday

!55

© 2013 Gavin Huttley !56

���56

Ben

CamSteph

BobÅsa

Yicheng

HardipJackie

Aaron

VB Yap (Nat. Uni. Singapore), H Lindsay (ETS Zurich),

H Ying (CSIRO) Peter Maxwell (NZ)

Gavin

Funding from ARC, NHMRC, BPA