Extracting Information From DNA Sequences Using Models of Sequence Evolution - Gavin Huttley
-
Upload
australian-bioinformatics-network -
Category
Technology
-
view
218 -
download
0
Transcript of Extracting Information From DNA Sequences Using Models of Sequence Evolution - Gavin Huttley
© 2013 Gavin Huttley
Extracting information from DNA sequences using
models of sequence evolutionGavin Huttley
John Curtin School of Medical Research Australian National University
!1
© 2013 Gavin Huttley
Overview
• Darwin meets Mendel meets …
• Felsenstein applies Fisher
• Markov processes of sequence evolution
• Measuring model support
!3
© 2013 Gavin Huttley
Mendels underpinning
• Population of 1 self-pollinating plant
• Dies at reproduction
• Founder is A/G at one locus
• What is the probability the second generation population is homozygous at this locus?
http://www.flickr.com/photos/22281745@N04/2149169348/
!5
© 2013 Gavin Huttley
Gen 1A/A A/G G/G
0.25 0.5 0.25
Gen 2 A/A A/G G/G
0.25 0.5 0.25
The probability of being fixed by generation 2 is then
Prob(fixed by gen 2) =
1
2
+
1
4
=
3
4
Prob(not fixed gen 1) =
1
2
Prob(fixed gen 2) =
1
2
⇥ 1
2
=
1
4
Prob(fixed) =
1
4
+
1
4
=
1
2
A/G
!6
Hartl & Clark, 2007, p 121
“[Ne is] the number of individuals in a theoretically ideal population having the same
magnitude of random genetic drift as the actual population.
!8
© 2013 Gavin Huttley
RGD Summary
• probability of fixation is just the allele frequency, which initially is 1/(2Ne)
• the expected time to fixation is 4Ne generations
• Big populations have more variation than small ones
• Future allele frequencies depend only on the current population frequency, not past frequencies
!9
© 2013 Gavin Huttley
µ
So population size has no effect!!
Number of new mutations = 2Nµ
Prob(fixation of a new mutant) =
1
2N
Rate of fixation = 2Nµ⇥ 1
2N
= µ
!11
© 2013 Gavin Huttley
What does neutral mean?
• That a genetic variant is ‘invisible’ to natural selection. (Hence, selectively neutral.)
• The evolutionary dynamics (changes in frequency) are dictated by random genetic drift and mutation only.
• “functionally less important molecules or parts of a molecule evolve faster than more important ones”
!13
© 2013 Gavin Huttley
• Say only hydrophilic amino acids are allowed at a specific position
• In this case, the mutation events that produce non-hydrophilic amino acids will be eliminated by natural selection, the fixation probability will be less and ditto for the substitution rate
R HK
D
E
ST
N
QC
U
G
P
!14
What happened in populations is what we see between species, ie
polymorphism and substitution are related.
!16
© 2013 Gavin Huttley
110 1 2 3 4 5 6 7 8 9 10
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Time
Frequency
Positive Natural Selection
Negative Natural Selection
Balancing Natural Selection
Neutral Evolution
!17
© 2013 Gavin Huttley
Summary• The Neutral Theory as the null hypothesis for
evolutionary analysis.
• Darwin's filter as a basis for predicting biological significance
• Slowly evolving is taken as evidence for functionally important
• In other words, inferring selection requires quantifying a neutral process
!18
–Edwards 1992
“The likelihood, L(H|R), of the hypothesis H given the data R, and a specific model, is
proportional to P(R|H), the constant of proportionality being arbitrary.”
!20
© 2013 Gavin Huttley
Utility of phylogenetic techniques
• identify relationships among sequences
• understand divergence mechanisms
!21
© 2013 Gavin Huttley
The phylogenetic “hypothesis”
• The tree topology
• Representation of sequence divergence
• Substitution matrices — P(n) in the figure — for each branch specifying probabilities of change from one sequence state to another
• Ancestral state frequencies
!22
© 2013 Gavin Huttley
Likelihood for 3 sequences
Unobserved ancestral states P (t1)
P (t2)
P (t3)
For this alignment column, the likelihood the ancestral base was A
L(A) = �A � pA,A(t1)
�pA,G(t2)� pA,C(t3)
L1 = LA + LG + LC + LT
The full likelihood is
!23
© 2013 Gavin Huttley
Some Assumptions
• edges are independent
• the same tree holds for all nucleotides
• + assumptions of substitution model
!24
© 2013 Gavin Huttley
Likelihood & consistency
• Consistency (convergence of an estimate to the true parameter value) occurs by addition of aligned columns, ie longer alignments
• JT Chang Full reconstruction of Markov models on evolutionary trees: Identifiability and consistency Mathematical Biosciences 137:1, 51-73
!25
© 2013 Gavin Huttley
Q for F81/HKY85/GTR
P (ti) = exp
Qiti
Q =
2
664
� rA$C⇡C rA$G⇡G rA$T⇡T
rA$C⇡A � rC$G⇡G rC$T⇡T
rA$G⇡A rC$G⇡C � rG$T⇡T
rA$T⇡A rC$T⇡C rG$T⇡G �
3
775
ri$j – exchangeability term
⇡i – probability of base i!27
© 2013 Gavin Huttley
Typical assumptions
• positions evolve iid
• time homogeneous (embeddable)
• reversible
• stationary
!28
© 2013 Gavin Huttley
Comparing nested models• Tree topology must be the same between null and
alternate models
• Processes in null and alternate models must be nested, e.g. HKY and GTR
• THEN (typically), the conventional likelihood ratio test can be employed and related to the χ2 distribution
LR = 2(lnL1 � lnL0)
!30
© 2013 Gavin Huttley
Comparing non-nested models
• Also use a LR statistic
• The probability of observing a LR statistic of equal or greater value by chance under the null hypothesis is estimated using a parametric bootstrap procedure in which data are simulated under the fitted null model
• Goldman N. (1993) J. Mol. Evol. 36:2, 182-98
!33
© 2013 Gavin Huttley
Other comparison approaches
• information criterion (AIC, BIC)
• estimates of a parameter of interest
!34
© 2013 Gavin Huttley
• Consistency meaningful for the true (generating) model
• Model specification is THE most important issue
• Black-box model comparison procedures can support choices that are mechanistically invalid
• Models should be mechanistically coherent, interpretable and explain the data well
!36
© 2013 Gavin Huttley
I will show
• Context dependent models warranted
• For context dependent models, model specification choices have profound consequences
• Not just about number of parameters
• One approach to an empirical check
!37
© 2013 Gavin Huttley
Encoding proteins with DNA• 20 aa are encoded by triplets of
nucleotides (codons)
• There are three special “stop” codons
• changes to codons classified as
• synonymous (syn) changes do not modify encoded aa
• nonsynonymous (nsyn) changes do
• nonsense changes create a stop codon
!39
© 2013 Gavin Huttley
Modelling codon evolution
• Split alignments into non-overlapping trinucleotides and treat each such column as evolving independently
!40
© 2013 Gavin Huttley
Readily tested in protein coding sequences
• Nonsynonymous substitutions can be affected by natural selection
• Synonymous substitutions do not modify the encoded amino acid and are presumed “neutral”
• The rate ratio (nsyn/syn), termed ω, is taken as an indicator of the mode of natural selection
!41
© 2013 Gavin Huttley
© 1988 Nature Publishing Group
1686
Mol. Biol. Evol. 19(10):1686–1694. 2002� 2002 by the Society for Molecular Biology and Evolution. ISSN: 0737-4038
Phylogenetic Evidence for Frequent Positive Selection and Recombination
in the Meningococcal Surface Antigen PorB
Rachel Urwin,* Edward C. Holmes,† Andrew J. Fox,‡ Jeremy P. Derrick,§ andMartin C. J. Maiden**The Peter Medawar Building for Pathogen Research and Department of Zoology, University of Oxford; †Department ofZoology, University of Oxford; ‡Meningococcus Reference Unit, Public Health Laboratory, Withington Hospital, Manchester;and §Department of Biomolecular Sciences, University of Manchester Institute of Science and Technology
Previous estimates of rates of synonymous (dS) and nonsynonymous (dN) substitution among Neisseria meningitidisgene sequences suggested that the surface loops of the variable outer membrane protein PorB were under only weakselection pressure from the host immune response. These findings were consistent with studies indicating that PorBvariants were not always protective in immunological and microbiological assays and questioned the suitability ofthis protein as a vaccine component. PorB, which is expressed at high levels on the surface of the meningococcus,has been implicated in mechanisms of pathogenesis and has also been used as a typing target in epidemiologicalinvestigations. In this work, using more precise estimates of selection pressures and recombination rates, we haveshown that some residues in the surface loops of PorB are under very strong positive selection, as great as thatobserved in human immunodeficiency virus-1 surface glycoproteins, whereas amino acids within the loops and themembrane-spanning regions of the protein are under purifying selection, presumably because of structural con-straints. Congruence tests showed that recombination occurred at a rate that was not sufficient to erase all phylo-genetic similarity and did not greatly bias selection analysis. Homology models of PorB structure indicated thatmany strongly selected sites encoded residues that were predicted to be exposed to host immune responses, implyingthat this protein is under strong immune selection and requires further examination as a potential vaccine candidate.These data show that phylogenetic inference can be used to complement immunological and biochemical data inthe choice of vaccine candidates.
Introduction
The generation of antigenic diversity has evolvedas a strategy for evading immune attack in a wide rangeof pathogenic and commensal organisms (Deitsch, Mox-on, and Wellems 1997). It is effective against both nat-ural and artificially induced immunity and represents amajor obstacle to the development of vaccines againstpathogens as diverse as Plasmodium falciparum, humanimmunodeficiency virus (HIV), and Neisseria meningi-tidis, the meningococcus. In an era when many new vac-cine candidates are being identified by genomic tech-niques, sequence data of the antigen genes obtainedfrom population samples of pathogens can be analyzedby phylogenetic and biochemical modeling techniquesto provide a picture of the evolutionary processes actingon these sequences and hence a preliminary evaluationof their vaccine potential.
The meningococcus is an appropriate model systemto evaluate this approach because it is a pathogen ofglobal significance which is genetically and antigenical-ly diverse and for which no comprehensive vaccine ex-ists (Pollard and Frasch 2001). Further, large geneticallydefined isolate collections have been assembled andmodels of the population biology of this organism areavailable (Caugant et al. 1987; Maiden et al. 1998).Amongst the candidate vaccine components proposedare the variable outer membrane proteins, the trimericporins, which act as pores for the passage of solutes into
Key words: Neisseria meningitidis, porB, evolution, recombina-tion, selection.
Address for correspondence and reprints: Martin C. J. Maiden,Department of Zoology, University of Oxford, South Parks Road, Ox-ford OX1 3PS, U.K. E-mail: [email protected].
the cell (Tommassen et al. 1990). These molecules aretargeted by the host immune response and have beenused in meningococcal typing schemes (Bjune et al.1991a, 1991b; Sierra et al. 1991; van der Ley and Pool-man 1992; van der Ley, van der Biezen, and Poolman1995). Unlike most other Neisseria species, the menin-gococcus expresses two porins, PorA and PorB. Ex-pression of PorA is regulated at transcription and exhib-its three levels depending on the length of the poly-guanidine stretch in the promoter region of the porAgene (van der Ende et al. 1995), whereas there is noevidence to suggest that PorB proteins are subject tophase variation. Mutant meningococcal strains that lackPorB do not grow well (Tommassen et al. 1990), sug-gesting that PorB has a function essential for growth.PorB is also capable of translocating vectorially into themembranes of mammalian cells (Blake and Gotschlich1987), and of binding ATP and GTP, which down reg-ulates pore size and alters voltage dependence and ionselectivity (Rudel et al. 1996). These functional andstructural characteristics are thought to influence the ear-ly stages of neutrophil activation and therefore implicatethe PorB protein in meningococcal pathogenesis (Rudelet al. 1996).
A PorB topology model has been constructed onthe basis of nucleotide sequence data (Maiden et al.1991; van der Ley et al. 1991) and, more recently, thestructural similarity between the Neisseria porins andthe Escherichia coli porins OmpF and PhoE has beenexploited to generate a three-dimensional homologymodel for Neisseria porins (Derrick et al. 1999). Thesemodels predicted eight surface exposed ‘‘loops’’ inter-spersed with highly conserved outer membrane-span-ning sequences that formed a ‘‘�-barrel’’ (Kleffel et al.
and thousands more
pic from wikipedia
!42
© 2013 Gavin Huttley
q(a,b)= 0 More than one diff.r(a,b)πe Otherwise
⎧⎨⎪
⎩⎪
Context dependent rate matrices
• Multi-position changes disallowed
• Definition of πe distinguishes competing model forms
• GY frequency of ending tuple
• MG frequency of ending base
• CNF conditional frequency of ending base
!44
© 2013 Gavin Huttley
Κ = κπ (R)π (Y )
Two state alphabet {R,Y}
QNF (κ ) =
− π (Y ) π (Y ) 0π (R) − 0 κπ (Y )π (R) 0 − κπ (Y )0 κπ (R) κπ (R) −
⎡
⎣
⎢⎢⎢⎢⎢
⎤
⎦
⎥⎥⎥⎥⎥
MG
QTF (Κ) =
− π (R)π (Y ) π (Y )π (R) 0π (R)π (R) − 0 Κπ (Y )π (Y )π (R)π (R) 0 − Κπ (Y )π (Y )
0 Κπ (R)π (Y ) Κπ (Y )π (R) −
⎡
⎣
⎢⎢⎢⎢⎢
⎤
⎦
⎥⎥⎥⎥⎥
GY
NB: GY here has multiplicative form
!45
© 2013 Gavin Huttley
Simulated F81 AT-rich seqs
Lindsay, H., Yap, V. B., Ying, H., & Huttley, G. A. (2008). Pitfalls of the most commonly used models of context dependent
substitution. Biol Direct, 3, 52.
MG!GY
!46
© 2013 Gavin Huttley
• MG
• π is multiplicative, meaning it’s the product of the monomer frequencies*
• to get to an independent (monomer) processes you remove context parameters
• GY
• π is not multiplicative.
• π is a more realistic representation of tuple frequencies in real data
• to get to an independent (monomer) processes you add context parameters
!47
© 2013 Gavin Huttley
The Conditional Nucleotide Frequency (CNF) model
Consider the exchange AAA → ATA CNF: !e is the conditional probability of T given 5’-A•A-3’.
q(a,b)= 0 More than one diff.r(a,b)πe Otherwise
⎧⎨⎪
⎩⎪
Yap, V. B., Lindsay, H., Easteal, S., & Huttley, G. (2010). Estimates of the effect of natural selection on protein coding content. Molecular
Biology and Evolution, 27(3), 726-34.
!48
© 2013 Gavin Huttley
Codon models
!49
qa,b
=
8>>>>>><
>>>>>>:
0 more than 1 di�erence
⇥x
synonymous transversion
⇥x
· ⇤ nonsynonymous transversion
⇥x
· � synonymous transition
⇥x
· � · ⇤ nonsynonymous transition
HKY form
© 2013 Gavin Huttley
How good are the models?• Comparing models by LR tests or AIC
• rubbish against rubbish?
• Do they explain the data well?!
• How can we evaluate this?
• G-statistic (expecteds computed from MLEs)
• Comparison with Goldman’s best likelihood
!54
© 2013 Gavin Huttley
Summary• Multitude of evolutionary models published
• Underlying rate matrices share assumptions that are not well supported
• Model comparison approaches amongst these do not address fundamental issue of how well the data are described by the model
• Metrics of tree support meaningful only if model explains data well
• More on this by Dr Ben Kaehler on friday
!55