Monte Carlo methods for estimating population genetic parameters
description
Transcript of Monte Carlo methods for estimating population genetic parameters
Monte Carlo methods for Monte Carlo methods for estimating population genetic estimating population genetic
parametersparameters
Rasmus NielsenRasmus Nielsen
University of CopenhagenUniversity of Copenhagen
OutlineOutline Idiosyncratic history and background on ML Idiosyncratic history and background on ML
estimation of demographic parameters based estimation of demographic parameters based on DNA sequence data.on DNA sequence data.
A new computational approach/modification.A new computational approach/modification. Idiosyncratic history and background on ML Idiosyncratic history and background on ML
estimation of demographic parameters based estimation of demographic parameters based on SNP data.on SNP data.
Ascertainment and large scale SNP data sets.Ascertainment and large scale SNP data sets.
Felsenstein’s Equation
dGGpGXX
)|()|Pr()|Pr(
)|Pr(| GXEG
SoSo
k
iiGX
kX
1
)|Pr(1
)|Pr(
where where GGii,, ii=1,2,…=1,2,…kk, has been simulated from , has been simulated from pp((GG||).).
Coefficient of Variation
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 2 4 6 8 10
Sample size
C.V
.
Importance Sampling
)(
)|()|Pr(
)(
)()|()|Pr()|()|Pr(
Gh
GpGXE
dGGh
GhGpGXdGGpGX
So
k
i i
ii
Gh
GpGX
kX
1 )(
)|()|Pr(1)|Pr(
where Gi, i=1,2,…k, has been simulated from h(G).
Griffiths and Tavare
Recursion
Simulate mutation (coalescent) from
and correct using importance sampling.
'
'
),|()|Pr()|(
),|()|Pr()|()|Pr(
''
''
coal
mut
Xcoalcoal
Xmutmut
coalXXpXcoalp
mutXXpXmutpX
''
),|()|(),|()|(
),|()|(''
'
coalmut Xcoal
Xmut
mut
coalXXpcoalpmutXXpmutp
mutXXpmutp
Example (Nielsen 1998)
•Infinite sites Infinite sites modelmodel
•Estimation of TEstimation of T
•Estimation of Estimation of population population phylogeniesphylogenies
Integro-recursionIntegro-recursion Ugliest Ugliest equation equation ever ever published in published in a biological a biological journal…journal…
MLE: T=1.8 (36,000 years)
Data from the Caribean Hawksbill TurtleData from the Caribean Hawksbill Turtle
MCMC
)()|()|Pr()|,( pGpGXXGp
Set up a Markov chain on state space on all supported values of and G and with stationary distribution p(, G | X). Now since
this can easily be done using Metropolis-Hastings sampling, i.e. updates to and G are proposed from a proposal distribution q( , G → ’ , G’) and accepted with probability
)',',()|()|(
),','()'|'()'|(
GGqGPGXP
GGqGPGXP
Some problems…
• Histogram estimator or other smoothing must be used.
• Likelihood ratios hard to estimate (e.g. M=0).
A new method
• It is possible to calculate the marginal prior probability of a genealogy
dPGPGP )()|()(
• It turns out that this math is doable, for most components of Θ such as and M.• The we can sample from the marginal posterior of G
using the previously discussed MCMC procedures.
Slide inspired by Jody Slide inspired by Jody HeyHey
)()|()(
)()|()|( GPGXP
XP
GPGXPXGP
dGXGPGPXP
)|()|()|(
We then recover the posterior for using
Approximated by
k
i i
ik
ii GP
PGP
kGP
kXP
11 )(
)()|(1)|(
1)|(
Slide inspired by Jody Slide inspired by Jody HeyHey
Advantages
• Eliminates problems with covariance between parameters leading to mixing problems.
• Provides a smooth posterior/likelihood function useful for optimization and likelihood ratio estimation.
Disadvantages
• Requires more calculation in each MCMC iteration
Likelihood ratio estimation
6 loci, 15 gene copies, H0: m1=m2
Other approaches
• Kuhner and Felsenstein use a combination of MCMC and importance sampling to estimate surfaces (no prior for the parameters).
• PAC methods suggested by Stephens and Donnelly samples from a close approximation to
to estimate an approximate likelihood.• ABC (Beaumont, Pritchard, Tavare and others) methods are
a very popular and promising class of methods based on (1) reducing the data to summary statistics, (2) simulate new data from the prior, (3) accepting the parameter value under which the data was simulated if the difference between simulated and true statistics is less than .
)|Pr(
)|()|Pr(),|(
X
GpGXXGp
SNP DataNielsen and Slatkin (2000)
A more efficient method..Griffiths and Tavare (1998), Nielsen (2000)
A more efficient method..Griffiths and Tavare (1998), Nielsen (2000)
Ascertainment Sample vs. Typed Sample
Ascertainment sample
Typed sample
n = 20, d = 4, #SNPs = 1000
0.00
0.05
0.10
0.15
0.20
0.25
0.30
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
x
Fre
qu
ency
True Frequencies
Observed frequencies
0.5
0.6
0.7
0.8
0.9
1.0
0 1 2 3 4 5 6 7 8 9 10
=2Nc
E[D
']
no ascertainment biasascertainment bias
Correcting for ascertainment biases
Now, for simplicity, consider the case without a sweep, then
where (in the simplest possible case)
and
)|Pr(
)|Pr(
)|Pr(
)|,Pr()(
PP
PP
Asc
xXAscp
Asc
AscxXL i
xi
i
d
nd
xn
d
x
xXAsc i 1)|Pr(
1
1
)|Pr()|Pr(n
jij jXAscpAsc P
In this simple case, the maximum likelihood estimate of P is simply given by
, k = 1, 2, …, n – 1,
where nk is the number of SNPs with allele frequency k.
11
1 )|Pr()|Pr(ˆ
n
j
jkk jXAsc
n
kXAsc
np
Selective sweeps:
Similarly define ),,,|Pr(),,( AscDXDL PP
0
0.05
0.1
0.15
0.2
0.25
0.3
1 3 5 7 9 11 13 15 17 19
True frequencies
Observed frequencies
Corrected frequencies
10,000 simulated SNPs with n = 20 and d = 5
b.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
2 3 4 5 6 7 8 9 10
=2N c
Hudson’s (2001) Estimator when n = 100, m = 5, = 5, and #SNP pairs = 200.
Corrected
Uncorrected
Complications• Double-hit ascertainment (HapMap)• Ascertainment based on chimpanzee (HapMap)• Panel depth may vary among SNPs and/or
among regions (HapMap).• Ascertainment method may vary among SNPs
(HapMap).• Population structure (HapMap).• Loss of information regarding asc. scheme
(HapMap??).
0.00E+00
5.00E-02
1.00E-01
1.50E-01
2.00E-01
2.50E-01
3.00E-01
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
HapMap ascertainment depth distrb.(ignores many important components)
PerlegenPerlegen
HapMapHapMap
DataDataDirectly sequenced polymorphism data from Directly sequenced polymorphism data from
20 European-Americans, 19 African-20 European-Americans, 19 African-Americans and one chimpanzee from Americans and one chimpanzee from 9,316 protein coding genes9,316 protein coding genes
Data set previously described in Data set previously described in Bustamante, C.D. et al. 2005. Natural Bustamante, C.D. et al. 2005. Natural selection on protein-coding genes in the selection on protein-coding genes in the human genome. Nature human genome. Nature 437437, 1153-7., 1153-7.
Demographic modelDemographic model
European-AmericansEuropean-Americans African-AmericansAfrican-Americans
BottleneckBottleneck
Population growthPopulation growth
migratiomigrationn
AdmixtureAdmixture
EstimationEstimation
1
1
)()(n
j
nj
jpL
, Sampling probabilities from the 2D frequency Sampling probabilities from the 2D frequency spectrumspectrum
Number of SNPs with pattern Number of SNPs with pattern jj in the 2D frequency in the 2D frequency spectrumspectrum
SNPs within a gene are correlated. But estimator is SNPs within a gene are correlated. But estimator is consistent. The estimate has the same properties as consistent. The estimate has the same properties as a real likelihood estimator except that it converges a real likelihood estimator except that it converges slightly slower because of the correlation (Wiuf 2006).slightly slower because of the correlation (Wiuf 2006).
African-AmericansAfrican-Americans
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0 5 10 15 20 25 30 35
Allele Frequency
%
Simulated
Observed
European-AmericansEuropean-Americans
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0 5 10 15 20 25 30 35 40
Allele Frequency
%
Simulated
Observed
Godness-of-fit: Godness-of-fit: p p = 0.6= 0.6
Acknowledgements
Jody Hey, John Wakeley, Melissa Hubisz, Andy Clark, Carlos Bustamante, Scott Williamson, Aida Andres, Amit Andip, Adam Boyko, Anders Albrechtsen,Mark Adams, Michelle Cargill and other staff at Celera Genomics and Applied Biosystems.