Post on 05-Jun-2020
ch. 4 © 2003 Broman, Churchill, Yandell, Zeng 1
4 Interval Mapping for a Single QTL• basic idea of interval mapping• interval mapping by maximum likelihood
– maximum likelihood using EM, MCMC• Bayesian interval mapping
– "natural" Bayesian priors– multiple imputation, MCMC
• bootstrapped variance estimates• advantages & shortcomings of IM• Haley-Knott regression approximation
ch. 4 © 2003 Broman, Churchill, Yandell, Zeng 2
4.1 basic idea of interval mapping• study properties of likelihood at each possible QTL
– treating QTL as missing data– assuming only a single QTL (for now)
• recall likelihood as mixture over unknown QTL– likelihood = product of sum of products– complicated to evaluate--requires iteration
),|(pr),|(pr sum prod),,|(pr prod),,|(pr),|,(
θλλθλθλθ
QYXQXYXYXYL
iiQi
iii
==
=
ch. 4 © 2003 Broman, Churchill, Yandell, Zeng 3
uncertainty in QTL genotype Q• how to improve guess on Q with data, parameters?
– prior recombination: pr(Q | Xi, λ )– posterior recombination: pr(Q | Yi, Xi,θ, λ )
• main philosophies for assessing likelihood– maximum likelihood: study peak(s)– Bayesian analysis: study whole shape
• implementation methodologies– Expectation-Maximization (EM)– Markov chain Monte Carlo (MCMC)– multiple imputation– genetic algorithms, GEE, …
ch. 4 © 2003 Broman, Churchill, Yandell, Zeng 4
posterior on QTL genotypes• full conditional of Q given data, parameters
– proportional to prior pr(Q | Xi, λ )• weight toward Q that agrees with flanking markers
– proportional to likelihood pr(Yi|Q,θ)• weight toward Q so that group mean GQ ≈ Yi
• phenotype and flanking markers may conflict– posterior recombination balances these two weights
),,|(pr),|(pr),|(pr),,,|(pr
λθθλλθ
ii
iiii XY
QYXQXYQ =
ch. 4 © 2003 Broman, Churchill, Yandell, Zeng 5
how does phenotype Y affect Q?
90
100
110
120
D4Mit41D4Mit214
Genotype
bp
AAAA
ABAA
AAAB
ABAB
what are probabilitiesfor genotype Qbetween markers?
recombinants AA:AB
all 1:1 if ignore Yand if we use Y?
ch. 4 © 2003 Broman, Churchill, Yandell, Zeng 6
maximum likelihood (ML) idea• pick QTL locus λ (usually scan whole genome)• find ML estimates of gene action θ given λ• maximum likelihood at peak of likelihood
– slope (derivative with respect to θ) is zero – sometimes maximum is at a boundary (non-zero slope)
• slope is weighted average using posteriors for Q– cannot write estimate in "closed form"– need to know θ to estimate it!– iterate toward the maximum in some clever way
θθλθ
θλθ
dQYdXYQ
dXYdL i
iiQi)),|(prlog(),,,|(pr sum),|,(
,=
ch. 4 © 2003 Broman, Churchill, Yandell, Zeng 7
Bayesian model posterior• augment data (Y,X) with unknowns Q
– study unknowns (θ,λ,Q) given data (Y,X)– Q ~ pr(Q | Yi, Xi,θ, λ )
• no longer need weighted average over Q– instead we average over Q to study parameters– pr(θ,λ|Y,X) = sumQ pr(θ,λ,Q|Y,X)
• study properties of posterior– need to specify priors for (θ,λ) – denominator is very difficult to compute in practice– drawing samples from posterior in some clever way
)|(pr)(pr)|(pr),|(pr),|(pr),|,,(pr
XYXQYXQXYQ θλθλλθ =
ch. 4 © 2003 Broman, Churchill, Yandell, Zeng 8
4.2 interval mapping by ML• search whole genome for putative QTL• "profile" likelihood across all possible λ
– find ML estimate of θ given λ– ML estimate of (θ,λ) at maximum over genome
=
=
=
)|ˆ(),|,ˆ(log)(
)ˆ,ˆ|(),|(prsum prod),|,ˆ(
),ˆ|(prod)|ˆ(
0010
2
200
pool
YLXYLLOD
GYfXQXYL
sYfYL
QiQi
ii
θλθλ
σλλθ
µθ
ch. 4 © 2003 Broman, Churchill, Yandell, Zeng 9
LOD for hyper dataset
0
2
4
6
8
lod
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 171819 X
19 + X chromosomeshighest LOD on 4other QTL?
ch. 4 © 2003 Broman, Churchill, Yandell, Zeng 10
LOD(λ) on chr 4 of hyper
0 20 40 60
0
2
4
6
8
Map position (cM)
lod
EMHKIMP
EM "exact"Haley-Knott regressionsingle imputation
all agree at the peakand mostly at markers
note marker spacing
ch. 4 © 2003 Broman, Churchill, Yandell, Zeng 11
EM method for interval mapping• fix a possible QTL λ• iterate between expectation & maximization
– likelihood increases with each iteration– stop iterating when the change is "negligible"
• initial values– PQi = pr(Q | Xi,λ )
• recombination model in the absence of data– or use Haley-Knott regression estimates of θ
ch. 4 © 2003 Broman, Churchill, Yandell, Zeng 12
EM method for interval mapping• E-step: estimate posterior recombination
– PQi = pr(Q | Yi, Xi,θ, λ )– estimate for every individual i, genotype Q– depends on effects θ
• M-steps: maximize likelihood for θ– may be many parameters– technical point: caution on parallel updates– solve system of equation: derivatives set to zero – depends on PQi
θθ
dQYdP i
QiQi)),|(prlog( sum0 ,=
ch. 4 © 2003 Broman, Churchill, Yandell, Zeng 13
4.2.2 M-step for normal phenotype• Y = GQ + e, e ~ N(0,σ2)• pr(Y | Q,θ ) = f(Y |GQ, σ2)• see notes in book for derivative details• E-step estimates:
( ) nPGY
PPYG
QiQiQi
QiiQiiiQ
/ˆ sumˆ
sum/ sumˆ2
,2 −=
=
σ
ch. 4 © 2003 Broman, Churchill, Yandell, Zeng 14
4.2.3 ML via MCMC• basic idea of simulated annealing
– start with non-informative priors on (θ,λ) – sample from posterior (somehow…)– gradually shrink priors toward ML estimate
• slight difficulty– need to know (θ,λ) to sample from posterior– iteration leads to Markov chain
• point of this section– MCMC does not imply a Bayesian perspective!
ch. 4 © 2003 Broman, Churchill, Yandell, Zeng 15
4.3 Bayesian interval mapping• sample missing genotypes Q• decouple effects θ from QTL λ• but Q depends on (θ,λ) and vice versa• also need to specify priors
)|(pr)(pr),|(pr~
),,,|(pr~)|(pr
)|(pr),|(pr~
QYQY
XYQQXQ
XXQ
ii
θθθ
λθ
λλλ
ch. 4 © 2003 Broman, Churchill, Yandell, Zeng 16
4.3.1 Bayesian priors for QTL• locus λ may be uniform over genome
– pr(λ | X ) = 1 / length of genome• missing genotypes Q
– pr(Q | X, λ )– recombination model is formally a prior
• effects θ = (G,σ2), G = (GQQ,GQq,Gqq)– conjugate priors for normal phenotype– GQ ~ N(µ, κσ2)– σ2 ~ inverse-χ2(ν,τ2), or ντ2 / σ2 ~ χ2
ch. 4 © 2003 Broman, Churchill, Yandell, Zeng 17
effect of prior variance on posterior
normal prior, posterior for n = 1, posterior for n = 5 , true mean(solid black) (dotted blue) (dashed red) (green arrow)
2 3 4 5 6 7 8 9 11 13 15 17
κ = 0.5
2 3 4 5 6 7 8 9 11 13 15 17
κ = 1
2 3 4 5 6 7 8 9 11 13 15 17
κ = 22.0 κ=0.5
ch. 4 © 2003 Broman, Churchill, Yandell, Zeng 18
details of phenotype priors
• priors depend on "hyper-parameters"• GQ ~ N(µ, κσ2)
– center around phenotype grand mean– κσ2 ≈ σG
2 = genetic variance – κ ≈ σG
2 /σ2 = h2 / (1–h2)– h2 = σG
2 /(σG2 +σ2 ) = heritability
• σ2 ~ inverse-χ2(ν,τ2), or ντ2 / σ2 ~ χ2
– τ2 ≈ s2 = total sample variance– ν = prior degrees of freedom = small integer
ch. 4 © 2003 Broman, Churchill, Yandell, Zeng 19
Y = G + E posterior for single individualenviron E ~ N(0,σ2), σ2 known likelihood pr(Y | G,σ2) = N(Y | G,σ2)prior pr(G | σ2,µ,κ) = N(G | µ,κσ2)posterior N(G | µ+B1(Y-µ), B1σ2)Yi = G + Ei posterior for sample of n individuals
shrinkage weights Bn go to 1
Bayes for normal data
11
,sum with
),(N),,,|pr(2
2
→+
==
−+=
•
•
nnB
nYY
nBYBGYG
ni
nn
κκ
σµµκµσ
ch. 4 © 2003 Broman, Churchill, Yandell, Zeng 20
Y = GQ + E genetic Q = qq,Qq,QQenviron E ~ N(0,σ2), σ2 knownparameters θ = (G,σ2)
likelihood pr(Y | Q,G,σ2) = N(Y | GQ,σ2)prior pr(GQ | σ2,µ,κ) = N(GQ | µ,κσ2)posterior:
posterior by QT genetic value
11
,sum},{count
),(N),,,,|(pr
}:{
22
→+
====
−+=
=Q
Q
iQQiQiQ
QQQQQQ
nn
BnYYQQn
nBYBGQYG
i κκ
σµµκµσ
ch. 4 © 2003 Broman, Churchill, Yandell, Zeng 21
How do we choose hyper-parameters µ,κ?Empirical Bayes: marginalize over prior
estimate µ,κ from marginal posterior
likelihood pr(Yi | Qi,G,σ2) = N(Yi | G(Qi),σ2)prior pr(GQ | σ2,µ,κ) = N(GQ | µ,κσ2)marginal pr(Yi | σ2,µ,κ) = N(Yi | µ, (κ +1)σ2)
estimates
EB posterior
Empirical Bayes: choosing hyper-parameters
−+=
−≈+=
−==
••
••
QQQQQQ
ii
nBYYBYGYG
hssnYYsY
2
2222
22
ˆ),(N)|(pr
)1/()1/(ˆ/)(sum,ˆ
σ
κσµ
ch. 4 © 2003 Broman, Churchill, Yandell, Zeng 22
What if variance σ2 is unknown?• recall that sample variance is proportional to chi-square
– pr(s2 | σ2) = χ2 (ns2/σ2 | n)– or equivalently, ns2/σ2 | σ2 ~ χn
2
• conjugate prior is inverse chi-square– pr(σ2 | ν,τ2) = inv-χ2 (σ2 | ν,τ2 )– or equivalently, ντ2/σ2 | ν,τ2 ~ χν
2
– empirical choice: τ2= s2/3, ν=6• E(σ2 | ν,τ2) = s2/2, Var(σ2 | ν,τ2) = s4/4
• posterior given data– pr(σ2 | Y,ν,τ2) = inv-χ2 (σ2|ν+n, (ντ2+ns2)/(ν+n) )– weighted average of prior and data
ch. 4 © 2003 Broman, Churchill, Yandell, Zeng 23
Yi = G(Qi) + Ei genetic Qi = qq,Qq,QQenviron E ~ N(0,σ2)parameters θ = (G,σ2)
likelihood pr(Yi | Qi,G,σ2) = N(Yi | G(Qi),σ2)prior pr(GQ | σ2,µ,κ) = N(GQ | µ, σ2/κ)
pr(σ2 | ν,τ2) = inv-χ2 (σ2 | ν,τ2 )posterior:
joint effects posterior details
( ) nQGYsn
nB
nns
nGQY
nBYYBYGQYG
iiiQQ
QQQQQQ
/)(sum, with
,-inv),,,,|(pr
),(N),,,,|(pr
22
222222
22
−=+
=
++
+=
−+= ⋅⋅
κ
νντ
νσχτνσ
σκµσ
ch. 4 © 2003 Broman, Churchill, Yandell, Zeng 24
4.3.2 Bayesian multiple imputation• basic idea
– impute multiple copies of missing genotypes Q• sample Q ~ pr(Q | X, λ)• weighted to appear as draws from posterior
– average out gene effects θ– study posterior for putative QTL λ
• most effective for multiple QTL– use single QTL to introduce idea– consider all loci as possible QTL
• sample on grid Λ of `pseudomarkers' (every 2cM)• similar to interval map scan of whole genome
ch. 4 © 2003 Broman, Churchill, Yandell, Zeng 25
importance sampling idea• draw samples from one distribution
– Q1, Q2, Q3,…, Qn ~ f(Q)• weight them appropriately by ω(Q)• sample summaries from distribution g(Q)
– g(Q)= f(Q)ω(Q) / constant– mean for f = sumi Qi / n– mean for g = sumi Qiω(Qi) / sumi ω(Qi)
ch. 4 © 2003 Broman, Churchill, Yandell, Zeng 26
example: mean copies of Qgenotype qq Qq QQ sumQ copies 0 1 2 true g 0.25 0.5 0.25 1.0draw f 1/3 1/3 1/3 1.0weight ω1 2 1f×ω 1/3 2/3 1/3 4/3importance sampling g = f×0.75ωsample 30 30 40 100mean f 0×30 1×30 2×40 110/100=1.1mean g 0×1×30 1×2×30 2×1×40 140/130=1.08
ch. 4 © 2003 Broman, Churchill, Yandell, Zeng 27
what are appropriate weights?• ideally draw genotype from posterior
– want sample Q ~ g(Q) = sumθ pr(Q | Y,X,θ,λ)pr(θ )– but have sample Q ~ f(Q) = pr(Q | X,λ)
• appropriate weights – ω(Q,λ|Y,X) = pr(λ |X) sumθ pr(Y |Q,θ )pr(θ )
• estimate marginal posterior for QTL λ– draw N samples from prior at each QTL λ
Q1, Q2, Q3,…, QN ~ pr(Q | X,λ)pr(λ | Y,X) = sumQ ω(Q,λ|Y,X) pr(Q | X,λ) / constant
≈ sumj ω(Qj,λ|Y,X) / constant– constant is summed over all λ, but not actually needed
ch. 4 © 2003 Broman, Churchill, Yandell, Zeng 28
relating weights to posterior• posterior is simply averaged over θ• weights comprise terms except pr(Q | X,λ)• estimating weights: see Sen & Churchill
)|(pr/),|,(),|(pr)|(pr
)(pr),|(pr sum )|(pr),|(pr),|,,(pr sum),|,(pr
XYXYQXQXY
QYXXQXYQXYQ
λωλ
θθλλλθλ
θ
θ
=
=
=
ch. 4 © 2003 Broman, Churchill, Yandell, Zeng 29
estimating effects via imputation• multiple imputation averages over effects• difficult to study posterior of effects directly• can estimate usual summaries
constant/),|,(),|( sum)|(pr/),|,(),|(pr ),|( sum
),|(pr ),|( sum),|(
,
,
XYQQYEXYXYQXQQYE
XYQQYEXYE
jjj
Q
Q
λωθλωλθ
θθ
λ
λ
≈
=
=
ch. 4 © 2003 Broman, Churchill, Yandell, Zeng 30
4.3.3 Bayesian MCMC
• Markov chain Monte Carlo– Monte Carlo samples along a Markov chain
• What is a Markov chain?• What is MCMC?
– Sampling from full conditionals– Gibbs sampler, Metropolis-Hastings
ch. 4 © 2003 Broman, Churchill, Yandell, Zeng 31
What is a Markov chain?• future given present is independent of past• update chain based on current value
– can make chain arbitrarily complicated– chain converges to stable pattern π() we wish to study
0 1 1-q
q
p
1-p
)/()1(pr qpp +=
ch. 4 © 2003 Broman, Churchill, Yandell, Zeng 32
Markov chain idea
0 1 1-qq
p
1-p
)/()1(pr qpp +=
0 5 10 15 20 25 30
01
stat
e
time
ch. 4 © 2003 Broman, Churchill, Yandell, Zeng 33
Markov chain Monte Carlo• can study arbitrarily complex models
– need only specify how parameters affect each other– can reduce to specifying full conditionals
• construct Markov chain with “right” model– joint posterior of unknowns as limiting “stable” distribution– update unknowns given data and all other unknowns
• sample from full conditionals• cycle at random through all parameters
– next step depends only on current values• nice Markov chains have nice properties
– sample summaries make sense– consider almost as random sample from distribution– ergodic theorem and all that stuff
ch. 4 © 2003 Broman, Churchill, Yandell, Zeng 34
Markov chain Monte Carlo idea
0 2 4 6 8 10
0.0
0.1
0.2
0.3
0.4
θ
pr(θ
|Y)
have posterior pr(θ |Y)want to draw samples
propose θ ~ pr(θ |Y)(ideal: Gibbs sample)
propose new θ “nearby”accept if more probabletoss coin if less probablebased on relative heights(Metropolis-Hastings)
ch. 4 © 2003 Broman, Churchill, Yandell, Zeng 35
MCMC realization
0 2 4 6 8 10
020
0040
0060
0080
0010
000
θ
mcm
c se
quen
ce
2 3 4 5 6 7 80.
00.
10.
20.
30.
40.
50.
6θ
pr(θ
|Y)
added twist: occasionally propose from whole domain
ch. 4 © 2003 Broman, Churchill, Yandell, Zeng 36
marginal posteriors
• joint posterior– pr(λ,Q,θ |Y,X) = pr(θ)pr(λ)pr(Q|X,λ)pr(Y|Q,θ) /constant
• genetic effects– pr(θ |Y,X) = sumQ pr(θ |Y,Q) pr(Q|Y,X)
• QTL locus– pr(λ |Y,X) = sumQ pr(λ |X,Q) pr(Q|Y,X)
• QTL genotypes more complicated– pr(Q|Y,X) = sumλ,θ pr(Q|Y,X,λ,θ ) pr(λ,θ |Y,X)– impossible to separate λ and θ in sum
observed X Y
missing Q
unknown λ θ
ch. 4 © 2003 Broman, Churchill, Yandell, Zeng 37
Why not Ordinary Monte Carlo?• independent samples of joint distribution• chaining (or peeling) of effects
pr(θ|Y,Q)=pr(GQ | Y,Q,σ2) pr(σ2 | Y,Q) • possible analytically here given genotypes Q• Monte Carlo: draw N samples from posterior
– sample variance σ2
– sample genetic values GQ given variance σ2
• but we know markers X, not genotypes Q!– would have messy average over possible Q– pr(θ |Y,X) = sumQ pr(θ |Y,Q) pr(Q|Y,X)
ch. 4 © 2003 Broman, Churchill, Yandell, Zeng 38
MCMC Idea for QTLs• construct Markov chain around posterior
– want posterior as stable distribution of Markov chain– in practice, the chain tends toward stable distribution
• initial values may have low posterior probability• burn-in period to get chain mixing well
• update components from full conditionals– update effects θ given genotypes & traits– update locus λ given genotypes & marker map– update genotypes Q given traits, marker map, locus & effects
NQQQXYQQ
),,(),,(),,(),|,,(pr~),,(
21 θλθλθλθλθλ
→→→ L
ch. 4 © 2003 Broman, Churchill, Yandell, Zeng 39
sample from full conditionals• hard to sample from joint posterior• update each unknown given all others• examine posterior: keep terms with unknown• normalizing denominator make a distribution
)|(pr)(pr),|(pr~
),,,|(pr~)|(pr
)|(pr),|(pr~
QYQY
XYQQXQ
XXQ
ii
θθθ
λθ
λλλ
ch. 4 © 2003 Broman, Churchill, Yandell, Zeng 40
sample from full conditionalsfor model with m QTL
• hard to sample from joint posterior– pr(λ,Q,θ |Y,X) = pr(θ)pr(λ)pr(Q|X,λ)pr(Y|Q,θ) /constant
• easy to sample parameters from full conditionals– full conditional for genetic effects
• pr(θ |Y,X,λ,Q) = pr(θ |Y,Q) = pr(θ) pr(Y|Q,θ) /constant
– full conditional for QTL locus• pr(λ |Y,X,θ,Q) = pr(λ |X,Q) = pr(λ) pr(Q|X,λ) /constant
– full conditional for QTL genotypes• pr(Q|Y,X,λ,θ ) = pr(Q|X,λ) pr(Y|Q,θ) /constant
observed X Y
missing Q
unknown λ θ
ch. 4 © 2003 Broman, Churchill, Yandell, Zeng 41
Gibbs sampler idea• want to study two correlated normals• could sample directly from bivariate normal• Gibbs sampler:
– sample each from its full conditional– pick order of sampling at random– repeat N times
( )( )2
11212
222121
2
1
2
1
1),(~,,|
1),(~,,|
11
,~,|
ρµθρµρµθθρµθρµρµθθ
ρρ
µµ
ρµθθ
−−+
−−+
NN
N
ch. 4 © 2003 Broman, Churchill, Yandell, Zeng 42
Gibbs sampler samples: ρ = 0.6
0 10 20 30 40 50
-2-1
01
2
Markov chain index
Gib
bs: m
ean
1
0 10 20 30 40 50
-2-1
01
23
Markov chain index
Gib
bs: m
ean
2
-2 -1 0 1 2
-2-1
01
23
Gibbs: mean 1
Gib
bs: m
ean
2
-2 -1 0 1 2
-2-1
01
23
Gibbs: mean 1
Gib
bs: m
ean
2
0 50 100 150 200
-2-1
01
23
Markov chain index
Gib
bs: m
ean
1
0 50 100 150 200
-2-1
01
2
Markov chain index
Gib
bs: m
ean
2
-2 -1 0 1 2 3
-2-1
01
2
Gibbs: mean 1
Gib
bs: m
ean
2
-2 -1 0 1 2 3
-2-1
01
2
Gibbs: mean 1
Gib
bs: m
ean
2
N = 50 samples N = 200 samples
ch. 4 © 2003 Broman, Churchill, Yandell, Zeng 43
Gibbs Sampler: effects & genotypes• for given locus λ, can sample effects θ and genotypes Q
– effects parameter vector θ = (G,σ2) with G=(Gqq,GQq,GQQ)– missing genotype vector Q = (Q1,Q2,…, Qn)
• Gibbs sampler: update one at a time via full conditionals– randomly select order of unknowns– update each given current values of all others, locus λ and data (Y,X)
• sample variance σ2 given Y,Q and genetic values G• sample genotype Qi given markers Xi and locus λ
– can do block updates if more efficient• sample all genetic values G given Y,Q and variance σ2
ch. 4 © 2003 Broman, Churchill, Yandell, Zeng 44
phenotype model: alternate form• genetic value GQ in “cell means” form easy• but often useful to model effects directly
– sort out additive and dominance effects– useful for reduced models with multiple QTL
• QTL main effects and interactions (pairwise, 3-way, etc.)
• we only consider additive effects here– Gqq = µ – a, GQq = µ, GQQ = µ + a
• recoding for regression model– Qi = –1 for genotype qq– Qi = 0 for genotype Qq– Qi = 1 for genotype QQ– G(Qi) = µ + aQi
ch. 4 © 2003 Broman, Churchill, Yandell, Zeng 45
MCMC run of mean & additive
1
1
111
1
1
111
1
11
1
1
111
1
1
1
1111
1
1
1
11
1111
11
1
111
11
1
111
1
11
11
1
1
1
1
1
1
1
111
1
111111
1
111
1
11
1
1
1
1
1
11111
1
11111
111
1
1111
111111111
1
1
11
1
111111
11
1
1111
1
1
1
1
11
1
1111
1
11111
1111
1
11111111
11
1
1
11
111
1
11111111
1
1
1
1
1
11
1
1
11
1
1
1
1
1
1
1111
1
111
1
1
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1111
1
11
11
1
1
1
1
1
11
1
11
1
1
1
1111111
111111
111
1
11111
1
1
1
1
111
11
111
1
111
11
1
1
111
1
11
1111
1
1111111
1
11
1
1111
11
1
1111
1
11
1111
1
1111111
111
11
1
1
1
1
1
1
1
111
1
11
1111
11
1
111
1
1
11
111
1
11
1
1
11
1
111
1
111111
11
111
11
111111
1
11111
111
11
11
1
1
1
11111
111
1
11
1
1
1
1
111
1
1
1
11
1
11
1
1
1
1
1
111
1
1
1
11
1
1
111
11
1111
1
1
1111
11
1
11
1
1
1
1
1
1
1
11
11
1
1
1
111
1
11
111111
1
1
1
111
1
111111111
1
11
1
11
1111
1
1111111111
1
1
11
1
1
111
1111
11
1
1
11
111
1
1
1
111
111
1
111
1
11
11
111
1
11
1
1
1111
11111
1
1
1
1
1
1
1
11
1
11
1
1
1
1
11
1
11
1
1
1
1
11
1
1
11
11
1
1
1
1
11
11
1
11
11
1
1
11
1
1
1
1
1
1111
111
11
1
1
1
11111111
1
111
11111
1
1
1
11
1
1
1
1
1
1
1
1
11
1
11
1
1
1111
11
1
11111
1
11
1
1
1
1
111
1
1
1
11
1
1
1
1
1
1
1
1
111
1
1111111
1
1
1
1111
1
1
1
1
111
111
11
1
1
1
1
1
1
1
1
1
111111
11
11
1
1
1
1
1111
1
1
1
11
1
1
1
1
1
1
111
1
11
1
11
1
1
111
1
11
1
11
1
1
11111
1
11
1
1
1
11
111
1
1
11
1
11
111
1
11
11
1
1
1
111
111
1
1111111
11
1
11
1
1111
1
111
1
1
1
1
11
1111
1
1
1
11
11
1
1
11
1
1
1
1
11
111
1
11
1
11
11
111
1111
1
11
1
1
11
1
1
1
1
1
1
11
11
1
1
1
1
1111
111
111
1111
1
111
111
1
1
1
1
111
1
11
111
1
11
1
111111
0 200 400 600 800 1000
9.6
9.8
10.0
10.2
10.4
MCMC run/100
9.6
9.8
10.0
10.2
10.4
0 20 40 60
mea
n
frequency
1
11
1111
1
11
111
1
1
111
111
1
1
111
11
1
1
1
1111
11
11
1
111
11
11
1
1
1
11111
11111
11
11
1
1
11
1
11
1
1
1
1
11
1
11
11
1
11
11
11
111
111
11
1
1
1
1
11111
11
1
1111
1
1
1111
1
11
1
11
1
1
111
1
111111
1
1
1
1
111
11
1
1
11
1
1
111
1
1
1
1
11
1
1
1
1
1
11
11
1
1
11
111
111
11
1
1
1
1
11
1
1
1
1
1
11
1
1
11
1
1
1
1
1
1
1
111
1
1
1
1
1
1
11
1
11
1
111
1
1
11
1
1
11
1
111
1
1
1
1
1
1
1111
1
1
1
11
1
1
1
1
11
1111
1
1
11
1
1
1
1111
1
1
1
1
1
11
11
1
1
1
1
11
1
1
11
1
11
1
1
111
11
1
1
1
1
1
1
1
1
1
11
11111
1
1
1
1
111
1
1
111
11
1
1
1
1
11
1
11
1
1111
1
11
1
1111
1111
1
1
1
1
1
1
111111
1
11
1
1
111
1
11
1
1
1
11
1
1111
11
1
1
1
1
1
1
1
1
1
11
1
11
111
1
1111
1
1
1
1
1
1
1
1
1
1
1
11
1
1
1
11
1
11
11
1
1
1
1
1
1
1
11
1
111
111
11111
1
1
1
1
1
111
1
1
1
1
1
1
1
1111111
1111
111
11
1
1
1
1
11
1
1
1
11
1
11
1
1
1
1
1
1
1
1
11
11111
1
11
11
11
1
1
111
11
11
1
1
11
1
111111
1
111
11111
1
111
1
1
1
1111
1
11
1
1
1
1
1
11111
1
1
1
1
1
1
1
11
1
1
1
11
1
1
11
11
1
1
1
1
11
1
1
1
11
11
11
1
11
1
1
11111111
1
1
11
1
1
1
11
1111
1
111
11
1
1
1
1111
1
1
1
1
1
1
1
1
1
11
1111
1
11
1
1
1
11
1
1
1
1
1
1
11
1
1
1
1
1
1111
1
11
11
1
11
1
11
11111
1
111
1
1
1
1
11
11
1
11
1
111
1
11
1
111
111
1
111111
111111111
111
1
1
1
1
1
1
1
111
1
1
11
1111
1111
11
1111
1
1
1
1
1
1
1
1
111
11
11
1
1
1
1
1
11
11
111
1
1111
1
11
11111
11
1
11111
11
1
11
1
11
111
1
11
1
1
1
11
1
1
11
11111
1
1
11111
11
111
1
1
1
11
1
1
1
1
11
1
1
1
1
1
1
1
111
1
1
1
11111
11
11
1
1111
1
1
1
11
1
11
1
11
1
111
1
11
1
11111
111
1
111
1111
1
1
1
11
1
11
11
1
1
1
1
1
1
1
11
1
1
11
1
111
1
1
1
111
1
1
11
11
1
1
1
1
1
1
11
1
11
1
0 200 400 600 800 1000
1.8
2.0
2.2
MCMC run/1001.
82.
02.
20 20 40 60
addi
tive
frequency
ch. 4 © 2003 Broman, Churchill, Yandell, Zeng 46
MCMC run for variance
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
11
1
1
1
11
1
11
1
1
1
1
1
11
1
1
1
11
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
111
1
11
1
1
1
1
1
111
1
1
11
1
1
11
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
111
1
1
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
1
1
11
1
1
1
1
1
1
11
1
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
11
1
11
1
111
1
1
1
1
11
1
1
111
1
1
1
1
1
1
1
1
11
1
1
1
1
11
11
1
1
11
1
1
1
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
1
1
11
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
11
11
1
1
1
1
1
11
1
1
11
1
1
11
1
1
1
1
1
1
1
111
1
11
11
11
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1111
111
1
1
1
1
1
1
1111
1
1
1
1
1
1
1
1
11
1
1
11
1
1
1
1
11
1
11
11
1
1
1
11
1
1
1
1
11
1
1
1
1
1
1
1
11
11
1
11
111
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
11
1
11
1
1
1
1
11
11
11
1
1
1
1
1
1111
1
1
11
1
1
1
11
1
1
1
111
1
1
1
11
1
11
1
1
1
1
1
11
1
111
1
1
1
11
111
1
1
1
1
1
1
1
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
11
11
1
11
1
1
1
1
11
1
1
1
1
11
1
1
11
11
1
1
11
11
1
1
1
11
1
1
111
1
1
11
1111
1
11
11
1
1
1
1
11
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1
111
1
11
11
1
1
1
1
11
1
11
11
1
11
11
1
1
1
1
1
1
1
1
1
1
11
1
1
1
1
11
1
1
1
1
11
11
1
1
1
1
1
1
11
1
11
1
1
11
1
11
1
11
1
1
1
11
1
111
1111
111
11
1
1
11
1
1
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
111
1
1
1
1
1
1
1
1
1
1
1
1
1
11
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
11
1
1
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
11
1
1
1
1
1
1
11
1
1
1
11
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
11
11
1
1
11
1
1
1
1
1
1
1
11
11
1
1
1
1
1
1
1
1
111
1
1
1
11
1
1
1
1
1
1
1
1
1
11
1
1
1
111
1
11
1
1
1
1
1
1
11
1
1
1
1
1
11
1
1
1
11
1
1
1
1
11
1
1
1
1
11
1
1
1
1
1
1
1
111
1
1
1
11
1
1
1
11
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
11
11
1
1
11
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
11
1
1
1
1
1
1
11
1
1
0 200 400 600 800 1000
1.0
1.2
1.4
1.6
1.8
2.0
MCMC run
1.0
1.2
1.4
1.6
1.8
2.0
0 10 20 30 40
varia
nce
frequency
ch. 4 © 2003 Broman, Churchill, Yandell, Zeng 47
missing marker data• sample missing marker data a la QT genotypes• full conditional for missing markers depends on
– flanking markers– possible flanking QTL
• can explicitly decompose by individual i– binomial (or trinomial) probability
),,|(pr),,,,|(prAAor Aa aa,
λλθ iiikiiiik
ik
QXXQXYXX
==
ch. 4 © 2003 Broman, Churchill, Yandell, Zeng 48
Metropolis-Hastings idea• want to study distribution f(θ)• take Monte Carlo samples
– unless too complicated• Metropolis-Hastings samples:
– current sample value θ– propose new value θ*
• from some distribution g(θ,θ*)• Gibbs sampler: g(θ,θ*) = f(θ*)
– accept new value with prob A• Gibbs sampler: A = 1
=
),()(),()(,1min *
**
θθθθθθ
gfgfA
0 2 4 6 8 10
0.0
0.2
0.4
-4 -2 0 2 4
0.0
0.2
0.4
f(θ)
g(θ–θ*)
ch. 4 © 2003 Broman, Churchill, Yandell, Zeng 49
Metropolis-Hastings samples
0 2 4 6 8
050
150
θ
mcm
c se
quen
ce
0 2 4 6 8
02
46
θ
pr(θ
|Y)
0 2 4 6 8
050
150
θ
mcm
c se
quen
ce
0 2 4 6 8
0.0
0.4
0.8
1.2
θ
pr(θ
|Y)
0 2 4 6 8
040
080
0
θ
mcm
c se
quen
ce0 2 4 6 8
0.0
1.0
2.0
θ
pr(θ
|Y)
0 2 4 6 8
040
080
0
θ
mcm
c se
quen
ce
0 2 4 6 8
0.0
0.2
0.4
0.6
θ
pr(θ
|Y)
N = 200 samples N = 1000 samplesnarrow g wide g narrow g wide g
ch. 4 © 2003 Broman, Churchill, Yandell, Zeng 50
full conditional for locus• cannot easily sample from locus full conditional
pr(λ |Y,X,θ,Q) = pr(λ |X,Q)= pr(λ) pr(Q|X,λ) /constant
• cannot explicitly determine full conditional– difficult to normalize– need to average over all possible genotypes over entire map
• Gibbs sampler will not work– but can use method based on ratios of probabilities…
ch. 4 © 2003 Broman, Churchill, Yandell, Zeng 51
Metropolis-Hastings Step• pick new locus based upon current locus
– propose new locus from distribution q( )• pick value near current one?• pick uniformly across genome?
– accept new locus with probability a()• Gibbs sampler is special case of M-H
– always accept new proposal• acceptance insures right stable distribution
– accept new proposal with probability A– otherwise stick with current value
=
),()|(),()|(,1min),( *
*
newoldold
oldnewnewnewold q
qAλλλπλλλπλλ
xx
ch. 4 © 2003 Broman, Churchill, Yandell, Zeng 52
1
1
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1
11
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
111
1
1
11
1
11
1
1
1
11
1
1
1
11
1
1
1
1
1
1
1
111
1
1
1
11
1
1
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1
11
1
111
1
1
1
1
1
1
1
111
1
11
1
11
1
11
1
1
1
1
1
1
1
1
1
1
1
11
1
1
1
1
11
1
11
1
1
1
1
1
1
111
1
1
1
1
1
1
1
1
1
1
1
11
1
11
1
1
11
1
1
1
1
1
11
1
1
1
11
11
1
1
1
1
1
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
11
11
1
1
1
11
1
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
111
11
11
1
1
1
1
1
1
1
1
1
1
11
1
1
1
1
1
11
1
1
1
1
111
1
1
1
11
1
11
1
1
11
1
1
1
1
1
11
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
11
1
111
1
1
1
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1
11
11
1
1
11
1
1
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1
11
1
1
1
1
1
1
11
1
1
1
11
111
1
1
11
1
1
1
1
1
1
1
1
1
1
1
11
11
1
1
1
1
11
1
11
1
1
1
11
11
1
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1
11
1
1
1
1
11
1
1
1
1
1
1
1
1
11
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
11
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
1
11
1
1
1
1
11
1
1
11
1
1
1
1
1
1
1
1
1
111
1
1
1
1
11
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
1
1
11
1
1
111
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
11
1
1
1
1111
1
1
1
1
111
1
11
1
1
1
1
1
1
11
1
1
1
1
1
1
11
1
1
11
1
1
11
1
1
1
1
1
1
1
11
1
11
1
1
11
1
1
1
11
1111
1
1
1
1
1
1
1
111
1
1
1
1
1
1
11
1
11
1
1
1
1
1
1
1111
1
1
11
1
1
111
11
1
11
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
1
1
11
1
11
1
111
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1
111
1
1
1
1
1
111
11
1
1
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1
11
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
1
11
1
1
1
11
1
1
111
1
1
1
1
1
1
1
1
1
111
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
11
1
1
1
1
11
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1
11
1
11
1
1
11
1
1
111
1
11
1
1
11
11
1
1
11
1
1
1
1
1
11
1
1
1
1
1
1
1
1
11
1
1
1
11
0 200 400 600 800 1000
3638
4042
MCMC run/100
3638
4042
0 20 40 60
dist
ance
(cM
)
frequency
MCMC Run for 1 locus at 40cM
ch. 4 © 2003 Broman, Churchill, Yandell, Zeng 53
Care & Use of MCMC• sample chain for long run (100,000-1,000,000)
– longer for more complicated likelihoods– use diagnostic plots to assess “mixing”
• standard error of estimates– use histogram of posterior– compute variance of posterior--just another summary
• studying the Markov chain– Monte Carlo error of series (Geyer 1992)
• time series estimate based on lagged auto-covariances– convergence diagnostics for “proper mixing”
ch. 4 © 2003 Broman, Churchill, Yandell, Zeng 54
4.4 bootstrapped variance estimates
• (re)sample (Yi,Xi) with replacement– create bootstrap sample "new" data of size n– estimate loci λ and effects θ
• repeat this N times• construct summaries of these
– mean, variance, median, percentile• construct 95% confidence intervals for λ and θ
– (2.5%ile, 97.5%ile)– order estimates, pick number .025N and .975N
ch. 4 © 2003 Broman, Churchill, Yandell, Zeng 55
4.5 advantages & shortcomings of IM
• advantages over single marker analysis– can infer position and effect of QTL– estimated locations & effects almost unbiased
• if only one segregating QTL per chromosome
– requires fewer individuals for detection of QTL
ch. 4 © 2003 Broman, Churchill, Yandell, Zeng 56
shortcomings of IM• not an interval test
– cannot say whether or not QTL is in an interval– not independent of effects of QTL outside interval
• can give false positives due to linkage– high LOD score due to nearby QTL– less of a problem for unlinked QTL
• can detect "ghost QTL"– higher peak between two linked QTL– estimated position and effect are biased
ch. 4 © 2003 Broman, Churchill, Yandell, Zeng 57
4.6 Haley-Knott Regression Approximation• likelihood mixes over missing genotypes
– normal data → mixture of normals• approximate mixture by one normal
– just estimate mean and variance• advantages
– works well for closely spaced markers– mean is correct– can exploit flanking markers for missing data– calculations are easy and fast (PLABQTL)
• disadvantages– variance depends on marker genotypes and spacings– approximation errors accumulate for multiple QTL
ch. 4 © 2003 Broman, Churchill, Yandell, Zeng 58
Haley-Knott regression idea• replace missing genotypes Q by expected values
– Pi = E(Q | Xi,λ) = sumQ pr(Q | Xi,λ) Q• fit regression model (e.g. additive gene action)
– Yi = µ + αPi + ei , i = 1,…,n• assume constant variance• correct mean
– E(Yi |Xi, θ, λ) = Pi
• wrong variance– V (Yi |Xi, θ, λ) = σ2 sumQ [pr(Q | Xi,λ)]2
ch. 4 © 2003 Broman, Churchill, Yandell, Zeng 59
Haley-Knott and EM
• both use expected value of genotypes Q– HK: Pi = E(Q | Xi,λ) = prior expectation– EM: Pi = E(Q | Yi, Xi, θ, λ) = prior expectation
• both solve regression problems for effects• difference is in iteration
– HK is first step iteration– EM iterates E-step and M-steps to convergence