Estimating recombination rates using three-site likelihoods Jeff Wall Program in Molecular and...
-
date post
21-Dec-2015 -
Category
Documents
-
view
214 -
download
0
Transcript of Estimating recombination rates using three-site likelihoods Jeff Wall Program in Molecular and...
Estimating recombination rates using three-site
likelihoods
Jeff Wall
Program in Molecular and Computational Biology, USC
DNA sequence variation
Patterns of DNA sequence variation are affected by
mutationrecombinationpopulation structurechanges in population sizenatural selectiongenetic drift
DNA sequence variation
Patterns of DNA sequence variation are affected by
mutationrecombinationpopulation structurechanges in population sizenatural selectiongenetic drift
Standard double strand break model of recombination
Gene conversion
Crossover (with gene conversion)
Slide courtesy of M. Przeworski
Standard double strand break model of recombination
Gene conversion
Crossover (with gene conversion)
Approximated as
Gene conversion
Crossover
Ignore patchworks.
e.g.
Slide courtesy of M. Przeworski
Gene conversion
• Most population genetic models ignore gene conversion. However gene conversion has a strong effect on the levels of linkage disequilibrium between closely linked sites.
Recombinants are produced at a rate proportional to the genetic distance between the sites.
Recombinants are produced at a rate that is roughly independent of the distance between the sites.
Crossing over
Gene conversion
Effect of gene conversion on patterns of linkage disequilibrium (LD)
Gene conversion leads to a steeper decay of LD at short distances.
0
0.1
0.2
0.3
0.4
0 5000 10000 15000 20000
avera
ge r2
Physical distance between markers (bps)
no gene conversion
gene conversion
Figure courtesy of M. Przeworski
Implications of high levels of gene conversion
• To detect natural selection (Andolfatto and Nordborg 1998; Berry and Barbadilla 2000)
Implications of high levels of gene conversion
• To detect natural selection (Andolfatto and Nordborg 1998; Berry and Barbadilla 2000)
• For linkage disequilibrium-based association studiesA
B C
1 2 3 1 2 3
1 2 3
Parameters
= 4Nerco where Ne is the effective population size and rco is the crossover rate
per bp per generation
f = rgc / rco where rgc is the rate of gene conversion initiation per bp per
generation
t = mean gene conversion tract length. We assume that gene conversion tract lengths follow a geometric distribution.
General Approach
Ideally we would calculate the probability of the data
as a function of the recombination parameters.
However, full likelihood methods (e.g., Fearnhead & Donnelly 2001) are too computationally intensive.
The composite likelihood approach calculates likelihoods for small subsets of the data, thenmultiplies these likelihoods over many subsets.
Composite likelihood (Frisse et al. 2001)
Sequence 1 a c c g a t g c g t a a g c t
Sequence 2 g t a g a t g c g t c a g c t
Sequence 3 g t a g t c g t g t c g g c c
Sequence 4 a c a g t c g t g t c g g t t
Sequence 5 a c a g t c g t g t a g g t t
Sequence 6 a c c g a c g c c c a a g c t
Sequence 7 a c c g a t g c c c a a g c t
Sequence 8 a c c g a t g c c c a a g c c
Sequence 9 a c c t a t g c g t a a g c t
Sequence 10 a c c g a t a c g t c g g t t
Sequence 11 a c a g a c g c g t c g c c t
Sequence 12 g t a g a t g c c c a a g c t
Composite likelihood (Frisse et al. 2001)
Sequence 1 a c c g a t g c g t a a g c t
Sequence 2 g t a g a t g c g t c a g c t
Sequence 3 g t a g t c g t g t c g g c c
Sequence 4 a c a g t c g t g t c g g t t
Sequence 5 a c a g t c g t g t a g g t t
Sequence 6 a c c g a c g c c c a a g c t
Sequence 7 a c c g a t g c c c a a g c t
Sequence 8 a c c g a t g c c c a a g c c
Sequence 9 a c c t a t g c g t a a g c t
Sequence 10 a c c g a t a c g t c g g t t
Sequence 11 a c a g a c g c g t c g c c t
Sequence 12 g t a g a t g c c c a a g c t
Composite likelihood (Frisse et al. 2001)
Sequence 1 a c c g a t g c g t a a g c t
Sequence 2 g t a g a t g c g t c a g c t
Sequence 3 g t a g t c g t g t c g g c c
Sequence 4 a c a g t c g t g t c g g t t
Sequence 5 a c a g t c g t g t a g g t t
Sequence 6 a c c g a c g c c c a a g c t
Sequence 7 a c c g a t g c c c a a g c t
Sequence 8 a c c g a t g c c c a a g c c
Sequence 9 a c c t a t g c g t a a g c t
Sequence 10 a c c g a t a c g t c g g t t
Sequence 11 a c a g a c g c g t c g c c t
Sequence 12 g t a g a t g c c c a a g c t
Composite likelihood (Frisse et al. 2001)
Sequence 1 a c c g a t g c g t a a g c t
Sequence 2 g t a g a t g c g t c a g c t
Sequence 3 g t a g t c g t g t c g g c c
Sequence 4 a c a g t c g t g t c g g t t
Sequence 5 a c a g t c g t g t a g g t t
Sequence 6 a c c g a c g c c c a a g c t
Sequence 7 a c c g a t g c c c a a g c t
Sequence 8 a c c g a t g c c c a a g c c
Sequence 9 a c c t a t g c g t a a g c t
Sequence 10 a c c g a t a c g t c g g t t
Sequence 11 a c a g a c g c g t c g c c t
Sequence 12 g t a g a t g c c c a a g c t
Composite likelihood (Wall 2004)
Sequence 1 a c c g a t g c g t a a g c t
Sequence 2 g t a g a t g c g t c a g c t
Sequence 3 g t a g t c g t g t c g g c c
Sequence 4 a c a g t c g t g t c g g t t
Sequence 5 a c a g t c g t g t a g g t t
Sequence 6 a c c g a c g c c c a a g c t
Sequence 7 a c c g a t g c c c a a g c t
Sequence 8 a c c g a t g c c c a a g c c
Sequence 9 a c c t a t g c g t a a g c t
Sequence 10 a c c g a t a c g t c g g t t
Sequence 11 a c a g a c g c g t c g c c t
Sequence 12 g t a g a t g c c c a a g c t
Composite likelihood (Wall 2004)
Sequence 1 a c c g a t g c g t a a g c t
Sequence 2 g t a g a t g c g t c a g c t
Sequence 3 g t a g t c g t g t c g g c c
Sequence 4 a c a g t c g t g t c g g t t
Sequence 5 a c a g t c g t g t a g g t t
Sequence 6 a c c g a c g c c c a a g c t
Sequence 7 a c c g a t g c c c a a g c t
Sequence 8 a c c g a t g c c c a a g c c
Sequence 9 a c c t a t g c g t a a g c t
Sequence 10 a c c g a t a c g t c g g t t
Sequence 11 a c a g a c g c g t c g c c t
Sequence 12 g t a g a t g c c c a a g c t
Composite likelihood (Wall 2004)
Sequence 1 a c c g a t g c g t a a g c t
Sequence 2 g t a g a t g c g t c a g c t
Sequence 3 g t a g t c g t g t c g g c c
Sequence 4 a c a g t c g t g t c g g t t
Sequence 5 a c a g t c g t g t a g g t t
Sequence 6 a c c g a c g c c c a a g c t
Sequence 7 a c c g a t g c c c a a g c t
Sequence 8 a c c g a t g c c c a a g c c
Sequence 9 a c c t a t g c g t a a g c t
Sequence 10 a c c g a t a c g t c g g t t
Sequence 11 a c a g a c g c g t c g c c t
Sequence 12 g t a g a t g c c c a a g c t
Composite likelihood (Wall 2004)
Sequence 1 a c c g a t g c g t a a g c t
Sequence 2 g t a g a t g c g t c a g c t
Sequence 3 g t a g t c g t g t c g g c c
Sequence 4 a c a g t c g t g t c g g t t
Sequence 5 a c a g t c g t g t a g g t t
Sequence 6 a c c g a c g c c c a a g c t
Sequence 7 a c c g a t g c c c a a g c t
Sequence 8 a c c g a t g c c c a a g c c
Sequence 9 a c c t a t g c g t a a g c t
Sequence 10 a c c g a t a c g t c g g t t
Sequence 11 a c a g a c g c g t c g c c t
Sequence 12 g t a g a t g c c c a a g c t
Simulations
We ran simulations of 5 Kb loci with
n = 50, θ = ρ = 0.001 / bp, f = 4 and t = 125 bp.
We analyze each locus individually as well as groupsof 5, 20 and 100 loci (assuming each locus is evolutionarily independent). For each group, we estimate f over a grid of values using the methods of Frisse et al. (2001) and Wall (2004).
Distribution of estimates of f(1 locus)
Triplet
method
Pair method
Estimated value of f
Frequ
en
cy
0
0.05
0.1
0.15
0.2
0.25
0 1 1.4 2 2.8 4 5.6 8 11.2 16
Distribution of estimates of f(5 loci)
Triplet
method
Pair method
Estimated value of f
Frequ
en
cy
0
0.05
0.1
0.15
0.2
0.25
0.3
0 1 1.4 2 2.8 4 5.6 8 11.2 16
Distribution of estimates of f(20 loci)
Triplet
method
Pair method
Estimated value of f
Frequ
en
cy
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0 1 1.4 2 2.8 4 5.6 8 11.2 16
Distribution of estimates of f(100 loci)
Triplet
method
Pair method
Estimated value of f
Frequ
en
cy
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 1 1.4 2 2.8 4 5.6 8 11.2 16
Estimating ρ and f jointly
0
0.2
0.4
0.6
0.8
1
1 10 100 1000
Triplet method
Pair method
Number of loci
Pro
bab
ility
Conclusions
• For estimating gene conversion rates, the triplet composite likelihood method is slightly more accurate than the pairwise composite likelihood method.
• Both methods are not very accurate on an absolute scale.
Further directions
• Modify method to handle unphased data, missing data, ascertainment bias, etc.
• Variation in recombination rates
• Confounding factors:– Multiple hits– Sequencing errors– Population history– Natural selection