Gil McVean Department of Statistics, Oxford Approximate genealogical inference.
-
Upload
clarence-griffin -
Category
Documents
-
view
218 -
download
0
Transcript of Gil McVean Department of Statistics, Oxford Approximate genealogical inference.
![Page 1: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.](https://reader033.fdocuments.us/reader033/viewer/2022051316/56649eb55503460f94bbdf53/html5/thumbnails/1.jpg)
Gil McVeanDepartment of Statistics, Oxford
Approximate genealogical inference
![Page 2: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.](https://reader033.fdocuments.us/reader033/viewer/2022051316/56649eb55503460f94bbdf53/html5/thumbnails/2.jpg)
Motivation
• We have a genome’s worth of data on genetic variation
• We would like to use these data to make inferences about multiple processes: recombination, mutation, natural selection, demographic history
![Page 3: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.](https://reader033.fdocuments.us/reader033/viewer/2022051316/56649eb55503460f94bbdf53/html5/thumbnails/3.jpg)
Example I: Recombination
• In humans, the recombination rate varies along a chromosome
• Recombination has characteristic influences on patterns of genetic variation
• We would like to estimate the profile of recombination from the variation data – and learn about the factors influencing rate
![Page 4: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.](https://reader033.fdocuments.us/reader033/viewer/2022051316/56649eb55503460f94bbdf53/html5/thumbnails/4.jpg)
Example II: Genealogical inference
• The genealogical relationships between sequences are highly informative about underlying processes
• We would like to estimate these relationships from DNA sequences
• We could use these to learn about history, selection and the location of disease-associated mutations
![Page 5: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.](https://reader033.fdocuments.us/reader033/viewer/2022051316/56649eb55503460f94bbdf53/html5/thumbnails/5.jpg)
Modelling genetic variation
• We have a probabilistic model that can describe the effects of diverse processes on genetic variation: The coalescent
• Coalescent modelling describes the distribution of genealogical relationships between sequences sampled from idealised populations
• Patterns of genetic variation result from mapping mutations on the genealogy
![Page 6: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.](https://reader033.fdocuments.us/reader033/viewer/2022051316/56649eb55503460f94bbdf53/html5/thumbnails/6.jpg)
![Page 7: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.](https://reader033.fdocuments.us/reader033/viewer/2022051316/56649eb55503460f94bbdf53/html5/thumbnails/7.jpg)
Where do these trees come from?
Present day
![Page 8: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.](https://reader033.fdocuments.us/reader033/viewer/2022051316/56649eb55503460f94bbdf53/html5/thumbnails/8.jpg)
Ancestry of current population
Present day
![Page 9: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.](https://reader033.fdocuments.us/reader033/viewer/2022051316/56649eb55503460f94bbdf53/html5/thumbnails/9.jpg)
Ancestry of sample
Present day
![Page 10: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.](https://reader033.fdocuments.us/reader033/viewer/2022051316/56649eb55503460f94bbdf53/html5/thumbnails/10.jpg)
The coalescent: a model of genealogies
time
coalescenceMost recent common ancestor (MRCA)
Ancestral lineages
Present day
![Page 11: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.](https://reader033.fdocuments.us/reader033/viewer/2022051316/56649eb55503460f94bbdf53/html5/thumbnails/11.jpg)
Coalescent modelling describes the distribution of genealogies
![Page 12: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.](https://reader033.fdocuments.us/reader033/viewer/2022051316/56649eb55503460f94bbdf53/html5/thumbnails/12.jpg)
…and data
![Page 13: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.](https://reader033.fdocuments.us/reader033/viewer/2022051316/56649eb55503460f94bbdf53/html5/thumbnails/13.jpg)
Generalising the coalescent
• The impact of many different forces can readily be incorporated into coalescent modelling
• With recombination, the history of the sample is described by a complex graph in which local genealogical trees are embedded – called the ancestral recombination graph or ARG
Ancestral chromosome recombines
![Page 14: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.](https://reader033.fdocuments.us/reader033/viewer/2022051316/56649eb55503460f94bbdf53/html5/thumbnails/14.jpg)
Genealogical trees vary along a chromosome
![Page 15: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.](https://reader033.fdocuments.us/reader033/viewer/2022051316/56649eb55503460f94bbdf53/html5/thumbnails/15.jpg)
Coalescent-based inference
• We would like to use the coalescent model to drive inference about underlying processes
• Generally, we would like to calculate the likelihood function
• However, there is a many-to-one mapping of ARGs (and genealogies) to data
• Consequently, we have to integrate out the ‘missing-data’ of the ARG
• This can only be done using Monte Carlo methods (except in trivial examples)
![Page 16: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.](https://reader033.fdocuments.us/reader033/viewer/2022051316/56649eb55503460f94bbdf53/html5/thumbnails/16.jpg)
timetimetimetimetimetimetime
tMRCACoalescence
Mutation t7
Coalescence t6
Coalescence t5
Mutation t4
Coalescence t3
Recombination t2
Coalescence t1
t = 0
eventsi
ieventHistoryL )|Pr()|(
![Page 17: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.](https://reader033.fdocuments.us/reader033/viewer/2022051316/56649eb55503460f94bbdf53/html5/thumbnails/17.jpg)
A problem and a possible solution
• Efficient exploration of the space of ARGs is a difficult problem
• The difficulties of performing efficient exact genealogical inference (at least within a coalescent framework) currently seem insurmountable
• There are several possible solutions– Dimension-reduction– Approximate the model– Approximate the likelihood function
• One approach that has proved useful is to combine information from subsets of data for which the likelihood function can be estimated– Composite likelihood
![Page 18: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.](https://reader033.fdocuments.us/reader033/viewer/2022051316/56649eb55503460f94bbdf53/html5/thumbnails/18.jpg)
Example I: Recombination rate estimation
• We can estimate the likelihood function for the recombination fraction separating two SNPs
• To approximate the likelihood for the whole data set, we simply multiply the marginal likelihoods (Hudson 2001)
• The method performs well in point-estimation
![Page 19: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.](https://reader033.fdocuments.us/reader033/viewer/2022051316/56649eb55503460f94bbdf53/html5/thumbnails/19.jpg)
1571127224231111
-6
-5
-4
-3
-2
-1
0
1
0 2 4 6 8 10
Full likelihood
Composite-likelihood approximation
RlnL
R
lnL
R
lnL
![Page 20: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.](https://reader033.fdocuments.us/reader033/viewer/2022051316/56649eb55503460f94bbdf53/html5/thumbnails/20.jpg)
Good and bad things about CL
• Good things– Estimation using CL can be made very efficient– It performs well in simulations– It can generalise to variable recombination rates
• Bad things– It throws away information– It is NOT a true likelihood– It typically underestimates uncertainty because of ‘double-counting’
![Page 21: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.](https://reader033.fdocuments.us/reader033/viewer/2022051316/56649eb55503460f94bbdf53/html5/thumbnails/21.jpg)
Fitting a variable recombination rate
• Use a reversible-jump MCMC approach (Green 1995)
Merge blocks
Change block size
Change block rate
Cold
Hot
SNP positions
Split blocks
![Page 22: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.](https://reader033.fdocuments.us/reader033/viewer/2022051316/56649eb55503460f94bbdf53/html5/thumbnails/22.jpg)
),(
),(
)(
)(
),(
),(
)(
)(,1min),(
u
u
q
q
C
C
Composite likelihood ratio Hastings ratio
Ratio of priors
Jacobian of partial derivatives relating changes in parameters to sampled random numbers
Acceptance rates
• Include a prior on the number of change points that encourages smoothing
![Page 23: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.](https://reader033.fdocuments.us/reader033/viewer/2022051316/56649eb55503460f94bbdf53/html5/thumbnails/23.jpg)
rjMCMC in action
• 200kb of the HLA region – strong evidence of LD breakdown
![Page 24: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.](https://reader033.fdocuments.us/reader033/viewer/2022051316/56649eb55503460f94bbdf53/html5/thumbnails/24.jpg)
How do you validate the method?
• Concordance with rate estimates from sperm-typing experiments at fine scale
• Concordance with pedigree-based genetic maps at broad scales
![Page 25: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.](https://reader033.fdocuments.us/reader033/viewer/2022051316/56649eb55503460f94bbdf53/html5/thumbnails/25.jpg)
Strong concordance between fine-scale rate estimates from sperm and genetic variation
Rates estimated from sperm Jeffreys et al (2001)
Rates estimated from genetic variationMcVean et al (2004)
![Page 26: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.](https://reader033.fdocuments.us/reader033/viewer/2022051316/56649eb55503460f94bbdf53/html5/thumbnails/26.jpg)
We have generated a map of hotspots across the human genome
Myers et al (2005)
![Page 27: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.](https://reader033.fdocuments.us/reader033/viewer/2022051316/56649eb55503460f94bbdf53/html5/thumbnails/27.jpg)
We have identified DNA sequence motifs that explain 40% of all hotspots
Myers et al (2008)
![Page 28: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.](https://reader033.fdocuments.us/reader033/viewer/2022051316/56649eb55503460f94bbdf53/html5/thumbnails/28.jpg)
?
Age of mutationDate of population foundingMigration and admixture
Example II: Estimating local genealogies
![Page 29: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.](https://reader033.fdocuments.us/reader033/viewer/2022051316/56649eb55503460f94bbdf53/html5/thumbnails/29.jpg)
The decay of a tree by recombination
![Page 30: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.](https://reader033.fdocuments.us/reader033/viewer/2022051316/56649eb55503460f94bbdf53/html5/thumbnails/30.jpg)
0 100
The decay of a tree by recombination
![Page 31: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.](https://reader033.fdocuments.us/reader033/viewer/2022051316/56649eb55503460f94bbdf53/html5/thumbnails/31.jpg)
Two sequence case
• Any pair of haplotypes will have regions of high and low divergence
• We can combine HMM structures with numerical techniques (Gaussian quadrature) to estimate the marginal likelihood surface at a given position, x
• We can further approximate the likelihood surface by fitting a scaled gamma distribution– This massively reduces the computational load of subsequent steps– In the case of no recombination the truth is a scaled gamma
distribution
010111000000000000000000000000000111001110000001001010000000000000000000000000101000100010
)|,Pr( 21 xtHH
![Page 32: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.](https://reader033.fdocuments.us/reader033/viewer/2022051316/56649eb55503460f94bbdf53/html5/thumbnails/32.jpg)
Combining surfaces
• Suppose we have a partially-reconstructed tree
• We can approximate the probability of any further step in the tree using the composite-likelihood
Pr( )
1)(
1 1
)|,Pr(
i j
ji t0
t
}
(assumes un-coalesced ancestors are independent draws from stationary distribution)
![Page 33: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.](https://reader033.fdocuments.us/reader033/viewer/2022051316/56649eb55503460f94bbdf53/html5/thumbnails/33.jpg)
An important detail
• Actually, don’t use exactly this construction
• Use a ‘nearest-neighbour’ construction – Each lineage chooses a nearest neighbour– Choose which nearest-neighbour event to occur– Choose a time for the nearest-neighbour event
• Still uses composite likelihood
![Page 34: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.](https://reader033.fdocuments.us/reader033/viewer/2022051316/56649eb55503460f94bbdf53/html5/thumbnails/34.jpg)
Building the tree
• We can use these functions to choose (e.g. maximise or sample) the next event
• The gamma approximation leads to an efficient algorithm for estimating the local genealogy that has the same time and memory complexity of neighbour-joining
• This mean it can be applied to large data sets
![Page 35: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.](https://reader033.fdocuments.us/reader033/viewer/2022051316/56649eb55503460f94bbdf53/html5/thumbnails/35.jpg)
Desirable properties of the algorithm
• It can be fully stochastic (unlike NJ, UPGMA, ML)
• It returns the prior in the absence of data:
• It returns the truth in the limit of infinite data:
• It is correct for a single SNP:
• It is close to the optimal proposal distribution (as defined by Stephens and Donnelly 2000) in the case of no recombination
• It uses much of the available information
• It is fast – time complexity in n the same as for NJ
0
0
![Page 36: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.](https://reader033.fdocuments.us/reader033/viewer/2022051316/56649eb55503460f94bbdf53/html5/thumbnails/36.jpg)
Example: mutation rate = recombination rate
The true tree at 0.5 Simulations
= 10, R = 10
![Page 37: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.](https://reader033.fdocuments.us/reader033/viewer/2022051316/56649eb55503460f94bbdf53/html5/thumbnails/37.jpg)
How to evaluate tree accuracy?
• Specific applications will require different aspects of the estimated trees to be more or less accurate
• Nevertheless, a general approach is to compare the representation of bi-partitions in the true tree to estimated ones
• Rather than require 100% accuracy in predicting a bi-partition, we can (for every observed bi-partition) find the ‘most similar’ bi-partition in the estimated tree
• We should also weight by the branch length associated with each bi-partition
![Page 38: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.](https://reader033.fdocuments.us/reader033/viewer/2022051316/56649eb55503460f94bbdf53/html5/thumbnails/38.jpg)
Simple distance weighting(UPGMA)
Single sample from posterior
Average weighted max r2 = 0.59 Average weighted max r2 = 0.96
n = 100, = 20, R = 30 (hotspots)
![Page 39: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.](https://reader033.fdocuments.us/reader033/viewer/2022051316/56649eb55503460f94bbdf53/html5/thumbnails/39.jpg)
Open questions
• Obtaining useful estimates of uncertainty– Power transformations of composite likelihood function
• Using larger subsets of the data– E.g. quartets
![Page 40: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.](https://reader033.fdocuments.us/reader033/viewer/2022051316/56649eb55503460f94bbdf53/html5/thumbnails/40.jpg)
Acknowledgements
• Many thanks to
• Oxford statistics– Simon Myers, Chris Hallsworth, Adam Auton, Colin Freeman, Peter
Donnelly
• Lancaster– Paul Fearnhead
• International HapMap Project