Gil McVean Department of Statistics, Oxford Approximate genealogical inference.

40
Gil McVean Department of Statistics, Oxford Approximate genealogical inference

Transcript of Gil McVean Department of Statistics, Oxford Approximate genealogical inference.

Page 1: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.

Gil McVeanDepartment of Statistics, Oxford

Approximate genealogical inference

Page 2: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.

Motivation

• We have a genome’s worth of data on genetic variation

• We would like to use these data to make inferences about multiple processes: recombination, mutation, natural selection, demographic history

Page 3: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.

Example I: Recombination

• In humans, the recombination rate varies along a chromosome

• Recombination has characteristic influences on patterns of genetic variation

• We would like to estimate the profile of recombination from the variation data – and learn about the factors influencing rate

Page 4: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.

Example II: Genealogical inference

• The genealogical relationships between sequences are highly informative about underlying processes

• We would like to estimate these relationships from DNA sequences

• We could use these to learn about history, selection and the location of disease-associated mutations

Page 5: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.

Modelling genetic variation

• We have a probabilistic model that can describe the effects of diverse processes on genetic variation: The coalescent

• Coalescent modelling describes the distribution of genealogical relationships between sequences sampled from idealised populations

• Patterns of genetic variation result from mapping mutations on the genealogy

Page 6: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.
Page 7: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.

Where do these trees come from?

Present day

Page 8: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.

Ancestry of current population

Present day

Page 9: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.

Ancestry of sample

Present day

Page 10: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.

The coalescent: a model of genealogies

time

coalescenceMost recent common ancestor (MRCA)

Ancestral lineages

Present day

Page 11: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.

Coalescent modelling describes the distribution of genealogies

Page 12: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.

…and data

Page 13: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.

Generalising the coalescent

• The impact of many different forces can readily be incorporated into coalescent modelling

• With recombination, the history of the sample is described by a complex graph in which local genealogical trees are embedded – called the ancestral recombination graph or ARG

Ancestral chromosome recombines

Page 14: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.

Genealogical trees vary along a chromosome

Page 15: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.

Coalescent-based inference

• We would like to use the coalescent model to drive inference about underlying processes

• Generally, we would like to calculate the likelihood function

• However, there is a many-to-one mapping of ARGs (and genealogies) to data

• Consequently, we have to integrate out the ‘missing-data’ of the ARG

• This can only be done using Monte Carlo methods (except in trivial examples)

Page 16: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.

timetimetimetimetimetimetime

tMRCACoalescence

Mutation t7

Coalescence t6

Coalescence t5

Mutation t4

Coalescence t3

Recombination t2

Coalescence t1

t = 0

eventsi

ieventHistoryL )|Pr()|(

Page 17: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.

A problem and a possible solution

• Efficient exploration of the space of ARGs is a difficult problem

• The difficulties of performing efficient exact genealogical inference (at least within a coalescent framework) currently seem insurmountable

• There are several possible solutions– Dimension-reduction– Approximate the model– Approximate the likelihood function

• One approach that has proved useful is to combine information from subsets of data for which the likelihood function can be estimated– Composite likelihood

Page 18: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.

Example I: Recombination rate estimation

• We can estimate the likelihood function for the recombination fraction separating two SNPs

• To approximate the likelihood for the whole data set, we simply multiply the marginal likelihoods (Hudson 2001)

• The method performs well in point-estimation

Page 19: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.

1571127224231111

-6

-5

-4

-3

-2

-1

0

1

0 2 4 6 8 10

Full likelihood

Composite-likelihood approximation

RlnL

R

lnL

R

lnL

Page 20: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.

Good and bad things about CL

• Good things– Estimation using CL can be made very efficient– It performs well in simulations– It can generalise to variable recombination rates

• Bad things– It throws away information– It is NOT a true likelihood– It typically underestimates uncertainty because of ‘double-counting’

Page 21: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.

Fitting a variable recombination rate

• Use a reversible-jump MCMC approach (Green 1995)

Merge blocks

Change block size

Change block rate

Cold

Hot

SNP positions

Split blocks

Page 22: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.

),(

),(

)(

)(

),(

),(

)(

)(,1min),(

u

u

q

q

C

C

Composite likelihood ratio Hastings ratio

Ratio of priors

Jacobian of partial derivatives relating changes in parameters to sampled random numbers

Acceptance rates

• Include a prior on the number of change points that encourages smoothing

Page 23: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.

rjMCMC in action

• 200kb of the HLA region – strong evidence of LD breakdown

Page 24: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.

How do you validate the method?

• Concordance with rate estimates from sperm-typing experiments at fine scale

• Concordance with pedigree-based genetic maps at broad scales

Page 25: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.

Strong concordance between fine-scale rate estimates from sperm and genetic variation

Rates estimated from sperm Jeffreys et al (2001)

Rates estimated from genetic variationMcVean et al (2004)

Page 26: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.

We have generated a map of hotspots across the human genome

Myers et al (2005)

Page 27: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.

We have identified DNA sequence motifs that explain 40% of all hotspots

Myers et al (2008)

Page 28: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.

?

Age of mutationDate of population foundingMigration and admixture

Example II: Estimating local genealogies

Page 29: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.

The decay of a tree by recombination

Page 30: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.

0 100

The decay of a tree by recombination

Page 31: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.

Two sequence case

• Any pair of haplotypes will have regions of high and low divergence

• We can combine HMM structures with numerical techniques (Gaussian quadrature) to estimate the marginal likelihood surface at a given position, x

• We can further approximate the likelihood surface by fitting a scaled gamma distribution– This massively reduces the computational load of subsequent steps– In the case of no recombination the truth is a scaled gamma

distribution

010111000000000000000000000000000111001110000001001010000000000000000000000000101000100010

)|,Pr( 21 xtHH

Page 32: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.

Combining surfaces

• Suppose we have a partially-reconstructed tree

• We can approximate the probability of any further step in the tree using the composite-likelihood

Pr( )

1)(

1 1

)|,Pr(

i j

ji t0

t

}

(assumes un-coalesced ancestors are independent draws from stationary distribution)

Page 33: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.

An important detail

• Actually, don’t use exactly this construction

• Use a ‘nearest-neighbour’ construction – Each lineage chooses a nearest neighbour– Choose which nearest-neighbour event to occur– Choose a time for the nearest-neighbour event

• Still uses composite likelihood

Page 34: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.

Building the tree

• We can use these functions to choose (e.g. maximise or sample) the next event

• The gamma approximation leads to an efficient algorithm for estimating the local genealogy that has the same time and memory complexity of neighbour-joining

• This mean it can be applied to large data sets

Page 35: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.

Desirable properties of the algorithm

• It can be fully stochastic (unlike NJ, UPGMA, ML)

• It returns the prior in the absence of data:

• It returns the truth in the limit of infinite data:

• It is correct for a single SNP:

• It is close to the optimal proposal distribution (as defined by Stephens and Donnelly 2000) in the case of no recombination

• It uses much of the available information

• It is fast – time complexity in n the same as for NJ

0

0

Page 36: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.

Example: mutation rate = recombination rate

The true tree at 0.5 Simulations

= 10, R = 10

Page 37: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.

How to evaluate tree accuracy?

• Specific applications will require different aspects of the estimated trees to be more or less accurate

• Nevertheless, a general approach is to compare the representation of bi-partitions in the true tree to estimated ones

• Rather than require 100% accuracy in predicting a bi-partition, we can (for every observed bi-partition) find the ‘most similar’ bi-partition in the estimated tree

• We should also weight by the branch length associated with each bi-partition

Page 38: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.

Simple distance weighting(UPGMA)

Single sample from posterior

Average weighted max r2 = 0.59 Average weighted max r2 = 0.96

n = 100, = 20, R = 30 (hotspots)

Page 39: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.

Open questions

• Obtaining useful estimates of uncertainty– Power transformations of composite likelihood function

• Using larger subsets of the data– E.g. quartets

Page 40: Gil McVean Department of Statistics, Oxford Approximate genealogical inference.

Acknowledgements

• Many thanks to

• Oxford statistics– Simon Myers, Chris Hallsworth, Adam Auton, Colin Freeman, Peter

Donnelly

• Lancaster– Paul Fearnhead

• International HapMap Project