METHODS FOR HAPLOTYPE RECONSTRUCTION

METHODS FOR HAPLOTYPE RECONSTRUCTION

Andrew MorrisWellcome Trust Centre for Human GeneticsMarch 6, 2003

Outline

Haplotypes and genotypes. Reconstruction in pedigrees. Reconstruction in unrelated

individuals. Interpretation and LD assessment. Two stage analyses.

Haplotypes and genotypes (1)

1 0 0 0 1

1 1 0 0 0

11 01 00 00 01

Haplotypes and genotypes (2)

Individuals that are homozygous at every locus, or heterozygous at just one locus can be resolved.

Individuals that are heterozygous at k loci are consistent with 2k-1 configurations of haplotypes.

Why do we need haplotypes?

Correlation between alleles at closely linked loci…

Fine-scale mapping studies. Association studies with multiple

markers in candidate genes. Investigating patterns of LD across

genomic regions. Inferring population histories.

Molecular methods

Single molecule dilution. Allele specific long range PCR.

Prone to errors. Expensive and inefficient: low

throughput.

Simplex family data (1)

00 01 00 11 x 01 11 01 01(M) (F)

00 01 01 01


00 01 00 11 x 01 11 01 01(M) (F)

00 01 01 01

Inferred haplotypes: 0001 / 0110


00 01 00 01 x 01 01 00 01(M) (F)

00 01 00 01

Cannot be fully resolved…

Pedigree data (1)

11 01 11 01 11 x 00 00 11 11 11

01 01 11 11 11 x 01 00 00 01 00

01 01 01 01 01 11 01 01 01 01 00 00 01 11 01

Pedigree data (1)

11111 / 10101 x 00111 / 00111

11111 / 00111 x 00010 / 10000

11111 / 00000 11111 / 10000 00111 / 00010

Pedigree data (2)

Many combinations of haplotypes may be consistent with pedigree genotype data.

Complex computational problem. Need to make assumptions about

recombination. SIMWALK and MERLIN.

Statistical approaches to reconstruct haplotypes in unrelated individuals

Parsimony methods: Clark’s algorithm. Likelihood methods: E-M algorithm. Bayesian methods: PHASE algorithm.

Aims: reconstruct haplotypes and/or estimate population frequencies.

Clark’s algorithm (1)

Reconstruct haplotypes in unresolved individuals via parsimony.

Minimise number of haplotypes observed in sample.

Microsatellite or SNP genotypes.

Clark’s algorithm (2)

1. Search for resolved individuals, and record all recovered haplotypes.

2. Compare each unresolved individual with list of recovered haplotypes.

3. If a recovered haplotype is identified, individual is resolved.

4. Complimentary haplotype added to list of recovered haplotypes.

5. Repeat 2-4 until all individuals are resolved or no more haplotypes can be recovered.

Example

(A) 00 01 01 00(B) 00 00 00 00(C) 00 01 00 00(D) 01 11 01 11(E) 00 11 01 01(F) 01 11 11 00(G) 00 01 11 01(H) 00 01 01 11(I) 00 00 00 00(J) 00 00 00 11

Example

(A) 00 01 01 00(B) 0000 / 0000(C) 0000 / 0100(D) 01 11 01 11(E) 00 11 01 01(F) 0110 / 1110(G) 00 01 11 01(H) 00 01 01 11(I)0000 / 0000(J)0001 / 0001

Recovered haplotypes:

00000100011011100001

Example

(A) 0000 / 0110(B) 0000 / 0000(C) 0000 / 0100(D) 01 11 01 11(E) 00 11 01 01(F) 0110 / 1110(G) 00 01 11 01(H) 00 01 01 11(I)0000 / 0000(J)0001 / 0001


0000 01110100011011100001

Example

(A) 0000 / 0110(B) 0000 / 0000(C) 0000 / 0100(D) 01 11 01 11(E) 0100 / 0111(F) 0110 / 1110(G) 00 01 11 01(H) 00 01 01 11(I)0000 / 0000(J)0001 / 0001


0000 01110100 0011011011100001

Example

(A) 0000 / 0110(B) 0000 / 0000(C) 0000 / 0100(D) 0111 / 1101(E) 0100 / 0111(F) 0110 / 1110(G) 0110 / 0011(H) 0001 / 0111(I)0000 / 0000(J)0001 / 0001


0000 01110100 00110110 110111100001

Example: problem…

(A) 0000 / 0110(B) 0000 / 0000(C) 0000 / 0100(D) 01 11 01 11(E) 0100 / 0111(F) 0110 / 1110(G) 00 01 11 01(H) 00 01 01 11(I)0000 / 0000(J)0001 / 0001


0000 01110100 0011011011100001

Example: problem…

(A) 0000 / 0110(B) 0000 / 0000(C) 0000 / 0100(D) 01 11 01 11(E) 0100 / 0111(F) 0110 / 1110(G) 00 01 11 01(H) 00 01 01 11(I)0000 / 0000(J)0001 / 0001


0000 01110100 0010011011100001

Clark’s algorithm: problems

Multiple solutions: try many different orderings of individuals.

No starting point for algorithm. Algorithm may leave many

unresolved individuals. How to deal with missing data?

E-M algorithm (1)

Maximum likelihood method for population haplotype frequency estimation.

Allows for the fact that unresolved genotypes could be constructed from many different haplotype configurations.

Microsatellite or SNP genotypes.

E-M algorithm (2)

Observed sample of N individuals with genotypes, G.

Unobserved population haplotype frequencies, h.

Unobserved configurations, H, consisting of a complimentary haplotype pairs Hi = {Hi1,Hi2}.

E-M algorithm (4)

Numerical algorithm used to obtain maximum likelihood estimates of h.

Initial set of haplotype frequencies h(0). Haplotype frequencies h(t) at iteration t

updated from frequencies at iteration t-1 using Expectation and Maximisation steps.

Continue until h(t) has converged.

Maximisation step

Compute haplotype frequencies using procedure equivalent to gene counting.

hs(t+1) = ∑k ∑i Zsi E(Hi|Gk,h(t))

2N Zsi = number of copies (0,1,2) of sth

haplotype in configuration Hi.

E-M algorithm: comments

Can handle missing data. For many loci, the number of

possible haplotypes is large, so population frequencies are difficult to estimate: re-parameterisation.

Does not provide reconstructed haplotype configuration for unresolved individuals: can use “maximum likelihood” configuration.

PHASE algorithm (1)

Treats haplotype configuration for each unresolved individual as an unobserved random quantity.

Evaluate the conditional distribution, given a sample of unresolved genotype data.

Microsatellite or SNP genotypes. Reconstruction and population

haplotype frequency estimation.

PHASE algorithm (2)

Bayesian framework: goal is to approximate posterior distribution of haplotype configurations f(H|G).

Implements Markov chain Monte Carlo (MCMC) methods to sample from f(H|G): Gibbs sampling.

Start at random configuration. Repeatedly select unresolved individuals at

random, and sample from their possible haplotype configurations, assuming all other individuals to be correctly resolved.

PHASE algorithm (3)

Initial haplotype configuration H(0). Subsequent iterations obtain H(t+1) from

H(t) using the following steps: Select an unresolved individual,i, at random. Sample Hi

(t+1) from f(Hi|G,H-i(t)).

Set Hk(t+1) = Hk

(t) for all k ≠ i.

On convergence, each sampled configuration represents random draw from f(H|G).

PHASE algorithm (4)

How to obtain f(Hi|G,H-i)? Base directly on sample frequency of

observed haplotypes in configuration H-i. Better to introduce prior model for

population haplotype frequencies, f(h). Coalescent process used to predict

likely patterns of haplotypes occurring in populations.

PHASE algorithm (5)

Key principle…

Configuration Hi is more likely to consist of haplotypes Hi1 and Hi2 that are exactly the same as, or similar to, haplotypes in the configuration H-i.

Example (1)

Resolved haplotypes 22544 and 22334. Unresolved individual 22 22 35 34 44.

Example (1)

Resolved haplotypes 22544 and 22334. Unresolved individual 22 22 35 34 44. Possible configurations…

(1) 22334 / 22544(2) 22534 / 22344

Example (1)


(1) 22334 / 22544(2) 22534 / 22344 -1 and +1

Example (1)


(1) 22334 / 22544(2) 22534 / 22344 -1 and +1

Assign high probability to sampling configuration (1).

Example (2)

Resolved haplotypes 22544 and 22334. Unresolved individual 22 22 46 34 44.

Example (2)

Resolved haplotypes 22544 and 22334. Unresolved individual 22 22 46 34 44. Possible configurations:

(1) 22434 / 22644(2) 22634 / 22444

Example (2)

Resolved haplotypes 22544 and 22334. Unresolved individual 22 22 46 34 44. Possible configurations:

(1) 22434 / 22644 +1 and +1(2) 22634 / 22444 +3 and -1

Assign high probability to sampling configuration (1).

PHASE algorithm: comments

Allows for uncertainty in haplotype reconstruction in Bayesian framework.

Can handle missing data. Coalescent process does not explicitly

allow for recombination, but performs well even when cross-over events occur (up to ~0.1cM).

Up to 50% more efficient than Clark’s algorithm or the E-M algorithm.

PHASE algorithm: output

“Best” reconstruction output for each individual. Uncertainty in reconstruction indicated by system

of brackets: [] inferred missing genotype uncertain with

posterior probability less than specified threshold; () inferred phase assignment uncertain with

posterior probability less than specified threshold.

[0] [(1)] 0 0 1 (0)[0] [(0)] 1 0 1 (1)

PHASE algorithm: interpretation

“Best” reconstruction not necessarily correct.

Uncertain haplotype configurations should be investigated further.

Effective targeting of additional genotyping costs.

Other Bayesian MCMC algorithms

HAPLOTYPER Prior model for haplotype frequencies given by

Dirichelet distribution. Deals with large number of SNPs by partition

ligation. Outputs “best” reconstruction with uncertainty

measured by posterior probability. HAPMCMC

Log-linear prior model for haplotype frequencies incorporating interactions corresponding to first order LD between SNPs.

Designed specifically for investigating LD across small genomic regions.

Don’t ignore uncertainty…

Tempting to treat “best” configuration as correct in subsequent analyses.

Can seriously affect inferences…

Example: LD across small genomic regions…

False positive error rates

Level E-M PHASE(BEST)

HAPMCMC(BEST)

HAPMCMC(POST)

0.01 0.008 0.204 0.060 0.009

0.05 0.045 0.362 0.183 0.056

0.1 0.103 0.446 0.247 0.091

0.2 0.212 0.566 0.392 0.201

0.5 0.524 0.767 0.640 0.527

Two stage analyses

For many studies, haplotype reconstruction is just an intermediate step required for a second stage of analysis.

We don’t actually care what the haplotypes are: missing data that we must account for in second stage of analysis.

Better to develop methods that allow for uncertainty in haplotype configuration in unresolved individuals, rather than haplotype reconstruction methods per-se.

Summary

Haplotype information required for analysis of high-density marker data.

Many algorithms available. Interpret output with care. Reconstructed haplotypes often not

of interest, but uncertainty must be accounted for in subsequent analyses.

METHODS FOR HAPLOTYPE RECONSTRUCTION

Documents

Transcript of METHODS FOR HAPLOTYPE RECONSTRUCTION