David Evans and Stacey Cherny University of Oxford ...

54
Calculation of IBD probabilities David Evans and Stacey Cherny University of Oxford Wellcome Trust Centre for Human Genetics

Transcript of David Evans and Stacey Cherny University of Oxford ...

Page 1: David Evans and Stacey Cherny University of Oxford ...

Calculation of IBD probabilities

David Evans and Stacey Cherny

University of OxfordWellcome Trust Centre for Human

Genetics

Page 2: David Evans and Stacey Cherny University of Oxford ...

This Session …IBD vs IBSWhy is IBD important?Calculating IBD probabilities

Lander-Green Algorithm (MERLIN)Single locus probabilitiesHidden Markov Model

Other ways of calculating IBD statusElston-Stewart AlgorithmMCMC approaches

MERLINPractical Example

IBD determinationInformation content mapping

SNPs vs micro-satellite markers?

Page 3: David Evans and Stacey Cherny University of Oxford ...

Aim of Gene Mapping Experiments

Identify variants that control interesting traits

Susceptibility to human diseasePhenotypic variation in the population

The hypothesisIndividuals sharing these variants will be more similar for traits they control

The difficulty…Testing ~10 million variants is impractical…

Page 4: David Evans and Stacey Cherny University of Oxford ...

Identity-by-Descent (IBD)

Two alleles are IBD if they are descended from the same ancestral allele

If a stretch of chromosome is IBD among a set of individuals, ALL variants within that stretch will also be shared IBD (markers, QTLs, disease genes)

Allows surveys of large amounts of variation even when a few polymorphisms measured

Page 5: David Evans and Stacey Cherny University of Oxford ...

A Segregating Disease Allele

+/+ +/mut

+/+ +/mut +/mut+/mut

+/mut

+/+

+/+

All affected individuals IBD for disease causing mutation

Page 6: David Evans and Stacey Cherny University of Oxford ...

Segregating Chromosomes

MARKER

DISEASE LOCUS

Affected individuals tend to share adjacent areas of chromosome IBD

Page 7: David Evans and Stacey Cherny University of Oxford ...

Marker Shared Among Affecteds

“4” allele segregates with disease

1/2 3/4

4/41/42/41/33/4

4/4 1/4

Page 8: David Evans and Stacey Cherny University of Oxford ...

Why is IBD sharing important?

1/2 3/4

4/41/42/41/33/4

4/4 1/4

IBD sharing forms the basis of non-parametric linkage statistics

Affected relatives tend to share marker alleles close to the disease locus IBD more often than chance

Page 9: David Evans and Stacey Cherny University of Oxford ...

Linkage between QTL and marker

Marker

QTL

IBD 0 IBD 1 IBD 2

Page 10: David Evans and Stacey Cherny University of Oxford ...

NO Linkage between QTL and marker

Marker

Page 11: David Evans and Stacey Cherny University of Oxford ...

IBD vs IBS

1 4

2 41 3

31

Identical by Descentand

Identical by State

2 1

2 31 1

31

Identical by state only

Page 12: David Evans and Stacey Cherny University of Oxford ...

Example: IBD in Siblings

Consider a mating between mother AB x father CD:

IBD 0 : 1 : 2 = 25% : 50% : 25%

2110BD1201BC1021AD0112AC

BDBCADACSib1

Sib2

Page 13: David Evans and Stacey Cherny University of Oxford ...

IBD can be trivial…

1

1 1

1

/ 2 2/

2/ 2/

IBD=0

Page 14: David Evans and Stacey Cherny University of Oxford ...

Two Other Simple Cases…

1

1 1

1

/

2/ 2/

1 1/

1 12/ 2/

IBD=2

2 2/ 2 2/

Page 15: David Evans and Stacey Cherny University of Oxford ...

A little more complicated…

1 2/

IBD=1(50% chance)

2 2/

1 2/ 1 2/

IBD=2(50% chance)

Page 16: David Evans and Stacey Cherny University of Oxford ...

And even more complicated…

1 1/IBD=? 1 1/

Page 17: David Evans and Stacey Cherny University of Oxford ...

Bayes Theorem

∑=

=

=

jAjBPAjP

AiBPAiPBP

AiBPAiPBP

BBAiP

)|()()|()(

)(|()(

)(),P()|(

)

Ai

Page 18: David Evans and Stacey Cherny University of Oxford ...

Bayes Theorem for IBD Probabilities

∑ ====

=

===

===

jjIBDGPjIBDP

iIBDGPiIBDPGP

iIBDGPiIBDPGP

GiGiIBDP

)|()()|()(

)()|()(

)(), P(IBD)|(

Page 19: David Evans and Stacey Cherny University of Oxford ...

p22p2

3p24A2A2A2A2

0p1p222p1p2

3A1A2A2A2

00p12p2

2A1A1A2A2

0p1p222p1p2

3A2A2A1A2

2p1p2p1p24p12p2

2A1A2A1A2

0p12p22p1

3p2A1A1A1A2

00p12p2

2A2A2A1A1

0p12p22p1

3p2A1A2A1A1

p12p1

3p14A1A1A1A1

k=2k=1k=0

P(observing genotypes / k alleles IBD)Sib 2Sib 1

P(Marker Genotype|IBD State)

Page 20: David Evans and Stacey Cherny University of Oxford ...

Worked Example

1 1/ 1 1/

p1 = 0.5

Page 21: David Evans and Stacey Cherny University of Oxford ...

Worked Example

1 1/ 1 1/ 94

)(4

1)|2(

94

)(2

1)|1(

91

)(4

1)|0(

649

41

21

41)(

41)2|(8

1)1|(16

1)0|(

5.0

21

3

41

21

31

41

21

31

41

1

1

===

===

===

=++=

===

===

===

=

GP

pGIBDP

GP

pGIBDP

GP

pGIBDP

pppGP

pIBDGP

pIBDGP

pIBDGP

p

Page 22: David Evans and Stacey Cherny University of Oxford ...

For ANY PEDIGREE the inheritance pattern at every point in the genome can be completely described by a binary inheritance vector:

v(x) = (p1, m1, p2, m2, …,pn,mn)

whose coordinates describe the outcome of the 2n paternal and maternal meioses giving rise to the n non-founders in the pedigree

pi (mi) is 0 if the grandpaternal allele transmittedpi (mi) is 1 if the grandmaternal allele is transmitted

a c/ b d/

a b/ c d/

v(x) = [0,0,1,1]p1 m1p2 m2

Page 23: David Evans and Stacey Cherny University of Oxford ...

Inheritance Vector

Inheritance vector Prior Posterior------------------------------------------------------------------0000 1/16 1/80001 1/16 1/80010 1/16 00011 1/16 00100 1/16 1/80101 1/16 1/80110 1/16 00111 1/16 01000 1/16 1/81001 1/16 1/81010 1/16 01011 1/16 01100 1/16 1/81101 1/16 1/81110 1/16 01111 1/16 0

In practice, it is not possible to determine the true inheritance vector at every point in the genome, rather we represent partial information as a probability distribution over the 22n possible inheritance vectors

a b

p1

m2p2

m1

b ba c

a ca b 1 2

3 4

5

Page 24: David Evans and Stacey Cherny University of Oxford ...

Computer Representation

Define inheritance vector vℓEach inheritance vector indexed by a different memory locationLikelihood for each gene flow pattern

Conditional on observed genotypes at location ℓ22n elements !!!

At each marker location ℓ

0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111

L L L L L L L L L L L L L L L L

Page 25: David Evans and Stacey Cherny University of Oxford ...

a) bit-indexed array

b) packed tree

c) sparse tree

Legend

Node with zero likelihood

Node identical to sibling

Likelihood for this branch

0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111

L1 L2 L1 L2 L1 L2 L1 L2

L1 L2 L1 L2

L1 L2 L1 L2 L1 L2 L1 L2

Abecasis et al (2002) Nat Genet 30:97-101

Page 26: David Evans and Stacey Cherny University of Oxford ...

Multipoint IBD

IBD status may not be able to be ascertained with certainty because e.g. the mating is not informative, parental information is not availableIBD information at uninformative loci can be made more precise by examining nearby linked loci

Page 27: David Evans and Stacey Cherny University of Oxford ...

a c/ b d/1 1/ 1 2/

a b/1 1/

c d/1 2/

Multipoint IBD

IBD = 0

IBD = 0 or IBD =1?

Page 28: David Evans and Stacey Cherny University of Oxford ...

Complexity of the Problemin Larger Pedigrees

2n meioses in pedigree with n non-founders

Each meiosis has 2 possible outcomesTherefore 22n possibilities for each locus

For each genetic locusOne location for each of m genetic markersDistinct, non-independent meiotic outcomes

Up to 4nm distinct outcomes!!!

Page 29: David Evans and Stacey Cherny University of Oxford ...

0000

0001

0010

1111

2 3 4 m = 10…

Marker

Inheritance vector

(22xn)m = (22 x 2)10 = 1012 possible paths !!!

Example: Sib-pair Genotyped at 10 Markers

1

Page 30: David Evans and Stacey Cherny University of Oxford ...

Lander-Green AlgorithmThe inheritance vector at a locus is conditionally independent of the inheritance vectors at all preceding loci given the inheritance vector at the immediately preceding locus (“Hidden Markov chain”)

The conditional probability of an inheritance vector vi+1 at locus i+1, given the inheritance vector vi at locus i is θi

j(1-θi)2n-j where θ is the recombination fraction and j is the number of changes in elements of the inheritance vector (“transition probabilities”)

[0000]

Conditional probability = (1 – θ)3θ

Locus 2Locus 1[0001]

Example:

Page 31: David Evans and Stacey Cherny University of Oxford ...

0000

0001

0010

1111

1 2 3 m…

Total Likelihood = 1’Q1T1Q2T2…Tm-1Qm1

P[0000]

0

Qi =0

P[1111]

P[0001]00

000

000

00

22n x 22n diagonal matrix of single locus probabilitiesat locus i

(1-θ)4

θ4

Ti =(1-θ)3θ

(1-θ)4

(1-θ)4

…θ4

………

…(1-θ)θ3

(1-θ)3θ(1-θ)θ3

22n x 22n matrix of transitional probabilities betweenlocus i and locus i+1

~10 x (22 x 2)2 operations = 2560 for this case !!!

Page 32: David Evans and Stacey Cherny University of Oxford ...

0000

0001

0010

1111

2 3 4 m = 10…

Marker

Inheritance vector

(L[0000] + L[0101] + L[1010] + L[1111] ) / L[ALL]

P(IBD) = 2 at Marker Three

1

L[IBD = 2 at marker 3] / L[ALL]

Page 33: David Evans and Stacey Cherny University of Oxford ...

0000

0001

0010

1111

2 3 4 m = 10…

Marker

Inheritance vector

P(IBD) = 2 at arbitrary position on the chromosome

1

(L[0000] + L[0101] + L[1010] + L[1111] ) / L[ALL]

Page 34: David Evans and Stacey Cherny University of Oxford ...

Further speedups…

Trees summarize redundant informationPortions of inheritance vector that are repeatedPortions of inheritance vector that are constant or zeroUse sparse-matrix by vector multiplication

Regularities in transition matricesUse symmetries in divide and conquer algorithm (Idury & Elston, 1997)

Page 35: David Evans and Stacey Cherny University of Oxford ...

Lander-Green Algorithm Summary

Factorize likelihood by markerComplexity ∝ m·en

Large number of markers (e.g. dense SNP data)Relatively small pedigreesMERLIN, GENEHUNTER, ALLEGRO etc

Page 36: David Evans and Stacey Cherny University of Oxford ...

Elston-Stewart Algorithm

Factorize likelihood by individualComplexity ∝ n·em

Small number of markersLarge pedigrees

With little inbreeding

VITESSE etc

Page 37: David Evans and Stacey Cherny University of Oxford ...

Other methods

Number of MCMC methods proposed~Linear on # markers~Linear on # people

Hard to guarantee convergence on very large datasets

Many widely separated local minima

E.g. SIMWALK, LOKI

Page 38: David Evans and Stacey Cherny University of Oxford ...

MERLIN-- Multipoint Engine for Rapid Likelihood Inference

Page 39: David Evans and Stacey Cherny University of Oxford ...

Capabilities

Linkage AnalysisNPL and K&C LODVariance Components

HaplotypesMost likelySamplingAll

IBD and info content

Error DetectionMost SNP typing errors are Mendelian consistent

RecombinationNo. of recombinants per family per interval can be controlled

Simulation

Page 40: David Evans and Stacey Cherny University of Oxford ...

MERLIN Website

Reference

FAQ

Source

Binaries

TutorialLinkageHaplotypingSimulationError detectionIBD calculation

www.sph.umich.edu/csg/abecasis/Merlin

Page 41: David Evans and Stacey Cherny University of Oxford ...

Test Case Pedigrees

Page 42: David Evans and Stacey Cherny University of Oxford ...

Timings – Marker Locations

A (x1000) B C DGenehunter 38s 37s 18m16s *

Allegro 18s 2m17s 3h54m13s *Merlin 11s 18s 13m55s *

A (x1000) B C DGenehunter 45s 1m54s * *

Allegro 18s 1m08s 1h12m38s *Merlin 13s 25s 15m50s *

Top Generation Genotyped

Top Generation Not Genotyped

Page 43: David Evans and Stacey Cherny University of Oxford ...

Intuition: Approximate Sparse T

Dense maps, closely spaced markersSmall recombination fractions θReasonable to set θk with zero

Produces a very sparse transition matrix

Consider only elements of v separated by <k recombination events

At consecutive locations

Page 44: David Evans and Stacey Cherny University of Oxford ...

Additional Speedup…Time Memory

Exact 40s 100 MB

No recombination <1s 4 MB≤1 recombinant 2s 17 MB≤2 recombinants 15s 54 MB

Genehunter 2.1 16min 1024MB

Keavney et al (1998) ACE data, 10 SNPs within gene,4-18 individuals per family

Page 45: David Evans and Stacey Cherny University of Oxford ...

Input Files

Pedigree FileRelationshipsGenotype dataPhenotype data

Data FileDescribes contents of pedigree file

Map FileRecords location of genetic markers

Page 46: David Evans and Stacey Cherny University of Oxford ...

Example Pedigree File<contents of example.ped>1 1 0 0 1 1 x 3 3 x x1 2 0 0 2 1 x 4 4 x x1 3 0 0 1 1 x 1 2 x x1 4 1 2 2 1 x 4 3 x x1 5 3 4 2 2 1.234 1 3 2 21 6 3 4 1 2 4.321 2 4 2 2<end of example.ped>

Encodes family relationships, marker and phenotype information

Page 47: David Evans and Stacey Cherny University of Oxford ...

Example Data File<contents of example.dat>T some_trait_of_interestM some_markerM another_marker<end of example.dat>

Provides information necessary to decode pedigree file

Page 48: David Evans and Stacey Cherny University of Oxford ...

Data File Field Codes

Zygosity.Z

Covariate.C

Quantitative Trait.T

Affection Status.A

Marker GenotypeM

DescriptionCode

Page 49: David Evans and Stacey Cherny University of Oxford ...

Example Map File<contents of example.map>CHROMOSOME MARKER POSITION2 D2S160 160.02 D2S308 165.0…<end of example.map>

Indicates location of individual markers, necessary to derive recombination fractions between them

Page 50: David Evans and Stacey Cherny University of Oxford ...

Worked Example

1 1/ 1 1/

94)|2(

94)|1(

91)|0(

5.01

==

==

==

=

GIBDP

GIBDP

GIBDP

p

merlin –d example.dat –p example.ped –m example.map --ibd

Page 51: David Evans and Stacey Cherny University of Oxford ...

Application: Information Content Mapping

Information content: Provides a measure of how well a marker set approaches the goal of completely determining the inheritance outcome

Based on concept of entropyE = -ΣPilog2Pi where Pi is probability of the ith outcome

IE(x) = 1 – E(x)/E0Always lies between 0 and 1Does not depend on test for linkageScales linearly with power

Page 52: David Evans and Stacey Cherny University of Oxford ...

Application: Information Content Mapping

Simulations (sib-pairs with/out parental genotypes)1 micro-satellite per 10cM (ABI)1 microsatellite per 3cM (deCODE)1 SNP per 0.5cM (Illumina)1 SNP per 0.2 cM (Affymetrix)

Which panel performs best in terms of extracting marker information?Do the results depend upon the presence of parental genotypes?

merlin –d file.dat –p file.ped –m file.map --information --step 1 --markerNames

Page 53: David Evans and Stacey Cherny University of Oxford ...

SNPs + parents

microsat + parents

SNP microsat0.2 cM 3 cM0.5 cM 10 cM

Densities

0.0

0.10.2

0.30.4

0.5

0.60.7

0.80.9

1.0

0 10 20 30 40 50 60 70 80 90 100Position (cM)

Info

rmat

ion

Con

tent

SNPs vs Microsatellites with parents

Page 54: David Evans and Stacey Cherny University of Oxford ...

microsat - parents

SNPs - parents

SNP microsat0.2 cM 3 cM0.5 cM 10 cM

Densities

0.0

0.10.2

0.30.4

0.5

0.60.7

0.80.9

1.0

0 10 20 30 40 50 60 70 80 90 100Position (cM)

Info

rmat

ion

Con

tent

SNP microsat0.2 cM 3 cM0.5 cM 10 cM

Densities

SNPs vs Microsatellites without parents