“In Silico” Genotyping for Genome Wide Association Scans · In Silico Genotyping For Family...

“In “In SilicoSilico” Genotyping for” Genotyping forGenome Wide Association ScansGenome Wide Association Scans

Turning a Flood of Data into a DelugeTurning a Flood of Data into a Deluge

Gonçalo AbecasisGonçalo AbecasisUniversity of MichiganUniversity of Michigan

Lots of Genotypes Are Good…Lots of Genotypes Are Good…How About Even More Genotypes?How About Even More Genotypes?

If millions of genotypes are good, wouldn’t billions be better?If millions of genotypes are good, wouldn’t billions be better?

Spend more dollars, euros, pounds, and …Spend more dollars, euros, pounds, and …Examine more individuals …Examine more individuals …Examine more SNPs …Examine more SNPs …

Inexpensive “in Inexpensive “in silicosilico” genotyping strategies” genotyping strategies

Estimate genotypes for individuals related to those in GWAS sampEstimate genotypes for individuals related to those in GWAS sampleleIntuition for how Intuition for how in in silicosilico genotyping worksgenotyping works

Estimate additional genotypes for individuals in the GWAS sampleEstimate additional genotypes for individuals in the GWAS sampleFacilitate comparisons across studiesFacilitate comparisons across studiesImprove coverage of the genomeImprove coverage of the genome

In In SilicoSilico Genotyping For Genotyping For Family SamplesFamily Samples

Family members share large segments of chromosomesFamily members share large segments of chromosomes

If we genotype many related individuals, we will effectively If we genotype many related individuals, we will effectively be genotyping a few chromosomes many timesbe genotyping a few chromosomes many times

An alternative is to:An alternative is to:Genotype a few markers on all samplesGenotype a few markers on all samplesIdentify shared chromosomal segments that segregate in familyIdentify shared chromosomal segments that segregate in familyUse a highUse a high--density panel to genotype a few samples per familydensity panel to genotype a few samples per familyEstimate missing genotypes in samples without high density dataEstimate missing genotypes in samples without high density data

The first two steps are optional, but very helpfulThe first two steps are optional, but very helpful

Burdick et al, Nat Genet, 2006

Genotype InferenceGenotype InferencePart 1 Part 1 –– Observed Genotype DataObserved Genotype Data

A/A A/G A/A A/G

G/G C/C G/G

G/G G/T G/T G/TG/G G/T

A/T A/A A/T T/T

A/AT/T

T/T G/T

G/GT/T

A/AT/G

T/TG/TA/A

A/GT/T

A/G

C/G

T/TC/G

A/GA/TG/TG/TG/AT/TC/G

A/T

A/G./../.

G/T./../.

C/G

A/G./../.

T/T./../.

G/G

A/A./../.

G/G./../.

C/C

A/A./../.

G/G./../.

C/C

G/G./../.

T/T./../.

G/G

G/G./../.

T/T./../.

G/G

Genotype InferenceGenotype InferencePart 2 Part 2 –– Inferring Allele SharingInferring Allele Sharing

A A G A A A G AT A A A A T T TT T T G G G T GG G T G G T T GA G A A G G A AT T T T T T T GC G G G C C G G

A G A GT A A TT T G TG T G TA A G AT T T TC G C G

A G A G A A A A G G G G. . . . . . . . . . . .. . . . . . . . . . . .G T T T G G G G T T T T. . . . . . . . . . . .. . . . . . . . . . . .C G G G C C C C G G G G

Genotype InferenceGenotype InferencePart 3 Part 3 –– Imputing Missing GenotypesImputing Missing Genotypes

A/A A/G A/A A/G

G/G G/T G/T G/GG/G G/T

A/T A/A A/T T/TT/T G/T

T/TA/AT/G

T/T

G/G C/C G/G

A/AT/T

G/G

G/TA/A

A/GT/T

A/G

C/G

T/TC/G

A/GA/TG/TG/TG/AT/TC/G

A/T

A/G./../.

G/T./.

T/TC/G

A/G./../.

T/TA/AT/TG/G

A/AA/TG/TG/GA/GT/TC/C

A/AA/TG/TG/GA/GT/TC/C

G/GA/TT/TT/TA/AT/TG/G

G/GA/TT/TT/TA/AT/TG/G

Formal ApproachFormal ApproachConsider full set of observed genotypes Consider full set of observed genotypes GG

Evaluate pedigree likelihood Evaluate pedigree likelihood LL for each possible value of for each possible value of each missing genotype each missing genotype ggijij

Posterior probability for each missing genotypePosterior probability for each missing genotype

Implemented both using Implemented both using ElstonElston--Stewart (1972) and Stewart (1972) and LanderLander--Green (1987) algorithmsGreen (1987) algorithms

)(),(

)|(GL

xgGLGxgP ij

ij

===

Model With Inferred GenotypesModel With Inferred GenotypesReplace genotype score Replace genotype score gg with its expected value:with its expected value:

WhereWhere

Association test implemented as score test or as likelihood ratiAssociation test implemented as score test or as likelihood ratio testo testVariance component framework to allow for relatednessVariance component framework to allow for relatedness

Alternatives would be to Alternatives would be to (a) impute genotypes with large posterior probabilities; or (a) impute genotypes with large posterior probabilities; or (b) integrate joint distribution of unobserved genotypes in fami(b) integrate joint distribution of unobserved genotypes in familyly

...)( +++= cgyE cgi ββμ

)|1()|2(2 GgPGgPg iii =+==

Quantitative Trait GWASQuantitative Trait GWASin Sardiniain Sardinia

6,148 Sardinians from 4 towns in 6,148 Sardinians from 4 towns in OgliastraOgliastraMany close relationships among sampled individuals Many close relationships among sampled individuals

Measured 98 aging related quantitative traitsMeasured 98 aging related quantitative traits

Genotyping:Genotyping:10,000 SNPs measured in ~4,500 individuals 10,000 SNPs measured in ~4,500 individuals 500,000 SNPs measured in ~1,400 individuals500,000 SNPs measured in ~1,400 individuals

Geno

Genomic PositionGenomic Position

LOD

sco

re

with

out i

mpu

tatio

n

with

impu

tatio

n

Li

nkag

e

An Example Where We Know The AnswerAn Example Where We Know The Answer

In In SilicoSilico Genotyping For Genotyping For Case Control SamplesCase Control Samples

In families, we expected relatively long stretches of In families, we expected relatively long stretches of shared chromosomeshared chromosome

In unrelated individuals, these stretches will typically be In unrelated individuals, these stretches will typically be much shortermuch shorter

Nevertheless, it may still be possible to identify stretches Nevertheless, it may still be possible to identify stretches of shared chromosome …of shared chromosome …

… and by comparing shared stretches between densely … and by comparing shared stretches between densely genotyped individuals and those with sparser datagenotyped individuals and those with sparser data

Observed GenotypesObserved Genotypes

Observed Genotypes

. . . . A . . . . . . . A . . . . A . . .

. . . . G . . . . . . . C . . . . A . . .

Reference Haplotypes

C G A G A T C T C C T T C T T C T G T G CC G A G A T C T C C C G A C C T C A T G GC C A A G C T C T T T T C T T C T G T G CC G A A G C T C T T T T C T T C T G T G CC G A G A C T C T C C G A C C T T A T G CT G G G A T C T C C C G A C C T C A T G GC G A G A T C T C C C G A C C T T G T G CC G A G A C T C T T T T C T T T T G T A CC G A G A C T C T C C G A C C T C G T G CC G A A G C T C T T T T C T T C T G T G C

Study Sample

HapMap

Identify Match Among ReferenceIdentify Match Among Reference

Observed Genotypes

. . . . A . . . . . . . A . . . . A . . .

. . . . G . . . . . . . C . . . . A . . .



Phase Chromosome, Phase Chromosome, Impute Missing GenotypesImpute Missing Genotypes

Observed Genotypes

c g a g A t c t c c c g A c c t c A t g gc g a a G c t c t t t t C t t t c A t g g



ImplementationImplementationMarkov model is used to model each haplotype, Markov model is used to model each haplotype, conditional on all othersconditional on all others

Gibbs sampler is used to estimate parameters Gibbs sampler is used to estimate parameters and update haplotypesand update haplotypes

Each individual is updated conditional on all othersEach individual is updated conditional on all othersIn parallel to updating haplotypes, estimate “error In parallel to updating haplotypes, estimate “error rates” and “crossover” probabilitiesrates” and “crossover” probabilities

In theory, this should be very close to the Li and In theory, this should be very close to the Li and Stephens (2003) modelStephens (2003) model

Does This Actually Work?Does This Actually Work?Preliminary ResultsPreliminary Results

Used 11 tag SNPs to Used 11 tag SNPs to predict 84 SNPs in CFHpredict 84 SNPs in CFH

Predicted genotypes differ Predicted genotypes differ from original ~1.8% of the from original ~1.8% of the timetime

Reasonably similar results Reasonably similar results possible using methods, possible using methods, such as, PHASE and such as, PHASE and fastPHASEfastPHASE

0 50 100 150 200

050

100

150

200

Chi-Square Test Statistic for Disease-Marker Association

Imputed Data

Expe

rimen

tal D

ata

Comparison of Test Statistics,Truth vs. Imputed

Does This Really Work?Does This Really Work?Used about ~300,000 SNPs from Illumina HumanHap300 Used about ~300,000 SNPs from Illumina HumanHap300 to impute 2.1M HapMap SNPs in 2500 individuals from a to impute 2.1M HapMap SNPs in 2500 individuals from a study of type II diabetes (Scott et al, Science, 2007)study of type II diabetes (Scott et al, Science, 2007)

Compared imputed genotypes with actual experimental Compared imputed genotypes with actual experimental genotypes in a candidate region on chromosome 14genotypes in a candidate region on chromosome 14

1190 individuals, 521 markers not on Illumina chip1190 individuals, 521 markers not on Illumina chip

Results of comparisonResults of comparisonAverage rAverage r22 with true genotypes 0.92 (median 0.97)with true genotypes 0.92 (median 0.97)1.4% of imputed alleles mismatch original1.4% of imputed alleles mismatch original2.8% of imputed genotypes mismatch2.8% of imputed genotypes mismatchMost errors concentrated on worst 3% of SNPsMost errors concentrated on worst 3% of SNPs

Genomic PositionGenomic Position

Back to Sardinia G6PD Activity Example …Back to Sardinia G6PD Activity Example …

After imputing HapMap SNPs a region on chromosome 1 becomes top hit after G6PD and HBB

The new hit is upstream of 6PGD

6-phosphogluconate dehydrogenase is an enzyme that is known to metabolize some of the same substrates as G6PD

Combined Lipid ScansCombined Lipid ScansSardiNIASardiNIA (Schlessinger, (Schlessinger, UdaUda, et al.), et al.)

~4,300 individuals, cohort~4,300 individuals, cohort

FUSION (Mohlke, Boehnke, Collins, et al.)FUSION (Mohlke, Boehnke, Collins, et al.)~2,500 individuals~2,500 individuals

DGI (Kathiresan, DGI (Kathiresan, AltshulerAltshuler, , OrhoOrho--MellanderMellander, et al.), et al.)~3,000 individuals~3,000 individuals

Individually, 1Individually, 1--3 hits/scan, mostly known loci3 hits/scan, mostly known loci

Analysis:Analysis:Impute genotypes so that all scans are analyzed at the same “SNPImpute genotypes so that all scans are analyzed at the same “SNPs”s”Carry out metaCarry out meta--analysis of results across scansanalysis of results across scans

Combined Lipid Scan Results Combined Lipid Scan Results

LDLLDL--C association near LDLRC association near LDLR

SNPs typedby all 3 groups(44,998)

Affy panel SNPs(320,681)

Imputed SNPs(~ 2.25 million)

Does This Work Across Populations?Does This Work Across Populations?

Conrad et al. (2006) datasetConrad et al. (2006) dataset

52 regions, each ~330 kb52 regions, each ~330 kb

Human Genome Diversity PanelHuman Genome Diversity Panel~927 individuals, 52 populations~927 individuals, 52 populations

1864 SNPs1864 SNPsGrid of 872 SNPs used as tagsGrid of 872 SNPs used as tagsPredicted genotypes for the other 992 SNPsPredicted genotypes for the other 992 SNPsCompared predictions to actual genotypesCompared predictions to actual genotypes

Tag SNP Portability

(Evaluation Using ~1 SNP per 10kb in 52 x 300kb regions For Imputation)

Comparison With ImputeComparison With ImputeWe compared our results with IMPUTE across all the We compared our results with IMPUTE across all the HGDP populationsHGDP populations

We found that:We found that:

Genotypes imputed by MACH were more concordant Genotypes imputed by MACH were more concordant with original genotypes in 29/52 populationswith original genotypes in 29/52 populations

Genotypes imputed by IMPUTE were more concordant Genotypes imputed by IMPUTE were more concordant with original genotypes in 7/52 populationswith original genotypes in 7/52 populations

Overall, the two methods are more concordant with each Overall, the two methods are more concordant with each other than with the real dataother than with the real data

AcknowledgementsAcknowledgementsSardinia Collaborators, led by:Sardinia Collaborators, led by:

David Schlessinger, Antonio Cao, Manuela David Schlessinger, Antonio Cao, Manuela UdaUda, Ed , Ed LakattaLakatta, , Paul CostaPaul CostaAnalysis by Serena Analysis by Serena SannaSanna, Paul Scheet, , Paul Scheet, WeiminWeimin ChenChen

FUSION Investigators, led by:FUSION Investigators, led by:Mike Boehnke, Francis Collins, Karen Mohlke, Jaakko Mike Boehnke, Francis Collins, Karen Mohlke, Jaakko Tuomilehto, Richard BergmanTuomilehto, Richard BergmanAnalysis by Analysis by CristenCristen Willer and Willer and YunYun LiLi

DGI Investigators:DGI Investigators:Sekar Kathiresan, David Sekar Kathiresan, David AltshulerAltshuler and colleaguesand colleagues

MaCHMaCH DevelopmentDevelopmentYunYun Li, Paul Scheet, Jun DingLi, Paul Scheet, Jun Ding

[email protected]@umich.edu

“In Silico” Genotyping for Genome Wide Association Scans · In Silico Genotyping For Family...

Documents

Transcript of “In Silico” Genotyping for Genome Wide Association Scans · In Silico Genotyping For Family...