“In Silico” Genotyping for Genome Wide Association Scans · In Silico Genotyping For Family...
Transcript of “In Silico” Genotyping for Genome Wide Association Scans · In Silico Genotyping For Family...
“In “In SilicoSilico” Genotyping for” Genotyping forGenome Wide Association ScansGenome Wide Association Scans
Turning a Flood of Data into a DelugeTurning a Flood of Data into a Deluge
Gonçalo AbecasisGonçalo AbecasisUniversity of MichiganUniversity of Michigan
Lots of Genotypes Are Good…Lots of Genotypes Are Good…How About Even More Genotypes?How About Even More Genotypes?
If millions of genotypes are good, wouldn’t billions be better?If millions of genotypes are good, wouldn’t billions be better?
Spend more dollars, euros, pounds, and …Spend more dollars, euros, pounds, and …Examine more individuals …Examine more individuals …Examine more SNPs …Examine more SNPs …
Inexpensive “in Inexpensive “in silicosilico” genotyping strategies” genotyping strategies
Estimate genotypes for individuals related to those in GWAS sampEstimate genotypes for individuals related to those in GWAS sampleleIntuition for how Intuition for how in in silicosilico genotyping worksgenotyping works
Estimate additional genotypes for individuals in the GWAS sampleEstimate additional genotypes for individuals in the GWAS sampleFacilitate comparisons across studiesFacilitate comparisons across studiesImprove coverage of the genomeImprove coverage of the genome
In In SilicoSilico Genotyping For Genotyping For Family SamplesFamily Samples
Family members share large segments of chromosomesFamily members share large segments of chromosomes
If we genotype many related individuals, we will effectively If we genotype many related individuals, we will effectively be genotyping a few chromosomes many timesbe genotyping a few chromosomes many times
An alternative is to:An alternative is to:Genotype a few markers on all samplesGenotype a few markers on all samplesIdentify shared chromosomal segments that segregate in familyIdentify shared chromosomal segments that segregate in familyUse a highUse a high--density panel to genotype a few samples per familydensity panel to genotype a few samples per familyEstimate missing genotypes in samples without high density dataEstimate missing genotypes in samples without high density data
The first two steps are optional, but very helpfulThe first two steps are optional, but very helpful
Burdick et al, Nat Genet, 2006
Genotype InferenceGenotype InferencePart 1 Part 1 –– Observed Genotype DataObserved Genotype Data
A/A A/G A/A A/G
G/G C/C G/G
G/G G/T G/T G/TG/G G/T
A/T A/A A/T T/T
A/AT/T
T/T G/T
G/GT/T
A/AT/G
T/TG/TA/A
A/GT/T
A/G
C/G
T/TC/G
A/GA/TG/TG/TG/AT/TC/G
A/T
A/G./../.
G/T./../.
C/G
A/G./../.
T/T./../.
G/G
A/A./../.
G/G./../.
C/C
A/A./../.
G/G./../.
C/C
G/G./../.
T/T./../.
G/G
G/G./../.
T/T./../.
G/G
Genotype InferenceGenotype InferencePart 2 Part 2 –– Inferring Allele SharingInferring Allele Sharing
A A G A A A G AT A A A A T T TT T T G G G T GG G T G G T T GA G A A G G A AT T T T T T T GC G G G C C G G
A G A GT A A TT T G TG T G TA A G AT T T TC G C G
A G A G A A A A G G G G. . . . . . . . . . . .. . . . . . . . . . . .G T T T G G G G T T T T. . . . . . . . . . . .. . . . . . . . . . . .C G G G C C C C G G G G
Genotype InferenceGenotype InferencePart 3 Part 3 –– Imputing Missing GenotypesImputing Missing Genotypes
A/A A/G A/A A/G
G/G G/T G/T G/GG/G G/T
A/T A/A A/T T/TT/T G/T
T/TA/AT/G
T/T
G/G C/C G/G
A/AT/T
G/G
G/TA/A
A/GT/T
A/G
C/G
T/TC/G
A/GA/TG/TG/TG/AT/TC/G
A/T
A/G./../.
G/T./.
T/TC/G
A/G./../.
T/TA/AT/TG/G
A/AA/TG/TG/GA/GT/TC/C
A/AA/TG/TG/GA/GT/TC/C
G/GA/TT/TT/TA/AT/TG/G
G/GA/TT/TT/TA/AT/TG/G
Formal ApproachFormal ApproachConsider full set of observed genotypes Consider full set of observed genotypes GG
Evaluate pedigree likelihood Evaluate pedigree likelihood LL for each possible value of for each possible value of each missing genotype each missing genotype ggijij
Posterior probability for each missing genotypePosterior probability for each missing genotype
Implemented both using Implemented both using ElstonElston--Stewart (1972) and Stewart (1972) and LanderLander--Green (1987) algorithmsGreen (1987) algorithms
)(),(
)|(GL
xgGLGxgP ij
ij
===
Model With Inferred GenotypesModel With Inferred GenotypesReplace genotype score Replace genotype score gg with its expected value:with its expected value:
WhereWhere
Association test implemented as score test or as likelihood ratiAssociation test implemented as score test or as likelihood ratio testo testVariance component framework to allow for relatednessVariance component framework to allow for relatedness
Alternatives would be to Alternatives would be to (a) impute genotypes with large posterior probabilities; or (a) impute genotypes with large posterior probabilities; or (b) integrate joint distribution of unobserved genotypes in fami(b) integrate joint distribution of unobserved genotypes in familyly
...)( +++= cgyE cgi ββμ
)|1()|2(2 GgPGgPg iii =+==
Quantitative Trait GWASQuantitative Trait GWASin Sardiniain Sardinia
6,148 Sardinians from 4 towns in 6,148 Sardinians from 4 towns in OgliastraOgliastraMany close relationships among sampled individuals Many close relationships among sampled individuals
Measured 98 aging related quantitative traitsMeasured 98 aging related quantitative traits
Genotyping:Genotyping:10,000 SNPs measured in ~4,500 individuals 10,000 SNPs measured in ~4,500 individuals 500,000 SNPs measured in ~1,400 individuals500,000 SNPs measured in ~1,400 individuals
Geno
Genomic PositionGenomic Position
LOD
sco
re
with
out i
mpu
tatio
n
with
impu
tatio
n
Li
nkag
e
An Example Where We Know The AnswerAn Example Where We Know The Answer
In In SilicoSilico Genotyping For Genotyping For Case Control SamplesCase Control Samples
In families, we expected relatively long stretches of In families, we expected relatively long stretches of shared chromosomeshared chromosome
In unrelated individuals, these stretches will typically be In unrelated individuals, these stretches will typically be much shortermuch shorter
Nevertheless, it may still be possible to identify stretches Nevertheless, it may still be possible to identify stretches of shared chromosome …of shared chromosome …
… and by comparing shared stretches between densely … and by comparing shared stretches between densely genotyped individuals and those with sparser datagenotyped individuals and those with sparser data
Observed GenotypesObserved Genotypes
Observed Genotypes
. . . . A . . . . . . . A . . . . A . . .
. . . . G . . . . . . . C . . . . A . . .
Reference Haplotypes
C G A G A T C T C C T T C T T C T G T G CC G A G A T C T C C C G A C C T C A T G GC C A A G C T C T T T T C T T C T G T G CC G A A G C T C T T T T C T T C T G T G CC G A G A C T C T C C G A C C T T A T G CT G G G A T C T C C C G A C C T C A T G GC G A G A T C T C C C G A C C T T G T G CC G A G A C T C T T T T C T T T T G T A CC G A G A C T C T C C G A C C T C G T G CC G A A G C T C T T T T C T T C T G T G C
Study Sample
HapMap
Identify Match Among ReferenceIdentify Match Among Reference
Observed Genotypes
. . . . A . . . . . . . A . . . . A . . .
. . . . G . . . . . . . C . . . . A . . .
Reference Haplotypes
C G A G A T C T C C T T C T T C T G T G CC G A G A T C T C C C G A C C T C A T G GC C A A G C T C T T T T C T T C T G T G CC G A A G C T C T T T T C T T C T G T G CC G A G A C T C T C C G A C C T T A T G CT G G G A T C T C C C G A C C T C A T G GC G A G A T C T C C C G A C C T T G T G CC G A G A C T C T T T T C T T T T G T A CC G A G A C T C T C C G A C C T C G T G CC G A A G C T C T T T T C T T C T G T G C
Phase Chromosome, Phase Chromosome, Impute Missing GenotypesImpute Missing Genotypes
Observed Genotypes
c g a g A t c t c c c g A c c t c A t g gc g a a G c t c t t t t C t t t c A t g g
Reference Haplotypes
C G A G A T C T C C T T C T T C T G T G CC G A G A T C T C C C G A C C T C A T G GC C A A G C T C T T T T C T T C T G T G CC G A A G C T C T T T T C T T C T G T G CC G A G A C T C T C C G A C C T T A T G CT G G G A T C T C C C G A C C T C A T G GC G A G A T C T C C C G A C C T T G T G CC G A G A C T C T T T T C T T T T G T A CC G A G A C T C T C C G A C C T C G T G CC G A A G C T C T T T T C T T C T G T G C
ImplementationImplementationMarkov model is used to model each haplotype, Markov model is used to model each haplotype, conditional on all othersconditional on all others
Gibbs sampler is used to estimate parameters Gibbs sampler is used to estimate parameters and update haplotypesand update haplotypes
Each individual is updated conditional on all othersEach individual is updated conditional on all othersIn parallel to updating haplotypes, estimate “error In parallel to updating haplotypes, estimate “error rates” and “crossover” probabilitiesrates” and “crossover” probabilities
In theory, this should be very close to the Li and In theory, this should be very close to the Li and Stephens (2003) modelStephens (2003) model
Does This Actually Work?Does This Actually Work?Preliminary ResultsPreliminary Results
Used 11 tag SNPs to Used 11 tag SNPs to predict 84 SNPs in CFHpredict 84 SNPs in CFH
Predicted genotypes differ Predicted genotypes differ from original ~1.8% of the from original ~1.8% of the timetime
Reasonably similar results Reasonably similar results possible using methods, possible using methods, such as, PHASE and such as, PHASE and fastPHASEfastPHASE
0 50 100 150 200
050
100
150
200
Chi-Square Test Statistic for Disease-Marker Association
Imputed Data
Expe
rimen
tal D
ata
Comparison of Test Statistics,Truth vs. Imputed
Does This Really Work?Does This Really Work?Used about ~300,000 SNPs from Illumina HumanHap300 Used about ~300,000 SNPs from Illumina HumanHap300 to impute 2.1M HapMap SNPs in 2500 individuals from a to impute 2.1M HapMap SNPs in 2500 individuals from a study of type II diabetes (Scott et al, Science, 2007)study of type II diabetes (Scott et al, Science, 2007)
Compared imputed genotypes with actual experimental Compared imputed genotypes with actual experimental genotypes in a candidate region on chromosome 14genotypes in a candidate region on chromosome 14
1190 individuals, 521 markers not on Illumina chip1190 individuals, 521 markers not on Illumina chip
Results of comparisonResults of comparisonAverage rAverage r22 with true genotypes 0.92 (median 0.97)with true genotypes 0.92 (median 0.97)1.4% of imputed alleles mismatch original1.4% of imputed alleles mismatch original2.8% of imputed genotypes mismatch2.8% of imputed genotypes mismatchMost errors concentrated on worst 3% of SNPsMost errors concentrated on worst 3% of SNPs
Genomic PositionGenomic Position
Back to Sardinia G6PD Activity Example …Back to Sardinia G6PD Activity Example …
After imputing HapMap SNPs a region on chromosome 1 becomes top hit after G6PD and HBB
The new hit is upstream of 6PGD
6-phosphogluconate dehydrogenase is an enzyme that is known to metabolize some of the same substrates as G6PD
Combined Lipid ScansCombined Lipid ScansSardiNIASardiNIA (Schlessinger, (Schlessinger, UdaUda, et al.), et al.)
~4,300 individuals, cohort~4,300 individuals, cohort
FUSION (Mohlke, Boehnke, Collins, et al.)FUSION (Mohlke, Boehnke, Collins, et al.)~2,500 individuals~2,500 individuals
DGI (Kathiresan, DGI (Kathiresan, AltshulerAltshuler, , OrhoOrho--MellanderMellander, et al.), et al.)~3,000 individuals~3,000 individuals
Individually, 1Individually, 1--3 hits/scan, mostly known loci3 hits/scan, mostly known loci
Analysis:Analysis:Impute genotypes so that all scans are analyzed at the same “SNPImpute genotypes so that all scans are analyzed at the same “SNPs”s”Carry out metaCarry out meta--analysis of results across scansanalysis of results across scans
Combined Lipid Scan Results Combined Lipid Scan Results
LDLLDL--C association near LDLRC association near LDLR
SNPs typedby all 3 groups(44,998)
Affy panel SNPs(320,681)
Imputed SNPs(~ 2.25 million)
Does This Work Across Populations?Does This Work Across Populations?
Conrad et al. (2006) datasetConrad et al. (2006) dataset
52 regions, each ~330 kb52 regions, each ~330 kb
Human Genome Diversity PanelHuman Genome Diversity Panel~927 individuals, 52 populations~927 individuals, 52 populations
1864 SNPs1864 SNPsGrid of 872 SNPs used as tagsGrid of 872 SNPs used as tagsPredicted genotypes for the other 992 SNPsPredicted genotypes for the other 992 SNPsCompared predictions to actual genotypesCompared predictions to actual genotypes
Tag SNP Portability
(Evaluation Using ~1 SNP per 10kb in 52 x 300kb regions For Imputation)
Comparison With ImputeComparison With ImputeWe compared our results with IMPUTE across all the We compared our results with IMPUTE across all the HGDP populationsHGDP populations
We found that:We found that:
Genotypes imputed by MACH were more concordant Genotypes imputed by MACH were more concordant with original genotypes in 29/52 populationswith original genotypes in 29/52 populations
Genotypes imputed by IMPUTE were more concordant Genotypes imputed by IMPUTE were more concordant with original genotypes in 7/52 populationswith original genotypes in 7/52 populations
Overall, the two methods are more concordant with each Overall, the two methods are more concordant with each other than with the real dataother than with the real data
AcknowledgementsAcknowledgementsSardinia Collaborators, led by:Sardinia Collaborators, led by:
David Schlessinger, Antonio Cao, Manuela David Schlessinger, Antonio Cao, Manuela UdaUda, Ed , Ed LakattaLakatta, , Paul CostaPaul CostaAnalysis by Serena Analysis by Serena SannaSanna, Paul Scheet, , Paul Scheet, WeiminWeimin ChenChen
FUSION Investigators, led by:FUSION Investigators, led by:Mike Boehnke, Francis Collins, Karen Mohlke, Jaakko Mike Boehnke, Francis Collins, Karen Mohlke, Jaakko Tuomilehto, Richard BergmanTuomilehto, Richard BergmanAnalysis by Analysis by CristenCristen Willer and Willer and YunYun LiLi
DGI Investigators:DGI Investigators:Sekar Kathiresan, David Sekar Kathiresan, David AltshulerAltshuler and colleaguesand colleagues
MaCHMaCH DevelopmentDevelopmentYunYun Li, Paul Scheet, Jun DingLi, Paul Scheet, Jun Ding
[email protected]@umich.edu