HMM structure:

1
ISBRA 2008 Genotype Error Detection and Imputation using Hidden Markov Models of Haplotype Diversity J. Kennedy, I. Mandoiu and B. Pasaniuc Computer Science & Engineering Dept., University of Connecticut M ethods H M M Param eter Estim ation Previous w orks use EM training based on unrelated genotype data To exploitavailable pedigree info,w e use 2-step algorithm : 1. Infer haplotypes using pedigree-aw are ENT algorithm based on entropy-m inim ization [GPM ’08] 2. Train HM M based on inferred haplotypes,using Baum -Welch algorithm G enotype Im putation G gi x denotes the m ulti-locus genotype obtained from G by replacing the i-th SNP genotype w ith x,w here x is either0,1 or2 Fora m issing (un-called orun-typed) genotype g i im putation is done by replacing g i with G enotype Error D etection and Correction Genotype g i is flagged as a potentialerror w heneverthe log-likelihood ratio LLR(g i ) com puted as below exceeds a globalorlocus-specific detection threshold (w e used 1.6 in ourexperim ents) Fortrio data LLR is com puted as the m inim um overlog-likelihood ratios com puted overthe entire trio,parent-child duos,and unrelated individuals [KM P’07] to reduce false positives caused by errors in related individuals Error D etection and Im putation Flow s: IM P :un-typed SNP genotypes w ere im puted using the original genotype data M D R+ IM P: first,m issing genotypes attyped SNPs w ere recovered, then the com plete data w as used forim putation ofun-typed SNP genotypes ED C+ M D R + IM P: First,erroneous genotypes w ere detected and corrected,second,corrected genotypes w ere used to recoverm issing genotypes attyped SNPs,then this data w as used to im pute genotypes foruntyped SNPs ) | ( ) | ( max log ) ( } 2 , 1 , 0 { M G P M G P g LLR x g x i i ) | ( argmax } 2 , 1 , 0 { M G P x x g x i Introduction B ackground and M otivation Single N ucleotide Polym orphism s (SN Ps): Positions in the genom e atw hich tw o ofthe possible four nucleotides occur in a large percentage ofthe population (~ 12 m illion cataloged to date) H aplotype: describes the com bination ofalleles presenton one ofthe hom ologous chrom osom es; alleles usually denoted as 0 (m ajor) and 1 (m inor) G enotype: com binations ofS N P alleles presenton the tw o chrom osom es:0/1 both chrom osom es contain the m ajor/m inor allele;2 chrom osom es contain differentalleles Genom e-w ide association studies (G W AS): Methodology form apping disease associated genes by typing a dense setof m arkers in large num bers of cases (individuals a affected disease) and controls (unaffected individuals) follow ed by a statisticaltestofassociation Enabled by recentadvances in SNP genotyping technologies Higher statisticalpow ercom pared to othergene m apping m ethods such as linkage for uncovering genetic basis ofcom plex diseases GW AS com putationalchallenges: G enotype error detection: even low levels ofgenotyping errors decrease statisticalpow er and can invalidate som e statisticaltestfor disease association based on haplotypes Im putation ofun-called genotypes: genotypes atsom e SNPs are leftun-called due to uncertainties in low -levelprobe intensity data butm any analyses require com plete data Im putation ofun-typed genotypes: currentgenotyping platform s stillhave lim ited coverage (500k-1M SNPs) and thus unlikely to include causalSNPs;im putation ofun-typed SNPs has em erged as a pow erfultechnique for increasing GW AS pow er HM M ofH aplotype D iversity HMM structure for K=4 founders and 5 S NP loci ExperimentalSetup W TCCC D ataset: Genotype data ofthe 1958 birth cohortof the The W elcom e TrustCase ControlConsortium genom e- association study 1,444 individuals genotyped using the Affym etrix 500k platform W e inserted 1% errors,set1% ofthe genotype calls as m issing and m asked 1% ofthe SNPs as un-typed AD H D D ataset: D atasetfrom the Genetic Association Inform ation Netw ork (GAIN) study on attention deficit hyperactivity disorder(ADH D) 958 m other-father-child trios genotyped using the Perlgen 500k platform W e inserted 1% errors,set1% ofthe genotype calls as m issing and m asked 1% ofthe SNPs as un-typed H apM ap D ataset: CEU panelconsisting ofUtah residents w ith European ancestry 30 m other-father-child trio fam ilies genotyped using both the Affym etrix 500k platform and the Affym etrix 6.0 platform s Affym etrix 500k genotypes w ere used to im pute un-called genotypes and genotypes ofSNPs on the Affym etrix 6.0 platform notcovered by the Affym etrix500k ActualAffym etrix 6.0 genotypes w ere assum ed to be correctw hen estim ating im putation and m issing data recovery accuracy,in particular disagreem ents betw een Affym etrix 500k and 6.0 calls w ere assum ed to be correctin 6.0 data A ccuracy m easures Error D etection True Positive (TP) R ate is the percentage ofthe genotype errors inserted thatgetflagged. Error D etection False Positive (FP) R ate is com puted as the percentage ofcorrectgenotype calls thatgeterroneously flagged. Error Correction A ccuracy is m easured as the percentage offlagged errors thatgetcorrected to the originalvalue The IM P and M D R Error R ates are m easured as the percentage of erroneously recovered genotypes from the totalnum ber ofm asked genotypes Experim entalR esults E stimates of the allele 0 frequencies based on imputed versus true genotypes A ccuracy and missing data rate for imputed genotypes for different confidence thresholds. T he solid line shows the discordance between imputed genotypes and original genotype calls while the dashed line shows the missing data rate E rror Detection (E D), E rror C orrection (E C ) Missing Data R ecovery (MDR ) and Imputation (IMP) accuracy results obtained on the WTCC C , ADHD, and HapMap datasets 0% 20% 40% 60% 80% 100% 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 C alling threshold P ercentage m issing genotypes 0% 2% 4% 6% 8% 10% 12% 14% P ercentage discordance 0% 20% 40% 60% 80% 100% 0.5 0.6 0.7 0.8 0.9 C alling threshold P ercentage m issing genotypes 0% 2% 4% 6% 8% 10% 12% 14% P ercentage discordance 0% 20% 40% 60% 80% 100% 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 C alling threshold P ercentage m issing genotypes 0% 2% 4% 6% 8% 10% 12% 14% P ercentage discordance W TCCC W TCCC ADHD ADHD H apM ap H apM ap C onclusions Allproposed flow s have been im plem ented in a softw are tool called GED I (forGenotype ErrorD etection and Im putation); the open source w illbe released at http://dna.engr.uconn.edu A significantpercentage oferrors are detected w ith very low false positive rate (TP rate forHapM ap datasetis under- estim ated since som e ofthe discordances are caused by errors in Affym etrix 6.0 genotypes) O ver 97% ofdetected genotype errors are accurately corrected Un-called genotypes are recovered w ith high accuracy,but accuracy seem s to be sensitive to datasetspecific m issing data patterns Im putation ofun-typed genotypes has high accuracy (less than 2% discordance for genotypes im puted w ith > .95 confidence), and im puted allele frequencies m atch w ellthe observed frequencies HapM ap haplotype frequencies transfer w ellto related populations for im putation ofun-typed variation;how ever,ED C and M D R benefitfrom training the HM M based on haplotypes inferred from the population under study Perform ing errordetection and m issing data recovery increases im putation accuracy forall3 datasets,show ing the advantages ofthe com bined approach (ED C+ M DR+ IM P) R eferences 1. L.E.Baum ,T.Petrie,G.Soules,and N .W eiss.A m axim ization technique occurring in the statisticalanalysis ofprobabilistic functions ofM arkov chains. Ann. Math. Statist. ,41:164-171,1970 2. The W ellcom e TrustConsortium .Genom e-w ide association study of14,000 cases ofseven com m on diseases and 3,000 shared controls. Nature,447:661-678,2007 3. A.Gusev,B.Pasaniuc,and I.I.M andoiu.Highly scalable genotype phasing by entropy m inim ization. IEEE/ACM Transactions on Computational Biology and Bioinformatics,5:252-261,2008 4. J.Kennedy,I.I.M andoiu,and B.Pasaniuc.G enotype error detection using H idden M arkov m odels ofhaplotype diversity. In Proc. 7th Workshop on Algorithms in Bioinformatics ,LNCS,73-84,2007 5. G.Kim m eland R.Sham ir.A block-free hidden M arkov m odelforgenotypes and its application to disease association. J ournal of Computational Biology ,12:1243-1260,2005 6. Y.Liand G.R.Abecasis.M ach 1.0: Rapid haplotype reconstruction and m issing genotype inference. American J ournal of Human Genetics,2290,2006. 7. J.M archini,B.How ie,S.M yers,G.M cVean,and P.D onnelly.A new m ultipointm ethod forgenom e-w ide association studies by im putation ofgenotypes. Nat. Genet.,39:906-913,2007. 8. P.Rastas,M .Koivisto,H.M annila,and E.Ukkonen.Phasing genotypes using a hidden M arkov m odel.In I.I.M andoiu and A. Zelikovsky,editors, Bioinformatics Algorithms: Techniques and Applications ,355-372.Wiley,2008. 9. P.Scheetand M .Stephens.A fastand flexible statisticalm odelforlarge-scale population genotype data:applications to inferring m issing genotypes and haplotypic phase. American J ournal of Human Genetics,78:629-644,2006. 10. R.Schw artz.Algorithm s for association study design using a generalized m odelofhaplotype conservation. In Proc. CSB, pages 90-97,2004. 11. X.W en and D .L.Nicolae.Association studies for untyped m arkers w ith tuna. Bioinformatics,24:435-437,2008 HMM structure: Left-to-right HMM similar to models proposed by [Schwartz 04, Rastas et al. 05, Kimmel&Shamir 05] Determined by number n of SNP loci and user specified number K of “founder” states at each SNP (set to 7 in our experiments) Each state allowed to emit both alleles but training usually introduces strong bias towards one of them Paths with high transition probability correspond to “founder” haplotypes; transition probabilities capture observed (founder-specific) recombination rates Efficient Likelihood Computations: A trained HMM M emits haplotypes along left-to-right paths P(H|M) = sum over all possible HMM paths of joint probability that M follows and emits H; efficiently computed in O(nK) time using forward algorithm P(G|M) = probability with which M emits any two haplotypes that explain G along any pair of paths; efficiently computed in O(nK 3 ) time by a 2-path extension of the forward algorithm combined with speed-up idea of [Rastas07] Similar speed-up can be used for computing in O(nK 5 ) the likelihood of genotype trios … ataggtccCtatttcgcgcCgtatacacgggActata … … ataggtccGtatttcgcgcCgtatacacgggTctata … … ataggtccCtatttcgcgcCgtatacacgggTctata … 021200210 011000110 001100010 two haplotypes per individual genotype ) , ( ) , ( ) , ; 1 ( ) , ; ( ) , ; ( 2 ' 2 1 ' 1 ) , ( ' 2 ' 1 2 1 2 1 2 1 ' 2 ' 1 q q q q q q j f q q j E q q j f j Q q q ED EC MDR IMP TP rate FP rate Accuracy Error Rate Error Rate WTCCC IMP 6.63% MDR+IMP (HapMap haps) 11.78% 6.63% MDR+IMP (ENT haps) 10.98% 6.63% EDC+MDR+IMP (HapMap haps) 79.54% 0.87% 96.58% 11.98% 6.90% EDC+MDR+IMP (ENT haps) 72.08% 0.21% 97.16% 10.89% 6.49% ADHD IMP 9.16% MDR+IMP (HapMap haps) 6.14% 8.91% MDR+IMP (ENT haps) 5.21% 8.88% EDC+MDR+IMP (HapMap haps) 61.55% 0.39% 97.85% 5.98% 8.89% EDC+MDR+IMP (ENT haps) 52.62% 0.07% 98.39% 4.58% 8.74% HapMap IMP 8.89% MDR+IMP (HapMap haps) 23.74% 8.76% MDR+IMP (ENT haps) 23.76% 8.80% EDC+MDR+IMP (HapMap haps) 40.43% 0.03% 99.40% 23.04% 8.73% EDC+MDR+IMP (ENT haps) 6.10% 0.03% 100.00% 25.21% 8.84%

description

0 2 1 2 00 2 10 0 1 1 0 00 1 10 0 0 1 1 00 0 10. genotype. two haplotypes per individual. … ataggtcc C tatttcgcgc C gtatacacggg A ctata … … ataggtcc G tatttcgcgc C gtatacacggg T ctata … … ataggtcc C tatttcgcgc C gtatacacggg T ctata …. HMM structure: - PowerPoint PPT Presentation

Transcript of HMM structure:

Page 1: HMM  structure:

ISBRA 2008

Genotype Error Detection and Imputation using Hidden Markov Models of Haplotype Diversity

J. Kennedy, I. Mandoiu and B. PasaniucComputer Science & Engineering Dept., University of Connecticut

Methods

HMM Parameter Estimation Previous works use EM training based on unrelated genotype data

To exploit available pedigree info, we use 2-step algorithm:

1. Infer haplotypes using pedigree-aware ENT algorithm based on entropy-minimization [GPM’08]

2. Train HMM based on inferred haplotypes, using Baum-Welch algorithm

Genotype Imputation Ggix denotes the multi-locus genotype obtained from G by replacing

the i-th SNP genotype with x, where x is either 0,1 or 2

For a missing (un-called or un-typed) genotype gi imputation is done by replacing gi with

Genotype Error Detection and Correction Genotype gi is flagged as a potential error whenever the log-likelihood

ratio LLR(gi) computed as below exceeds a global or locus-specific detection threshold (we used 1.6 in our experiments)

For trio data LLR is computed as the minimum over log-likelihood ratios computed over the entire trio, parent-child duos, and unrelated individuals [KMP’07] to reduce false positives caused by errors in related individuals

Error Detection and Imputation Flows: IMP: un-typed SNP genotypes were imputed using the original

genotype data MDR+IMP: first, missing genotypes at typed SNPs were recovered,

then the complete data was used for imputation of un-typed SNP genotypes

EDC+MDR+IMP: First, erroneous genotypes were detected and corrected, second, corrected genotypes were used to recover missing genotypes at typed SNPs, then this data was used to impute genotypes for untyped SNPs

)|(

)|(maxlog)( }2,1,0{

MGP

MGPgLLR xgxi

i

)|(argmax }2,1,0{ MGPx xgx i

IntroductionBackground and Motivation

Single Nucleotide Polymorphisms (SNPs): Positions in the genome at which two of the possible four nucleotides occur in a large percentage of the population (~12 million cataloged to date)

Haplotype: describes the combination of alleles present on one of the homologous chromosomes; alleles usually denoted as 0 (major) and 1 (minor) Genotype: combinations of SNP alleles present on the two chromosomes: 0/1 both chromosomes contain the major/minor allele; 2 chromosomes contain different alleles

Genome-wide association studies (GWAS): Methodology for mapping disease associated genes by typing a dense set of markers in large numbers of cases (individuals a affected disease) and controls (unaffected individuals) followed by a statistical test of association

Enabled by recent advances in SNP genotyping technologies Higher statistical power compared to other gene mapping methods

such as linkage for uncovering genetic basis of complex diseases

GWAS computational challenges: Genotype error detection: even low levels of genotyping errors

decrease statistical power and can invalidate some statistical test for disease association based on haplotypes

Imputation of un-called genotypes: genotypes at some SNPs are left un-called due to uncertainties in low-level probe intensity data but many analyses require complete data

Imputation of un-typed genotypes: current genotyping platforms still have limited coverage (500k-1M SNPs) and thus unlikely to include causal SNPs; imputation of un-typed SNPs has emerged as a powerful technique for increasing GWAS power

HMM of Haplotype Diversity

HMM structure for K=4 founders and 5 SNP loci

Experimental Setup

WTCCC Dataset: Genotype data of the 1958 birth cohort of the The Welcome Trust Case Control Consortium genome-association study

1,444 individuals genotyped using the Affymetrix 500k platform We inserted 1% errors, set 1% of the genotype calls as missing and

masked 1% of the SNPs as un-typed

ADHD Dataset: Dataset from the Genetic Association Information Network (GAIN) study on attention deficit hyperactivity disorder (ADHD)

958 mother-father-child trios genotyped using the Perlgen 500k platform

We inserted 1% errors, set 1% of the genotype calls as missing and masked 1% of the SNPs as un-typed

HapMap Dataset: CEU panel consisting of Utah residents with European ancestry

30 mother-father-child trio families genotyped using both the Affymetrix500k platform and the Affymetrix 6.0 platforms

Affymetrix 500k genotypes were used to impute un-called genotypes and genotypes of SNPs on the Affymetrix 6.0 platform not covered by the Affymetrix 500k

Actual Affymetrix 6.0 genotypes were assumed to be correct when estimating imputation and missing data recovery accuracy, in particular disagreements between Affymetrix 500k and 6.0 calls were assumed to be correct in 6.0 data

Accuracy measures Error Detection True Positive (TP) Rate is the percentage of the

genotype errors inserted that get flagged. Error Detection False Positive (FP) Rate is computed as the percentage of correct genotype calls that get erroneously flagged. Error Correction Accuracy is measured as the percentage of flagged errors that get corrected to the original value

The IMP and MDR Error Rates are measured as the percentage of erroneously recovered genotypes from the total number of masked genotypes

Experimental Results

Estimates of the allele 0 frequencies based on imputed versus true genotypes

Accuracy and missing data rate for imputed genotypes for different confidence thresholds. The solid line shows the discordance between imputed genotypes and original genotype calls while

the dashed line shows the missing data rate

Error Detection (ED), Error Correction (EC) Missing Data Recovery (MDR) and Imputation (IMP) accuracy results obtained on the WTCCC, ADHD, and HapMap datasets

0%

20%

40%

60%

80%

100%

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95

Calling threshold

Per

cen

tag

e m

issi

ng

gen

oty

pes

0%

2%

4%

6%

8%

10%

12%

14%

Per

cen

tag

e d

isco

rdan

ce

0%

20%

40%

60%

80%

100%

0.5 0.6 0.7 0.8 0.9

Calling threshold

Per

cen

tag

e m

issi

ng

gen

oty

pes

0%

2%

4%

6%

8%

10%

12%

14%

Per

cen

tag

e d

isco

rdan

ce

0%

20%

40%

60%

80%

100%

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95

Calling threshold

Per

cen

tag

e m

issi

ng

gen

oty

pes

0%

2%

4%

6%

8%

10%

12%

14%

Per

cen

tag

e d

isco

rdan

ce

WTCCC

WTCCC

ADHD

ADHD

HapMap

HapMap

Conclusions

All proposed flows have been implemented in a software tool called GEDI (for Genotype Error Detection and Imputation); the open source will be released at http://dna.engr.uconn.edu

A significant percentage of errors are detected with very low false positive rate (TP rate for HapMap dataset is under-estimated since some of the discordances are caused by errors in Affymetrix 6.0 genotypes)

Over 97% of detected genotype errors are accurately corrected

Un-called genotypes are recovered with high accuracy, but accuracy seems to be sensitive to dataset specific missing data patterns

Imputation of un-typed genotypes has high accuracy (less than 2% discordance for genotypes imputed with >.95 confidence), and imputed allele frequencies match well the observed frequencies

HapMap haplotype frequencies transfer well to related populations for imputation of un-typed variation; however, EDC and MDR benefit from training the HMM based on haplotypes inferred from the population under study

Performing error detection and missing data recovery increases imputation accuracy for all 3 datasets, showing the advantages of the combined approach (EDC+MDR+IMP)

References1. L.E. Baum, T. Petrie, G. Soules, and N.Weiss. A maximization technique occurring in the statistical analysis of probabilistic

functions of Markov chains. Ann. Math. Statist., 41:164-171, 19702. The Wellcome Trust Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000

shared controls. Nature, 447:661-678, 20073. A. Gusev, B. Pasaniuc, and I.I. Mandoiu. Highly scalable genotype phasing by entropy minimization. IEEE/ACM Transactions

on Computational Biology and Bioinformatics, 5:252-261, 20084. J . Kennedy, I.I. Mandoiu, and B. Pasaniuc. Genotype error detection using Hidden Markov models of haplotype diversity. In

Proc. 7th Workshop on Algorithms in Bioinformatics, LNCS, 73-84, 20075. G. Kimmel and R. Shamir. A block-free hidden Markov model for genotypes and its application to disease association. J ournal

of Computational Biology, 12:1243-1260, 20056. Y. Li and G. R. Abecasis. Mach 1.0: Rapid haplotype reconstruction and missing genotype inference. American Journal of

Human Genetics, 2290, 2006.7. J . Marchini, B. Howie, S. Myers, G. McVean, and P. Donnelly. A new multipoint method for genome-wide association studies

by imputation of genotypes. Nat. Genet., 39:906-913, 2007.8. P. Rastas, M. Koivisto, H. Mannila, and E. Ukkonen. Phasing genotypes using a hidden Markov model. In I.I. Mandoiu and A.

Zelikovsky, editors, Bioinformatics Algorithms: Techniques and Applications, 355-372. Wiley, 2008.9. P. Scheet and M. Stephens. A fast and flexible statistical model for large-scale population genotype data: applications to

inferring missing genotypes and haplotypic phase. American Journal of Human Genetics, 78:629-644, 2006.10. R. Schwartz. Algorithms for association study design using a generalized model of haplotype conservation. In Proc. CSB,

pages 90-97, 2004.11. X. Wen and D. L. Nicolae. Association studies for untyped markers with tuna. Bioinformatics, 24:435-437, 2008

HMM structure: Left-to-right HMM similar to models proposed by [Schwartz

04, Rastas et al. 05, Kimmel&Shamir 05] Determined by number n of SNP loci and user specified

number K of “founder” states at each SNP (set to 7 in our experiments)

Each state allowed to emit both alleles but training usually introduces strong bias towards one of them

Paths with high transition probability correspond to “founder” haplotypes; transition probabilities capture observed (founder-specific) recombination rates

Efficient Likelihood Computations: A trained HMM M emits haplotypes along left-to-right paths

P(H|M) = sum over all possible HMM paths of joint

probability that M follows and emits H; efficiently computed in O(nK) time using forward algorithm

P(G|M) = probability with which M emits any two haplotypes that explain G along any pair of paths; efficiently computed in O(nK3) time by a 2-path extension of the forward algorithm combined with speed-up idea of [Rastas07]

Similar speed-up can be used for computing in O(nK5) the likelihood of genotype trios

… ataggtccCtatttcgcgcCgtatacacgggActata …… ataggtccGtatttcgcgcCgtatacacgggTctata …… ataggtccCtatttcgcgcCgtatacacgggTctata …

021200210011000110001100010

two haplotypes per individual

genotype

),(),(),;1(),;(),;( 2'21

'1

),(

'2

'12121

21

'2

'1

qqqqqqjfqqjEqqjfjQqq

 

  ED EC MDR IMP

 TP

rateFP

rate Accuracy Error Rate Error Rate

WTCCCIMP 6.63%

MDR+IMP (HapMap haps) 11.78% 6.63%

MDR+IMP (ENT haps) 10.98% 6.63%

EDC+MDR+IMP (HapMap haps) 79.54% 0.87% 96.58% 11.98% 6.90%

EDC+MDR+IMP (ENT haps) 72.08% 0.21% 97.16% 10.89% 6.49%

ADHDIMP 9.16%

MDR+IMP (HapMap haps) 6.14% 8.91%

MDR+IMP (ENT haps) 5.21% 8.88%

EDC+MDR+IMP (HapMap haps) 61.55% 0.39% 97.85% 5.98% 8.89%

EDC+MDR+IMP (ENT haps) 52.62% 0.07% 98.39% 4.58% 8.74%

HapMapIMP 8.89%

MDR+IMP (HapMap haps) 23.74% 8.76%

MDR+IMP (ENT haps) 23.76% 8.80%

EDC+MDR+IMP (HapMap haps) 40.43% 0.03% 99.40% 23.04% 8.73%

EDC+MDR+IMP (ENT haps) 6.10% 0.03% 100.00% 25.21% 8.84%