HMM structure:

ISBRA 2008

Genotype Error Detection and Imputation using Hidden Markov Models of Haplotype Diversity

J. Kennedy, I. Mandoiu and B. PasaniucComputer Science & Engineering Dept., University of Connecticut

Methods

HMM Parameter Estimation Previous works use EM training based on unrelated genotype data

To exploit available pedigree info, we use 2-step algorithm:

1. Infer haplotypes using pedigree-aware ENT algorithm based on entropy-minimization [GPM’08]

2. Train HMM based on inferred haplotypes, using Baum-Welch algorithm

Genotype Imputation Ggix denotes the multi-locus genotype obtained from G by replacing

the i-th SNP genotype with x, where x is either 0,1 or 2

For a missing (un-called or un-typed) genotype gi imputation is done by replacing gi with

Genotype Error Detection and Correction Genotype gi is flagged as a potential error whenever the log-likelihood

ratio LLR(gi) computed as below exceeds a global or locus-specific detection threshold (we used 1.6 in our experiments)

For trio data LLR is computed as the minimum over log-likelihood ratios computed over the entire trio, parent-child duos, and unrelated individuals [KMP’07] to reduce false positives caused by errors in related individuals

Error Detection and Imputation Flows: IMP: un-typed SNP genotypes were imputed using the original

genotype data MDR+IMP: first, missing genotypes at typed SNPs were recovered,

then the complete data was used for imputation of un-typed SNP genotypes

EDC+MDR+IMP: First, erroneous genotypes were detected and corrected, second, corrected genotypes were used to recover missing genotypes at typed SNPs, then this data was used to impute genotypes for untyped SNPs

)|(

)|(maxlog)( }2,1,0{

MGP

MGPgLLR xgxi

i

)|(argmax }2,1,0{ MGPx xgx i

IntroductionBackground and Motivation

Single Nucleotide Polymorphisms (SNPs): Positions in the genome at which two of the possible four nucleotides occur in a large percentage of the population (~12 million cataloged to date)

Haplotype: describes the combination of alleles present on one of the homologous chromosomes; alleles usually denoted as 0 (major) and 1 (minor) Genotype: combinations of SNP alleles present on the two chromosomes: 0/1 both chromosomes contain the major/minor allele; 2 chromosomes contain different alleles

Genome-wide association studies (GWAS): Methodology for mapping disease associated genes by typing a dense set of markers in large numbers of cases (individuals a affected disease) and controls (unaffected individuals) followed by a statistical test of association

Enabled by recent advances in SNP genotyping technologies Higher statistical power compared to other gene mapping methods

such as linkage for uncovering genetic basis of complex diseases

GWAS computational challenges: Genotype error detection: even low levels of genotyping errors

decrease statistical power and can invalidate some statistical test for disease association based on haplotypes

Imputation of un-called genotypes: genotypes at some SNPs are left un-called due to uncertainties in low-level probe intensity data but many analyses require complete data

Imputation of un-typed genotypes: current genotyping platforms still have limited coverage (500k-1M SNPs) and thus unlikely to include causal SNPs; imputation of un-typed SNPs has emerged as a powerful technique for increasing GWAS power

HMM of Haplotype Diversity

HMM structure for K=4 founders and 5 SNP loci

Experimental Setup

WTCCC Dataset: Genotype data of the 1958 birth cohort of the The Welcome Trust Case Control Consortium genome-association study

1,444 individuals genotyped using the Affymetrix 500k platform We inserted 1% errors, set 1% of the genotype calls as missing and

masked 1% of the SNPs as un-typed

ADHD Dataset: Dataset from the Genetic Association Information Network (GAIN) study on attention deficit hyperactivity disorder (ADHD)

958 mother-father-child trios genotyped using the Perlgen 500k platform

We inserted 1% errors, set 1% of the genotype calls as missing and masked 1% of the SNPs as un-typed

HapMap Dataset: CEU panel consisting of Utah residents with European ancestry

30 mother-father-child trio families genotyped using both the Affymetrix500k platform and the Affymetrix 6.0 platforms

Affymetrix 500k genotypes were used to impute un-called genotypes and genotypes of SNPs on the Affymetrix 6.0 platform not covered by the Affymetrix 500k

Actual Affymetrix 6.0 genotypes were assumed to be correct when estimating imputation and missing data recovery accuracy, in particular disagreements between Affymetrix 500k and 6.0 calls were assumed to be correct in 6.0 data

Accuracy measures Error Detection True Positive (TP) Rate is the percentage of the

genotype errors inserted that get flagged. Error Detection False Positive (FP) Rate is computed as the percentage of correct genotype calls that get erroneously flagged. Error Correction Accuracy is measured as the percentage of flagged errors that get corrected to the original value

The IMP and MDR Error Rates are measured as the percentage of erroneously recovered genotypes from the total number of masked genotypes

Experimental Results

Estimates of the allele 0 frequencies based on imputed versus true genotypes

Accuracy and missing data rate for imputed genotypes for different confidence thresholds. The solid line shows the discordance between imputed genotypes and original genotype calls while

the dashed line shows the missing data rate

Error Detection (ED), Error Correction (EC) Missing Data Recovery (MDR) and Imputation (IMP) accuracy results obtained on the WTCCC, ADHD, and HapMap datasets

0%

20%

40%

60%

80%

100%

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95

Calling threshold

Per

cen

tag

e m

issi

ng

gen

oty

pes

0%

2%

4%

6%

8%

10%

12%

14%

Per

cen

tag

e d

isco

rdan

ce

0%

20%

40%

60%

80%

100%

0.5 0.6 0.7 0.8 0.9

Calling threshold

Per

cen

tag

e m

issi

ng

gen

oty

pes

0%

2%

4%

6%

8%

10%

12%

14%

Per

cen

tag

e d

isco

rdan

ce

0%

20%

40%

60%

80%

100%

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95

Calling threshold

Per

cen

tag

e m

issi

ng

gen

oty

pes

0%

2%

4%

6%

8%

10%

12%

14%

Per

cen

tag

e d

isco

rdan

ce

WTCCC

WTCCC

ADHD

ADHD

HapMap

HapMap

Conclusions

All proposed flows have been implemented in a software tool called GEDI (for Genotype Error Detection and Imputation); the open source will be released at http://dna.engr.uconn.edu

A significant percentage of errors are detected with very low false positive rate (TP rate for HapMap dataset is under-estimated since some of the discordances are caused by errors in Affymetrix 6.0 genotypes)

Over 97% of detected genotype errors are accurately corrected

Un-called genotypes are recovered with high accuracy, but accuracy seems to be sensitive to dataset specific missing data patterns

Imputation of un-typed genotypes has high accuracy (less than 2% discordance for genotypes imputed with >.95 confidence), and imputed allele frequencies match well the observed frequencies

HapMap haplotype frequencies transfer well to related populations for imputation of un-typed variation; however, EDC and MDR benefit from training the HMM based on haplotypes inferred from the population under study

Performing error detection and missing data recovery increases imputation accuracy for all 3 datasets, showing the advantages of the combined approach (EDC+MDR+IMP)

References1. L.E. Baum, T. Petrie, G. Soules, and N.Weiss. A maximization technique occurring in the statistical analysis of probabilistic

functions of Markov chains. Ann. Math. Statist., 41:164-171, 19702. The Wellcome Trust Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000

shared controls. Nature, 447:661-678, 20073. A. Gusev, B. Pasaniuc, and I.I. Mandoiu. Highly scalable genotype phasing by entropy minimization. IEEE/ACM Transactions

on Computational Biology and Bioinformatics, 5:252-261, 20084. J . Kennedy, I.I. Mandoiu, and B. Pasaniuc. Genotype error detection using Hidden Markov models of haplotype diversity. In

Proc. 7th Workshop on Algorithms in Bioinformatics, LNCS, 73-84, 20075. G. Kimmel and R. Shamir. A block-free hidden Markov model for genotypes and its application to disease association. J ournal

of Computational Biology, 12:1243-1260, 20056. Y. Li and G. R. Abecasis. Mach 1.0: Rapid haplotype reconstruction and missing genotype inference. American Journal of

Human Genetics, 2290, 2006.7. J . Marchini, B. Howie, S. Myers, G. McVean, and P. Donnelly. A new multipoint method for genome-wide association studies

by imputation of genotypes. Nat. Genet., 39:906-913, 2007.8. P. Rastas, M. Koivisto, H. Mannila, and E. Ukkonen. Phasing genotypes using a hidden Markov model. In I.I. Mandoiu and A.

Zelikovsky, editors, Bioinformatics Algorithms: Techniques and Applications, 355-372. Wiley, 2008.9. P. Scheet and M. Stephens. A fast and flexible statistical model for large-scale population genotype data: applications to

inferring missing genotypes and haplotypic phase. American Journal of Human Genetics, 78:629-644, 2006.10. R. Schwartz. Algorithms for association study design using a generalized model of haplotype conservation. In Proc. CSB,

pages 90-97, 2004.11. X. Wen and D. L. Nicolae. Association studies for untyped markers with tuna. Bioinformatics, 24:435-437, 2008

HMM structure: Left-to-right HMM similar to models proposed by [Schwartz

04, Rastas et al. 05, Kimmel&Shamir 05] Determined by number n of SNP loci and user specified

number K of “founder” states at each SNP (set to 7 in our experiments)

Each state allowed to emit both alleles but training usually introduces strong bias towards one of them

Paths with high transition probability correspond to “founder” haplotypes; transition probabilities capture observed (founder-specific) recombination rates

Efficient Likelihood Computations: A trained HMM M emits haplotypes along left-to-right paths

P(H|M) = sum over all possible HMM paths of joint

probability that M follows and emits H; efficiently computed in O(nK) time using forward algorithm

P(G|M) = probability with which M emits any two haplotypes that explain G along any pair of paths; efficiently computed in O(nK3) time by a 2-path extension of the forward algorithm combined with speed-up idea of [Rastas07]

Similar speed-up can be used for computing in O(nK5) the likelihood of genotype trios

… ataggtccCtatttcgcgcCgtatacacgggActata …… ataggtccGtatttcgcgcCgtatacacgggTctata …… ataggtccCtatttcgcgcCgtatacacgggTctata …

021200210011000110001100010

two haplotypes per individual

genotype

),(),(),;1(),;(),;( 2'21

'1

),(

'2

'12121

21

'2

'1

qqqqqqjfqqjEqqjfjQqq

ED EC MDR IMP

TP

rateFP

rate Accuracy Error Rate Error Rate

WTCCCIMP 6.63%

MDR+IMP (HapMap haps) 11.78% 6.63%

MDR+IMP (ENT haps) 10.98% 6.63%

EDC+MDR+IMP (HapMap haps) 79.54% 0.87% 96.58% 11.98% 6.90%

EDC+MDR+IMP (ENT haps) 72.08% 0.21% 97.16% 10.89% 6.49%

ADHDIMP 9.16%





HapMapIMP 8.89%





HMM structure:

Documents

Transcript of HMM structure: