HMM structure:
-
Upload
jerome-paul -
Category
Documents
-
view
20 -
download
0
description
Transcript of HMM structure:
ISBRA 2008
Genotype Error Detection and Imputation using Hidden Markov Models of Haplotype Diversity
J. Kennedy, I. Mandoiu and B. PasaniucComputer Science & Engineering Dept., University of Connecticut
Methods
HMM Parameter Estimation Previous works use EM training based on unrelated genotype data
To exploit available pedigree info, we use 2-step algorithm:
1. Infer haplotypes using pedigree-aware ENT algorithm based on entropy-minimization [GPM’08]
2. Train HMM based on inferred haplotypes, using Baum-Welch algorithm
Genotype Imputation Ggix denotes the multi-locus genotype obtained from G by replacing
the i-th SNP genotype with x, where x is either 0,1 or 2
For a missing (un-called or un-typed) genotype gi imputation is done by replacing gi with
Genotype Error Detection and Correction Genotype gi is flagged as a potential error whenever the log-likelihood
ratio LLR(gi) computed as below exceeds a global or locus-specific detection threshold (we used 1.6 in our experiments)
For trio data LLR is computed as the minimum over log-likelihood ratios computed over the entire trio, parent-child duos, and unrelated individuals [KMP’07] to reduce false positives caused by errors in related individuals
Error Detection and Imputation Flows: IMP: un-typed SNP genotypes were imputed using the original
genotype data MDR+IMP: first, missing genotypes at typed SNPs were recovered,
then the complete data was used for imputation of un-typed SNP genotypes
EDC+MDR+IMP: First, erroneous genotypes were detected and corrected, second, corrected genotypes were used to recover missing genotypes at typed SNPs, then this data was used to impute genotypes for untyped SNPs
)|(
)|(maxlog)( }2,1,0{
MGP
MGPgLLR xgxi
i
)|(argmax }2,1,0{ MGPx xgx i
IntroductionBackground and Motivation
Single Nucleotide Polymorphisms (SNPs): Positions in the genome at which two of the possible four nucleotides occur in a large percentage of the population (~12 million cataloged to date)
Haplotype: describes the combination of alleles present on one of the homologous chromosomes; alleles usually denoted as 0 (major) and 1 (minor) Genotype: combinations of SNP alleles present on the two chromosomes: 0/1 both chromosomes contain the major/minor allele; 2 chromosomes contain different alleles
Genome-wide association studies (GWAS): Methodology for mapping disease associated genes by typing a dense set of markers in large numbers of cases (individuals a affected disease) and controls (unaffected individuals) followed by a statistical test of association
Enabled by recent advances in SNP genotyping technologies Higher statistical power compared to other gene mapping methods
such as linkage for uncovering genetic basis of complex diseases
GWAS computational challenges: Genotype error detection: even low levels of genotyping errors
decrease statistical power and can invalidate some statistical test for disease association based on haplotypes
Imputation of un-called genotypes: genotypes at some SNPs are left un-called due to uncertainties in low-level probe intensity data but many analyses require complete data
Imputation of un-typed genotypes: current genotyping platforms still have limited coverage (500k-1M SNPs) and thus unlikely to include causal SNPs; imputation of un-typed SNPs has emerged as a powerful technique for increasing GWAS power
HMM of Haplotype Diversity
HMM structure for K=4 founders and 5 SNP loci
Experimental Setup
WTCCC Dataset: Genotype data of the 1958 birth cohort of the The Welcome Trust Case Control Consortium genome-association study
1,444 individuals genotyped using the Affymetrix 500k platform We inserted 1% errors, set 1% of the genotype calls as missing and
masked 1% of the SNPs as un-typed
ADHD Dataset: Dataset from the Genetic Association Information Network (GAIN) study on attention deficit hyperactivity disorder (ADHD)
958 mother-father-child trios genotyped using the Perlgen 500k platform
We inserted 1% errors, set 1% of the genotype calls as missing and masked 1% of the SNPs as un-typed
HapMap Dataset: CEU panel consisting of Utah residents with European ancestry
30 mother-father-child trio families genotyped using both the Affymetrix500k platform and the Affymetrix 6.0 platforms
Affymetrix 500k genotypes were used to impute un-called genotypes and genotypes of SNPs on the Affymetrix 6.0 platform not covered by the Affymetrix 500k
Actual Affymetrix 6.0 genotypes were assumed to be correct when estimating imputation and missing data recovery accuracy, in particular disagreements between Affymetrix 500k and 6.0 calls were assumed to be correct in 6.0 data
Accuracy measures Error Detection True Positive (TP) Rate is the percentage of the
genotype errors inserted that get flagged. Error Detection False Positive (FP) Rate is computed as the percentage of correct genotype calls that get erroneously flagged. Error Correction Accuracy is measured as the percentage of flagged errors that get corrected to the original value
The IMP and MDR Error Rates are measured as the percentage of erroneously recovered genotypes from the total number of masked genotypes
Experimental Results
Estimates of the allele 0 frequencies based on imputed versus true genotypes
Accuracy and missing data rate for imputed genotypes for different confidence thresholds. The solid line shows the discordance between imputed genotypes and original genotype calls while
the dashed line shows the missing data rate
Error Detection (ED), Error Correction (EC) Missing Data Recovery (MDR) and Imputation (IMP) accuracy results obtained on the WTCCC, ADHD, and HapMap datasets
0%
20%
40%
60%
80%
100%
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95
Calling threshold
Per
cen
tag
e m
issi
ng
gen
oty
pes
0%
2%
4%
6%
8%
10%
12%
14%
Per
cen
tag
e d
isco
rdan
ce
0%
20%
40%
60%
80%
100%
0.5 0.6 0.7 0.8 0.9
Calling threshold
Per
cen
tag
e m
issi
ng
gen
oty
pes
0%
2%
4%
6%
8%
10%
12%
14%
Per
cen
tag
e d
isco
rdan
ce
0%
20%
40%
60%
80%
100%
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95
Calling threshold
Per
cen
tag
e m
issi
ng
gen
oty
pes
0%
2%
4%
6%
8%
10%
12%
14%
Per
cen
tag
e d
isco
rdan
ce
WTCCC
WTCCC
ADHD
ADHD
HapMap
HapMap
Conclusions
All proposed flows have been implemented in a software tool called GEDI (for Genotype Error Detection and Imputation); the open source will be released at http://dna.engr.uconn.edu
A significant percentage of errors are detected with very low false positive rate (TP rate for HapMap dataset is under-estimated since some of the discordances are caused by errors in Affymetrix 6.0 genotypes)
Over 97% of detected genotype errors are accurately corrected
Un-called genotypes are recovered with high accuracy, but accuracy seems to be sensitive to dataset specific missing data patterns
Imputation of un-typed genotypes has high accuracy (less than 2% discordance for genotypes imputed with >.95 confidence), and imputed allele frequencies match well the observed frequencies
HapMap haplotype frequencies transfer well to related populations for imputation of un-typed variation; however, EDC and MDR benefit from training the HMM based on haplotypes inferred from the population under study
Performing error detection and missing data recovery increases imputation accuracy for all 3 datasets, showing the advantages of the combined approach (EDC+MDR+IMP)
References1. L.E. Baum, T. Petrie, G. Soules, and N.Weiss. A maximization technique occurring in the statistical analysis of probabilistic
functions of Markov chains. Ann. Math. Statist., 41:164-171, 19702. The Wellcome Trust Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000
shared controls. Nature, 447:661-678, 20073. A. Gusev, B. Pasaniuc, and I.I. Mandoiu. Highly scalable genotype phasing by entropy minimization. IEEE/ACM Transactions
on Computational Biology and Bioinformatics, 5:252-261, 20084. J . Kennedy, I.I. Mandoiu, and B. Pasaniuc. Genotype error detection using Hidden Markov models of haplotype diversity. In
Proc. 7th Workshop on Algorithms in Bioinformatics, LNCS, 73-84, 20075. G. Kimmel and R. Shamir. A block-free hidden Markov model for genotypes and its application to disease association. J ournal
of Computational Biology, 12:1243-1260, 20056. Y. Li and G. R. Abecasis. Mach 1.0: Rapid haplotype reconstruction and missing genotype inference. American Journal of
Human Genetics, 2290, 2006.7. J . Marchini, B. Howie, S. Myers, G. McVean, and P. Donnelly. A new multipoint method for genome-wide association studies
by imputation of genotypes. Nat. Genet., 39:906-913, 2007.8. P. Rastas, M. Koivisto, H. Mannila, and E. Ukkonen. Phasing genotypes using a hidden Markov model. In I.I. Mandoiu and A.
Zelikovsky, editors, Bioinformatics Algorithms: Techniques and Applications, 355-372. Wiley, 2008.9. P. Scheet and M. Stephens. A fast and flexible statistical model for large-scale population genotype data: applications to
inferring missing genotypes and haplotypic phase. American Journal of Human Genetics, 78:629-644, 2006.10. R. Schwartz. Algorithms for association study design using a generalized model of haplotype conservation. In Proc. CSB,
pages 90-97, 2004.11. X. Wen and D. L. Nicolae. Association studies for untyped markers with tuna. Bioinformatics, 24:435-437, 2008
HMM structure: Left-to-right HMM similar to models proposed by [Schwartz
04, Rastas et al. 05, Kimmel&Shamir 05] Determined by number n of SNP loci and user specified
number K of “founder” states at each SNP (set to 7 in our experiments)
Each state allowed to emit both alleles but training usually introduces strong bias towards one of them
Paths with high transition probability correspond to “founder” haplotypes; transition probabilities capture observed (founder-specific) recombination rates
Efficient Likelihood Computations: A trained HMM M emits haplotypes along left-to-right paths
P(H|M) = sum over all possible HMM paths of joint
probability that M follows and emits H; efficiently computed in O(nK) time using forward algorithm
P(G|M) = probability with which M emits any two haplotypes that explain G along any pair of paths; efficiently computed in O(nK3) time by a 2-path extension of the forward algorithm combined with speed-up idea of [Rastas07]
Similar speed-up can be used for computing in O(nK5) the likelihood of genotype trios
… ataggtccCtatttcgcgcCgtatacacgggActata …… ataggtccGtatttcgcgcCgtatacacgggTctata …… ataggtccCtatttcgcgcCgtatacacgggTctata …
021200210011000110001100010
two haplotypes per individual
genotype
),(),(),;1(),;(),;( 2'21
'1
),(
'2
'12121
21
'2
'1
qqqqqqjfqqjEqqjfjQqq
ED EC MDR IMP
TP
rateFP
rate Accuracy Error Rate Error Rate
WTCCCIMP 6.63%
MDR+IMP (HapMap haps) 11.78% 6.63%
MDR+IMP (ENT haps) 10.98% 6.63%
EDC+MDR+IMP (HapMap haps) 79.54% 0.87% 96.58% 11.98% 6.90%
EDC+MDR+IMP (ENT haps) 72.08% 0.21% 97.16% 10.89% 6.49%
ADHDIMP 9.16%
MDR+IMP (HapMap haps) 6.14% 8.91%
MDR+IMP (ENT haps) 5.21% 8.88%
EDC+MDR+IMP (HapMap haps) 61.55% 0.39% 97.85% 5.98% 8.89%
EDC+MDR+IMP (ENT haps) 52.62% 0.07% 98.39% 4.58% 8.74%
HapMapIMP 8.89%
MDR+IMP (HapMap haps) 23.74% 8.76%
MDR+IMP (ENT haps) 23.76% 8.80%
EDC+MDR+IMP (HapMap haps) 40.43% 0.03% 99.40% 23.04% 8.73%
EDC+MDR+IMP (ENT haps) 6.10% 0.03% 100.00% 25.21% 8.84%