Post on 14-Dec-2014
description
University of Pittsburgh Department of Biomedical Informatics
The Application of Naive Bayes Model Averaging to Predict Alzheimer’s Disease
from Genome-Wide DataWei Wei, Shyam Visweswaran and Gregory F. Cooper
Motivation: Develop methods for using genome-wide information about an individual to inform clinical care
Background
• Genome-wide association studies (GWASs)• Single-nucleotide polymorphism (SNP)• High-throughput genotyping technologies
• Alzheimer’s disease (AD): • AD afflicts about 10% of persons over 65 and
almost half of those over 85• ~5.5 million cases currently in U.S.• 95% of all AD cases are Late-Onset AD (LOAD)
Background
• SourceTGEN dataset by Reiman et al *
• Cases• 1411 individuals• 861 LOAD and 550 controls
• SNPs• 312,316 SNPs• Two additional SNPs (rs429358 and rs7412)
genotyped separately (these determine APOE status)____________________________________________________________________
* Reiman E, Webster J, Myers A, Hardy J, Dunckley T, Zismann V, et al. GAB2 alleles modify Alzheimer's risk in APOE epsilon4 carriers. Neuron. 2007;54(5):713-20.
Background
• Bayesian Model Averaging• Represents uncertainty about the correctness of
any given model• Performs inference by weighting the prediction of
each model by our uncertainty in that model• Model-Averaged Naïve Bayes (MANB)
MANB efficiently averages over all naive Bayes models (on a given set of variables) in making a prediction for an individual patient case
Methods: Naive Bayes (NB)
SNP 1 SNP 2 SNP 3 …
LOAD
SNP 312318
Methods: Feature Selection Naive Bayes (FSNB)
Perform feature selection using a greedy, forward-stepping search that optimizes the prediction of LOAD
LOAD
SNP25,920
SNP 1,100
SNP104,582
SNP276,455
Methods: Model-Averaged Naive Bayes (MANB)
LOAD
SNP 1 SNP 2 SNP312,318
…
Methods: MANB
Model 1 Model i Model2312,318
……
312,3182
1
( | )
( | , ) ( | )i
P LOAD Ev
P LOAD Ev model i P model i training data
Methods: MANB• We can take advantage of the conditional independence
relationships in NB models to make it efficient to model average over all those many models.
• The computational “trick” is as follows*
• For each SNPi we construct a model-averaged conditional probability, PMANB(SNPi | LOAD), by averaging over whether or not there is an arc from LOAD to SNPi.
____________________________________________________________________* Dash D, Cooper G. Exact model averaging with naive Bayesian classifiers.
International Conference on Machine Learning (2002) 91 - 98.
This step can be viewed as a “soft” form of feature selection.
Methods: MANB• We can take advantage of the conditional independence
relationships in NB models to make it efficient to model average over all those many models.
• The computational “trick” is as follows*
• For each SNPi we construct a model-averaged conditional probability, PMANB(SNPi | LOAD), by averaging over whether or not there is an arc from LOAD to SNPi
• We use these model-averaged conditional probabilities to define a new NB model M over which we now perform NB inference.
• Performing inference with M is the same as model averaging over the exponential number of NB models discussed previously.
____________________________________________________________________* Dash D, Cooper G. Exact model averaging with naive Bayesian classifiers.
International Conference on Machine Learning (2002) 91 - 98.
Methods: Prior Probabilities
• Structure priors• FSNB and MANB assume each arc is present with some
probability p, independent of the status of other arcs in the model.
• Informed by the literature, we chose a value of p that yields an expected number of arcs of 20.
• Parameter priorsIf we think of P(SNPi |LOAD) as defining a table of probabilities, then we assume that every way of filling in that table (consistent with the axioms of probability) is equally likely a priori.
Methods: Experimental Design
• Five-fold cross-validation• Performance measures• Area under the ROC curve (AUC) as a measure of
discrimination• Calibration plots and Hosmer-Lemeshow goodness-of-
fit statistics• Run time
• Control algorithms• NB • FSNB
Results: Run time (in seconds)
Machine parameters: CPU 2.33 GHz, RAM 2 GB. Training time was the average over the five cross-validation folds. Time for loading data into memory is not included, but was about XYZ seconds.
MANB NB FSNB0
200400600800
10001200140016001800
16.1 15.6
1684.2
TrainingTime
TrainingTime
Results: Area under the ROC curve (AUC)
Discussion:• AUCs of FSNB and
MANB are similar (95% confidence interval of their AUC difference is -0.008 to 0.029). Their performance is strongly influenced by several APOE SNPs.
• AUCs of NB and MANB are strongly statistically different (p<0.00001).
Results: Calibration plot of NB
Discussion:
NB is poorly calibrated with almost all the test cases having probability predictions near 0 or 1. Such extreme predictions occur because there are such a large number of features in the model.
Results: Calibration plot of NB and FSNB
Discussion:
FSNB is the best calibrated algorithm among the three we evaluated. This result is likely due to the FSNB models containing only a few SNP features (< 4).
Results: Calibration plot of NB, FSNB and MANB
Discussion:
MANB is better calibrated than NB.
MANB is not as well calibrated as FSNB. We believe this result may be due to FSNB having such a small number of features in its models.
Summary of Results
NB FSNB MANB
AUC + +
Calibration ++ +
Run time ++ ++
Algorithm Availability
• A full description of the MANB algorithm is available in the appendix of our paper.
• It provides all the details needed to readily implement the algorithm.
Future Work Includes the Following
• Apply the MANB algorithm to additional datasets• Predict additional clinical outcomes• Use both genomic and clinical data to predict
clinical outcomes• Explore the use of additional genome-wide
measurement platforms, including next generation sequencing data
• Include additional control algorithms in future evaluations
Acknowledgement
• We thank Mr. Kevin Bui for his help in data preparation, software development, and the preparation of the appendix. We thank Dr. Pablo Hennings-Yeomans, Dr. Michael Barmada, and the other members of our research group for helpful discussions.
• The research reported here was funded by NLM grant R01-LM010020 and NSF grant IIS-0911032.
Thank you
Questions?