Wei Wei - Application of Naive Bayes Model

University of Pittsburgh Department of Biomedical Informatics

The Application of Naive Bayes Model Averaging to Predict Alzheimer’s Disease

from Genome-Wide DataWei Wei, Shyam Visweswaran and Gregory F. Cooper

Motivation: Develop methods for using genome-wide information about an individual to inform clinical care

Background

• Genome-wide association studies (GWASs)• Single-nucleotide polymorphism (SNP)• High-throughput genotyping technologies

• Alzheimer’s disease (AD): • AD afflicts about 10% of persons over 65 and

almost half of those over 85• ~5.5 million cases currently in U.S.• 95% of all AD cases are Late-Onset AD (LOAD)

Background

• SourceTGEN dataset by Reiman et al *

• Cases• 1411 individuals• 861 LOAD and 550 controls

• SNPs• 312,316 SNPs• Two additional SNPs (rs429358 and rs7412)

genotyped separately (these determine APOE status)____________________________________________________________________

* Reiman E, Webster J, Myers A, Hardy J, Dunckley T, Zismann V, et al. GAB2 alleles modify Alzheimer's risk in APOE epsilon4 carriers. Neuron. 2007;54(5):713-20.

Background

• Bayesian Model Averaging• Represents uncertainty about the correctness of

any given model• Performs inference by weighting the prediction of

each model by our uncertainty in that model• Model-Averaged Naïve Bayes (MANB)

MANB efficiently averages over all naive Bayes models (on a given set of variables) in making a prediction for an individual patient case

Methods: Naive Bayes (NB)

SNP 1 SNP 2 SNP 3 …

SNP 312318

Methods: Feature Selection Naive Bayes (FSNB)

Perform feature selection using a greedy, forward-stepping search that optimizes the prediction of LOAD

SNP25,920

SNP 1,100

SNP104,582

SNP276,455

Methods: Model-Averaged Naive Bayes (MANB)

SNP 1 SNP 2 SNP312,318

Methods: MANB

Model 1 Model i Model2312,318

……

312,3182

( | , ) ( | )i

P LOAD Ev

P LOAD Ev model i P model i training data

Methods: MANB• We can take advantage of the conditional independence

relationships in NB models to make it efficient to model average over all those many models.

• The computational “trick” is as follows*

• For each SNPi we construct a model-averaged conditional probability, PMANB(SNPi | LOAD), by averaging over whether or not there is an arc from LOAD to SNPi.

____________________________________________________________________* Dash D, Cooper G. Exact model averaging with naive Bayesian classifiers.

International Conference on Machine Learning (2002) 91 - 98.

This step can be viewed as a “soft” form of feature selection.

Methods: MANB• We can take advantage of the conditional independence

relationships in NB models to make it efficient to model average over all those many models.

• The computational “trick” is as follows*

• For each SNPi we construct a model-averaged conditional probability, PMANB(SNPi | LOAD), by averaging over whether or not there is an arc from LOAD to SNPi

• We use these model-averaged conditional probabilities to define a new NB model M over which we now perform NB inference.

• Performing inference with M is the same as model averaging over the exponential number of NB models discussed previously.

____________________________________________________________________* Dash D, Cooper G. Exact model averaging with naive Bayesian classifiers.

International Conference on Machine Learning (2002) 91 - 98.

Methods: Prior Probabilities

• Structure priors• FSNB and MANB assume each arc is present with some

probability p, independent of the status of other arcs in the model.

• Informed by the literature, we chose a value of p that yields an expected number of arcs of 20.

• Parameter priorsIf we think of P(SNPi |LOAD) as defining a table of probabilities, then we assume that every way of filling in that table (consistent with the axioms of probability) is equally likely a priori.

Methods: Experimental Design

• Five-fold cross-validation• Performance measures• Area under the ROC curve (AUC) as a measure of

discrimination• Calibration plots and Hosmer-Lemeshow goodness-of-

fit statistics• Run time

• Control algorithms• NB • FSNB

Results: Run time (in seconds)

Machine parameters: CPU 2.33 GHz, RAM 2 GB. Training time was the average over the five cross-validation folds. Time for loading data into memory is not included, but was about XYZ seconds.

MANB NB FSNB0

200400600800

10001200140016001800

16.1 15.6

1684.2

TrainingTime

Results: Area under the ROC curve (AUC)

Discussion:• AUCs of FSNB and

MANB are similar (95% confidence interval of their AUC difference is -0.008 to 0.029). Their performance is strongly influenced by several APOE SNPs.

• AUCs of NB and MANB are strongly statistically different (p<0.00001).

Results: Calibration plot of NB

Discussion:

NB is poorly calibrated with almost all the test cases having probability predictions near 0 or 1. Such extreme predictions occur because there are such a large number of features in the model.

Results: Calibration plot of NB and FSNB

Discussion:

FSNB is the best calibrated algorithm among the three we evaluated. This result is likely due to the FSNB models containing only a few SNP features (< 4).

Results: Calibration plot of NB, FSNB and MANB

Discussion:

MANB is better calibrated than NB.

MANB is not as well calibrated as FSNB. We believe this result may be due to FSNB having such a small number of features in its models.

Summary of Results

NB FSNB MANB

AUC + +

Calibration ++ +

Run time ++ ++

Algorithm Availability

• A full description of the MANB algorithm is available in the appendix of our paper.

• It provides all the details needed to readily implement the algorithm.

Future Work Includes the Following

• Apply the MANB algorithm to additional datasets• Predict additional clinical outcomes• Use both genomic and clinical data to predict

clinical outcomes• Explore the use of additional genome-wide

measurement platforms, including next generation sequencing data

• Include additional control algorithms in future evaluations

Acknowledgement

• We thank Mr. Kevin Bui for his help in data preparation, software development, and the preparation of the appendix. We thank Dr. Pablo Hennings-Yeomans, Dr. Michael Barmada, and the other members of our research group for helpful discussions.

• The research reported here was funded by NLM grant R01-LM010020 and NSF grant IIS-0911032.

Thank you

Questions?

Wei Wei - Application of Naive Bayes Model

Documents

Transcript of Wei Wei - Application of Naive Bayes Model

Spam Filtering with Naive Bayes – Which Naive Bayes? · Spam Filtering with Naive Bayes – Which Naive Bayes? Vangelis Metsis1,2, Ion Androutsopoulos1 and Georgios Paliouras2 1Department

Alleviating Naive Bayes Attribute Independence Assumption ...

K-nearest neighbor & Naive Bayes

Gaussian Naive Bayes Classifier & Logistic Regression

Naive Bayes Classi cation

Naive Bayes Classifiers and Document Classification · Brandon Malone Naive Bayes Classi ers and Document Classi cation. The Multinomial Distribution Multinomial document model Naive

Bayes Theorem & Naïve Bayes - Penn Engineeringcis521/Lectures/naive-bayes-spam.pdfUsing Naive Bayes Classifiers to Classify Text: Basic method for Multinomial Variables • As a generative

Clustered Naive Bayes

text classiﬁcation with naive Bayes

Naive Bayes lecture

Spam Filtering with Naive Bayes â€“ Which Naive Bayes?

Naive Bayes and Sentiment Classification

Naive Bayes Text Classiï¬cation

Naive Bayes Text Classification · Naive Bayes Text Classification Sandeep Avula asandeep@live.unc.edu. Outline Basic Probability and Notation Bayes Law and Naive Bayes Classification

Bayesian Machine Learning - Naive Bayes

Naive Bayes and Map-Reducewcohen/10-605/notes/scalable-nb-notes.pdf · Naive Bayes and Map-Reduce William Cohen September 6, 2017 1 A Baseline Naive Bayes Algorithm We’ll start

Forward stagewise naive Bayes - UPM

L11 naive bayes - Virginia Tech

Naive Bayes for the Superbowl

Naive Bayes Herni